LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

Paper Code arXiv MoonMath.ai Contact Us

Overview

The LiteAttention framework addresses computational bottlenecks in state-of-the-art video generation models. Without requiring any fine-tuning of pre-trained models, it leverages the temporal coherence of sparsity patterns across denoising timesteps to significantly speed up the inference process.

Through a co-design of algorithms and systems, LiteAttention provides an end-to-end solution that achieves measurable speedups on real-world hardware, offering greater efficiency and lower costs for video generation tasks.

Key Features

Evolutionary Computation Skips

Identify non-essential tiles once during early denoising and propagate skip decisions forward through the entire trajectory.

Full-Stage Elimination

Skip the entire attention iteration (QK product, softmax, PV product) for marked tiles, not just partial stages.

Error Calibration

Assign different error bounds to different timesteps, with stricter bounds for earlier timesteps that have greater influence.

Zero Training Required

Production-ready, requires no model retraining or architectural modifications.

Efficient Sparse Attention via QK-Skip Algorithm

LiteAttention introduces evolutionary computation skips that leverage temporal coherence in diffusion attention.

QK-Skip Algorithm

Unlike dynamic methods that repeatedly recompute sparsity at every step (incurring 10-20% overhead), LiteAttention maintains a Skip-Mask that is updated at each timestep. As the diffusion process progresses, the number of tiles marked for skipping gradually increases.

Once a tile is marked as skippable, the entire attention iteration is bypassed for subsequent timesteps, eliminating redundant computations without repeated profiling.

This approach combines:

Content adaptivity of dynamic sparsity (patterns derived from actual attention statistics)
Efficiency of static sparsity (no per-step re-evaluation overhead)
Completeness of full computation elimination

Quantitative Evaluation

LiteAttention achieves state-of-the-art video quality with significant speedups compared to other sparse attention methods, evaluated using VBench metrics on production video diffusion models.

Baseline - FlashAttention3

LiteAttention

RadialAttention

SparseVideoGen

Wan2.2-14B Detailed Comparison

Method	AQ ↑	BC ↑	DD ↑	IQ ↑	SC ↑	TF ↑	TS ↑	Sparsity ↑	Runtime ↓
FlashAttention3	0.693	0.977	0.583	72.73	0.970	0.953	0.133	0%	1473 sec
SparseVideoGen	0.689	0.962	0.417	72.24	0.961	0.952	0.061	66%	1022 sec
RadialAttention	0.682	0.974	0.500	72.73	0.967	0.947	0.061	66%	1207 sec
LiteAttention	0.698	0.977	0.500	71.44	0.969	0.953	0.135	32%	893 sec

Best results in bold, second-best in italic
VBench Metrics: AQ (Aesthetic Quality), BC (Background Consistency), DD (Dynamic Degree), IQ (Imaging Quality), SC (Subject Consistency), TF (Temporal Flickering), TS (Temporal Style)

Speedup Analysis

LiteAttention achieves significant speedups over FlashAttention3 baseline:

Wan2.1-14B: 1707 sec → 902 sec = 1.89× speedup (47% time reduction)
Wan2.2-14B: 1473 sec → 893 sec = 1.65× speedup (39% time reduction)

LiteAttention achieves the best runtime on both models while maintaining superior quality metrics compared to SparseVideoGen and RadialAttention.

Ablation Study: Sparsity vs Runtime

Our ablation studies demonstrate that runtime improvement scales with attention sparsity:

Sparsity	Self-Attention Runtime	Runtime Improvement
0%	695 sec	0% (baseline)
21%	573 sec	18%
42%	418 sec	40%
57%	308 sec	56%
77%	163 sec	77%

The near-linear scaling between sparsity and runtime improvement demonstrates the efficiency of our QK-Skip algorithm.

Gallery - Wan2.1-14B Generation Times

LiteAttention provides significant speedups on video generation tasks. Below are generation times and visual comparisons at different threshold settings:

Baseline (no skip)

23m 51s

Threshold -10

14m 19s

Threshold -3

11m 46s

Threshold 0

8m 31s

Quick Start

Installation

git clone https://github.com/moonmath-ai/LiteAttention.git
cd LiteAttention/hopper
python setup.py install

Basic Usage

from lite_attention import LiteAttention

# Initialize with threshold
attn = LiteAttention(threshold=-6.0)

# Use in your model
output = attn(query, key, value, scale)

See the GitHub repository for detailed documentation and examples.

BibTeX

@misc{shmilovich2025liteattentiontemporalsparseattention,
                title={LiteAttention: A Temporal Sparse Attention for Diffusion Transformers}, 
                author={Dor Shmilovich and Tony Wu and Aviad Dahan and Yuval Domb},
                year={2025},
                eprint={2511.11062},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2511.11062}, 
          }
}

Acknowledgements

LiteAttention is built on top of FlashAttention3 by Tri Dao and contributors. We thank the FlashAttention team for their foundational work on efficient attention mechanisms.

We also thank the teams behind SparseVideoGen, RadialAttention, SageAttention, Wan2.1, and LTX-Video for their insights and benchmarking support.