The LiteAttention framework addresses computational bottlenecks in state-of-the-art video generation models. Without requiring any fine-tuning of pre-trained models, it leverages the temporal coherence of sparsity patterns across denoising timesteps to significantly speed up the inference process.
Through a co-design of algorithms and systems, LiteAttention provides an end-to-end solution that achieves measurable speedups on real-world hardware, offering greater efficiency and lower costs for video generation tasks.
Identify non-essential tiles once during early denoising and propagate skip decisions forward through the entire trajectory.
Skip the entire attention iteration (QK product, softmax, PV product) for marked tiles, not just partial stages.
Assign different error bounds to different timesteps, with stricter bounds for earlier timesteps that have greater influence.
Production-ready, requires no model retraining or architectural modifications.
LiteAttention introduces evolutionary computation skips that leverage temporal coherence in diffusion attention.
Unlike dynamic methods that repeatedly recompute sparsity at every step (incurring 10-20% overhead), LiteAttention maintains a Skip-Mask that is updated at each timestep. As the diffusion process progresses, the number of tiles marked for skipping gradually increases.
Once a tile is marked as skippable, the entire attention iteration is bypassed for subsequent timesteps, eliminating redundant computations without repeated profiling.
This approach combines:
LiteAttention achieves state-of-the-art video quality with significant speedups compared to other sparse attention methods, evaluated using VBench metrics on production video diffusion models.
| Method | AQ ↑ | BC ↑ | DD ↑ | IQ ↑ | SC ↑ | TF ↑ | TS ↑ | Sparsity ↑ | Runtime ↓ |
|---|---|---|---|---|---|---|---|---|---|
| FlashAttention3 | 0.693 | 0.977 | 0.583 | 72.73 | 0.970 | 0.953 | 0.133 | 0% | 1473 sec |
| SparseVideoGen | 0.689 | 0.962 | 0.417 | 72.24 | 0.961 | 0.952 | 0.061 | 66% | 1022 sec |
| RadialAttention | 0.682 | 0.974 | 0.500 | 72.73 | 0.967 | 0.947 | 0.061 | 66% | 1207 sec |
| LiteAttention | 0.698 | 0.977 | 0.500 | 71.44 | 0.969 | 0.953 | 0.135 | 32% | 893 sec |
Best results in bold, second-best in
italic
VBench Metrics: AQ (Aesthetic Quality), BC
(Background Consistency), DD (Dynamic Degree), IQ (Imaging Quality),
SC (Subject Consistency), TF (Temporal Flickering), TS (Temporal
Style)
LiteAttention achieves significant speedups over FlashAttention3 baseline:
LiteAttention achieves the best runtime on both models while maintaining superior quality metrics compared to SparseVideoGen and RadialAttention.
Our ablation studies demonstrate that runtime improvement scales with attention sparsity:
| Sparsity | Self-Attention Runtime | Runtime Improvement |
|---|---|---|
| 0% | 695 sec | 0% (baseline) |
| 21% | 573 sec | 18% |
| 42% | 418 sec | 40% |
| 57% | 308 sec | 56% |
| 77% | 163 sec | 77% |
The near-linear scaling between sparsity and runtime improvement demonstrates the efficiency of our QK-Skip algorithm.
LiteAttention provides significant speedups on video generation tasks. Below are generation times and visual comparisons at different threshold settings:
git clone https://github.com/moonmath-ai/LiteAttention.git
cd LiteAttention/hopper
python setup.py install
from lite_attention import LiteAttention
# Initialize with threshold
attn = LiteAttention(threshold=-6.0)
# Use in your model
output = attn(query, key, value, scale)
See the GitHub repository for detailed documentation and examples.
@misc{shmilovich2025liteattentiontemporalsparseattention,
title={LiteAttention: A Temporal Sparse Attention for Diffusion Transformers},
author={Dor Shmilovich and Tony Wu and Aviad Dahan and Yuval Domb},
year={2025},
eprint={2511.11062},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.11062},
}
}
LiteAttention is built on top of FlashAttention3 by Tri Dao and contributors. We thank the FlashAttention team for their foundational work on efficient attention mechanisms.
We also thank the teams behind SparseVideoGen, RadialAttention, SageAttention, Wan2.1, and LTX-Video for their insights and benchmarking support.