LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

Overview

The LiteAttention framework addresses computational bottlenecks in state-of-the-art video generation models. Without requiring any fine-tuning of pre-trained models, it leverages the temporal coherence of sparsity patterns across denoising timesteps to significantly speed up the inference process.

Through a co-design of algorithms and systems, LiteAttention provides an end-to-end solution that achieves measurable speedups on real-world hardware, offering greater efficiency and lower costs for video generation tasks.

Key Features

Evolutionary Computation Skips

Identify non-essential tiles once during early denoising and propagate skip decisions forward through the entire trajectory.

Full-Stage Elimination

Skip the entire attention iteration (QK product, softmax, PV product) for marked tiles, not just partial stages.

Error Calibration

Assign different error bounds to different timesteps, with stricter bounds for earlier timesteps that have greater influence.

Zero Training Required

Production-ready, requires no model retraining or architectural modifications.

Efficient Sparse Attention via QK-Skip Algorithm

LiteAttention introduces evolutionary computation skips that leverage temporal coherence in diffusion attention.

QK-Skip Algorithm

Unlike dynamic methods that repeatedly recompute sparsity at every step (incurring 10-20% overhead), LiteAttention maintains a Skip-Mask that is updated at each timestep. As the diffusion process progresses, the number of tiles marked for skipping gradually increases.

Once a tile is marked as skippable, the entire attention iteration is bypassed for subsequent timesteps, eliminating redundant computations without repeated profiling.

This approach combines:

Quantitative Evaluation

LiteAttention achieves state-of-the-art video quality with significant speedups compared to other sparse attention methods, evaluated using VBench metrics on production video diffusion models.

Baseline - FlashAttention3
LiteAttention
RadialAttention
SparseVideoGen

Wan2.2-14B Detailed Comparison

Method AQ ↑ BC ↑ DD ↑ IQ ↑ SC ↑ TF ↑ TS ↑ Sparsity ↑ Runtime ↓
FlashAttention3 0.693 0.977 0.583 72.73 0.970 0.953 0.133 0% 1473 sec
SparseVideoGen 0.689 0.962 0.417 72.24 0.961 0.952 0.061 66% 1022 sec
RadialAttention 0.682 0.974 0.500 72.73 0.967 0.947 0.061 66% 1207 sec
LiteAttention 0.698 0.977 0.500 71.44 0.969 0.953 0.135 32% 893 sec

Best results in bold, second-best in italic
VBench Metrics: AQ (Aesthetic Quality), BC (Background Consistency), DD (Dynamic Degree), IQ (Imaging Quality), SC (Subject Consistency), TF (Temporal Flickering), TS (Temporal Style)

Speedup Analysis

LiteAttention achieves significant speedups over FlashAttention3 baseline:

LiteAttention achieves the best runtime on both models while maintaining superior quality metrics compared to SparseVideoGen and RadialAttention.

Ablation Study: Sparsity vs Runtime

Our ablation studies demonstrate that runtime improvement scales with attention sparsity:

Sparsity Self-Attention Runtime Runtime Improvement
0% 695 sec 0% (baseline)
21% 573 sec 18%
42% 418 sec 40%
57% 308 sec 56%
77% 163 sec 77%

The near-linear scaling between sparsity and runtime improvement demonstrates the efficiency of our QK-Skip algorithm.

Gallery - Wan2.1-14B Generation Times

LiteAttention provides significant speedups on video generation tasks. Below are generation times and visual comparisons at different threshold settings:

Baseline
Baseline (no skip)
23m 51s
Threshold -10
Threshold -10
14m 19s
Threshold -3
Threshold -3
11m 46s
Threshold 0
Threshold 0
8m 31s

Quick Start

Installation

git clone https://github.com/moonmath-ai/LiteAttention.git
cd LiteAttention/hopper
python setup.py install

Basic Usage

from lite_attention import LiteAttention

# Initialize with threshold
attn = LiteAttention(threshold=-6.0)

# Use in your model
output = attn(query, key, value, scale)

See the GitHub repository for detailed documentation and examples.

BibTeX

@misc{shmilovich2025liteattentiontemporalsparseattention,
                title={LiteAttention: A Temporal Sparse Attention for Diffusion Transformers}, 
                author={Dor Shmilovich and Tony Wu and Aviad Dahan and Yuval Domb},
                year={2025},
                eprint={2511.11062},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2511.11062}, 
          }
}

Acknowledgements

LiteAttention is built on top of FlashAttention3 by Tri Dao and contributors. We thank the FlashAttention team for their foundational work on efficient attention mechanisms.

We also thank the teams behind SparseVideoGen, RadialAttention, SageAttention, Wan2.1, and LTX-Video for their insights and benchmarking support.