VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
Abstract
VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput.
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
Community
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation (2026)
- Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion (2026)
- AdaState: Self-Evolving Anchors for Streaming Video Generation (2026)
- DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation (2026)
- WorldKV: Efficient World Memory with World Retrieval and Compression (2026)
- Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity (2026)
- CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.30351 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper