AI Sparse Attention

A family of attention mechanisms that compute interactions between only a subset of token pairs instead of every pair, breaking the O(n²) cost of full self-attention. The dense attention used in vanilla Transformers forces every token to attend to every other token; sparse attention restricts that i

Canonical version: AI Sparse Attention.

A family of attention mechanisms that compute interactions between only a subset of token pairs instead of every pair, breaking the O(n²) cost of full self-attention. The dense attention used in vanilla Transformers forces every token to attend to every other token; sparse attention restricts that interaction graph by structure (fixed patterns) or by content (learned routing), trading a small drop in modelling power for very large gains in speed and memory.

Sparse attention is the architectural move that makes million-token context windows economically viable — both at training time (FLOPs) and at inference time (KV cache size).

Why it matters

Full attention scales O(n²) in compute and O(n) in KV cache per layer. At a 1M-token context, those costs dominate everything else. Sparse attention attacks both axes:

  • Compute: each token attends to k ≪ n neighbours, so per-layer cost drops from O(n²) toward O(n·k) or O(n·log n).
  • Memory: only the kept keys/values need to live in the cache; in some schemes the cache itself is compressed.

The trade-off is information loss when the dropped pairs would have mattered. The whole design space is "which pairs can I safely skip?"

Three families

  • Pattern-based (fixed sparsity). Each token attends to a deterministic subset: sliding window (Longformer), local + global landmarks (BigBird), strided / dilated (Sparse Transformer), block-diagonal. Cheap and predictable but blind to content.
  • Content-based (learned sparsity). The model learns which pairs matter. Routing transformers cluster tokens; Reformer uses LSH; Native Sparse Attention and DeepSeek Sparse Attention (DSA, see DeepSeek v4) compress and route at the token level. More expensive logic per layer, but far better quality at the same sparsity ratio.
  • Approximated / linearised attention. Performer, Linformer, and Linear Attention rewrite the softmax so attention becomes O(n) via kernel approximation or low-rank projection. Technically not "sparse" but pursues the same goal — drop the quadratic — through a different door.

Relationship to MoE

Sparse attention and MoE are both forms of sparsity but operate on different axes. MoE is sparse in the parameter dimension: only a few experts fire per token. Sparse attention is sparse in the sequence dimension: only a few token pairs interact per layer. Modern frontier open models (DeepSeek v4, Native Sparse Attention models, Kimi K2) stack the two — sparse experts plus sparse attention — to compound the efficiency gains.

What this unlocks

  • Million-token contexts that were technically supported but economically impractical.
  • Long-context inference on smaller deployments, because the KV cache stops dominating GPU memory.
  • Better cost / quality positioning for open-weights models vs closed frontier APIs — most of DeepSeek v4's price advantage traces back to DSA.

Limitations

  • Quality on long-range, content-dependent tasks (multi-hop retrieval over a 500K-token context) can lag full attention; benchmarks like RULER and InfiniteBench expose this.
  • Pattern-based schemes underperform when the structure is wrong for the data.
  • Content-based schemes add training complexity and can be hard to load-balance, similar to MoE routing failure modes.

References


About Sébastien

I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.

I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.

If you want to follow my work, then become a member and join our community.

Ready to get to the next level?

If you're tired of information overwhelm and ready to build a reliable knowledge system:

Found this valuable? Share it with someone who needs it.

Join 6,000+ readers. Get practical systems for knowledge & AI. Free.

Subscribe ✨

Free: Knowledge System Checklist

A clear roadmap to building your own knowledge system. Subscribe and get it straight to your inbox.

6,000+ readers. No spam. Unsubscribe anytime.