DeepSeek v4
Fourth-generation flagship release from Deepseek (April 24, 2026). Two open-weight variants — V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active) — both built on a MoE) architecture, ship with a 1M-token Context Window by default, and fold what was the separate R reaso
Canonical version: DeepSeek v4.
Fourth-generation flagship release from Deepseek (April 24, 2026). Two open-weight variants — V4-Pro (1.6T total / 49B active parameters) and V4-Flash (284B total / 13B active) — both built on a MoE architecture, ship with a 1M-token Context Window by default, and fold what was the separate R reasoning line into a single model with switchable Thinking / Non-Thinking modes.
V4-Pro is the largest open weights model released to date.
What's actually new
- DeepSeek Sparse Attention (DSA) + token-wise compression. The headline architectural innovation; a content-based variant of AI Sparse Attention. V4-Pro uses ~27% of the single-token FLOPs and ~10% of the KV cache size of DeepSeek V3.2 at the same context length. Against vanilla full attention the gap is far larger (early reader notes on the paper estimate ~1% of native attention FLOPs and KV size, with throughput improvements on the order of ~50× still to be independently validated). This is an efficiency-first release, not a scale-first one.
- KV cache footprint that fits on commodity hardware. A full 1M-token context fits in roughly 5.7 GB of KV cache at FP8. For comparison, a Llama-3-405B-class native-attention model would need on the order of ~500 GB to hold the same context. That is what makes 1M-token inference economically real, not just paper-feasible; practitioners report running V4-Flash fully in GPU RAM at 1M context on setups that previously had to spill V3.2 into system memory at 256k.
- Reasoning is no longer a separate model. The R series is folded into V4 (see AI Reasoning Models). Both Pro and Flash expose a
reasoning_effort-style toggle. - Bitwise batch-invariant, deterministic kernels. Same input → same output across batch sizes. Most frontier labs trade reproducibility for throughput; DeepSeek deliberately doesn't.
- API surface compatibility. Native support for both the OpenAI ChatCompletions and Anthropic API formats out of the box, lowering migration friction.
Pricing (per million tokens, input / output)
| Model | Input | Output |
|---|---|---|
| DeepSeek V4-Flash | $0.14 | $0.28 |
| DeepSeek V4-Pro | $1.74 | $3.48 |
| Claude Opus (ref.) | $5 | $25 |
| GPT-5.5 (ref.) | $5 | $30 |
V4-Pro is the cheapest of the larger frontier models by a wide margin; V4-Flash undercuts even OpenAI's cheapest tier. DeepSeek has signalled further reductions once Huawei Ascend deployment lands in mid-2026.
Performance positioning
V4-Pro rivals top closed-source frontier models and beats all current open models on Math / STEM / Coding benchmarks while preserving stronger world knowledge than other open releases. Independent assessments (Simon Willison, HN practitioners, PicoCreator's reading notes on the paper) consistently place it "between Sonnet and Opus" in feel; ~3–6 months behind absolute SOTA, close enough that the price gap dominates the decision in most agentic / batch workloads. V4-Flash's reasoning capability is reported to closely approach Pro for a fraction of the cost.
Token-economy caveat. The headline per-token price is the wrong number on its own. On the Artificial Analysis intelligence index, V4-Pro spends ~190M tokens to complete the suite (and Kimi K2.6 ~170M) versus ~45M for GPT-5.5 (high). The 5–15× per-token advantage shrinks (but does not disappear) once you account for verbosity on hard reasoning tasks; the cheaper-per-token model can occasionally cost roughly the same in dollars on the worst cases. The current discount on the official DeepSeek API also makes early comparisons rosier than the steady-state pricing will be; the open-weights release means alternative hosts (OpenRouter, Fireworks, etc.) can fill the gap when official capacity is throttled.
Why this matters
DeepSeek v4 is the clearest signal yet that the frontier is bifurcating along a cost / quality plane rather than a single capability axis. A 6-month-behind, 5-to-15× cheaper open model is the right tool for almost everything that isn't the absolute hardest reasoning step. The DSA + KV-cache reduction also makes ultra-long-context inference economically realistic, not just technically possible — the AI Inference cost curve just shifted.
Early practitioner reports back this up. A non-trivial TypeScript codebase audit (multi-file traversal, type analysis, refactor proposal across two prompts) ran end-to-end on V4-Pro for $0.09; the same task is reported to have cost on the order of $9–$13 on Claude Opus before recent price hikes. A full day of refactor work (many subagents, thousands of changed lines) totalled under $1. The cost ratio collapses on the workloads where verbosity bites (see token caveat above), but on the long tail of "good enough" engineering work it is roughly two orders of magnitude.
The real constraint, on day one, is operational: V4-Pro is hit hard with timeouts and rate limits at launch (including via OpenRouter at peak hours), so V4-Flash, or a third-party host, is the more reliable choice for iterative agent loops until capacity catches up.
References
- Official announcement: https://api-docs.deepseek.com/news/news260424
- Announcement post (X): https://x.com/deepseek_ai/status/2047516922263285776
- Model collection: https://huggingface.co/collections/deepseek-ai/deepseek-v4
- Technical report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
- Simon Willison's writeup: https://simonwillison.net/2026/Apr/24/deepseek-v4/
- PicoCreator's raw reading notes on the V4 paper (X): https://x.com/picocreator/status/2047625988125954386
- Hacker News, launch-day discussions: https://news.ycombinator.com/item?id=47884971 and https://news.ycombinator.com/item?id=47885014
- Hacker News, V4 in practice (cost, token economy, local deployment): https://news.ycombinator.com/item?id=47977026
- Artificial Analysis pages: https://artificialanalysis.ai/models/deepseek-v4-pro and https://artificialanalysis.ai/models/deepseek-v4-flash
Related
- Deepseek
- Large Language Models (LLMs)
- AI Mixture of Experts (MoE)
- AI Open Weight Models
- AI KV Cache
- AI Inference
- Context Window
- Sparse AI Models
- Dense AI Models
- Chain-of-Thought (CoT) prompting
- HuggingFace
- Claude
- ChatGPT
- Anthropic
- OpenAI
- Mistral Small 4
- Kimi K2.6
- GPT-5
- OpenRouter
- OpenCode
- Simon Willison
About Sébastien
I'm Sébastien Dubois, and I'm on a mission to help knowledge workers escape information overload. After 20+ years in IT and seeing too many brilliant minds drowning in digital chaos, I've decided to help people build systems that actually work. Through the Knowii Community, my courses, products & services and my Website/Newsletter, I share practical and battle-tested systems.
I write about Knowledge Work, Personal Knowledge Management, Note-taking, Lifelong Learning, Personal Organization, Productivity, and more. I also craft lovely digital products and tools.
If you want to follow my work, then become a member and join our community.
Ready to get to the next level?
If you're tired of information overwhelm and ready to build a reliable knowledge system:
- 📚 KM for Beginners — 10+ hours of structured video lessons
- 🚀 Obsidian Starter Kit — Ready-made vault with 40+ templates
- 💼 Knowledge Worker Kit — Complete guides + lifetime community
- 🦉 1-on-1 Coaching — Personalized guidance
- 🎯 Join Knowii — Community + ALL courses & tools
Found this valuable? Share it with someone who needs it.