๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ”ฌ Science & Tech

How AI Remembers: DeepSeek V4 and the Million-Token Breakthrough

by Lud3ns 2026. 4. 25.
๋ฐ˜์‘ํ˜•

How AI Remembers: DeepSeek V4 and the Million-Token Breakthrough

TL;DR

  • On April 24, 2026, DeepSeek released V4-Pro and V4-Flash, both supporting 1-million-token context windows natively.
  • V4-Pro uses just 27% of the compute and 10% of the memory cache compared to its predecessor at the same context length.
  • The breakthrough is not bigger models but smarter attention โ€” a hybrid system that compresses what AI remembers without losing the ability to recall.
  • This changes what AI can do: entire codebases, full novels, and year-long conversation histories now fit in a single prompt.
  • The real lesson is not about DeepSeek. It is about how compression โ€” not raw scale โ€” is becoming the frontier of useful AI.

When you ask a chatbot about something you said ten messages ago and it gives you a confused answer, you have hit the wall every modern AI runs into. That wall has a name: the attention mechanism. And on April 24, 2026, a Chinese lab named DeepSeek pushed it back by an order of magnitude.

The release of DeepSeek V4 is being covered as a benchmark story โ€” V4-Pro-Max scored 3206 on Codeforces, beating GPT-5.4-xHigh's 3168. Huawei announced "full support" via its Ascend chip line within hours. Bloomberg framed it as China's answer to American frontier labs. All true, all surface. The deeper story is what V4 teaches about how AI actually remembers โ€” and why that matters far more than which lab is winning this quarter.

Why AI "Forgets" in Long Conversations

To understand the breakthrough, you need to understand the limit it broke. Modern AI models are built on a mechanism called attention, introduced in the 2017 paper Attention Is All You Need. Attention is how the model decides which parts of your input matter for the next word it predicts.

The problem is that attention is quadratic. Every token (roughly, every word-chunk) in your prompt looks at every other token. Double the context, and computation roughly quadruples. Triple it, and computation grows ninefold.

Context Length Tokens Relative Compute
Short chat 2,000 1ร—
Long document 32,000 256ร—
Whole book 200,000 10,000ร—
Million-token 1,000,000 250,000ร—

This is why context windows are the real bottleneck for AI, not parameter count. A 2-trillion-parameter model that can only remember 8,000 tokens at a time is, for many real tasks, less useful than a smaller model that can hold a full project in mind.

What Is the KV Cache, and Why Does It Matter?

The "memory" your AI uses during a conversation lives in something called the KV cache โ€” short for key-value cache. Think of it as the model's working memory: a stored snapshot of every token's role in the conversation so far, used to inform every new word it generates.

The KV cache is what makes long context expensive. Every new token added to the conversation must reference everything in the cache. When a 200-page document gets fed into a model, the KV cache balloons until inference cost becomes prohibitive โ€” or the model simply truncates the older tokens and "forgets."

Reducing the KV cache without losing information is the holy grail of long-context AI. That is exactly what DeepSeek V4 does.

How DeepSeek V4 Compresses AI Memory

V4 introduces a hybrid architecture that combines two compression strategies inside the same model. Both work on the principle that not every token deserves equal storage.

Compressed Sparse Attention (CSA)

CSA bundles every m tokens of the KV cache into a single compressed entry, using a small learned module. Then a component called the Lightning Indexer picks only the top-k compressed entries that the current query token needs to look at. Everything else is ignored.

Imagine a research assistant who, instead of re-reading the entire transcript of a meeting for every new question, keeps a one-line summary of every five minutes โ€” and only re-opens the summaries that look relevant.

Heavily Compressed Attention (HCA)

HCA goes further. It bundles much larger groups of tokens into a single entry, then uses dense attention โ€” paying attention to all of those compressed entries at once โ€” without the sparse top-k selection step. Lossier, but cheaper.

Both layers also include a sliding window for recent tokens, so the model still has high-fidelity access to what was just said.

The combined effect on V4-Pro is striking:

Metric V3.2 baseline V4-Pro V4-Flash
Single-token inference FLOPs 100% 27% 10%
KV cache size 100% 10% 7%

V4-Flash uses one-tenth the compute and seven-hundredths the memory of the older model โ€” at million-token context length, all percentages measured against the V3.2 baseline.

What Million-Token Context Actually Enables

Numbers do not capture the change. A million tokens is roughly 750,000 words, which is about ten average novels, or a complete enterprise codebase, or every email exchange a team has had over a year.

Today, most "long-document" AI use cases are forced into a workaround called retrieval-augmented generation (RAG): chop the document into chunks, search for relevant chunks at query time, feed only those into the model. RAG works, but it loses cross-chunk context. The model never sees the whole document at once, so it cannot connect chapter three to chapter nineteen.

When the entire document fits, the workflow changes:

  • Code review across whole repositories instead of file-by-file
  • Legal analysis on full case files, not paragraph-level retrieval
  • Medical reasoning over an entire patient history rather than recent visits
  • Long-form creative work where the AI actually remembers the protagonist's arc on page 400

This is not a quantitative upgrade. It is a qualitative one.

How Does V4 Stack Up Against the Frontier?

V4 is not strictly the best at everything โ€” but it is competitive at a fraction of the cost.

Benchmark V4-Pro-Max Top US Model
Codeforces (coding) 3206 3168 (GPT-5.4-xHigh)
MRCR 1M (long-context recall) 83.5 76.3 (Gemini-3.1-Pro-High)
CorpusQA 1M (long-context QA) 62.0 53.8 (Gemini-3.1-Pro-High)
MRCR (short, leader) 83.5 92.9 (Claude Opus 4.6 Max)

DeepSeek's claim is not that V4 dominates. It is that V4 delivers comparable or better performance at long context for a small fraction of the inference cost โ€” and ships open-source weights, available on Hugging Face the day of release.

Why "Compression" Is the New Frontier

Step back from the specific architecture. The deeper pattern is this: for the past decade, AI progress has been driven mostly by scaling โ€” bigger models, more data, more compute. That worked, but it is hitting energy and economic walls. Training-grade GPUs are scarce. Data-center power demand is straining grids. The marginal return on adding another trillion parameters is shrinking.

So the frontier is shifting. Instead of "make it bigger," the question becomes "make the same capability cost less." That is a compression problem. It is what neuro-symbolic methods do for reasoning. It is what mixture-of-experts does for parameters. It is what V4's hybrid attention does for context.

This pattern has economic consequences. When a capability gets dramatically cheaper, total usage tends to grow rather than shrink โ€” a phenomenon called Jevons paradox. Cheaper long-context AI does not mean less AI compute is sold; it means more people use long-context AI for things that were never worth doing before.

Why Open-Source Releases Like This Reset the Game

DeepSeek made V4-Pro and V4-Flash openly available, with V4-Flash priced identically to its 2024 V2 model โ€” roughly the floor of cutting-edge pricing. Within hours of release, Huawei pledged hardware support, and the open-source ecosystem (vLLM, SGLang, NVIDIA Blackwell endpoints) absorbed the model.

The lesson here is not geopolitical. It is structural. Open-weight releases at the frontier compress the time between "only one company can do this" and "everyone can." A capability that was a competitive moat for a US lab in early 2026 became a commodity feature within days of V4's release. This will keep happening.

For users, the implication is that strategy built on "Lab X has the best model" decays fast. Strategy built on knowing what these models can structurally do โ€” and what they structurally cannot โ€” keeps its value.

Common Questions About AI Context Windows

Does a bigger context window always make AI smarter?
No. Context size and reasoning quality are separate dimensions. A model can have a million-token window and still struggle with three-step logic, just as a model with an 8,000-token window can be sharp but forgetful. What context buys you is the option to use more information at once. Whether the model uses it well depends on training, not size.

Will million-token context make RAG obsolete?
Not entirely. RAG remains useful for two reasons: cost (sending a million tokens is still expensive even when the model can handle it) and freshness (RAG can pull live, updated information that the model has never seen). What changes is the default. For static, bounded documents โ€” a contract, a codebase snapshot, a research paper โ€” feeding the whole thing in is now often better than chunking it.

Why is the KV cache the bottleneck rather than the model size?
Because the KV cache grows with conversation length, while model size is fixed once trained. A 70-billion-parameter model running on a single GPU might handle 32,000 tokens fine โ€” and run out of memory at 128,000 tokens, even though the model itself never changed. The cache is what scales with use.

What This Means for You

If you use AI day-to-day, three concrete things change as million-token context becomes standard:

  1. Stop chunking. Workflows where you feed AI small pieces of a document because you "had to" will be obsolete. Feed the whole thing and ask the synthesizing question.
  2. Test recall, not just intelligence. A model with a million-token context that loses information at position 700,000 is worse than a model with 200,000 tokens that remembers all of them. The MRCR-style "needle in a haystack" benchmarks matter more than raw IQ-style scores.
  3. Watch the cost curve, not the leaderboard. When the same task drops 90% in cost, your cost-benefit math changes โ€” even if the "best" model has not.

Conclusion

The DeepSeek V4 release is not really about DeepSeek. It is a clean illustration of where AI is going: not infinitely larger, but radically more efficient at remembering. The attention mechanism that gave us modern AI is the same mechanism whose limits define what AI cannot yet do โ€” and the labs solving those limits, rather than chasing benchmark records, are the ones reshaping what is possible.

A million tokens of context is not the end. But it is the moment AI stopped being something you handed snippets to, and started being something you could hand whole worlds to.


Related reading on Practical Mind:


๐Ÿ“Œ Sources

๋ฐ˜์‘ํ˜•