DeepSeek (V3 / R1)
What makes it distinctive
DeepSeek's bet: a giant brain that's cheap to run, and reasoning you can train into the model instead of writing it by hand.
Two ideas carry the whole family. Multi-head Latent Attention (MLA) squeezes the memory each token leaves behind (its KV cache, the saved keys and values) into one tiny shared vector — slashing the biggest cost of long context. DeepSeekMoE swaps the single feed-forward block for hundreds of small, specialized experts plus one always-on shared expert, so a 671B-parameter model only fires ~37B per token. Then R1 took the V3 base and used pure reinforcement learning (reward good answers, no human-written examples first) so step-by-step reasoning emerged on its own.
This page assumes the shared transformer recipe. For the common bricks — RoPE, RMSNorm, SwiGLU, MoE — start with How Open-Source LLMs Are Built, or the basics in What is an LLM?
The family so far
DeepSeek-AI iterated fast: a big MoE base model, then a reasoning model built on top of it, then small distilled versions anyone can run.
V2 (May 2024) introduced MLA + DeepSeekMoE. V3 (Dec 2024) scaled it up. R1 & R1-Zero (Jan 2025) added reasoning via RL; R1-0528 refreshed it (May 2025). V4 (preview, Apr 2026) folded reasoning in as a switchable mode — V4-Pro (1.6T total) and V4-Flash (284B) with 1M context, still MIT; the long-rumoured R2 never shipped.
V3 & R1 are sparse MoE: 671B total parameters, only ~37B run per token. V2 was 236B / 21B active.
Dense (non-MoE) models that learned to imitate R1's reasoning, built on Qwen2.5 and Llama3 backbones. Easy to run locally.
128K-token context (V2/V3/R1), reached via YaRN scaling (4K → 32K → 128K). V3 weights and R1/R1-Distill are MIT-licensed.
Signature #1 — Multi-head Latent Attention (MLA)
Normal attention is memory-hungry. Every token has to remember its keys and values (the K and V vectors) for every attention head, for every layer, so later tokens can look back at it. With 128 heads and long context, that saved KV cache dominates the memory bill.
MLA's trick: don't store the full keys and values. Instead, compress them into one small latent vector c per token (in V3, the KV side squeezes down to dimension 512), then reconstruct K and V on demand when attention runs. A small decoupled RoPE piece (64 dims per head) carries the position signal separately. Back in V2, this cut the KV cache by about 93% versus a standard dense model.
Less KV memory means you can hold far longer context, serve more users on the same GPU, and decode faster — without dropping to fewer attention heads (which would hurt quality). MLA keeps all 128 heads and shrinks the cache.
Signature #2 — DeepSeekMoE: fine-grained + a shared expert
A Mixture-of-Experts layer replaces one big feed-forward block with many small ones and a router that picks a few per token. DeepSeek pushes this two ways. First, fine-grained: split into lots of small experts (V3 has 256 routed experts per MoE layer, with the top 8 chosen per token) so each can specialize narrowly. Second, a single shared expert that every token always uses — it absorbs the common, general knowledge so the routed experts don't waste capacity re-learning it.
Step through one token's trip through a DeepSeekMoE layer: the router scores all experts, the top-8 light up, the shared expert is always on, and everything blends.
Most MoE models add an auxiliary loss to stop the router from over-using a few favorite experts — but that nudge can hurt quality. DeepSeek uses auxiliary-loss-free balancing: a small learnable bias per expert is nudged up or down to even out the load, so no quality-hurting penalty is needed.
Bonus — how R1 learned to reason
R1 started from the V3 base and was trained with reinforcement learning: let the model generate an answer, reward it when the answer is correct (math/code can be checked automatically), and update so good behavior gets more likely. The striking part — in R1-Zero, long chains of reasoning emerged with no supervised examples first. The loop below is the whole idea.
Repeat millions of times and reasoning behavior (checking its own work, trying again) grows on its own.
vs the shared recipe
DeepSeek keeps the standard transformer skeleton and swaps two big bricks. Here's what stays familiar and what's distinctly DeepSeek.
- Decoder-only transformer, RMSNorm pre-norm.
- SwiGLU feed-forward inside each expert.
- RoPE rotary positions; long context via YaRN.
- ~128K-token byte-level BPE vocabulary.
- Sparse MoE (top-k routing) like Mixtral, Qwen3-MoE.
- MLA instead of plain MHA/GQA — tiny KV cache, but a more involved attention block.
- Shared expert always on, unlike Qwen's pure fine-grained MoE.
- Aux-loss-free balancing via learnable bias, not a balancing penalty.
- Multi-Token Prediction training (predict more than one next token) for sharper learning.
- 671B total means heavy to host — the full model needs serious hardware (distills are the lightweight path).
Gotchas / good to know
- V3/R1 are huge. 671B total parameters — even at 37B active, you need lots of GPU memory to load the weights. For a laptop, use an R1-Distill (1.5B–70B) instead.
- Distills are dense, not MoE. R1-Distill models are ordinary Qwen2.5/Llama3 backbones taught to imitate R1's reasoning — they don't have MLA or DeepSeekMoE inside.
- R1 "thinks out loud." Reasoning models emit a long chain of thought before the final answer — great for hard problems, but slower and more tokens for simple ones.
- MLA is non-standard. Latent attention needs framework support; not every inference engine implements it as efficiently as plain attention.
- MTP is a training objective. Multi-Token Prediction sharpens learning and can speed generation, but using it for speedups at inference depends on your serving setup.
Related
How Open-Source LLMs Are Built
The shared recipe — the bricks every model on this hub reuses.
Qwen
Also fine-grained MoE — but no shared expert. A clean contrast to DeepSeekMoE.
gpt-oss
Another MoE family with its own attention tricks (attention sinks).
What is an LLM?
Start from zero: tokens, prediction, and how these models learn.