DeepSeek V3 / R1: Latent Attention, DeepSeekMoE, and RL Reasoning

What makes it distinctive

DeepSeek's bet: a giant brain that's cheap to run, and reasoning you can train into the model instead of writing it by hand.

Two ideas carry the whole family. Multi-head Latent Attention (MLA) squeezes the memory each token leaves behind (its KV cache, the saved keys and values) into one tiny shared vector — slashing the biggest cost of long context. DeepSeekMoE swaps the single feed-forward block for hundreds of small, specialized experts plus one always-on shared expert, so a 671B-parameter model only fires ~37B per token. Then R1 learned to reason via reinforcement learning — and in R1-Zero, step-by-step reasoning emerged from pure RL with no human-written examples first.

New here?

This page assumes the shared transformer recipe. For the common bricks — RoPE, RMSNorm, SwiGLU, MoE — start with How Open-Source LLMs Are Built, or the basics in What is an LLM?

The family so far

DeepSeek-AI iterated fast: a big MoE base model, then a reasoning model built on top of it, then small distilled versions anyone can run.

Timeline V2 → V3 → R1 → V4

V2 (May 2024) introduced MLA and combined it with DeepSeekMoE (from DeepSeek's Jan 2024 MoE paper). V3 (Dec 2024) scaled it up. R1 & R1-Zero (Jan 2025) added reasoning via RL; R1-0528 refreshed it (May 2025). V4 (preview, Apr 2026) folded reasoning in as a switchable mode — V4-Pro (1.6T total) and V4-Flash (284B) with 1M context, still MIT; the long-rumoured R2 never shipped.

Sizes 671B total / 37B active

V3 & R1 are sparse MoE: 671B total parameters, only ~37B run per token. V2 was 236B / 21B active.

Small versions R1-Distill 1.5B–70B

Dense (non-MoE) models that learned to imitate R1's reasoning, built on Qwen2.5 and Llama3 backbones. Easy to run locally.

Context & license 128K · MIT

128K-token context (V2/V3/R1), reached via YaRN scaling (4K → 32K → 128K). V3 weights and R1/R1-Distill are MIT-licensed.

Signature #1 — Multi-head Latent Attention (MLA)

Normal attention is memory-hungry. Every token has to remember its keys and values (the K and V vectors) for every attention head, for every layer, so later tokens can look back at it. With 128 heads and long context, that saved KV cache dominates the memory bill.

MLA's trick: don't store the full keys and values. Instead, compress them into one small latent vector c per token (in V3, the KV side squeezes down to dimension 512), then reconstruct K and V on demand when attention runs. A small shared decoupled-RoPE key (64 dims, shared across heads) carries the position signal separately. Back in V2, this cut the KV cache by about 93% versus a standard dense model.

Why it matters

Less KV memory means you can hold far longer context, serve more users on the same GPU, and decode faster — without dropping to fewer attention heads (which would hurt quality). MLA keeps all 128 heads and shrinks the cache.

Signature #2 — DeepSeekMoE: fine-grained + a shared expert

A Mixture-of-Experts layer replaces one big feed-forward block with many small ones and a router that picks a few per token. DeepSeek pushes this two ways. First, fine-grained: split into lots of small experts (V3 has 256 routed experts per MoE layer, with the top 8 chosen per token) so each can specialize narrowly. Second, a single shared expert that every token always uses — it absorbs the common, general knowledge so the routed experts don't waste capacity re-learning it.

Step through one token's trip through a DeepSeekMoE layer: the router scores all experts, the top-8 light up, the shared expert is always on, and everything blends.

Balancing without the usual tax

Most MoE models add an auxiliary loss to stop the router from over-using a few favorite experts — but that nudge can hurt quality. DeepSeek uses auxiliary-loss-free balancing: a small per-expert bias is nudged up or down based on observed load to even it out, so no quality-hurting penalty is needed.

Bonus — how R1 learned to reason

R1 started from the V3 base and was trained with reinforcement learning: let the model generate an answer, reward it when the answer is correct (math/code can be checked automatically), and update so good behavior gets more likely. The striking part — in R1-Zero, long chains of reasoning emerged with no supervised examples first. The loop below is the whole idea.

Repeat millions of times and reasoning behavior (checking its own work, trying again) grows on its own.

vs the shared recipe

DeepSeek keeps the standard transformer skeleton and swaps two big bricks. Here's what stays familiar and what's distinctly DeepSeek.

Keeps the standard bricks

Decoder-only transformer, RMSNorm pre-norm.
SwiGLU feed-forward inside each expert.
RoPE rotary positions; long context via YaRN.
~128K-token byte-level BPE vocabulary.
Sparse MoE (top-k routing) like Mixtral, Qwen3-MoE.

Changes & trade-offs

MLA instead of plain MHA/GQA — tiny KV cache, but a more involved attention block.
Shared expert always on, unlike Qwen's pure fine-grained MoE.
Aux-loss-free balancing via a dynamically adjusted bias, not a balancing penalty.
Multi-Token Prediction training (predict more than one next token) for sharper learning.
671B total means heavy to host — the full model needs serious hardware (distills are the lightweight path).

Gotchas / good to know

Before you reach for it

V3/R1 are huge. 671B total parameters — even at 37B active, you need lots of GPU memory to load the weights. For a laptop, use an R1-Distill (1.5B–70B) instead.
Distills are dense, not MoE. R1-Distill models are ordinary Qwen2.5/Llama3 backbones taught to imitate R1's reasoning — they don't have MLA or DeepSeekMoE inside.
R1 "thinks out loud." Reasoning models emit a long chain of thought before the final answer — great for hard problems, but slower and more tokens for simple ones.
MLA is non-standard. Latent attention needs framework support; not every inference engine implements it as efficiently as plain attention.
MTP is a training objective. Multi-Token Prediction sharpens learning and can speed generation, but using it for speedups at inference depends on your serving setup.