Phi (Microsoft): Small Models, Textbook-Quality Data

What makes it distinctive

Most LLM families chase a clever new block in the network. Phi does the opposite: the transformer is a plain Llama-style design, and the whole bet is on what it trains on — textbook-quality data instead of raw internet sludge.

Microsoft Research's motto for the line is literally "Textbooks Are All You Need" (a play on the famous "Attention Is All You Need"). They heavily filter web text down to genuinely educational pages, then add huge amounts of GPT-generated synthetic "practice problems," and train a small model on that curated mix. The result: compact models (a few billion parameters) that punch far above their weight. The architecture is deliberately boring so the data can be the star.

New here?

Every family on this site reuses the same handful of bricks (attention, RoPE, SwiGLU, RMSNorm). Read How Open-Source LLMs Are Built first for the shared recipe, then come back — Phi mostly reuses that recipe and changes the data.

The family so far

Phi grew from a 1.3B experiment into a small lineup of "small but strong" models. The sizes stay modest on purpose.

Timeline 2023 → 2026

Phi-1 & Phi-1.5 (1.3B), Phi-2 (2.7B) in 2023. Phi-3 family in 2024: mini 3.8B, small 7B, medium 14B. Phi-3.5 (Aug 2024). Phi-4 (14B, Dec 2024). Later Phi-4-mini (~3.8B) and Phi-4 reasoning variants in 2025; Phi-4-reasoning-vision (15B, Mar 2026) decides on its own when to think before answering.

Sizes 1.3B – 15B

Tiny by LLM standards. Almost all are dense (every parameter runs every token). The one exception is Phi-3.5-MoE (~42B total, ~6.6B active).

Context window 4K – 128K

Phi-3-mini/medium are natively 4K (Phi-2 was 2K); Phi-4 (14B) = 16K; Phi-3-mini/medium and Phi-3.5 add 128K via LongRoPE; Phi-3.5-MoE = 128K.

License MIT

The headline models — Phi-2, the full Phi-3 family, Phi-3.5 (incl. MoE), and Phi-4 — ship under the permissive MIT license. Easy to use and build on.

The signature idea: same architecture, better data

Here is the core move. Take an ordinary transformer. Instead of feeding it the messy whole internet, feed it a small, clean, textbook-quality diet plus synthetic practice problems. The little model ends up performing like a much bigger one.

The pipeline below contrasts the usual approach (scrape everything, scale the model up) with Phi's (curate the data, keep the model small). Watch the data — that is where the magic lives, not in the boxes labelled "model."

Small but mighty

Because the data does the heavy lifting, a tiny Phi model can land near much larger models on a rough quality axis. Toggle below to see the idea — a small model sitting close to a big one.

This chart is illustrative (relative positions, not real benchmark scores). The point is the shape: Phi clusters with bigger models while using a fraction of the parameters.

Illustrative positions only — not measured benchmark numbers. Up = more capable, right = more parameters.

The lone MoE member

Almost every Phi is dense. The one exception is Phi-3.5-MoE: a Mixture-of-Experts model with 16 experts where a router picks top-2 per token. So ~42B parameters exist in total, but only ~6.6B actually fire for any given token — much of the model "sleeps" each step.

Phi-3.5-MoE: a router sends each token to just 2 of 16 experts (top-2 routing).

Phi vs the shared recipe

Phi keeps almost every standard brick and changes the data plus a few efficiency knobs. Here is what it borrows and what it tweaks.

Keeps (standard Llama-style)

Decoder-only transformer — predict the next token, nothing exotic.
RMSNorm, pre-norm — normalize before each block (Llama-style); Phi-3-small and the early Phi-1/1.5/2 use LayerNorm.
SwiGLU feed-forward layers (the gated MLP Llama uses) in the Phi-3-mini/medium, Phi-3.5 and Phi-4 lineage — Phi-3-small swaps in a GEGLU variant.
RoPE rotary position embeddings for token order.
MIT license on the headline models — friendly to build on.

Changes / trade-offs

Data is the innovation, not the network — filtered + synthetic textbook-quality corpus.
Stays small on purpose — capability per parameter, not raw scale.
GQA in larger models; Phi-3-small adds alternating dense + blocksparse attention to shrink the KV-cache.
LongRoPE stretches context to 128K in several variants; Phi-4 instead raises the RoPE base (~250K) for a 16K window.
Two tokenizers — older minis use a Llama-2-style SentencePiece (~32K), while Phi-3-small and Phi-4 use a tiktoken BPE (~100K).

Attention MHA → GQA

Phi-3-mini (3.8B): standard MHA, 32 heads. Phi-3-medium (14B): GQA, 40 query / 10 KV heads. Phi-3-small (7B): GQA where 4 query heads share 1 KV, plus alternating dense/blocksparse layers. Phi-4 (14B): GQA, 40 query / 10 KV, full attention.

Positional RoPE + LongRoPE

RoPE everywhere. LongRoPE unlocks 128K in Phi-3-mini/medium and Phi-3.5. Phi-4 keeps 16K but raises the RoPE base (~250K) instead.

Tokenizer two lineages

Phi-3-mini / 3.5-mini / 3.5-MoE: SentencePiece, ~32,064 vocab. Phi-3-small & Phi-4: tiktoken (cl100k-style) BPE, ~100,352 vocab.

Mixture-of-Experts only 3.5-MoE

Everything dense except Phi-3.5-MoE: 16 experts, top-2 active, ~42B total / ~6.6B active per token.

Gotchas / good to know

Read before you reach for Phi

Small means small. These are SLMs (small language models). They are great per-parameter, but a 3.8B or 14B model still won't match a frontier-scale model on the hardest, broadest tasks.
Synthetic data is a double-edged sword. A lot of training text is GPT-generated. That sharpens reasoning patterns but can also bake in the quirks and blind spots of the model that produced it.
"128K context" isn't free. The long-context variants rely on LongRoPE; quality on truly long inputs can lag the short-context sweet spot. Phi-4 deliberately sticks to 16K.
Mind the tokenizer. Two different tokenizers across the family (~32K vs ~100K vocab) means prompts, token counts, and fine-tuning code aren't always interchangeable between members.
MoE is the odd one out. Only Phi-3.5-MoE uses experts; tooling, memory needs, and serving for it differ from the dense models even though it shares the brand.