Phi (Microsoft)
What makes it distinctive
Most LLM families chase a clever new block in the network. Phi does the opposite: the transformer is a plain Llama-style design, and the whole bet is on what it trains on — textbook-quality data instead of raw internet sludge.
Microsoft Research's motto for the line is literally "Textbooks Are All You Need" (a play on the famous "Attention Is All You Need"). They heavily filter web text down to genuinely educational pages, then add huge amounts of GPT-generated synthetic "practice problems," and train a small model on that curated mix. The result: compact models (a few billion parameters) that punch far above their weight. The architecture is deliberately boring so the data can be the star.
Every family on this site reuses the same handful of bricks (attention, RoPE, SwiGLU, RMSNorm). Read How Open-Source LLMs Are Built first for the shared recipe, then come back — Phi mostly reuses that recipe and changes the data.
The family so far
Phi grew from a 1.3B experiment into a small lineup of "small but strong" models. The sizes stay modest on purpose.
Phi-1 & Phi-1.5 (1.3B), Phi-2 (2.7B) in 2023. Phi-3 family in 2024: mini 3.8B, small 7B, medium 14B. Phi-3.5 (Aug 2024). Phi-4 (14B, Dec 2024). Later Phi-4-mini (~3.8B) and Phi-4 reasoning variants in 2025; Phi-4-reasoning-vision (15B, Mar 2026) decides on its own when to think before answering.
Tiny by LLM standards. Almost all are dense (every parameter runs every token). The one exception is Phi-3.5-MoE (~42B total, ~6.6B active).
Phi-4 (14B) = 16K tokens. Phi-3-mini/medium and Phi-3.5 ship 128K variants via LongRoPE. Phi-3.5-MoE = 128K.
The headline models — Phi-2, the full Phi-3 family, Phi-3.5 (incl. MoE), and Phi-4 — ship under the permissive MIT license. Easy to use and build on.
The signature idea: same architecture, better data
Here is the core move. Take an ordinary transformer. Instead of feeding it the messy whole internet, feed it a small, clean, textbook-quality diet plus synthetic practice problems. The little model ends up performing like a much bigger one.
The pipeline below contrasts the usual approach (scrape everything, scale the model up) with Phi's (curate the data, keep the model small). Watch the data — that is where the magic lives, not in the boxes labelled "model."
Small but mighty
Because the data does the heavy lifting, a tiny Phi model can land near much larger models on a rough quality axis. Toggle below to see the idea — a small model sitting close to a big one.
This chart is illustrative (relative positions, not real benchmark scores). The point is the shape: Phi clusters with bigger models while using a fraction of the parameters.
Illustrative positions only — not measured benchmark numbers. Up = more capable, right = more parameters.
Almost every Phi is dense. The one exception is Phi-3.5-MoE: a Mixture-of-Experts model with 16 experts where a router picks top-2 per token. So ~42B parameters exist in total, but only ~6.6B actually fire for any given token — much of the model "sleeps" each step.
Phi-3.5-MoE: a router sends each token to just 2 of 16 experts (top-2 routing).
Phi vs the shared recipe
Phi keeps almost every standard brick and changes the data plus a few efficiency knobs. Here is what it borrows and what it tweaks.
- Decoder-only transformer — predict the next token, nothing exotic.
- RMSNorm, pre-norm — normalize before each block (Llama-style).
- SwiGLU feed-forward layers (the gated MLP Llama uses).
- RoPE rotary position embeddings for token order.
- MIT license on the headline models — friendly to build on.
- Data is the innovation, not the network — filtered + synthetic textbook-quality corpus.
- Stays small on purpose — capability per parameter, not raw scale.
- GQA in larger models; Phi-3-small adds alternating dense + blocksparse attention to shrink the KV-cache.
- LongRoPE stretches context to 128K in several variants; Phi-4 instead raises the RoPE base (~250K) for a 16K window.
- Two tokenizers — older minis use a Llama-2-style SentencePiece (~32K), while Phi-3-small and Phi-4 use a tiktoken BPE (~100K).
Phi-3-mini (3.8B): standard MHA, 32 heads. Phi-3-medium: 40 heads. Phi-3-small (7B): GQA where 4 query heads share 1 KV, plus alternating dense/blocksparse layers. Phi-4 (14B): GQA, 40 query / 10 KV, full attention.
RoPE everywhere. LongRoPE unlocks 128K in Phi-3-mini/medium and Phi-3.5. Phi-4 keeps 16K but raises the RoPE base (~250K) instead.
Phi-3-mini / 3.5-mini / 3.5-MoE: SentencePiece, ~32,064 vocab. Phi-3-small & Phi-4: tiktoken (cl100k-style) BPE, ~100,352 vocab.
Everything dense except Phi-3.5-MoE: 16 experts, top-2 active, ~42B total / ~6.6B active per token.
Gotchas / good to know
- Small means small. These are SLMs (small language models). They are great per-parameter, but a 3.8B or 14B model still won't match a frontier-scale model on the hardest, broadest tasks.
- Synthetic data is a double-edged sword. A lot of training text is GPT-generated. That sharpens reasoning patterns but can also bake in the quirks and blind spots of the model that produced it.
- "128K context" isn't free. The long-context variants rely on
LongRoPE; quality on truly long inputs can lag the short-context sweet spot. Phi-4 deliberately sticks to 16K. - Mind the tokenizer. Two different tokenizers across the family (~32K vs ~100K vocab) means prompts, token counts, and fine-tuning code aren't always interchangeable between members.
- MoE is the odd one out. Only Phi-3.5-MoE uses experts; tooling, memory needs, and serving for it differ from the dense models even though it shares the brand.
Related
Where to go next.
How Open-Source LLMs Are Built
The shared recipe — attention, RoPE, SwiGLU, RMSNorm, MoE — that Phi reuses almost wholesale.
Llama (Meta)
The Llama-style decoder-only design Phi borrows: pre-norm RMSNorm, SwiGLU, RoPE, GQA.
Qwen (Alibaba)
Another family spanning small-to-large sizes — a useful contrast in how to scale a lineup.
What is an LLM?
The fundamentals — next-token prediction, tokens, and why scale and data both matter.