How Open-Source LLMs Are Built
The big picture
Almost every open-source LLM you've heard of — Llama, Mistral, Qwen, DeepSeek, Gemma, gpt-oss, Phi — is built from the same handful of LEGO bricks. They all stack the same kind of layer dozens of times. What makes them different is which version of each brick they pick.
This page is the map. First we walk the shared skeleton — the decoder-only transformer — one layer at a time, with an animation for each brick. Then a side-by-side table shows exactly which brick each model chose. Finally, pick any model for its own deep-dive.
This guide assumes you know roughly what attention is. If not, read How Transformers Work and What is an LLM? first — they build the foundation we reuse here.
One shared recipe
A modern open LLM is a tall sandwich. Text comes in at the bottom, gets turned into numbers, flows up through N identical blocks (often 28–80 of them), and a prediction for the next word comes out the top. Every block does two jobs: let tokens talk to each other (attention) and think about each token on its own (a feed-forward network). Normalization and "shortcut" (residual) connections wrap both jobs so the tall stack stays trainable.
Split text into tokens (sub-word chunks), then look up a learned vector for each. The vocabulary is typically 32k–256k tokens.
The repeated unit. Norm → attention (tokens mix) → add shortcut → norm → feed-forward (per-token thinking) → add shortcut.
One last normalization, then a linear layer projects to a score for every word in the vocabulary. Softmax turns scores into probabilities.
Pick the next token, glue it to the input, and run the whole stack again. One token at a time is how all the text you read gets written.
The animation below walks that whole journey. Press Play, or step through with Next.
Brick 1 — RoPE: how the model knows word order
Attention by itself is order-blind: shuffle the words and it can't tell. Modern open models fix this with RoPE (Rotary Position Embedding). Instead of adding a position signal, RoPE rotates each token's query and key vectors by an angle that grows with its position. The clever part: when two tokens compare, only the angle between them matters — so the model learns relative distance, and it stretches to longer texts more gracefully.
Drag the slider to move the whole sentence later in the document. Watch the vectors spin — but the gap between any two tokens stays fixed. That fixed gap is what attention actually reads.
Two tokens, two positions apart. As the pair moves later in the document both rotate — but the angle between them never changes.
Brick 2 — RMSNorm & pre-norm
Deep stacks are fragile: numbers can blow up or vanish as they pass through dozens of layers. Normalization rescales each token's vector to a sane size before each sub-layer. Modern models use RMSNorm — a cheaper, simpler cousin of LayerNorm that just divides by the vector's root-mean-square magnitude (no mean-subtraction). They also normalize before each sub-layer ("pre-norm") rather than after, which makes very deep models far easier to train.
Subtract the mean, divide by standard deviation, then scale and shift. Two learned vectors, more compute.
Skip the mean entirely — just divide by RMS magnitude and scale. Fewer operations, basically the same quality. Used by nearly every model on this page.
Normalizing the input to each block (not the output) keeps a clean "residual highway" running straight up the stack, so gradients flow freely.
Brick 3 — SwiGLU: the feed-forward network
After tokens mix via attention, each token is processed alone by a small two-layer network — the feed-forward network (FFN), where most of a dense model's parameters live. Modern models replace the plain FFN with a gated one: SwiGLU (and Gemma's close relative GeGLU). A gate lets the network decide, per dimension, how much signal to let through — a small change that reliably improves quality.
A normal FFN is "expand → squash → shrink." A gated FFN adds a second branch that acts like a dimmer switch: output = (Swish(xW₁)) ⊙ (xW₃) then project back down. The ⊙ is the gate.
Brick 4 — the attention family: MHA → MQA → GQA → MLA
Attention needs to remember a Key and Value for every past token — the KV cache. On long chats that cache becomes the memory bottleneck. The big efficiency story of open LLMs is shrinking it. Each step keeps the same number of query heads but shares or compresses the keys and values.
Click each option to see the keys and values collapse — and the KV-cache bar shrink with them.
Query heads stay the same; keys/values are shared (MQA/GQA) or squeezed into a latent (MLA). Less to store = longer context, faster generation.
The original. Highest quality, biggest KV cache. (Original transformer, GPT-2.)
Tiny cache, fastest, slight quality loss. (PaLM, early Gemini.)
The sweet spot, used by most models here. (Llama, Mistral, Qwen, Gemma.)
DeepSeek's trick: store a small shared latent, reconstruct K/V on the fly. Smallest cache at full-ish quality.
Brick 5 — Mixture-of-Experts (MoE)
How do you make a model "know more" without making every token cost more to compute? Mixture-of-Experts. Replace the single FFN with many expert FFNs, and add a tiny router that sends each token to just a few of them. The model has huge total parameters, but only a small active slice runs per token. Mixtral, DeepSeek-V3, gpt-oss, Qwen3-MoE and Llama 4 all use this.
Watch one token get routed: the gate scores all experts, the top few light up, they process, and their outputs are blended.
DeepSeek-V3 has 671B total parameters but activates only ~37B per token — roughly the cost of a 37B dense model, with the knowledge of something far larger. That's the MoE bargain: pay for what you use.
Brick 6 — long context without quadratic cost
Full attention compares every token to every other — cost grows with the square of the length. To read long documents cheaply, models limit who each token may look at. Sliding-window attention (Mistral) lets a token see only the last W tokens. Local + global layers (Gemma, gpt-oss) alternate cheap local layers with occasional full-range ones. Cost grows roughly linearly instead.
Flip between mask types and drag the window. Lit cells = "this query (row) is allowed to look at this key (column)."
Fewer lit cells = less compute and memory per token. The shape of the mask is one of the biggest levers on context length.
Brick 7 — tokenizer & vocabulary
Before any of this, text must become tokens. Models use byte-level BPE or SentencePiece to split text into sub-word pieces. A bigger vocabulary means fewer tokens per sentence (cheaper, longer effective context) but a larger embedding table. Vocabularies have grown a lot: Llama 2 used 32k, Llama 3 jumped to 128k, and Gemma uses 256k — great for multilingual and code text.
Same bricks, different choices
Here's the whole field at a glance. Every model is a decoder-only transformer with RoPE and RMSNorm — the differences are in attention, the FFN, and whether they go Mixture-of-Experts. Click a row to highlight it and open that model's deep-dive.
| Model | Type | Attention | Norm | FFN | Position | MoE | Context | Vocab |
|---|---|---|---|---|---|---|---|---|
| LlamaMeta | Dense (L4: MoE) | GQA | RMSNorm | SwiGLU | RoPE | Llama 4 | 128K (L3) → 1M+ (L4) | 128K |
| Mistral / MixtralMistral AI | Dense & MoE | GQA + sliding window | RMSNorm | SwiGLU | RoPE | Mixtral (8, top-2) | 32K → 256K | 32K → 131K |
| QwenAlibaba | Dense & MoE | GQA + QK-Norm | RMSNorm | SwiGLU | RoPE + YaRN | Qwen3-MoE (128, top-8) | 128K → 256K | ~151K |
| DeepSeekV3 / R1 | MoE | MLA | RMSNorm | SwiGLU + DeepSeekMoE | RoPE | Yes (fine-grained + shared) | 128K | ~129K |
| GemmaGoogle | Dense | GQA + local/global | RMSNorm (pre+post) | GeGLU | RoPE | No | 128K (Gemma 3) | 256K |
| gpt-ossOpenAI | MoE | GQA + alt dense/sparse + sinks | RMSNorm | SwiGLU (MoE) | RoPE + YaRN | Yes | 128K | o200k (harmony) |
| PhiMicrosoft | Dense (3.5: MoE) | GQA | RMSNorm | SwiGLU | RoPE + LongRoPE | Phi-3.5-MoE | 128K | ~100K |
Pick a model
Each deep-dive shows the family's timeline, its signature architecture trick with its own animation, and how it differs from the shared recipe.
Llama — Meta
The model that set the modern open recipe: GQA + RoPE + SwiGLU. Now going Mixture-of-Experts with Llama 4.
Mistral & Mixtral
Sliding-window attention for cheap long context, and the first hit open MoE — 8 experts, 2 active.
Qwen — Alibaba
A huge family from tiny to 235B, dense and MoE, with a large multilingual vocabulary and YaRN long context.
DeepSeek — V3 / R1
Multi-head Latent Attention shrinks the KV cache; DeepSeekMoE and RL reasoning power R1.
Gemma — Google
Alternating local/global attention, GeGLU, double RMSNorm, and a giant 256k vocabulary.
gpt-oss — OpenAI
OpenAI's open-weight MoE models with attention sinks and MXFP4-quantized experts.
Phi — Microsoft
Small models that punch far above their weight — the architecture is standard; the data is the secret.