How Open-Source LLMs Are Built

decoder-only RoPE RMSNorm SwiGLU GQA MLA Mixture-of-Experts

The big picture

Almost every open-source LLM you've heard of — Llama, Mistral, Qwen, DeepSeek, Gemma, gpt-oss, Phi — is built from the same handful of LEGO bricks. They all stack the same kind of layer dozens of times. What makes them different is which version of each brick they pick.

This page is the map. First we walk the shared skeleton — the decoder-only transformer — one layer at a time, with an animation for each brick. Then a side-by-side table shows exactly which brick each model chose. Finally, pick any model for its own deep-dive.

New to transformers?

This guide assumes you know roughly what attention is. If not, read How Transformers Work and What is an LLM? first — they build the foundation we reuse here.

One shared recipe

A modern open LLM is a tall sandwich. Text comes in at the bottom, gets turned into numbers, flows up through N identical blocks (often 28–80 of them), and a prediction for the next word comes out the top. Every block does two jobs: let tokens talk to each other (attention) and think about each token on its own (a feed-forward network). Normalization and "shortcut" (residual) connections wrap both jobs so the tall stack stays trainable.

Tokenizer + Embedding text → vectors

Split text into tokens (sub-word chunks), then look up a learned vector for each. The vocabulary is typically 32k–256k tokens.

N × Decoder block attention + FFN

The repeated unit. Norm → attention (tokens mix) → add shortcut → norm → feed-forward (per-token thinking) → add shortcut.

Final norm + LM head vectors → word

One last normalization, then a linear layer projects to a score for every word in the vocabulary. Softmax turns scores into probabilities.

Autoregressive loop predict, append, repeat

Pick the next token, glue it to the input, and run the whole stack again. One token at a time is how all the text you read gets written.

The animation below walks that whole journey. Press Play, or step through with Next.

Brick 1 — RoPE: how the model knows word order

Attention by itself is order-blind: shuffle the words and it can't tell. Modern open models fix this with RoPE (Rotary Position Embedding). Instead of adding a position signal, RoPE rotates each token's query and key vectors by an angle that grows with its position. The clever part: when two tokens compare, only the angle between them matters — so the model learns relative distance, and it stretches to longer texts more gracefully.

Drag the slider to move the whole sentence later in the document. Watch the vectors spin — but the gap between any two tokens stays fixed. That fixed gap is what attention actually reads.

Two tokens, two positions apart. As the pair moves later in the document both rotate — but the angle between them never changes.

Brick 2 — RMSNorm & pre-norm

Deep stacks are fragile: numbers can blow up or vanish as they pass through dozens of layers. Normalization rescales each token's vector to a sane size before each sub-layer. Modern models use RMSNorm — a cheaper, simpler cousin of LayerNorm that just divides by the vector's root-mean-square magnitude (no mean-subtraction). They also normalize before each sub-layer ("pre-norm") rather than after, which makes very deep models far easier to train.

LayerNorm older / BERT, GPT-2

Subtract the mean, divide by standard deviation, then scale and shift. Two learned vectors, more compute.

RMSNorm today's default

Skip the mean entirely — just divide by RMS magnitude and scale. Fewer operations, basically the same quality. Used by nearly every model on this page.

Pre-norm norm → sublayer → add

Normalizing the input to each block (not the output) keeps a clean "residual highway" running straight up the stack, so gradients flow freely.

Brick 3 — SwiGLU: the feed-forward network

After tokens mix via attention, each token is processed alone by a small two-layer network — the feed-forward network (FFN), where most of a dense model's parameters live. Modern models replace the plain FFN with a gated one: SwiGLU (and Gemma's close relative GeGLU). A gate lets the network decide, per dimension, how much signal to let through — a small change that reliably improves quality.

In plain words

A normal FFN is "expand → squash → shrink." A gated FFN adds a second branch that acts like a dimmer switch: output = (Swish(xW₁)) ⊙ (xW₃) then project back down. The ⊙ is the gate.

Brick 4 — the attention family: MHA → MQA → GQA → MLA

Attention needs to remember a Key and Value for every past token — the KV cache. On long chats that cache becomes the memory bottleneck. The big efficiency story of open LLMs is shrinking it. Each step keeps the same number of query heads but shares or compresses the keys and values.

Click each option to see the keys and values collapse — and the KV-cache bar shrink with them.

Query heads stay the same; keys/values are shared (MQA/GQA) or squeezed into a latent (MLA). Less to store = longer context, faster generation.

MHA — Multi-Head one K/V per query

The original. Highest quality, biggest KV cache. (Original transformer, GPT-2.)

MQA — Multi-Query all queries share 1 K/V

Tiny cache, fastest, slight quality loss. (PaLM, early Gemini.)

GQA — Grouped-Query groups share K/V

The sweet spot, used by most models here. (Llama, Mistral, Qwen, Gemma.)

MLA — Latent Attention K/V compressed to a latent

DeepSeek's trick: store a small shared latent, reconstruct K/V on the fly. Smallest cache at full-ish quality.

Brick 5 — Mixture-of-Experts (MoE)

How do you make a model "know more" without making every token cost more to compute? Mixture-of-Experts. Replace the single FFN with many expert FFNs, and add a tiny router that sends each token to just a few of them. The model has huge total parameters, but only a small active slice runs per token. Mixtral, DeepSeek-V3, gpt-oss, Qwen3-MoE and Llama 4 all use this.

Watch one token get routed: the gate scores all experts, the top few light up, they process, and their outputs are blended.

Why it's a big deal

DeepSeek-V3 has 671B total parameters but activates only ~37B per token — roughly the cost of a 37B dense model, with the knowledge of something far larger. That's the MoE bargain: pay for what you use.

Brick 6 — long context without quadratic cost

Full attention compares every token to every other — cost grows with the square of the length. To read long documents cheaply, models limit who each token may look at. Sliding-window attention (Mistral) lets a token see only the last W tokens. Local + global layers (Gemma, gpt-oss) alternate cheap local layers with occasional full-range ones. Cost grows roughly linearly instead.

Flip between mask types and drag the window. Lit cells = "this query (row) is allowed to look at this key (column)."

Fewer lit cells = less compute and memory per token. The shape of the mask is one of the biggest levers on context length.

Brick 7 — tokenizer & vocabulary

Before any of this, text must become tokens. Models use byte-level BPE or SentencePiece to split text into sub-word pieces. A bigger vocabulary means fewer tokens per sentence (cheaper, longer effective context) but a larger embedding table. Vocabularies have grown a lot: Llama 2 used 32k, Llama 3 jumped to 128k, and Gemma uses 256k — great for multilingual and code text.

Same bricks, different choices

Here's the whole field at a glance. Every model is a decoder-only transformer with RoPE and RMSNorm — the differences are in attention, the FFN, and whether they go Mixture-of-Experts. Click a row to highlight it and open that model's deep-dive.

ModelTypeAttentionNormFFNPositionMoEContextVocab
LlamaMeta Dense (L4: MoE)GQARMSNormSwiGLURoPE Llama 4128K (L3) → 1M+ (L4)128K
Mistral / MixtralMistral AI Dense & MoEGQA + sliding windowRMSNormSwiGLURoPE Mixtral (8, top-2)32K → 256K32K → 131K
QwenAlibaba Dense & MoEGQA + QK-NormRMSNormSwiGLURoPE + YaRN Qwen3-MoE (128, top-8)128K → 256K~151K
DeepSeekV3 / R1 MoEMLARMSNormSwiGLU + DeepSeekMoERoPE Yes (fine-grained + shared)128K~129K
GemmaGoogle DenseGQA + local/globalRMSNorm (pre+post)GeGLURoPE No128K (Gemma 3)256K
gpt-ossOpenAI MoEGQA + alt dense/sparse + sinksRMSNormSwiGLU (MoE)RoPE + YaRN Yes128Ko200k (harmony)
PhiMicrosoft Dense (3.5: MoE)GQARMSNormSwiGLURoPE + LongRoPE Phi-3.5-MoE128K~100K
Shared by all: decoder-only · RoPE · RMSNorm · gated FFN · GQA-or-better The differentiators: MoE · attention variant · window pattern · vocab size

Pick a model

Each deep-dive shows the family's timeline, its signature architecture trick with its own animation, and how it differs from the shared recipe.