Llama (Meta)
What makes it distinctive
Llama is the model that wrote the modern open recipe. When Meta shipped Llama 1 in February 2023, it bundled three choices — RMSNorm, SwiGLU, and RoPE — into one clean decoder stack that almost every open LLM since has copied.
After that it kept pushing the recipe forward: Grouped-Query Attention (GQA) for cheap, fast inference; a much bigger 128K tokenizer at Llama 3; and with Llama 4, a jump to Mixture-of-Experts plus a new attention scheme called iRoPE that stretches context into the multi-million-token range.
Llama shares its skeleton with every other model here. Start with How Open-Source LLMs Are Built for the shared recipe — RoPE, RMSNorm, SwiGLU, GQA and MoE — then come back to see Llama's specific choices.
The family so far
Six years of versions, from a 7B research weight drop to a near-2-trillion-parameter MoE herd. The shape stays familiar; the scale and the tricks keep growing.
Llama 1 (Feb 2023), Llama 2 (Jul 2023), Llama 3 (Apr 2024), 3.1 / 3.2 / 3.3, then the Llama 4 MoE herd (Apr 2025). As of mid-2026 Llama 4 is still the last open-weight Llama — in April 2026 Meta moved its frontier work to the closed-weight Muse Spark line.
Dense from 1B up to 405B (Llama 3.1). Llama 4 goes sparse: Scout 109B total, Maverick 400B, Behemoth ~2T (in training).
Llama 3.1 / 3.2 / 3.3 reach 128K tokens. Llama 4 Maverick hits 1M; Scout's headline window is 10M tokens.
Custom Llama Community License — free to use and build on, with extra terms for platforms over 700M monthly users. Not OSI "open source".
The signature: how each generation changed the block
Every Llama is the same decoder block stacked many times: turn tokens into vectors, normalize, do attention, normalize again, run a feed-forward layer. What changed across generations is which version of each brick Meta swapped in.
Llama 1 set the base. Llama 2 introduced GQA (sharing attention "memory" across query heads) on its 70B model. Llama 3 made GQA standard, switched to a 128K-token tokenizer, and raised RoPE's base frequency to 500,000 for longer context. Llama 4 replaced the single feed-forward layer with many experts and re-wired attention into iRoPE. Step through it:
Green = the brick this generation swapped in. Everything else is inherited from the version before.
Llama 4's trick: iRoPE interleaves local and global layers
To read millions of tokens without melting a GPU, Llama 4 mixes two kinds of attention layers. Most layers look only at a nearby chunk; a few special layers look at everything.
In RoPE local layers, a token attends only inside an 8,192-token window (a banded mask) and carries position info via RoPE — cheap and fast. Every ~4th layer is a NoPE global layer: no positional encoding, but full attention over the whole sequence (the usual lower-triangular causal mask), with an inference-time temperature tweak so it generalizes to very long inputs. Toggle between the two:
The grid is an attention mask: row = the token doing the looking, column = the token it may look at. Filled cell = allowed.
vs the shared recipe
Llama mostly kept the recipe it invented — and made a few deliberate swaps for speed and scale.
- RMSNorm, pre-norm — same in every generation.
- SwiGLU FFN (3 weight matrices) — since Llama 1.
- RoPE rotary positions — since Llama 1.
- Decoder-only stack — the standard text-generation shape.
- GQA — now the default for cheap inference, copied widely.
- RoPE base → 500,000 at Llama 3 to support longer context.
- Tokenizer grew from 32K (SentencePiece BPE) to 128K (tiktoken-style) — ~15% better compression, bigger embedding table.
- Llama 4 → MoE: huge total params, small active slice; routing adds complexity.
- iRoPE replaces uniform RoPE with local + global interleave for million-token context.
- Llama 4 is multimodal — text and vision in one model.
Gotchas / good to know
- "Open" needs an asterisk. The Llama Community License is source-available, not OSI open source — and platforms over 700M monthly active users need a separate agreement.
- Headline context is a ceiling, not a guarantee. Scout's 10M tokens is the architectural max; quality at extreme lengths depends heavily on how you use it.
- MoE "total" vs "active" params differ a lot. Maverick is 400B total but only ~17B active per token — memory footprint tracks total, compute tracks active.
- Tokenizer changed at Llama 3. Code and token counts from the 32K SentencePiece era don't carry over to the 128K tiktoken-style vocab.
- iRoPE is new and subtle. NoPE global layers and attention-temperature scaling are recent ideas; treat extreme-length behavior as still maturing.
Related
How Open-Source LLMs Are Built
The shared recipe — RoPE, RMSNorm, SwiGLU, GQA, MoE — that Llama and its peers all draw from.
Mistral / Mixtral
Another GQA + MoE family — compare its sliding-window attention and sparse experts to Llama 4's iRoPE.
DeepSeek
A different MoE + attention path (MLA, fine-grained experts) — a useful contrast to Llama's choices.
Transformer Architecture
The attention-and-FFN foundation every decoder-only LLM, Llama included, is built on.