Llama (Meta): How the Architecture Works

What makes it distinctive

Llama is the model that wrote the modern open recipe. When Meta shipped Llama 1 in February 2023, it bundled three choices — RMSNorm, SwiGLU, and RoPE — into one clean decoder stack that almost every open LLM since has copied.

After that it kept pushing the recipe forward: Grouped-Query Attention (GQA) for cheap, fast inference; a much bigger 128K tokenizer at Llama 3; and with Llama 4, a jump to Mixture-of-Experts plus a new attention scheme called iRoPE that stretches context into the multi-million-token range.

New to the building blocks?

Llama shares its skeleton with every other model here. Start with How Open-Source LLMs Are Built for the shared recipe — RoPE, RMSNorm, SwiGLU, GQA and MoE — then come back to see Llama's specific choices.

The family so far

Roughly two years of versions, from a 7B research weight drop to a near-2-trillion-parameter MoE herd. The shape stays familiar; the scale and the tricks keep growing.

Timeline 2023 → 2025

Llama 1 (Feb 2023), Llama 2 (Jul 2023), Llama 3 (Apr 2024), 3.1 / 3.2 / 3.3, then the Llama 4 MoE herd (Apr 2025). As of mid-2026 Llama 4 is still the last open-weight Llama release.

Sizes 1B → ~2T

Dense from 1B up to 405B (Llama 3.1). Llama 4 goes sparse: Scout 109B total, Maverick 400B, Behemoth ~2T (in training).

Context window 128K → 10M

Llama 3.1 / 3.2 / 3.3 reach 128K tokens. Llama 4 Maverick hits 1M; Scout's headline window is 10M tokens.

License Source-available

Custom Llama Community License — free to use and build on, with extra terms for platforms over 700M monthly users. Not OSI "open source".

The signature: how each generation changed the block

Every Llama is the same decoder block stacked many times: turn tokens into vectors, normalize, do attention, normalize again, run a feed-forward layer. What changed across generations is which version of each brick Meta swapped in.

Llama 1 set the base. Llama 2 introduced GQA (sharing attention "memory" across query heads) on its 70B model. Llama 3 made GQA standard, switched to a 128K-token tokenizer, and raised RoPE's base frequency to 500,000 for longer context. Llama 4 replaced the single feed-forward layer with many experts and re-wired attention into iRoPE. Step through it:

Green = the brick this generation swapped in. Everything else is inherited from the version before.

Llama 4's trick: iRoPE interleaves local and global layers

To read millions of tokens without melting a GPU, Llama 4 mixes two kinds of attention layers. Most layers look only at a nearby chunk; a few special layers look at everything.

In RoPE local layers, a token attends only inside an 8,192-token window (a chunked mask that resets at fixed 8,192-token chunk boundaries) and carries position info via RoPE — cheap and fast. Every ~4th layer is a NoPE global layer: no positional encoding, but full attention over the whole sequence (the usual lower-triangular causal mask), with an inference-time temperature tweak so it generalizes to very long inputs. Toggle between the two:

The grid is an attention mask: row = the token doing the looking, column = the token it may look at. Filled cell = allowed.

vs the shared recipe

Llama mostly kept the recipe it invented — and made a few deliberate swaps for speed and scale.

Keeps (and popularized)

RMSNorm, pre-norm — same in every generation.
SwiGLU FFN (3 weight matrices) — since Llama 1.
RoPE rotary positions — since Llama 1.
Decoder-only stack — the standard text-generation shape.
GQA — now the default for cheap inference, copied widely.

Changes / trade-offs

RoPE base → 500,000 at Llama 3 to support longer context.
Tokenizer grew from 32K (SentencePiece BPE) to 128K (tiktoken-style) — ~15% better compression, bigger embedding table.
Llama 4 → MoE: huge total params, small active slice; routing adds complexity.
iRoPE replaces uniform RoPE with local + global interleave for million-token context.
Llama 4 is multimodal — text and vision in one model.

Gotchas / good to know

Read before you build on it

"Open" needs an asterisk. The Llama Community License is source-available, not OSI open source — and platforms over 700M monthly active users need a separate agreement.
Headline context is a ceiling, not a guarantee. Scout's 10M tokens is the architectural max; quality at extreme lengths depends heavily on how you use it.
MoE "total" vs "active" params differ a lot. Maverick is 400B total but only ~17B active per token — memory footprint tracks total, compute tracks active.
Tokenizer changed at Llama 3. Code and token counts from the 32K SentencePiece era don't carry over to the 128K tiktoken-style vocab.
iRoPE is new and subtle. NoPE global layers and attention-temperature scaling are recent ideas; treat extreme-length behavior as still maturing.