Gemma (Google): Local/Global Attention and a 256K Vocabulary

What makes it distinctive

Gemma is Google DeepMind's open-weights family — and its trick is making long context cheap. Instead of every layer looking at every word, most layers only peek at a short nearby window; just occasional layers look at the whole text.

That one design choice — interleave many local sliding-window layers (each sees only nearby tokens) with a few global layers (each sees everything) — keeps the model's memory cache tiny while still reaching a 128K-token context. Pair that with a huge 256K-word vocabulary (shared between input and output) for strong multilingual coverage, and you have Gemma's two signatures. The rest is the familiar decoder-only recipe.

New to the shared recipe?

Every model here starts from the same blueprint — attention, feed-forward, normalization, positions. See How Open-Source LLMs Are Built for the bricks, then come back to see what Gemma swaps out.

The family so far

Four generations in about two years, each smaller-and-smarter than the last, plus an on-device variant.

Timeline & sizes 4 generations

Gemma 1 (2B / 7B, Feb 2024) → Gemma 2 (2B / 9B / 27B, mid-2024) → Gemma 3 (1B / 4B / 12B / 27B, Mar 2025). Gemma 3n (E2B / E4B, 2025) is the on-device version. Gemma 4 (Apr–Jun 2026: E2B / E4B / 12B / 26B MoE / 31B) adds native audio, 256K context and 140+ languages.

Multimodal Gemma 3 (4B+)

Gemma 3 adds a vision encoder so the 4B/12B/27B models read images, not just text. The 1B model stays text-only.

Context window 128K (Gemma 3)

Gemma 3 reaches 128K tokens (4B/12B/27B; 32K for 1B). Gemma 2 was 8K. The local/global trick is what makes the jump affordable. Gemma 4's larger models extend this to 256K.

License Gemma Terms

Gemma 1-3 use Google's custom Gemma Terms of Use (commercial use allowed under a prohibited-use policy, not OSI/Apache/MIT). Gemma 4 switched to standard Apache 2.0.

Signature: mostly-local attention

Normal attention lets every token attend to every earlier token — accurate, but the memory cost (the KV-cache, the stored keys/values for past tokens) grows with context length. Gemma's fix: make most layers local (each token only attends to a short sliding window of recent tokens), and sprinkle in occasional global layers that see the whole sequence so information can still travel far.

Gemma 2 alternated 1 local : 1 global with a 4096-token window. Gemma 3 pushed the ratio to 5 local : 1 global and shrank the window to 1024 — far fewer expensive global layers, so the KV-cache stays small even at 128K. Toggle the two generations below and watch the stack and the attention mask change.

Blue = local sliding-window layer · Orange = global full-attention layer. The small grids show one layer's attention mask (lit = a token it can look at).

Signature: double RMSNorm + GeGLU

Two more Gemma details. First, normalization: most models normalize the input of each sub-layer (pre-norm). Gemma normalizes both the input and the output of every attention and feed-forward block — a "double" RMSNorm sandwich that stabilizes training. (Gemma 3 also adds QK-norm, normalizing the query/key vectors, replacing Gemma 2's attention soft-capping.)

Second, the feed-forward network uses GeGLU — a gated unit where the gate passes through GELU (a smooth activation). Llama-style models use SwiGLU instead (gate through SiLU/Swish). Same gated idea, different activation. Step through the block below.

vs the shared recipe

Map each generic brick to Gemma's choice — what it keeps from the standard decoder-only blueprint, and where it diverges.

Keeps from the recipe

Decoder-only, dense. No mixture-of-experts through Gemma 3; every parameter runs for every token (Gemma 4 adds a 26B MoE option).
GQA attention. Grouped-query attention shrinks the KV-cache, like its peers.
RoPE positions. Rotary position encoding throughout.
Gated FFN. A gated feed-forward unit (its GeGLU is a GELU cousin of SwiGLU).

Changes / trade-offs

Local + global layers. Mostly short windows (5:1 in Gemma 3) instead of full attention everywhere — saves memory, adds design complexity.
Double RMSNorm. Pre- AND post-norm on every sub-layer, not just pre-norm.
Per-layer RoPE base. Gemma 3 uses base 1,000,000 on global layers, 10,000 on local — tuned for 128K context.
Huge 256K vocab. Great multilingual reach, but a big embedding table; mitigated by tying input and output embeddings.

Gotchas / good to know

Read before you build on it

Licensing. Gemma 1-3 ship under the custom Gemma Terms of Use (commercial allowed, prohibited-use policy); Gemma 4 moved to Apache 2.0 — check which generation you are using.
Not sparse, despite the name. Gemma 1-3 are fully dense (Gemma 4 later adds a 26B MoE variant). Gemma 3n uses MatFormer nested feed-forwards + Per-Layer Embeddings for elastic on-device sizing — that is nested-dense, not a sparse mixture-of-experts.
The 1B model is the exception. Only Gemma 3 4B and up are multimodal and reach 128K; the 1B stays text-only at 32K.
Big vocab, big embeddings. The 256K token vocabulary means a large embedding matrix; tied input/output embeddings help, but it still shapes the smallest models' parameter budget.
Generation differences matter. Gemma 2 used soft-capping + 4096 windows + 1:1 ratio; Gemma 3 replaced soft-capping with QK-norm, shrank windows to 1024, and went 5:1. Don't assume one config across versions.