Gemma (Google)
What makes it distinctive
Gemma is Google DeepMind's open-weights family — and its trick is making long context cheap. Instead of every layer looking at every word, most layers only peek at a short nearby window; just occasional layers look at the whole text.
That one design choice — interleave many local sliding-window layers (each sees only nearby tokens) with a few global layers (each sees everything) — keeps the model's memory cache tiny while still reaching a 128K-token context. Pair that with a huge 256K-word vocabulary (shared between input and output) for strong multilingual coverage, and you have Gemma's two signatures. The rest is the familiar decoder-only recipe.
Every model here starts from the same blueprint — attention, feed-forward, normalization, positions. See How Open-Source LLMs Are Built for the bricks, then come back to see what Gemma swaps out.
The family so far
Three generations in roughly a year, each smaller-and-smarter than the last, plus an on-device variant.
Gemma 1 (2B / 7B, Feb 2024) → Gemma 2 (2B / 9B / 27B, mid-2024) → Gemma 3 (1B / 4B / 12B / 27B, Mar 2025). Gemma 3n (E2B / E4B, 2025) is the on-device version. Gemma 4 (Apr–Jun 2026: E2B / E4B / 12B / 26B MoE / 31B) adds native audio, 256K context and 140+ languages.
Gemma 3 adds a vision encoder so the 4B/12B/27B models read images, not just text. The 1B model stays text-only.
Gemma 3 reaches 128K tokens (4B/12B/27B; 32K for 1B). Gemma 2 was 8K. The local/global trick is what makes the jump affordable.
Custom open-weights license. Commercial use is allowed under a prohibited-use policy. It is not OSI/Apache/MIT — read the terms.
Signature: mostly-local attention
Normal attention lets every token attend to every earlier token — accurate, but the memory cost (the KV-cache, the stored keys/values for past tokens) grows with context length. Gemma's fix: make most layers local (each token only attends to a short sliding window of recent tokens), and sprinkle in occasional global layers that see the whole sequence so information can still travel far.
Gemma 2 alternated 1 local : 1 global with a 4096-token window. Gemma 3 pushed the ratio to 5 local : 1 global and shrank the window to 1024 — far fewer expensive global layers, so the KV-cache stays small even at 128K. Toggle the two generations below and watch the stack and the attention mask change.
Blue = local sliding-window layer · Orange = global full-attention layer. The small grids show one layer's attention mask (lit = a token it can look at).
Signature: double RMSNorm + GeGLU
Two more Gemma details. First, normalization: most models normalize the input of each sub-layer (pre-norm). Gemma normalizes both the input and the output of every attention and feed-forward block — a "double" RMSNorm sandwich that stabilizes training. (Gemma 3 also adds QK-norm, normalizing the query/key vectors, replacing Gemma 2's attention soft-capping.)
Second, the feed-forward network uses GeGLU — a gated unit where the gate passes through GELU (a smooth activation). Llama-style models use SwiGLU instead (gate through SiLU/Swish). Same gated idea, different activation. Step through the block below.
vs the shared recipe
Map each generic brick to Gemma's choice — what it keeps from the standard decoder-only blueprint, and where it diverges.
- Decoder-only, dense. No mixture-of-experts; every parameter runs for every token.
- GQA attention. Grouped-query attention shrinks the KV-cache, like its peers.
- RoPE positions. Rotary position encoding throughout.
- Gated FFN. A gated feed-forward unit (its GeGLU is a GELU cousin of SwiGLU).
- Local + global layers. Mostly short windows (5:1 in Gemma 3) instead of full attention everywhere — saves memory, adds design complexity.
- Double RMSNorm. Pre- AND post-norm on every sub-layer, not just pre-norm.
- Per-layer RoPE base. Gemma 3 uses base 1,000,000 on global layers, 10,000 on local — tuned for 128K context.
- Huge 256K vocab. Great multilingual reach, but a big embedding table; mitigated by tying input and output embeddings.
Gotchas / good to know
- Licensing. Gemma ships under Google's custom Gemma Terms of Use, not Apache/MIT. Commercial use is allowed but bound by a prohibited-use policy — check it for your case.
- Not sparse, despite the name. Gemma is fully dense. Gemma 3n uses MatFormer nested feed-forwards + Per-Layer Embeddings for elastic on-device sizing — that is nested-dense, not a sparse mixture-of-experts.
- The 1B model is the exception. Only Gemma 3 4B and up are multimodal and reach 128K; the 1B stays text-only at 32K.
- Big vocab, big embeddings. The 256K token vocabulary means a large embedding matrix; tied input/output embeddings help, but it still shapes the smallest models' parameter budget.
- Generation differences matter. Gemma 2 used soft-capping + 4096 windows + 1:1 ratio; Gemma 3 replaced soft-capping with QK-norm, shrank windows to 1024, and went 5:1. Don't assume one config across versions.
Related
How Open-Source LLMs Are Built
The shared decoder-only recipe and the bricks every model swaps.
Mistral & Mixtral
The other sliding-window family — useful contrast for Gemma's local layers.
Llama (Meta)
Single pre-norm + SwiGLU — the baseline Gemma's double-norm and GeGLU diverge from.
Transformer Architecture
Attention, feed-forward, normalization — the fundamentals behind every brick here.