Gemma (Google)

decoder-only dense (no MoE) local + global attention GeGLU double RMSNorm 256K vocab 128K context

What makes it distinctive

Gemma is Google DeepMind's open-weights family — and its trick is making long context cheap. Instead of every layer looking at every word, most layers only peek at a short nearby window; just occasional layers look at the whole text.

That one design choice — interleave many local sliding-window layers (each sees only nearby tokens) with a few global layers (each sees everything) — keeps the model's memory cache tiny while still reaching a 128K-token context. Pair that with a huge 256K-word vocabulary (shared between input and output) for strong multilingual coverage, and you have Gemma's two signatures. The rest is the familiar decoder-only recipe.

New to the shared recipe?

Every model here starts from the same blueprint — attention, feed-forward, normalization, positions. See How Open-Source LLMs Are Built for the bricks, then come back to see what Gemma swaps out.

The family so far

Three generations in roughly a year, each smaller-and-smarter than the last, plus an on-device variant.

Timeline & sizes 4 generations

Gemma 1 (2B / 7B, Feb 2024) → Gemma 2 (2B / 9B / 27B, mid-2024) → Gemma 3 (1B / 4B / 12B / 27B, Mar 2025). Gemma 3n (E2B / E4B, 2025) is the on-device version. Gemma 4 (Apr–Jun 2026: E2B / E4B / 12B / 26B MoE / 31B) adds native audio, 256K context and 140+ languages.

Multimodal Gemma 3 (4B+)

Gemma 3 adds a vision encoder so the 4B/12B/27B models read images, not just text. The 1B model stays text-only.

Context window up to 128K

Gemma 3 reaches 128K tokens (4B/12B/27B; 32K for 1B). Gemma 2 was 8K. The local/global trick is what makes the jump affordable.

License Gemma Terms

Custom open-weights license. Commercial use is allowed under a prohibited-use policy. It is not OSI/Apache/MIT — read the terms.

Signature: mostly-local attention

Normal attention lets every token attend to every earlier token — accurate, but the memory cost (the KV-cache, the stored keys/values for past tokens) grows with context length. Gemma's fix: make most layers local (each token only attends to a short sliding window of recent tokens), and sprinkle in occasional global layers that see the whole sequence so information can still travel far.

Gemma 2 alternated 1 local : 1 global with a 4096-token window. Gemma 3 pushed the ratio to 5 local : 1 global and shrank the window to 1024 — far fewer expensive global layers, so the KV-cache stays small even at 128K. Toggle the two generations below and watch the stack and the attention mask change.

Blue = local sliding-window layer · Orange = global full-attention layer. The small grids show one layer's attention mask (lit = a token it can look at).

Signature: double RMSNorm + GeGLU

Two more Gemma details. First, normalization: most models normalize the input of each sub-layer (pre-norm). Gemma normalizes both the input and the output of every attention and feed-forward block — a "double" RMSNorm sandwich that stabilizes training. (Gemma 3 also adds QK-norm, normalizing the query/key vectors, replacing Gemma 2's attention soft-capping.)

Second, the feed-forward network uses GeGLU — a gated unit where the gate passes through GELU (a smooth activation). Llama-style models use SwiGLU instead (gate through SiLU/Swish). Same gated idea, different activation. Step through the block below.

vs the shared recipe

Map each generic brick to Gemma's choice — what it keeps from the standard decoder-only blueprint, and where it diverges.

Keeps from the recipe
  • Decoder-only, dense. No mixture-of-experts; every parameter runs for every token.
  • GQA attention. Grouped-query attention shrinks the KV-cache, like its peers.
  • RoPE positions. Rotary position encoding throughout.
  • Gated FFN. A gated feed-forward unit (its GeGLU is a GELU cousin of SwiGLU).
Changes / trade-offs
  • Local + global layers. Mostly short windows (5:1 in Gemma 3) instead of full attention everywhere — saves memory, adds design complexity.
  • Double RMSNorm. Pre- AND post-norm on every sub-layer, not just pre-norm.
  • Per-layer RoPE base. Gemma 3 uses base 1,000,000 on global layers, 10,000 on local — tuned for 128K context.
  • Huge 256K vocab. Great multilingual reach, but a big embedding table; mitigated by tying input and output embeddings.

Gotchas / good to know

Read before you build on it
  • Licensing. Gemma ships under Google's custom Gemma Terms of Use, not Apache/MIT. Commercial use is allowed but bound by a prohibited-use policy — check it for your case.
  • Not sparse, despite the name. Gemma is fully dense. Gemma 3n uses MatFormer nested feed-forwards + Per-Layer Embeddings for elastic on-device sizing — that is nested-dense, not a sparse mixture-of-experts.
  • The 1B model is the exception. Only Gemma 3 4B and up are multimodal and reach 128K; the 1B stays text-only at 32K.
  • Big vocab, big embeddings. The 256K token vocabulary means a large embedding matrix; tied input/output embeddings help, but it still shapes the smallest models' parameter budget.
  • Generation differences matter. Gemma 2 used soft-capping + 4096 windows + 1:1 ratio; Gemma 3 replaced soft-capping with QK-norm, shrank windows to 1024, and went 5:1. Don't assume one config across versions.

Related