gpt-oss (OpenAI)

Mixture-of-Experts Attention sinks MXFP4 4-bit experts 128K context Apache 2.0 Aug 5, 2025

What makes it distinctive

gpt-oss is OpenAI's first open-weight release since GPT-2 — a sparse Mixture-of-Experts (only a few sub-networks fire per token) reasoning model engineered to run on hardware you can actually rent.

The headline trick: store the bulky expert weights in MXFP4 (a 4-bit number format) so the 116.8B-parameter gpt-oss-120b squeezes onto a single 80GB GPU. On top of that, gpt-oss layers a few sharp ideas — attention layers that alternate between full and windowed, a learned "attention sink" that lets a head attend to nothing, and the structured harmony chat format.

New to this?

Every open model shares the same skeleton. See How Open-Source LLMs Are Built for the common recipe, then come back to see where gpt-oss bends it.

The family so far

Released Aug 5, 2025

OpenAI's first open-weight models since GPT-2 — weights you can download and run yourself. Safety-tuned gpt-oss-safeguard variants followed later in 2025; as of mid-2026 this pair is still the current open-weight family.

Two sizes 120b · 20b

gpt-oss-120b: 116.8B total / ~5.1B active, 36 layers. gpt-oss-20b: 20.9B total / ~3.6B active, 24 layers.

Context 131,072 tokens

128K window via RoPE positions stretched with YaRN (a long-context scaling trick) on the dense layers.

License Apache 2.0

Permissive — commercial use allowed (plus a usage policy). Tokenizer: o200k_harmony, 201,088 tokens.

Signature feature: top-4 experts, stored in 4 bits

A dense model runs all its weights on every token. gpt-oss is a Mixture-of-Experts: each block holds many parallel expert FFNs (feed-forward sub-networks), and a small linear router picks just the top-4 for each token. The 120b model has 128 experts per block (the 20b has 32), and these expert weights are 90%+ of all parameters.

That sparsity is why only ~5.1B of 116.8B params fire per token. The second half of the magic is MXFP4: those expert weights are quantized to 4 bits each instead of 32, shrinking memory ~8x so the whole 120b fits on one 80GB GPU. Step through it below.

Alternating attention + attention sinks

gpt-oss uses GQA (grouped-query attention): 64 query heads share just 8 key/value heads (group size 8), cutting memory. The twist is that layers alternate: one layer does full dense attention (every token sees all earlier tokens), the next does banded-sparse attention (each token only sees a sliding window of 128 tokens, GPT-3 style).

Each head also gets a learned per-head bias added in the softmax denominator — an attention sink. Think of it as an always-present junk slot a head can dump its weight into when nothing in the actual context is relevant, instead of being forced to over-attend to the first token. This keeps long-context attention stable. Toggle the two layer types below.

Rows = a query token; shaded squares = which earlier tokens it may attend to. The right-most column is the always-available attention sink.

vs the shared recipe

Keeps from the standard recipe
  • Decoder-only transformer — same autoregressive backbone everyone uses.
  • RMSNorm with pre-norm (normalize before each block) for stable training.
  • RoPE rotary positions, here stretched to 128K with YaRN.
  • SwiGLU gated feed-forward inside each expert (plus clamping + residual).
  • GQA attention to shrink the key/value cache.
Changes & trade-offs
  • Sparse MoE not dense — great FLOPs/quality, but you still must load all 128 experts into memory.
  • MXFP4 4-bit experts — fits 120b on one GPU, but quantization can cost some precision.
  • Alternating dense / banded-sparse attention — efficient, but window layers can't directly see far-back tokens.
  • Attention sinks — an extra learned bias most models skip.
  • harmony format required — convenient roles, but it is non-optional structure.

Gotchas / good to know

Read before you deploy
  • "Fits on 80GB" needs MXFP4. The single-GPU claim for 120b depends on the 4-bit expert weights; full-precision needs far more memory.
  • Total ≠ active. You must hold all 116.8B params in memory even though only ~5.1B compute per token — MoE saves FLOPs, not storage.
  • Use the harmony format. gpt-oss expects the structured chat format (roles: System, Developer, User, Assistant, Tool). Feeding raw text usually degrades output.
  • Banded layers have a 128-token window. Long-range reasoning leans on the alternating dense layers; don't assume every layer sees the whole context.
  • Apache 2.0 plus a usage policy. Permissive, but still read the policy for your use case.

Related