gpt-oss (OpenAI): Open-Weight MoE with Attention Sinks

What makes it distinctive

gpt-oss is OpenAI's first open-weight language model release since GPT-2 — a sparse Mixture-of-Experts (only a few sub-networks fire per token) reasoning model engineered to run on hardware you can actually rent.

The headline trick: store the bulky expert weights in MXFP4 (a 4-bit number format) so the 116.8B-parameter gpt-oss-120b squeezes onto a single 80GB GPU. On top of that, gpt-oss layers a few sharp ideas — attention layers that alternate between full and windowed, a learned "attention sink" that lets a head attend to nothing, and the structured harmony chat format.

New to this?

Every open model shares the same skeleton. See How Open-Source LLMs Are Built for the common recipe, then come back to see where gpt-oss bends it.

The family so far

Released Aug 5, 2025

OpenAI's first open-weight language models since GPT-2 — weights you can download and run yourself. Safety-tuned gpt-oss-safeguard variants followed later in 2025; as of mid-2026 this pair is still the current open-weight family.

Two sizes 120b · 20b

gpt-oss-120b: 116.8B total / ~5.1B active, 36 layers. gpt-oss-20b: 20.9B total / ~3.6B active, 24 layers.

Context 131,072 tokens

128K window via RoPE positions stretched with YaRN (a long-context scaling trick) on the dense layers.

License Apache 2.0

Permissive — commercial use allowed (plus a usage policy). Tokenizer: o200k_harmony, 201,088 tokens.

Signature feature: top-4 experts, stored in 4 bits

A dense model runs all its weights on every token. gpt-oss is a Mixture-of-Experts: each block holds many parallel expert FFNs (feed-forward sub-networks), and a small linear router picks just the top-4 for each token. The 120b model has 128 experts per block (the 20b has 32), and these expert weights are 90%+ of all parameters.

That sparsity is why only ~5.1B of 116.8B params fire per token. The second half of the magic is MXFP4: those expert weights are quantized to 4 bits each instead of the usual 16-bit (BF16), shrinking expert memory ~4x so the whole 120b fits on one 80GB GPU. Step through it below.

Alternating attention + attention sinks

gpt-oss uses GQA (grouped-query attention): 64 query heads share just 8 key/value heads (group size 8), cutting memory. The twist is that layers alternate: one layer does full dense attention (every token sees all earlier tokens), the next does banded-sparse attention (each token only sees a sliding window of 128 tokens, GPT-3 style).

Each head also gets a learned per-head bias added in the softmax denominator — an attention sink. Think of it as an always-present junk slot a head can dump its weight into when nothing in the actual context is relevant, instead of being forced to over-attend to the first token. This keeps long-context attention stable. Toggle the two layer types below.

Rows = a query token; shaded squares = which earlier tokens it may attend to. The right-most column is the always-available attention sink.

vs the shared recipe

Keeps from the standard recipe

Decoder-only transformer — same autoregressive backbone everyone uses.
RMSNorm with pre-norm (normalize before each block) for stable training.
RoPE rotary positions, here stretched to 128K with YaRN.
SwiGLU gated feed-forward inside each expert (plus clamping + residual).
GQA attention to shrink the key/value cache.

Changes & trade-offs

Sparse MoE not dense — great FLOPs/quality, but you still must load all 128 experts into memory.
MXFP4 4-bit experts — fits 120b on one GPU, but quantization can cost some precision.
Alternating dense / banded-sparse attention — efficient, but window layers can't directly see far-back tokens.
Attention sinks — an extra learned bias most models skip.
harmony format required — convenient roles, but it is non-optional structure.

Gotchas / good to know

Read before you deploy

"Fits on 80GB" needs MXFP4. The single-GPU claim for 120b depends on the 4-bit expert weights; full-precision needs far more memory.
Total ≠ active. You must hold all 116.8B params in memory even though only ~5.1B compute per token — MoE saves FLOPs, not storage.
Use the harmony format. gpt-oss expects the structured chat format (roles: System, Developer, User, Assistant, Tool). Feeding raw text usually degrades output.
Banded layers have a 128-token window. Long-range reasoning leans on the alternating dense layers; don't assume every layer sees the whole context.
Apache 2.0 plus a usage policy. Permissive, but still read the policy for your use case.