gpt-oss (OpenAI)
What makes it distinctive
gpt-oss is OpenAI's first open-weight release since GPT-2 — a sparse Mixture-of-Experts (only a few sub-networks fire per token) reasoning model engineered to run on hardware you can actually rent.
The headline trick: store the bulky expert weights in MXFP4 (a 4-bit number format) so the 116.8B-parameter gpt-oss-120b squeezes onto a single 80GB GPU. On top of that, gpt-oss layers a few sharp ideas — attention layers that alternate between full and windowed, a learned "attention sink" that lets a head attend to nothing, and the structured harmony chat format.
Every open model shares the same skeleton. See How Open-Source LLMs Are Built for the common recipe, then come back to see where gpt-oss bends it.
The family so far
OpenAI's first open-weight models since GPT-2 — weights you can download and run yourself. Safety-tuned gpt-oss-safeguard variants followed later in 2025; as of mid-2026 this pair is still the current open-weight family.
gpt-oss-120b: 116.8B total / ~5.1B active, 36 layers. gpt-oss-20b: 20.9B total / ~3.6B active, 24 layers.
128K window via RoPE positions stretched with YaRN (a long-context scaling trick) on the dense layers.
Permissive — commercial use allowed (plus a usage policy). Tokenizer: o200k_harmony, 201,088 tokens.
Signature feature: top-4 experts, stored in 4 bits
A dense model runs all its weights on every token. gpt-oss is a Mixture-of-Experts: each block holds many parallel expert FFNs (feed-forward sub-networks), and a small linear router picks just the top-4 for each token. The 120b model has 128 experts per block (the 20b has 32), and these expert weights are 90%+ of all parameters.
That sparsity is why only ~5.1B of 116.8B params fire per token. The second half of the magic is MXFP4: those expert weights are quantized to 4 bits each instead of 32, shrinking memory ~8x so the whole 120b fits on one 80GB GPU. Step through it below.
Alternating attention + attention sinks
gpt-oss uses GQA (grouped-query attention): 64 query heads share just 8 key/value heads (group size 8), cutting memory. The twist is that layers alternate: one layer does full dense attention (every token sees all earlier tokens), the next does banded-sparse attention (each token only sees a sliding window of 128 tokens, GPT-3 style).
Each head also gets a learned per-head bias added in the softmax denominator — an attention sink. Think of it as an always-present junk slot a head can dump its weight into when nothing in the actual context is relevant, instead of being forced to over-attend to the first token. This keeps long-context attention stable. Toggle the two layer types below.
Rows = a query token; shaded squares = which earlier tokens it may attend to. The right-most column is the always-available attention sink.
vs the shared recipe
- Decoder-only transformer — same autoregressive backbone everyone uses.
- RMSNorm with pre-norm (normalize before each block) for stable training.
- RoPE rotary positions, here stretched to 128K with YaRN.
- SwiGLU gated feed-forward inside each expert (plus clamping + residual).
- GQA attention to shrink the key/value cache.
- Sparse MoE not dense — great FLOPs/quality, but you still must load all 128 experts into memory.
- MXFP4 4-bit experts — fits 120b on one GPU, but quantization can cost some precision.
- Alternating dense / banded-sparse attention — efficient, but window layers can't directly see far-back tokens.
- Attention sinks — an extra learned bias most models skip.
- harmony format required — convenient roles, but it is non-optional structure.
Gotchas / good to know
- "Fits on 80GB" needs MXFP4. The single-GPU claim for 120b depends on the 4-bit expert weights; full-precision needs far more memory.
- Total ≠ active. You must hold all 116.8B params in memory even though only ~5.1B compute per token — MoE saves FLOPs, not storage.
- Use the harmony format. gpt-oss expects the structured chat format (roles: System, Developer, User, Assistant, Tool). Feeding raw text usually degrades output.
- Banded layers have a 128-token window. Long-range reasoning leans on the alternating dense layers; don't assume every layer sees the whole context.
- Apache 2.0 plus a usage policy. Permissive, but still read the policy for your use case.
Related
How Open-Source LLMs Are Built
The shared recipe behind every open model — start here for the bricks gpt-oss reuses.
DeepSeek
Another big Mixture-of-Experts family — compare routing and expert counts.
Mistral & Mixtral
MoE plus sliding-window attention — close cousins to gpt-oss's design choices.
What is an LLM?
The fundamentals — tokens, prediction, and why these models work at all.