Mistral & Mixtral

Mistral AI Sliding-Window Attention Sparse Mixture-of-Experts GQA + RoPE + SwiGLU Mostly Apache 2.0 Decoder-only

What makes it distinctive

Mistral's claim to fame is doing more with less: a small 7-billion-parameter model that punched far above its weight, and Mixtral, which made sparse Mixture-of-Experts (only part of the network runs per token) genuinely popular in open weights.

Two ideas carry the family. First, sliding-window attention (each token only looks back a fixed number of steps) paired with a rolling-buffer KV cache (a fixed-size memory that overwrites the oldest slot) — long context without the memory bill growing forever. Second, Mixtral's MoE: 8 expert sub-networks, but only 2 fire per token, so a 47B model costs roughly what a 13B model costs to run. Everything else is the standard modern recipe.

New here?

This page assumes the shared transformer recipe. For the generic bricks — embeddings, attention, normalization, the feed-forward block — start with How Open-Source LLMs Are Built, then come back for the parts Mistral changes.

The family so far

From a single 7B model in 2023 to a 675B Mixture-of-Experts flagship in 2025 — but the design DNA stays remarkably consistent.

Timeline 2023 → 2026

Mistral 7B (Sep 2023) → Mixtral 8x7B (Dec 2023) → Mixtral 8x22B (Apr 2024) → Mistral NeMo 12B (Jul 2024) → Mistral Small 3 (24B, 2025) → the Mistral 3 family (Dec 2025) → Mistral Small 4 (Mar 2026), one Apache model folding reasoning (Magistral), vision (Pixtral) and coding (Devstral) together.

Sizes & shapes 3B → 675B

Dense models from Ministral 3 (3 / 8 / 14B) up to Mistral Large 3, a sparse MoE with 675B total but only ~41B active per token.

Context length 32K → 256K

32K tokens for 7B and Mixtral; 128K for NeMo; up to 256K for Mistral Large 3 — the later models drop sliding windows to reach this.

License Apache 2.0

Most weights — 7B, Mixtral, NeMo, Small 3, the Mistral 3 family — ship under permissive Apache 2.0: free to use, modify, and ship commercially.

Signature feature 1 — sliding-window attention + a rolling KV cache

Plain attention lets every token look at every earlier token, so memory and compute grow with sequence length. Mistral 7B and Mixtral instead use a sliding window: each token attends only to the last W = 4096 tokens (the diagonal band below). Memory stays flat because the rolling buffer (a fixed-size cache) reuses slots — when it fills, the newest key overwrites the oldest.

The clever part: information still travels past the window. Stacking layers extends reach — at each layer a token can pull in something W steps back, and that token already absorbed something W steps before it. After several layers, the effective reach is roughly layers × W. Step through it below.

Why it matters

Combined with GQA (grouped-query attention — 32 query heads share just 8 key/value heads), the rolling buffer keeps the KV cache small and constant. That is what made cheap, long-ish context practical on a 7B model.

Signature feature 2 — Mixtral's 8 experts, 2 active

Mixtral replaces each model's single feed-forward block with 8 experts (8 parallel sub-networks). A tiny router scores all 8 for each token, picks the top 2, runs only those, and blends their outputs by the router's weights. So Mixtral 8x7B holds 47B parameters total but only ~13B do work on any given token. Click an expert combo and watch the router pick.

Different tokens route to different experts — but always exactly 2 of 8 light up.

Mixtral 8x7B 47B / ~13B

8 experts, top-2 routing. 47B total parameters, but only ~13B active per token.

Mixtral 8x22B 141B / ~39B

Same recipe scaled up: 8 experts, top-2, 141B total, ~39B active.

Mistral Large 3 675B / ~41B

Granular sparse MoE (many small experts, ~16:1 total-to-active ratio). Exact expert count not disclosed.

vs the shared recipe

Mistral keeps almost all of the standard decoder-only transformer and swaps two bricks. Here is what stays vs what changes.

Keeps from the shared recipe
  • Decoder-only stack — predict the next token, left to right.
  • RMSNorm, applied pre-norm (normalize before each sub-block).
  • RoPE for positions (7B uses rope_theta = 10000; long-context models use a much larger theta).
  • SwiGLU feed-forward (a SiLU-gated MLP; 7B intermediate size 14336).
  • GQA everywhere — 32 query heads, 8 KV heads, head dim 128.
Changes / trade-offs
  • Sliding-window attention (window 4096) + rolling-buffer KV cache on 7B / Mixtral — cheap context, but reach beyond the window depends on stacking layers.
  • Newer models (NeMo, Small 3, Mistral 3) drop SWA for full attention to hit 128K–256K — simpler, but the KV cache grows again.
  • Sparse MoE (Mixtral, Mistral Large 3) — big capacity, cheap inference, but all experts must sit in memory and routing can imbalance load.
  • Tokenizer jump: 7B / Mixtral use a 32,000 SentencePiece BPE; NeMo onward use the ~131K "Tekken" tokenizer (tiktoken-style BPE) for 100+ languages.

Gotchas / good to know

Read before you build on it
  • "47B" is not your VRAM budget. A sparse MoE only computes ~13B per token, but you must load all 47B of weights into memory — the savings are in compute, not storage.
  • Sliding window is a 7B / Mixtral thing. Don't assume the whole family uses it; NeMo, Small 3, and the Mistral 3 models use full attention instead.
  • RoPE theta matters for long context. The 7B's rope_theta = 10000 was tuned for ~32K; longer-context models retrain with a much larger theta. Mixing them up breaks position handling.
  • Two different tokenizers. A prompt's token count and any tokenizer-specific code differs between the old 32K SentencePiece vocab and the newer 131K Tekken vocab.
  • Router load balance. MoE training needs an auxiliary loss so experts get used evenly; left alone, a few experts hog all the tokens and capacity is wasted.

Related