Mistral & Mixtral
What makes it distinctive
Mistral's claim to fame is doing more with less: a small 7-billion-parameter model that punched far above its weight, and Mixtral, which made sparse Mixture-of-Experts (only part of the network runs per token) genuinely popular in open weights.
Two ideas carry the family. First, sliding-window attention (each token only looks back a fixed number of steps) paired with a rolling-buffer KV cache (a fixed-size memory that overwrites the oldest slot) — long context without the memory bill growing forever. Second, Mixtral's MoE: 8 expert sub-networks, but only 2 fire per token, so a 47B model costs roughly what a 13B model costs to run. Everything else is the standard modern recipe.
This page assumes the shared transformer recipe. For the generic bricks — embeddings, attention, normalization, the feed-forward block — start with How Open-Source LLMs Are Built, then come back for the parts Mistral changes.
The family so far
From a single 7B model in 2023 to a 675B Mixture-of-Experts flagship in 2025 — but the design DNA stays remarkably consistent.
Mistral 7B (Sep 2023) → Mixtral 8x7B (Dec 2023) → Mixtral 8x22B (Apr 2024) → Mistral NeMo 12B (Jul 2024) → Mistral Small 3 (24B, 2025) → the Mistral 3 family (Dec 2025) → Mistral Small 4 (Mar 2026), one Apache model folding reasoning (Magistral), vision (Pixtral) and coding (Devstral) together.
Dense models from Ministral 3 (3 / 8 / 14B) up to Mistral Large 3, a sparse MoE with 675B total but only ~41B active per token.
32K tokens for 7B and Mixtral; 128K for NeMo; up to 256K for Mistral Large 3 — the later models drop sliding windows to reach this.
Most weights — 7B, Mixtral, NeMo, Small 3, the Mistral 3 family — ship under permissive Apache 2.0: free to use, modify, and ship commercially.
Signature feature 1 — sliding-window attention + a rolling KV cache
Plain attention lets every token look at every earlier token, so memory and compute grow with sequence length. Mistral 7B and Mixtral instead use a sliding window: each token attends only to the last W = 4096 tokens (the diagonal band below). Memory stays flat because the rolling buffer (a fixed-size cache) reuses slots — when it fills, the newest key overwrites the oldest.
The clever part: information still travels past the window. Stacking layers extends reach — at each layer a token can pull in something W steps back, and that token already absorbed something W steps before it. After several layers, the effective reach is roughly layers × W. Step through it below.
Combined with GQA (grouped-query attention — 32 query heads share just 8 key/value heads), the rolling buffer keeps the KV cache small and constant. That is what made cheap, long-ish context practical on a 7B model.
Signature feature 2 — Mixtral's 8 experts, 2 active
Mixtral replaces each model's single feed-forward block with 8 experts (8 parallel sub-networks). A tiny router scores all 8 for each token, picks the top 2, runs only those, and blends their outputs by the router's weights. So Mixtral 8x7B holds 47B parameters total but only ~13B do work on any given token. Click an expert combo and watch the router pick.
Different tokens route to different experts — but always exactly 2 of 8 light up.
8 experts, top-2 routing. 47B total parameters, but only ~13B active per token.
Same recipe scaled up: 8 experts, top-2, 141B total, ~39B active.
Granular sparse MoE (many small experts, ~16:1 total-to-active ratio). Exact expert count not disclosed.
vs the shared recipe
Mistral keeps almost all of the standard decoder-only transformer and swaps two bricks. Here is what stays vs what changes.
- Decoder-only stack — predict the next token, left to right.
- RMSNorm, applied pre-norm (normalize before each sub-block).
- RoPE for positions (7B uses
rope_theta = 10000; long-context models use a much larger theta). - SwiGLU feed-forward (a SiLU-gated MLP; 7B intermediate size 14336).
- GQA everywhere — 32 query heads, 8 KV heads, head dim 128.
- Sliding-window attention (window 4096) + rolling-buffer KV cache on 7B / Mixtral — cheap context, but reach beyond the window depends on stacking layers.
- Newer models (NeMo, Small 3, Mistral 3) drop SWA for full attention to hit 128K–256K — simpler, but the KV cache grows again.
- Sparse MoE (Mixtral, Mistral Large 3) — big capacity, cheap inference, but all experts must sit in memory and routing can imbalance load.
- Tokenizer jump: 7B / Mixtral use a 32,000 SentencePiece BPE; NeMo onward use the ~131K "Tekken" tokenizer (tiktoken-style BPE) for 100+ languages.
Gotchas / good to know
- "47B" is not your VRAM budget. A sparse MoE only computes ~13B per token, but you must load all 47B of weights into memory — the savings are in compute, not storage.
- Sliding window is a 7B / Mixtral thing. Don't assume the whole family uses it; NeMo, Small 3, and the Mistral 3 models use full attention instead.
- RoPE theta matters for long context. The 7B's
rope_theta = 10000was tuned for ~32K; longer-context models retrain with a much larger theta. Mixing them up breaks position handling. - Two different tokenizers. A prompt's token count and any tokenizer-specific code differs between the old 32K SentencePiece vocab and the newer 131K Tekken vocab.
- Router load balance. MoE training needs an auxiliary loss so experts get used evenly; left alone, a few experts hog all the tokens and capacity is wasted.
Related
How Open-Source LLMs Are Built
The shared recipe these bricks plug into — start here if any term felt unfamiliar.
Llama
The dense recipe Mistral builds on — same GQA + RoPE + SwiGLU backbone.
DeepSeek
Another MoE design — compare Mixtral's top-2 routing to DeepSeek's many-expert approach.
gpt-oss
An open MoE family — a third take on sparse experts to contrast.
Transformer Architecture
The fundamentals — attention, blocks, and how a decoder predicts the next token.