Mistral & Mixtral: Sliding Windows and Mixture-of-Experts

What makes it distinctive

Mistral's claim to fame is doing more with less: a small 7-billion-parameter model that punched far above its weight, and Mixtral, which made sparse Mixture-of-Experts (only part of the network runs per token) genuinely popular in open weights.

Two ideas carry the family. First, sliding-window attention (each token only looks back a fixed number of steps) paired with a rolling-buffer KV cache (a fixed-size memory that overwrites the oldest slot) — long context without the memory bill growing forever. Second, Mixtral's MoE: 8 expert sub-networks, but only 2 fire per token, so a 47B model costs roughly what a 13B model costs to run. Everything else is the standard modern recipe.

New here?

This page assumes the shared transformer recipe. For the generic bricks — embeddings, attention, normalization, the feed-forward block — start with How Open-Source LLMs Are Built, then come back for the parts Mistral changes.

The family so far

From a single 7B model in 2023 to a 675B Mixture-of-Experts flagship in 2025 — but the design DNA stays remarkably consistent.

Timeline 2023 → 2026

Mistral 7B (Sep 2023) → Mixtral 8x7B (Dec 2023) → Mixtral 8x22B (Apr 2024) → Mistral NeMo 12B (Jul 2024) → Mistral Small 3 (24B, 2025) → the Mistral 3 family (Dec 2025) → Mistral Small 4 (Mar 2026), one Apache model folding reasoning (Magistral), vision (Pixtral) and coding (Devstral) together.

Sizes & shapes 3B → 675B

Dense models from Ministral 3 (3 / 8 / 14B) up to Mistral Large 3, a sparse MoE with 675B total but only ~41B active per token.

Context length 8K → 256K

8K for the original sliding-window Mistral 7B (v0.1); 32K for Mistral 7B v0.2+ (which drops SWA) and Mixtral; 128K for NeMo; up to 256K for Mistral Large 3 — the later models drop sliding windows to reach this.

License Apache 2.0

Most weights — 7B, Mixtral, NeMo, Small 3, the Mistral 3 family — ship under permissive Apache 2.0: free to use, modify, and ship commercially.

Signature feature 1 — sliding-window attention + a rolling KV cache

Plain attention lets every token look at every earlier token, so memory and compute grow with sequence length. Mistral 7B (v0.1) instead uses a sliding window: each token attends only to the last W = 4096 tokens (the diagonal band below). (Mixtral, by contrast, uses fully dense 32K attention.) Memory stays flat because the rolling buffer (a fixed-size cache) reuses slots — when it fills, the newest key overwrites the oldest.

The clever part: information still travels past the window. Stacking layers extends reach — at each layer a token can pull in something W steps back, and that token already absorbed something W steps before it. After several layers, the effective reach is roughly layers × W. Step through it below.

Why it matters

Combined with GQA (grouped-query attention — 32 query heads share just 8 key/value heads), the rolling buffer keeps the KV cache small and constant. That is what made cheap, long-ish context practical on a 7B model.

Signature feature 2 — Mixtral's 8 experts, 2 active

Mixtral replaces each layer's single feed-forward block with 8 experts (8 parallel sub-networks). A tiny router scores all 8 for each token, picks the top 2, runs only those, and blends their outputs by the router's weights. So Mixtral 8x7B holds 47B parameters total but only ~13B do work on any given token. Click an expert combo and watch the router pick.

Different tokens route to different experts — but always exactly 2 of 8 light up.

Mixtral 8x7B 47B / ~13B

8 experts, top-2 routing. 47B total parameters, but only ~13B active per token.

Mixtral 8x22B 141B / ~39B

Same recipe scaled up: 8 experts, top-2, 141B total, ~39B active.

Mistral Large 3 675B / ~41B

Granular sparse MoE (many small experts, ~16:1 total-to-active ratio). Exact expert count not disclosed.

vs the shared recipe

Mistral keeps almost all of the standard decoder-only transformer and swaps two bricks. Here is what stays vs what changes.

Keeps from the shared recipe

Decoder-only stack — predict the next token, left to right.
RMSNorm, applied pre-norm (normalize before each sub-block).
RoPE for positions (7B uses rope_theta = 10000; long-context models use a much larger theta).
SwiGLU feed-forward (a SiLU-gated MLP; 7B intermediate size 14336).
GQA everywhere — 32 query heads, 8 KV heads, head dim 128.

Changes / trade-offs

Sliding-window attention (window 4096) + rolling-buffer KV cache on Mistral 7B v0.1 — cheap context, but reach beyond the window depends on stacking layers.
Newer models (NeMo, Small 3.1+, the Mistral 3 family) drop SWA for full attention to hit 128K–256K — simpler, but the KV cache grows again.
Sparse MoE (Mixtral, Mistral Large 3) — big capacity, cheap inference, but all experts must sit in memory and routing can imbalance load.
Tokenizer jump: 7B / Mixtral use a 32,000 SentencePiece BPE; NeMo onward use the ~131K "Tekken" tokenizer (tiktoken-style BPE) for 100+ languages.

Gotchas / good to know

Read before you build on it

"47B" is not your VRAM budget. A sparse MoE only computes ~13B per token, but you must load all 47B of weights into memory — the savings are in compute, not storage.
Sliding window is a Mistral 7B v0.1 thing. Don't assume the rest of the family uses it; Mixtral, 7B v0.2+, NeMo, Small 3, and the Mistral 3 models use full attention instead.
RoPE theta matters for long context. The 7B's rope_theta = 10000 is the standard small default; longer-context models retrain with a much larger theta (e.g. 1e6) to keep positions well-resolved at long range. Mixing them up breaks position handling.
Two different tokenizers. A prompt's token count and any tokenizer-specific code differs between the old 32K SentencePiece vocab and the newer 131K Tekken vocab.
Router load balance. MoE training usually adds an auxiliary loss so experts get used evenly (DeepSeek instead uses aux-loss-free bias balancing); left alone, a few experts hog all the tokens and capacity is wasted.