Mixture of Experts (MoE)
More parameters, same compute
Bigger models are smarter but slower — every token normally passes through every weight. Mixture of Experts breaks that link. It replaces one big feed-forward layer with many smaller experts, and a tiny router sends each token to just a couple of them. The model can hold huge total knowledge while only a small slice fires per token.
Many feed-forward networks; different experts specialize in different patterns.
Scores the experts for each token and routes it to the best one or two.
Only k of N experts run, so compute stays low even as total parameters explode.
Route, activate, combine
Follow one token: the router scores all the experts, the top two light up, the rest stay dark, and their outputs are blended by the router's weights.
Why it matters — and the catch
- Huge parameter count, modest inference cost
- Experts specialize → better quality per FLOP
- Behind many frontier LLMs
- All experts must sit in memory (VRAM-hungry)
- Routing can become unbalanced (some experts idle)
- Trickier to train and serve
A model advertised as, say, 8 experts × 7B parameters has a big total size, but with top-2 routing only ~2 experts' worth of compute runs per token. That gap between total and active parameters is the whole point of MoE.