Mixture of Experts (MoE) · Suman Bhadra Notes

More parameters, same compute

Bigger models are smarter but slower — every token normally passes through every weight. Mixture of Experts breaks that link. It replaces one big feed-forward layer with many smaller experts, and a tiny router sends each token to just a couple of them. The model can hold huge total knowledge while only a small slice fires per token.

Experts parallel sub-nets

Many feed-forward networks; different experts specialize in different patterns.

Router (gate) picks top-k

Scores the experts for each token and routes it to the best one or two.

Sparse few active

Only k of N experts run, so compute stays low even as total parameters explode.

Route, activate, combine

Follow one token: the router scores all the experts, the top two light up, the rest stay dark, and their outputs are blended by the router's weights.

Why it matters — and the catch

Wins

Huge parameter count, modest inference cost
Experts specialize → better quality per FLOP
Behind many frontier LLMs

Costs

All experts must sit in memory (VRAM-hungry)
Routing can become unbalanced (some experts idle)
Trickier to train and serve

"8×7B" doesn't mean 56B active

A model advertised as, say, 8 experts × 7B parameters has a big total size, but with top-2 routing only ~2 experts' worth of compute runs per token. That gap between total and active parameters is the whole point of MoE.