Dropout (2014) — Training with Random Amnesia
The world before this paper
In the early 2010s, the best neural networks had a dirty habit: they cheated. Give a deep net enough parameters and it stops learning patterns and starts memorizing answers — training accuracy soars while test accuracy sinks. Everyone knew the reliable cure: train many independent models and average them. Everyone also knew you couldn't afford it. Training one large net took days on the GPUs of the era; training fifty was a fantasy. Worse, the neurons inside a single net were quietly conspiring with each other — and that conspiracy was the heart of the problem.
Big networks nailed their training sets and flopped on new data — great train accuracy, poor test accuracy.
The proven fix — many independently-trained nets, averaged — cost far too much at neural-net scale.
Neurons learn to fix each other's mistakes during training, producing fragile features that don't survive new data.
The key idea
The fix came out of Geoffrey Hinton's lab in Toronto, from Nitish Srivastava, Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov — "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", JMLR 2014. Their bet sounded like sabotage: on every training step, flip a coin for each neuron and silence the losers. A neuron that might vanish at any moment can't lean on a partner to cover for it — it has to learn features that are useful on their own. The paper even nods to evolution for intuition: genes get shuffled into random new combinations every generation, so the ones that survive are the ones that work with any teammate.
The trick had already proven itself in the wild — it kept the giant fully-connected layers of AlexNet from drowning in their own parameters during the 2012 ImageNet run. The 2014 paper is the full story: the recipe, the intuition, and the receipts.
During training, silence each neuron at random — typically with probability 0.5 — so no neuron can rely on its neighbors; every step trains a different "thinned" subnetwork, and the full net at test time behaves like an average of exponentially many of them.
Want the full mechanics — masks, scaling, where to place it? See Dropout mechanics.
Watch a network train with amnesia
Step through it: a full forward pass, two training steps under two different random masks, the test-time trick, and the payoff in the error curves.
The results that mattered
The evidence was blunt. Dropout improved the state of the art across vision, speech, and text benchmarks of the day — one trick, every domain. The price was patience: the gradients get noisy, so training takes longer to converge.
Half the network silenced on every training step; inputs are kept more often, at ~0.8–0.9.
Thinned subnetworks implicitly averaged at test time — all neurons on, weights scaled. An ensemble for free.
Slower convergence: noisy gradients mean more steps. The cost of the regularization.
Legacy — and the catch
- Dead-simple, one hyperparameter, works across domains
- Made very large nets trainable on modest datasets
- The "ensemble for free" framing reshaped how we think about regularization
- Slows training; interacts awkwardly with batch norm
- Needs tuning per architecture (where, and how much)
- Web-scale pretraining often drops dropout entirely
Read the original JMLR 2014 paper — it's unusually readable. Then revisit the concepts it leans on: Dropout mechanics, Overfitting vs underfitting, Ensemble learning, and Early stopping — the older cure dropout often replaced. Today dropout is rarer in massive pretraining (the data itself is the regularizer) but still routine in fine-tuning and smaller models. Next paper: Batch Normalization (2015).