Dropout (2014) — Training with Random Amnesia

The world before this paper

In the early 2010s, the best neural networks had a dirty habit: they cheated. Give a deep net enough parameters and it stops learning patterns and starts memorizing answers — training accuracy soars while test accuracy sinks. Everyone knew the reliable cure: train many independent models and average them. Everyone also knew you couldn't afford it. Training one large net took days on the GPUs of the era; training fifty was a fantasy. Worse, the neurons inside a single net were quietly conspiring with each other — and that conspiracy was the heart of the problem.

Overfitting memorize, not learn

Big networks nailed their training sets and flopped on new data — great train accuracy, poor test accuracy.

Ensembles too expensive

The proven fix — many independently-trained nets, averaged — cost far too much at neural-net scale.

Co-adaptation brittle features

Neurons learn to fix each other's mistakes during training, producing fragile features that don't survive new data.

The key idea

The fix came out of Geoffrey Hinton's lab in Toronto, from Nitish Srivastava, Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov — "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", JMLR 2014. Their bet sounded like sabotage: on every training step, flip a coin for each neuron and silence the losers. A neuron that might vanish at any moment can't lean on a partner to cover for it — it has to learn features that are useful on their own. The paper even nods to evolution for intuition: genes get shuffled into random new combinations every generation, so the ones that survive are the ones that work with any teammate.

The trick had already proven itself in the wild — it kept the giant fully-connected layers of AlexNet from drowning in their own parameters during the 2012 ImageNet run. The 2014 paper is the full story: the recipe, the intuition, and the receipts.

The paper in one sentence

During training, silence each neuron at random — typically with probability 0.5 — so no neuron can rely on its neighbors; every step trains a different "thinned" subnetwork, and the full net at test time behaves like an average of exponentially many of them.

Want the full mechanics — masks, scaling, where to place it? See Dropout mechanics.

Watch a network train with amnesia

Step through it: a full forward pass, two training steps under two different random masks, the test-time trick, and the payoff in the error curves.

The results that mattered

The evidence was blunt. Dropout improved the state of the art across vision, speech, and text benchmarks of the day — one trick, every domain. The price was patience: the gradients get noisy, so training takes longer to converge.

Drop rate 0.5

Half the network silenced on every training step; inputs are kept more often, at ~0.8–0.9.

Implicit ensemble 2ⁿ

Thinned subnetworks implicitly averaged at test time — all neurons on, weights scaled. An ensemble for free.

The price 2–3×

Slower convergence: noisy gradients mean more steps. The cost of the regularization.

Legacy — and the catch

What it unlocked

Dead-simple, one hyperparameter, works across domains
Made very large nets trainable on modest datasets
The "ensemble for free" framing reshaped how we think about regularization

The limits

Slows training; interacts awkwardly with batch norm
Needs tuning per architecture (where, and how much)
Web-scale pretraining often drops dropout entirely

Go deeper

Read the original JMLR 2014 paper — it's unusually readable. Then revisit the concepts it leans on: Dropout mechanics, Overfitting vs underfitting, Ensemble learning, and Early stopping — the older cure dropout often replaced. Today dropout is rarer in massive pretraining (the data itself is the regularizer) but still routine in fine-tuning and smaller models. Next paper: Batch Normalization (2015).