ResNet (2015) — Going Deeper with Shortcuts
The world before this paper
By 2015, deep learning had hit a wall that made no sense: adding layers made networks worse. AlexNet had 8 layers, VGG pushed to 19, and everyone assumed the road to better accuracy was simply deeper. Then the experiments came back. A plain 56-layer network lost to a 20-layer one — not just on test data, but on training error, the one number more capacity should always help.
Past VGG's 19 layers, plain stacks got worse. A 56-layer net couldn't even fit the training set as well as its 20-layer cousin.
It wasn't overfitting — training error itself rose. And it wasn't vanishing gradients alone — BatchNorm was already in play, keeping signals healthy.
Deep stacks of layers struggle to learn even the identity function. "Just pass the input through" is surprisingly hard for a pile of nonlinear layers.
The key idea
At Microsoft Research, Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun asked a sneaky question: if the extra layers of a 56-layer net only need to do nothing to match the 20-layer one, why is doing nothing so hard? Their bet (He, Zhang, Ren & Sun — "Deep Residual Learning for Image Recognition", CVPR 2016): stop asking each block to learn a full mapping. Let it learn only the residual — the small correction on top of its input — and wire the input straight to the output with a skip connection.
A residual block is just two or three layers plus an identity shortcut; the deepest nets use bottleneck blocks (1×1 → 3×3 → 1×1) to stay cheap. The trick adds no parameters and almost no compute. It only changes what the layers have to learn — and that changed everything.
Make every block learn only the residual — the difference from identity — so that with a skip connection the output is F(x) + x, "do nothing" becomes the easy default, and gradients flow backwards through the shortcut untouched.
Want the full mechanics? See Classic CNN architectures.
Watch the shortcut work
Five scenes: the bar chart that embarrassed deep nets, a residual block drawing its skip arc, why "nothing" becomes free, the gradient highway back to layer 1, and the depth it unlocked.
The results that mattered
ResNet didn't edge out the competition at ILSVRC 2015 — it lapped it, and then kept winning everywhere else.
8× deeper than VGG — yet, thanks to bottleneck blocks, lower compute than VGG-19.
Winning top-5 error at ILSVRC 2015, from an ensemble of residual nets — past the oft-quoted ≈5.1% human baseline.
First place across the ImageNet and COCO classification, detection and segmentation tracks that year.
Legacy — and the catch
- Made depth essentially free — the degradation problem vanished
- Residual connections became a universal design pattern (transformers included)
- Still a workhorse backbone for vision a decade later
- Depth ≠ understanding — gains saturated past ~1000 layers
- Why residuals work so well took years of theory to (partially) explain
- Vision's frontier eventually moved to attention (ViT) anyway
Read the original: arXiv:1512.03385. For the mechanics behind the story, see Classic CNN architectures, Vanishing & exploding gradients and Backpropagation. Next paper: Attention Is All You Need (2017).