ResNet (2015) — Going Deeper with Shortcuts

The world before this paper

By 2015, deep learning had hit a wall that made no sense: adding layers made networks worse. AlexNet had 8 layers, VGG pushed to 19, and everyone assumed the road to better accuracy was simply deeper. Then the experiments came back. A plain 56-layer network lost to a 20-layer one — not just on test data, but on training error, the one number more capacity should always help.

The degradation problem 56 < 20

Past VGG's 19 layers, plain stacks got worse. A 56-layer net couldn't even fit the training set as well as its 20-layer cousin.

Not the usual suspects ruled out

It wasn't overfitting — training error itself rose. And it wasn't vanishing gradients alone — BatchNorm was already in play, keeping signals healthy.

The suspicion identity is hard

Deep stacks of layers struggle to learn even the identity function. "Just pass the input through" is surprisingly hard for a pile of nonlinear layers.

The key idea

At Microsoft Research, Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun asked a sneaky question: if the extra layers of a 56-layer net only need to do nothing to match the 20-layer one, why is doing nothing so hard? Their bet (He, Zhang, Ren & Sun — "Deep Residual Learning for Image Recognition", CVPR 2016): stop asking each block to learn a full mapping. Let it learn only the residual — the small correction on top of its input — and wire the input straight to the output with a skip connection.

A residual block is just two or three layers plus an identity shortcut; the deepest nets use bottleneck blocks (1×1 → 3×3 → 1×1) to stay cheap. The trick adds no parameters and almost no compute. It only changes what the layers have to learn — and that changed everything.

The paper in one sentence

Make every block learn only the residual — the difference from identity — so that with a skip connection the output is F(x) + x, "do nothing" becomes the easy default, and gradients flow backwards through the shortcut untouched.

Want the full mechanics? See Classic CNN architectures.

Watch the shortcut work

Five scenes: the bar chart that embarrassed deep nets, a residual block drawing its skip arc, why "nothing" becomes free, the gradient highway back to layer 1, and the depth it unlocked.

The results that mattered

ResNet didn't edge out the competition at ILSVRC 2015 — it lapped it, and then kept winning everywhere else.

Depth 152 layers

8× deeper than VGG — yet, thanks to bottleneck blocks, lower compute than VGG-19.

Accuracy 3.57%

Winning top-5 error at ILSVRC 2015, from an ensemble of residual nets — past the oft-quoted ≈5.1% human baseline.

Clean sweep 5 challenges

First place in five tracks that year — ImageNet classification, localization and detection, plus COCO detection and segmentation.

Legacy — and the catch

What it unlocked

Made depth essentially free — the degradation problem vanished
Residual connections became a universal design pattern (transformers included)
Still a workhorse backbone for vision a decade later

The limits

Depth ≠ understanding — gains saturated past ~1000 layers
Why residuals work so well took years of theory to (partially) explain
Vision's frontier eventually moved to attention (ViT) anyway

Go deeper

Read the original: arXiv:1512.03385. For the mechanics behind the story, see Classic CNN architectures, Vanishing & exploding gradients and Backpropagation. Next paper: Attention Is All You Need (2017).