Residual Connections
Deeper was making things worse
By 2015 the recipe for better vision models was "add more layers" — until it stopped working. Stack enough plain layers and the deeper network trains worse than a shallower one: higher training error, not just test error, so it isn't overfitting. This is the degradation problem. ResNet (He et al., Dec 2015) largely fixed it with an almost embarrassingly simple move: let every block keep a copy of its input.
In the ResNet paper, a 56-layer plain net had higher training error than a 20-layer one. The deeper model couldn't even fit the data it was trained on. In principle it could copy the 20-layer solution and make the extra layers identities — but plain layers find it hard to learn the identity.
Learn the change, not the whole mapping
Instead of asking a block to learn the full mapping H(x), let it learn only the residual — the difference from its input — and add the input back at the end:
y = F(x) + x — F(x) is a couple of weight layers (the part being learned); the + x is a skip (shortcut) connection that routes the input around them, costing zero extra parameters.
The block only has to learn how the input should differ — often a small nudge, which is easier than the whole transformation.
The input flows around the block untouched and is simply added back. Nothing to learn, nothing to break.
Pushing the block's weights toward zero turns it into a no-op. "Change nothing" is easy to learn — so extra depth can't easily hurt.
Watch the shortcut at work
First see gradients fade through a plain stack, then add the skip connection and follow the gradient highway through the identity path — all the way to a full residual stream.
Why gradients love the shortcut
Backprop multiplies local derivatives along every path from the loss back to a weight. In a plain stack that's a long chain of factors, and if each one is a bit less than 1 the product collapses — the classic vanishing gradient story. The skip connection changes the math:
With y = F(x) + x, ∂y/∂x = ∂F/∂x + 1. That +1 means part of the gradient passes straight through, unscaled. Chain a hundred residual blocks and there is still an identity path the gradient can ride end to end — which is why networks with hundreds of layers stay trainable.
Where you've already seen it
The original CNN family: ResNet-18/34/50/101/152. The 152-layer version won ILSVRC 2015.
Every transformer block wraps both attention and the feed-forward network in a residual connection.
Stacked shortcuts form a single information stream that every layer reads from and writes small updates to.
Practical notes
- The shortcut is free when input and output shapes match
- Pairs naturally with normalization — Conv → BN → ReLU inside the block is the classic recipe
- Made depth a resource instead of a liability
- Dimensions must match to add — use a 1×1 conv (or a linear projection) on the shortcut when channels or spatial size change
- Where you put the norm (pre-norm vs post-norm) changes training stability in transformers
- Residuals ease optimization; they don't replace a sensible learning rate or good data
"Deep Residual Learning for Image Recognition" — He, Zhang, Ren & Sun, December 2015 (arXiv). It didn't make deep training trivially easy, but it largely fixed the degradation problem — and the skip connection became one of the most-copied ideas in all of deep learning.