Residual Connections · Suman Bhadra Notes

Deeper was making things worse

By 2015 the recipe for better vision models was "add more layers" — until it stopped working. Stack enough plain layers and the deeper network trains worse than a shallower one: higher training error, not just test error, so it isn't overfitting. This is the degradation problem. ResNet (He et al., Dec 2015) largely fixed it with an almost embarrassingly simple move: let every block keep a copy of its input.

Not overfitting — failing to optimize

In the ResNet paper, a 56-layer plain net had higher training error than a 20-layer one. The deeper model couldn't even fit the data it was trained on. In principle it could copy the 20-layer solution and make the extra layers identities — but plain layers find it hard to learn the identity.

Learn the change, not the whole mapping

Instead of asking a block to learn the full mapping H(x), let it learn only the residual — the difference from its input — and add the input back at the end:

The residual block

y = F(x) + x — F(x) is a couple of weight layers (the part being learned); the + x is a skip (shortcut) connection that routes the input around them, costing zero extra parameters.

The residual F(x) = the change

The block only has to learn how the input should differ — often a small nudge, which is easier than the whole transformation.

The shortcut + x, no params

The input flows around the block untouched and is simply added back. Nothing to learn, nothing to break.

A safe default F(x) = 0 → identity

Pushing the block's weights toward zero turns it into a no-op. "Change nothing" is easy to learn — so extra depth can't easily hurt.

Watch the shortcut at work

First see gradients fade through a plain stack, then add the skip connection and follow the gradient highway through the identity path — all the way to a full residual stream.

Why gradients love the shortcut

Backprop multiplies local derivatives along every path from the loss back to a weight. In a plain stack that's a long chain of factors, and if each one is a bit less than 1 the product collapses — the classic vanishing gradient story. The skip connection changes the math:

The math in one line

With y = F(x) + x, ∂y/∂x = ∂F/∂x + 1. That +1 means part of the gradient passes straight through, unscaled. Chain a hundred residual blocks and there is still an identity path the gradient can ride end to end — which is why networks with hundreds of layers stay trainable.

Where you've already seen it

ResNets 18 → 152 layers

The original CNN family: ResNet-18/34/50/101/152. The 152-layer version won ILSVRC 2015.

Transformers 2 per block

Every transformer block wraps both attention and the feed-forward network in a residual connection.

The residual stream one shared lane

Stacked shortcuts form a single information stream that every layer reads from and writes small updates to.

Practical notes

Good to know

The shortcut is free when input and output shapes match
Pairs naturally with normalization — Conv → BN → ReLU inside the block is the classic recipe
Made depth a resource instead of a liability

Watch out

Dimensions must match to add — use a 1×1 conv (or a linear projection) on the shortcut when channels or spatial size change
Where you put the norm (pre-norm vs post-norm) changes training stability in transformers
Residuals ease optimization; they don't replace a sensible learning rate or good data

The paper

"Deep Residual Learning for Image Recognition" — He, Zhang, Ren & Sun, December 2015 (arXiv). It didn't make deep training trivially easy, but it largely fixed the degradation problem — and the skip connection became one of the most-copied ideas in all of deep learning.