Pooling Layers — Max & Average · Suman Bhadra Notes

Summarize and shrink

After a convolution, feature maps are still large. A pooling layer shrinks them by replacing each small region with a single summary number — usually the maximum.

Max pooling take the max

Keep the strongest activation in each window — "was the feature present here at all?" The popular default.

Average pooling take the mean

Average the window — a smoother summary. Common as global average pooling before the final layer.

Watch a 4×4 map become 2×2

A 2×2 pooling window with stride 2 sweeps the feature map, taking the max (then the average) of each region.

Why pool?

Less compute smaller maps

Halving each dimension cuts the data 4× — fewer numbers for the next layer to process.

Translation tolerance small shifts ok

The max of a region barely changes if the feature shifts a pixel — adds robustness.

No parameters just a rule

Pooling has nothing to learn — it's a fixed operation, so it's cheap.

A note on modern practice

Pooling vs strided conv

Some modern architectures replace pooling with strided convolutions (which learn how to downsample). And global average pooling — averaging each whole feature map to one number — is the standard way to flatten before the classifier head, replacing big dense layers.