The Convolution Operation · Suman Bhadra Notes

Slide, multiply, sum

A convolution takes a small grid of numbers — the kernel — and slides it across the image. At each stop, it multiplies overlapping values and adds them into a single number. The grid of results is a feature map.

One position

Line the kernel up over a patch, multiply element-by-element, sum it all → one output pixel. Step over, repeat. That sum is a dot product between the kernel and the patch.

Watch the window slide

A 3×3 kernel sweeps across a 5×5 input. Each stop computes one cell of the output feature map.

What it produces

Feature map activation grid

High values where the patch matches the kernel's pattern; low where it doesn't.

Shrinks the size 5×5 → 3×3

A valid 3×3 conv over 5×5 yields 3×3 — see padding & stride to control this.

Many filters many maps

A conv layer has many kernels, each producing its own feature map (a channel).

Then a non-linearity

Each feature-map value passes through an activation (usually ReLU), just like a dense layer.

Why it's powerful

The kernel's numbers are learned, not hand-set. Training discovers which patterns are worth detecting — and because the same kernel is reused everywhere (weight sharing), it finds that pattern anywhere in the image, with very few parameters.

Prove it to yourself: here is one image (a bright square and a diagonal stroke) convolved with five different kernels. The kernel's numbers are the detector — switch them and the same image yields a completely different feature map. Click any output cell to see exactly which 3×3 input patch produced it.

The vertical-edge kernel fires along the square's sides — red where brightness rises to the right, green where it falls — but stays silent on the flat interior; the horizontal-edge kernel finds the top and bottom instead. Sharpen and blur are the same kernels image editors use. A CNN learns dozens of these — by gradient descent, not by hand.