The Convolution Operation
Slide, multiply, sum
A convolution takes a small grid of numbers — the kernel — and slides it across the image. At each stop, it multiplies overlapping values and adds them into a single number. The grid of results is a feature map.
Line the kernel up over a patch, multiply element-by-element, sum it all → one output pixel. Step over, repeat. That sum is a dot product between the kernel and the patch.
Watch the window slide
A 3×3 kernel sweeps across a 5×5 input. Each stop computes one cell of the output feature map.
What it produces
High values where the patch matches the kernel's pattern; low where it doesn't.
A valid 3×3 conv over 5×5 yields 3×3 — see padding & stride to control this.
A conv layer has many kernels, each producing its own feature map (a channel).
Each feature-map value passes through an activation (usually ReLU), just like a dense layer.
Why it's powerful
The kernel's numbers are learned, not hand-set. Training discovers which patterns are worth detecting — and because the same kernel is reused everywhere (weight sharing), it finds that pattern anywhere in the image, with very few parameters.
Prove it to yourself: here is one image (a bright square and a diagonal stroke) convolved with five different kernels. The kernel's numbers are the detector — switch them and the same image yields a completely different feature map. Click any output cell to see exactly which 3×3 input patch produced it.
The vertical-edge kernel fires along the square's sides — red where brightness rises to the right, green where it falls — but stays silent on the flat interior; the horizontal-edge kernel finds the top and bottom instead. Sharpen and blur are the same kernels image editors use. A CNN learns dozens of these — by gradient descent, not by hand.