Why CNNs for Images? · Suman Bhadra Notes

Dense networks drown on images

Feed an image to a fully-connected network and two problems hit at once: an explosion of weights, and total blindness to spatial structure.

A modest 200×200 colour image is 120,000 numbers. Connect that to a single hidden layer of 1,000 neurons and you already have 120 million weights — for one layer. Worse, flattening the image throws away the fact that nearby pixels belong together. A cat in the top-left and the same cat in the bottom-right look completely unrelated to a dense net.

See the contrast

A dense layer wires every pixel to every neuron; a convolution slides one tiny shared filter across the image. Watch the parameter counts — and the spatial awareness — diverge.

The three CNN superpowers

Local connectivity small patches

Each neuron looks at a small region, not the whole image — matching how visual features are local.

Weight sharing one filter, reused

The same small filter slides everywhere, so a feature is detected wherever it appears — and parameters plummet.

Translation equivariance position-agnostic

Because the same filter slides everywhere, a feature is detected wherever it appears (equivariance). Add pooling and you get approximate invariance — a cat is a cat top-left or bottom-right.

Hierarchy of features

Stack convolutions and the network learns edges → textures → parts → objects, layer by layer — exactly the depth hierarchy images call for.

What's next

The next articles build the CNN piece by piece: the convolution operation, the filters it learns, padding & stride, and pooling — then we assemble a full CNN.