Putting It Together: A CNN Architecture

Stacking the blocks

We have the parts — convolution, ReLU, pooling. A CNN simply stacks them into a pipeline: a feature extractor of repeated conv-pool blocks, then a classifier head of dense layers.

The classic pattern

[Conv → ReLU → Pool] × N → Flatten → Dense → Softmax

Follow an image through

Watch a 32×32 colour image flow through two conv-pool blocks — spatial size shrinking, channel count growing — then flatten into a dense classifier that outputs "cat".

The two halves

Feature extractor conv + pool

Repeated conv-pool blocks turn raw pixels into a stack of high-level feature maps. Spatial size ↓, channels ↑.

Flatten / GAP 3D → 1D

Collapse the final feature maps into a vector (flatten, or global average pooling).

Classifier head dense + softmax

One or two dense layers map features to class scores; softmax gives probabilities.

The recurring pattern

Wider but smaller

As data flows deeper, each block shrinks the spatial size (via pooling/stride) while growing the channel count. The network trades "where" for "what" — losing precise location but gaining rich, abstract features. This same recipe underlies LeNet, AlexNet, VGG, and ResNet.