Putting It Together: A CNN Architecture
Stacking the blocks
We have the parts — convolution, ReLU, pooling. A CNN simply stacks them into a pipeline: a feature extractor of repeated conv-pool blocks, then a classifier head of dense layers.
[Conv → ReLU → Pool] × N → Flatten → Dense → Softmax
Follow an image through
Watch a 32×32 colour image flow through two conv-pool blocks — spatial size shrinking, channel count growing — then flatten into a dense classifier that outputs "cat".
The two halves
Repeated conv-pool blocks turn raw pixels into a stack of high-level feature maps. Spatial size ↓, channels ↑.
Collapse the final feature maps into a vector (flatten, or global average pooling).
One or two dense layers map features to class scores; softmax gives probabilities.
The recurring pattern
As data flows deeper, each block shrinks the spatial size (via pooling/stride) while growing the channel count. The network trades "where" for "what" — losing precise location but gaining rich, abstract features. This same recipe underlies LeNet, AlexNet, VGG, and ResNet.