Data Augmentation · Suman Bhadra Notes

The cheapest extra data is the data you already have

Deep networks are data-hungry, and labeled data is expensive. But here's the trick: a cat photo flipped, cropped, or slightly darkened is still a cat — yet to the network, each version is a brand-new training example. Data augmentation multiplies your dataset by applying random label-safe transformations during training.

Because the model never sees the exact same pixels twice, it can't just memorize individual photos — it's forced to learn what actually makes a cat a cat. That makes augmentation free regularization, in the same family as dropout and one of the first things to reach for when fighting overfitting.

Flips & crops position invariance

A cat is a cat on the left or right of the frame, mirrored or zoomed in. The label doesn't change when the framing does.

Rotations & scale pose invariance

Slight tilts and rescales: same object, new pixels. The label doesn't change when the camera angle wobbles.

Color jitter & noise lighting invariance

Brightness, contrast, blur, sensor noise — the label survives a bad camera, so the model should too.

Each transform you choose teaches the network an invariance: "the label doesn't change when…". That's the whole game.

One image becomes eight

Watch a single cat photo fan out into a batch of augmented variants, see which transforms keep the label true — and which quietly break it — then compare the training curves with and without augmentation.

The label-preserving rule

A transform is only fair game if the label survives it. Flip a handwritten 6 upside down and you've just labeled a 9 as a 6. Mirror a chest X-ray and the heart moves to the wrong side of the body — the image looks fine, but the medical "label" is now a lie. The same flip that's harmless for cats is poison for digits and radiology.

Augmentation encodes domain knowledge

Picking augmentations is really you telling the model which variations are meaningless in your domain. Horizontal flips: safe for animals, unsafe for text, digits, and anatomy. There is no universal list — you have to know your data.

Beyond simple transforms

Mixup blend examples

Overlay two images and their labels: 70% cat + 30% dog pixels gets the soft label (0.7, 0.3). Smooths decision boundaries surprisingly well.

Cutout / random erasing mask patches

Black out random rectangles so the network can't lean on one feature (say, just the ears) — it must use the whole image.

Text & audio other modalities

Text is harder — synonym swaps and back-translation can break meaning. Audio is friendlier: time-shift, pitch, background noise.

Train-time only

Augmentation belongs in the training loop, applied on the fly each epoch. Never augment your validation or test set — those exist to measure performance on real, untouched data, so distorting them just corrupts the measurement (see train/test split). One footnote: test-time augmentation — averaging predictions over augmented copies of a test image — does exist, but it's a separate inference trick, not training.

Do

Augment on the fly, fresh randomness each epoch
Pick transforms your domain actually allows
Combine with transfer learning on small datasets

Don't

Augment the validation/test set
Use flips on digits, text, or X-rays
Crank distortions so far the image is unrecognizable

A 2012 secret weapon

Random crops and horizontal flips were a key ingredient in AlexNet's ImageNet win — they effectively multiplied the training set thousands of times over. Augmentation has been standard practice in computer vision ever since.