Batch vs Mini-Batch vs Stochastic Gradient Descent

How much data per step?

Gradient descent needs a gradient before each step. The question: compute it from all the data, one example, or some? That single choice changes everything about how training feels.

Batch GD all N examples

Exact gradient, smooth path — but one step needs a full pass over the data. Slow, memory-heavy.

Stochastic GD 1 example

One example per step. Lightning fast updates, but a noisy, zig-zagging path.

Mini-Batch GD e.g. 32–256

A small batch per step. The practical default — fast, GPU-friendly, reasonably smooth.

See the three paths

On the same loss surface, watch batch glide straight in, SGD bounce around noisily, and mini-batch take a steady middle road.

The trade-offs

Smaller batches

Faster, more frequent updates
Noise can escape shallow local minima
Fit in memory easily

Larger batches

Smoother, more stable gradients
Better hardware utilization per step
But fewer updates per epoch; can generalize worse

Vocabulary

An epoch = one full pass over the data. With batch size B and N examples, that's N/B steps per epoch. "SGD" in practice almost always means mini-batch SGD.

Practical defaults

Start here

Mini-batch of 32–256 (powers of 2), shuffled each epoch. Tune the learning rate alongside it — bigger batches often want a bigger learning rate. Then reach for a smarter optimizer like Adam.