Batch vs Mini-Batch vs Stochastic Gradient Descent
How much data per step?
Gradient descent needs a gradient before each step. The question: compute it from all the data, one example, or some? That single choice changes everything about how training feels.
Exact gradient, smooth path — but one step needs a full pass over the data. Slow, memory-heavy.
One example per step. Lightning fast updates, but a noisy, zig-zagging path.
A small batch per step. The practical default — fast, GPU-friendly, reasonably smooth.
See the three paths
On the same loss surface, watch batch glide straight in, SGD bounce around noisily, and mini-batch take a steady middle road.
The trade-offs
- Faster, more frequent updates
- Noise can escape shallow local minima
- Fit in memory easily
- Smoother, more stable gradients
- Better hardware utilization per step
- But fewer updates per epoch; can generalize worse
An epoch = one full pass over the data. With batch size B and N examples, that's N/B steps per epoch. "SGD" in practice almost always means mini-batch SGD.
Practical defaults
Mini-batch of 32–256 (powers of 2), shuffled each epoch. Tune the learning rate alongside it — bigger batches often want a bigger learning rate. Then reach for a smarter optimizer like Adam.