Batch vs Mini-Batch vs Stochastic Gradient Descent

Deep Learning training SGD batch size

How much data per step?

Gradient descent needs a gradient before each step. The question: compute it from all the data, one example, or some? That single choice changes everything about how training feels.

Batch GD all N examples

Exact gradient, smooth path — but one step needs a full pass over the data. Slow, memory-heavy.

Stochastic GD 1 example

One example per step. Lightning fast updates, but a noisy, zig-zagging path.

Mini-Batch GD e.g. 32–256

A small batch per step. The practical default — fast, GPU-friendly, reasonably smooth.

See the three paths

On the same loss surface, watch batch glide straight in, SGD bounce around noisily, and mini-batch take a steady middle road.

The trade-offs

Smaller batches
  • Faster, more frequent updates
  • Noise can escape shallow local minima
  • Fit in memory easily
Larger batches
  • Smoother, more stable gradients
  • Better hardware utilization per step
  • But fewer updates per epoch; can generalize worse
Vocabulary

An epoch = one full pass over the data. With batch size B and N examples, that's N/B steps per epoch. "SGD" in practice almost always means mini-batch SGD.

Practical defaults

Start here

Mini-batch of 32–256 (powers of 2), shuffled each epoch. Tune the learning rate alongside it — bigger batches often want a bigger learning rate. Then reach for a smarter optimizer like Adam.