ReLU & Its Variants
The modern default
ReLU — the Rectified Linear Unit — is almost embarrassingly simple: ReLU(x) = max(0, x). Pass positives straight through, clamp negatives to zero. Yet it powers most deep networks today.
Why did it dethrone sigmoid and tanh? Because for positive inputs its gradient is exactly 1 — it doesn't saturate, so gradients flow freely through deep stacks. It's also blazing fast: just a comparison, no exp.
ReLU, its flaw, and the fixes
See ReLU's kink, the "dying ReLU" problem where neurons get stuck at zero, and the variants that keep a little life on the negative side.
The family
Fast, simple, non-saturating for x > 0. The go-to default.
Lets a trickle through for x < 0 (e.g. 0.01x), so neurons never fully die.
Exponential curve for x < 0 — smooth, can output negatives, pushes mean activations toward zero.
The dying ReLU problem
If a neuron's weights push its input always negative, ReLU outputs 0 forever — and its gradient is 0, so it never updates. It's dead. Leaky/ELU/GELU keep a non-zero gradient on the negative side to prevent this.
- You want a fast, strong default
- Standard CNNs and feed-forward nets
- Pair with good weight init
- Many neurons are dying (dead ReLUs)
- You're building a transformer → GELU
- You want smoother training → ELU/GELU