ReLU & Its Variants

Deep Learning activations ReLU GELU

The modern default

ReLU — the Rectified Linear Unit — is almost embarrassingly simple: ReLU(x) = max(0, x). Pass positives straight through, clamp negatives to zero. Yet it powers most deep networks today.

Why did it dethrone sigmoid and tanh? Because for positive inputs its gradient is exactly 1 — it doesn't saturate, so gradients flow freely through deep stacks. It's also blazing fast: just a comparison, no exp.

ReLU, its flaw, and the fixes

See ReLU's kink, the "dying ReLU" problem where neurons get stuck at zero, and the variants that keep a little life on the negative side.

The family

ReLU max(0, x)

Fast, simple, non-saturating for x > 0. The go-to default.

Leaky ReLU small negative slope

Lets a trickle through for x < 0 (e.g. 0.01x), so neurons never fully die.

ELU smooth negative

Exponential curve for x < 0 — smooth, can output negatives, pushes mean activations toward zero.

GELU x · Φ(x)

A smooth, probabilistic gate. The default in transformers like BERT and GPT.

The dying ReLU problem

When a neuron dies

If a neuron's weights push its input always negative, ReLU outputs 0 forever — and its gradient is 0, so it never updates. It's dead. Leaky/ELU/GELU keep a non-zero gradient on the negative side to prevent this.

Use ReLU when
  • You want a fast, strong default
  • Standard CNNs and feed-forward nets
  • Pair with good weight init
Reach for a variant when
  • Many neurons are dying (dead ReLUs)
  • You're building a transformer → GELU
  • You want smoother training → ELU/GELU