Why We Need Activation Functions?

The linear trap

Here's a humbling fact: stack a million linear layers and you still have one linear function. Without something non-linear in between, depth buys you nothing.

Why? A linear layer computes Wx + b. Feed its output into another linear layer and you get W₂(W₁x + b₁) + b₂ — which simplifies to just W'x + b', a single linear layer. The whole tower collapses.

The fix

Insert a non-linear activation function after each layer. That kink is what stops the collapse and lets the network bend, fold, and carve curved boundaries.

See the collapse — and the rescue

First, stacked linear layers stay a straight line. Then a curved dataset no line can split. Then an activation lets the boundary curve and separate it.

What a good activation needs

Non-linear the whole point

Otherwise the network collapses to a single linear layer.

Differentiable for backprop

Training needs gradients — the function must have a usable derivative.

Cheap to compute runs billions of times

Applied at every neuron, every step — speed matters.

Well-behaved gradients don't vanish

Gradients that shrink to zero stall learning — a real problem for deep nets.

The menu

Different activations make different trade-offs — each gets its own article:

Sigmoid & Tanh squashers

Smooth S-curves into a bounded range — see Sigmoid & Tanh.

ReLU & variants the modern default

Simple, fast, gradient-friendly — see ReLU & Variants.

Softmax output layer

Turns scores into class probabilities — see Softmax.

The big result

With non-linear activations, a network with even one hidden layer is a universal approximator — given enough neurons, it can model any continuous function.