Why We Need Activation Functions?
The linear trap
Here's a humbling fact: stack a million linear layers and you still have one linear function. Without something non-linear in between, depth buys you nothing.
Why? A linear layer computes Wx + b. Feed its output into another linear layer and you get W₂(W₁x + b₁) + b₂ — which simplifies to just W'x + b', a single linear layer. The whole tower collapses.
Insert a non-linear activation function after each layer. That kink is what stops the collapse and lets the network bend, fold, and carve curved boundaries.
See the collapse — and the rescue
First, stacked linear layers stay a straight line. Then a curved dataset no line can split. Then an activation lets the boundary curve and separate it.
What a good activation needs
Otherwise the network collapses to a single linear layer.
Training needs gradients — the function must have a usable derivative.
Applied at every neuron, every step — speed matters.
Gradients that shrink to zero stall learning — a real problem for deep nets.
The menu
Different activations make different trade-offs — each gets its own article:
With non-linear activations, a network with even one hidden layer is a universal approximator — given enough neurons, it can model any continuous function.