Sigmoid & Tanh Activations
The classic S-curves
Sigmoid and tanh are the two original "squashing" activations. Both take any real number and smoothly compress it into a bounded range.
Sigmoid
σ(x) = 1 / (1 + e⁻ˣ)
Output in (0, 1). Reads like a probability — same curve as logistic regression.
Tanh
tanh(x)
Output in (−1, 1) and zero-centered — usually preferred over sigmoid for hidden layers.
See the curves and their derivatives
Both curves, side by side, then the gradient that flattens at the tails — the root of the vanishing-gradient problem.
The catch: saturation
Vanishing gradients
At the flat tails, the derivative is almost zero. In a deep network, multiplying many tiny gradients together makes them vanish — early layers stop learning. This is why ReLU largely replaced these in hidden layers.
Sigmoid downsides
- Saturates → vanishing gradients
- Not zero-centered (outputs all positive)
expis relatively costly
Tanh is better, but
- Zero-centered — gradients flow more nicely
- Stronger gradient near 0 than sigmoid
- Still saturates at the tails