Sigmoid & Tanh Activations · Suman Bhadra Notes

The classic S-curves

Sigmoid and tanh are the two original "squashing" activations. Both take any real number and smoothly compress it into a bounded range.

Sigmoid σ(x) = 1 / (1 + e⁻ˣ)

Output in (0, 1). Reads like a probability — same curve as logistic regression.

Tanh tanh(x)

Output in (−1, 1) and zero-centered — usually preferred over sigmoid for hidden layers.

See the curves and their derivatives

Both curves, side by side, then the gradient that flattens at the tails — the root of the vanishing-gradient problem.

The catch: saturation

Vanishing gradients

At the flat tails, the derivative is almost zero. In a deep network, multiplying many tiny gradients together makes them vanish — early layers stop learning. This is why ReLU largely replaced these in hidden layers.

Sigmoid downsides

Saturates → vanishing gradients
Not zero-centered (outputs all positive)
exp is relatively costly

Tanh is better, but

Zero-centered — gradients flow more nicely
Stronger gradient near 0 than sigmoid
Still saturates at the tails

Where they're still used

Sigmoid binary output

The final neuron of a binary classifier, to produce a probability.

Tanh RNN candidate values

Computes the candidate cell/hidden values inside LSTM/GRU cells (the gates themselves use sigmoid).

Hidden layers use ReLU instead

For deep feed-forward nets, ReLU avoids the saturation problem.