Sigmoid & Tanh Activations

Deep Learning activations sigmoid tanh

The classic S-curves

Sigmoid and tanh are the two original "squashing" activations. Both take any real number and smoothly compress it into a bounded range.

Sigmoid σ(x) = 1 / (1 + e⁻ˣ)

Output in (0, 1). Reads like a probability — same curve as logistic regression.

Tanh tanh(x)

Output in (−1, 1) and zero-centered — usually preferred over sigmoid for hidden layers.

See the curves and their derivatives

Both curves, side by side, then the gradient that flattens at the tails — the root of the vanishing-gradient problem.

The catch: saturation

Vanishing gradients

At the flat tails, the derivative is almost zero. In a deep network, multiplying many tiny gradients together makes them vanish — early layers stop learning. This is why ReLU largely replaced these in hidden layers.

Sigmoid downsides
  • Saturates → vanishing gradients
  • Not zero-centered (outputs all positive)
  • exp is relatively costly
Tanh is better, but
  • Zero-centered — gradients flow more nicely
  • Stronger gradient near 0 than sigmoid
  • Still saturates at the tails

Where they're still used

Sigmoid binary output

The final neuron of a binary classifier, to produce a probability.

Tanh RNN gates

Inside LSTM/GRU cells and other recurrent units.

Hidden layers use ReLU instead

For deep feed-forward nets, ReLU avoids the saturation problem.