Softmax Activation · Suman Bhadra Notes

Scores into a probability distribution

For multi-class classification, the network's last layer outputs a raw score (a logit) per class. Softmax turns those scores into probabilities that are all positive and sum to 1.

The formula

softmax(zᵢ) = e^{zᵢ} / Σⱼ e^{zⱼ} — exponentiate each score, then divide by the total.

Two steps: exponentiate (makes everything positive and amplifies the largest score), then normalize (so the outputs form a valid probability distribution).

Watch the transform

Three class logits become probabilities: exponentiate, then normalize. Notice how the biggest logit walks away with most of the probability.

Key properties

Sums to 1 a distribution

Outputs are a valid probability distribution over the classes.

Amplifies the max soft argmax

The exponential makes the largest logit dominate — a "soft" version of picking the max.

Temperature sharpness knob

Dividing logits by a temperature T sharpens (T<1) or flattens (T>1) the distribution — used in LLM sampling.

Now drive it yourself. Slide each logit and watch the probabilities react — the classes compete, so pushing one up steals share from the others. Then turn the temperature down toward 0.2 and watch softmax become a hard argmax, or up to 3 and watch it flatten.

cat dog fox T

Try: push dog up to 4 — cat's share collapses without cat's logit changing. Then set T = 0.2: winner takes (nearly) all. At T = 3 even fox gets a real share.

Softmax vs sigmoid, and the loss pairing

Softmax

Multi-class, mutually exclusive labels
Classes compete — probabilities sum to 1
Pairs with categorical cross-entropy

Sigmoid (per class)

Multi-label — a sample can have several labels
Each class scored independently
Pairs with binary cross-entropy per output

In practice

Frameworks fuse softmax with cross-entropy into one numerically stable op (e.g. CrossEntropyLoss), so you often feed it raw logits and skip an explicit softmax layer during training.