Softmax Activation
Scores into a probability distribution
For multi-class classification, the network's last layer outputs a raw score (a logit) per class. Softmax turns those scores into probabilities that are all positive and sum to 1.
softmax(zᵢ) = e^{zᵢ} / Σⱼ e^{zⱼ} — exponentiate each score, then divide by the total.
Two steps: exponentiate (makes everything positive and amplifies the largest score), then normalize (so the outputs form a valid probability distribution).
Watch the transform
Three class logits become probabilities: exponentiate, then normalize. Notice how the biggest logit walks away with most of the probability.
Key properties
Outputs are a valid probability distribution over the classes.
The exponential makes the largest logit dominate — a "soft" version of picking the max.
Dividing logits by a temperature T sharpens (T<1) or flattens (T>1) the distribution — used in LLM sampling.
Now drive it yourself. Slide each logit and watch the probabilities react — the classes compete, so pushing one up steals share from the others. Then turn the temperature down toward 0.2 and watch softmax become a hard argmax, or up to 3 and watch it flatten.
Try: push dog up to 4 — cat's share collapses without cat's logit changing. Then set T = 0.2: winner takes (nearly) all. At T = 3 even fox gets a real share.
Softmax vs sigmoid, and the loss pairing
- Multi-class, mutually exclusive labels
- Classes compete — probabilities sum to 1
- Pairs with categorical cross-entropy
- Multi-label — a sample can have several labels
- Each class scored independently
- Pairs with binary cross-entropy per output
Frameworks fuse softmax with cross-entropy into one numerically stable op (e.g. CrossEntropyLoss), so you often feed it raw logits and skip an explicit softmax layer during training.