Loss Functions — MSE & Cross-Entropy
The number training tries to shrink
After forward propagation makes a prediction, a loss function scores how wrong it is. That single number is the target the whole training process works to minimize.
The loss must match the task. Predicting a number? Measure the gap. Predicting a class? Measure how badly the probability was off. The wrong loss can cripple learning.
See what each loss penalizes
MSE grows with the squared error; cross-entropy explodes when the model is confidently wrong. Watch both as the prediction drifts from the truth.
The three you'll use most
mean (y − ŷ)². Penalizes big errors heavily. For predicting continuous numbers — see regression metrics.
−[y·log ŷ + (1−y)·log(1−ŷ)]. Pairs with a sigmoid output. Connects to logistic regression.
−Σ yᵢ·log ŷᵢ. Pairs with softmax. Punishes low probability on the true class.
Why cross-entropy for classification?
Cross-entropy's log makes the loss shoot to infinity when the model assigns near-zero probability to the correct class. A confident mistake is punished far more than a hesitant one — exactly the signal a classifier should get. MSE on probabilities gives weak, flat gradients by comparison.
- Linear output + MSE → regression
- Sigmoid + binary CE → yes/no
- Softmax + categorical CE → pick-one
- MAE / Huber — robust regression losses
- Hinge — for SVM-style margins
- "Loss" = one example; "cost" = average over the batch