Loss Functions — MSE & Cross-Entropy

The number training tries to shrink

After forward propagation makes a prediction, a loss function scores how wrong it is. That single number is the target the whole training process works to minimize.

The loss must match the task. Predicting a number? Measure the gap. Predicting a class? Measure how badly the probability was off. The wrong loss can cripple learning.

See what each loss penalizes

MSE grows with the squared error; cross-entropy explodes when the model is confidently wrong. Watch both as the prediction drifts from the truth.

The three you'll use most

MSE regression

mean (y − ŷ)². Penalizes big errors heavily. For predicting continuous numbers — see regression metrics.

Binary cross-entropy 2-class

−[y·log ŷ + (1−y)·log(1−ŷ)]. Pairs with a sigmoid output. Connects to logistic regression.

Categorical cross-entropy multi-class

−Σ yᵢ·log ŷᵢ. Pairs with softmax. Punishes low probability on the true class.

Why cross-entropy for classification?

Confidence matters

Cross-entropy's log makes the loss shoot to infinity when the model assigns near-zero probability to the correct class. A confident mistake is punished far more than a hesitant one — exactly the signal a classifier should get. MSE on probabilities gives weak, flat gradients by comparison.

Match the output

Linear output + MSE → regression
Sigmoid + binary CE → yes/no
Softmax + categorical CE → pick-one

Also good to know

MAE / Huber — robust regression losses
Hinge — for SVM-style margins
"Loss" = one example; "cost" = average over the batch