Naive Bayes
What it is
Naive Bayes classifies by asking, for each class, "how likely is this class given the evidence?" — and picking the winner.
It is built on Bayes' theorem, a rule for flipping a conditional probability around. We rarely know P(spam | these words) directly, but from a labelled inbox we can easily count P(these words | spam). Bayes' theorem lets us turn one into the other.
P(class | evidence) ∝ P(evidence | class) × P(class)
In words: the posterior (what we want) is proportional to the likelihood (how well the class explains the evidence) times the prior (how common the class is to begin with).
Why "naive"?
To compute P(all the words | spam) we'd need to know how words combine — a hopeless amount of counting. Naive Bayes makes one bold simplifying assumption: every feature is independent of the others, given the class. Then the joint likelihood is just the product of each feature's likelihood.
P(w₁, w₂, … | spam) ≈ P(w₁|spam) × P(w₂|spam) × …
The assumption is almost always false — "new" and "year" are not independent — yet the classifier works remarkably well in practice, especially for text.
Watch it classify an email
We trained on a tiny inbox. The animation scores the message "win free cash" for both classes, multiplying each word's likelihood by the prior, then normalises to a final probability.
The recipe
What fraction of training examples are in each class.
For each feature, how often it appears within each class.
Combine the prior with every feature's likelihood for a class score.
Whichever class scores highest wins; normalise for a probability.
A word never seen in a class would give a likelihood of zero and wipe out the whole product. Add a small count (usually +1) to every word so nothing is ever exactly zero.
Flavours
The default for text — features are how many times each word occurs.
Features are yes/no — whether each word appears at all.
For numeric features — assumes each follows a bell curve per class.
Strengths and weaknesses
- Fast to train and predict — just counting
- Works with tiny datasets and many features
- A strong, hard-to-beat baseline for text
- The independence assumption is unrealistic
- Probabilities are often poorly calibrated
- Correlated features get double-counted
The NLP track uses Naive Bayes for sentiment analysis — a classic first text-classification project.