Naive Bayes

ML classification probability Bayes' theorem supervised

What it is

Naive Bayes classifies by asking, for each class, "how likely is this class given the evidence?" — and picking the winner.

It is built on Bayes' theorem, a rule for flipping a conditional probability around. We rarely know P(spam | these words) directly, but from a labelled inbox we can easily count P(these words | spam). Bayes' theorem lets us turn one into the other.

Bayes' theorem

P(class | evidence) ∝ P(evidence | class) × P(class)

In words: the posterior (what we want) is proportional to the likelihood (how well the class explains the evidence) times the prior (how common the class is to begin with).

Why "naive"?

To compute P(all the words | spam) we'd need to know how words combine — a hopeless amount of counting. Naive Bayes makes one bold simplifying assumption: every feature is independent of the others, given the class. Then the joint likelihood is just the product of each feature's likelihood.

The naive assumption

P(w₁, w₂, … | spam) ≈ P(w₁|spam) × P(w₂|spam) × …

The assumption is almost always false — "new" and "year" are not independent — yet the classifier works remarkably well in practice, especially for text.

Watch it classify an email

We trained on a tiny inbox. The animation scores the message "win free cash" for both classes, multiplying each word's likelihood by the prior, then normalises to a final probability.

The recipe

1. Priors P(class)

What fraction of training examples are in each class.

2. Likelihoods P(feature | class)

For each feature, how often it appears within each class.

3. Multiply prior × ∏ likelihoods

Combine the prior with every feature's likelihood for a class score.

4. Pick the max argmax

Whichever class scores highest wins; normalise for a probability.

Laplace smoothing

A word never seen in a class would give a likelihood of zero and wipe out the whole product. Add a small count (usually +1) to every word so nothing is ever exactly zero.

Flavours

Multinomial word counts

The default for text — features are how many times each word occurs.

Bernoulli word present?

Features are yes/no — whether each word appears at all.

Gaussian continuous

For numeric features — assumes each follows a bell curve per class.

Strengths and weaknesses

Strengths
  • Fast to train and predict — just counting
  • Works with tiny datasets and many features
  • A strong, hard-to-beat baseline for text
Weaknesses
  • The independence assumption is unrealistic
  • Probabilities are often poorly calibrated
  • Correlated features get double-counted
See it applied

The NLP track uses Naive Bayes for sentiment analysis — a classic first text-classification project.