Mini Project: Spam Detection · Suman Bhadra Notes

The first real NLP project

Spam detection is the "hello world" of text classification: take an email, decide spam or not spam. It ties together everything in this track so far.

The recipe: clean and tokenize the text, turn it into TF-IDF vectors, then train a logistic regression classifier to output a spam probability. Bonus: logistic regression's weights are readable, so you can see which words scream "spam".

Watch the pipeline run

From labelled emails, to TF-IDF vectors, to a trained model and its learned word weights, to a verdict on a fresh email.

Step by step

1. Data labelled emails

A corpus like the SMS Spam or Enron dataset, each message tagged spam/ham.

2. Preprocess clean + tokenize

Lowercase, strip noise, tokenize — see the pipeline.

3. Vectorize TF-IDF

Fit a TF-IDF vectorizer on the training set; transform every message.

4. Train logistic regression

Learn a weight per word; positive weights push toward spam.

5. Evaluate precision/recall

Use a confusion matrix and precision/recall — false positives (real mail junked) are costly.

6. Predict probability

Threshold the output probability to flag new messages.

Why logistic regression here?

Great fit

Handles high-dimensional sparse TF-IDF well
Outputs a probability (often reasonably calibrated, but verify/calibrate before trusting it)
Weights are interpretable — see the spammy words

Watch for

Pick the threshold by the cost of a false positive
Spam drifts — retrain regularly
Use regularization to avoid overfitting rare words

Try the variant

Swap in Naive Bayes — historically the classic spam classifier — and compare. The sibling project, Sentiment Analysis, uses the same recipe for a different label.