Mini Project: Spam Detection
The first real NLP project
Spam detection is the "hello world" of text classification: take an email, decide spam or not spam. It ties together everything in this track so far.
The recipe: clean and tokenize the text, turn it into TF-IDF vectors, then train a logistic regression classifier to output a spam probability. Bonus: logistic regression's weights are readable, so you can see which words scream "spam".
Watch the pipeline run
From labelled emails, to TF-IDF vectors, to a trained model and its learned word weights, to a verdict on a fresh email.
Step by step
A corpus like the SMS Spam or Enron dataset, each message tagged spam/ham.
Fit a TF-IDF vectorizer on the training set; transform every message.
Learn a weight per word; positive weights push toward spam.
Use a confusion matrix and precision/recall — false positives (real mail junked) are costly.
Threshold the output probability to flag new messages.
Why logistic regression here?
- Handles high-dimensional sparse TF-IDF well
- Outputs a calibrated probability
- Weights are interpretable — see the spammy words
- Pick the threshold by the cost of a false positive
- Spam drifts — retrain regularly
- Use regularization to avoid overfitting rare words
Swap in Naive Bayes — historically the classic spam classifier — and compare. The sibling project, Sentiment Analysis, uses the same recipe for a different label.