The NLP Pipeline

NLP pipeline preprocessing overview

From raw text to a prediction

Text almost never goes straight into a model. It passes through an assembly line of stages, each one nudging the messy raw text closer to clean numbers a model can learn from.

This page is the map for the whole NLP track — every later article zooms into one of these stages.

Watch a sentence flow through

Follow one sentence as it is tokenized, cleaned, normalized, vectorized, and finally fed to a model that predicts its sentiment.

The stages

1. Tokenize split into pieces

Break text into words or sub-words — see Tokenization.

2. Clean remove noise

Lowercase, strip punctuation and HTML — see Text Cleaning.

3. Normalize to root forms

Drop stop words, then stem or lemmatize to base forms.

4. Vectorize text → numbers

Turn tokens into vectors — Bag of Words, TF-IDF, or embeddings.

5. Model learn / predict

Feed the vectors to a classifier or neural network.

6. Evaluate measure

Score with accuracy, precision, recall, F1 — and iterate.

A few guardrails

Good practice
  • Fit every transform on training data only
  • Keep the same steps for train and inference
  • Bundle stages into a single pipeline object
Watch out
  • Modern transformers need less cleaning — don't over-strip
  • Aggressive normalization can destroy signal (e.g. negation)
  • Mismatched train/serve steps cause silent bugs