BERT — Encoder-Only Transformers

Deep Learning BERT encoder masked LM

A transformer that reads, not writes

BERT (2018) is a stack of transformer encoders. Its superpower is that it reads the entire sentence in both directions at once — every word attends to every other word, left and right — building deep, context-rich representations.

That bidirectional view (the B in BERT) makes it superb at understanding tasks: classification, entity recognition, question answering, search. It doesn't generate text left-to-right — that's GPT's job.

How it learns: fill in the blank

BERT pretrains with masked language modeling: hide ~15% of the words and make the model predict them from both-side context. Watch it guess a masked word.

Pretrain, then fine-tune

1. Pretrain masked LM

On a huge text corpus, learn language by filling in masked words — no labels needed (self-supervised).

2. Fine-tune your task

Add a small task head and train briefly on your labelled data — exactly like transfer learning for vision.

[CLS] token sentence summary

A special token whose final vector represents the whole input — handy for classification.

BERT vs GPT

BERT (encoder)
  • Bidirectional — sees both sides
  • Great for understanding: classify, tag, search
  • Trained by filling blanks
GPT (decoder)
  • Left-to-right — sees only the past
  • Great for generation: write, chat, code
  • Trained by predicting the next word
Use it today

BERT and its many descendants (RoBERTa, DistilBERT) power search and classification everywhere. Grab one in a few lines via Hugging Face.