BERT — Encoder-Only Transformers
A transformer that reads, not writes
BERT (2018) is a stack of transformer encoders. Its superpower is that it reads the entire sentence in both directions at once — every word attends to every other word, left and right — building deep, context-rich representations.
That bidirectional view (the B in BERT) makes it superb at understanding tasks: classification, entity recognition, question answering, search. It doesn't generate text left-to-right — that's GPT's job.
How it learns: fill in the blank
BERT pretrains with masked language modeling: hide ~15% of the words and make the model predict them from both-side context. Watch it guess a masked word.
Pretrain, then fine-tune
On a huge text corpus, learn language by filling in masked words — no labels needed (self-supervised).
Add a small task head and train briefly on your labelled data — exactly like transfer learning for vision.
A special token whose final vector represents the whole input — handy for classification.
BERT vs GPT
- Bidirectional — sees both sides
- Great for understanding: classify, tag, search
- Trained by filling blanks
- Left-to-right — sees only the past
- Great for generation: write, chat, code
- Trained by predicting the next word
BERT and its many descendants (RoBERTa, DistilBERT) power search and classification everywhere. Grab one in a few lines via Hugging Face.