GPT — Decoder-Only Transformers
One simple objective, scaled enormously
GPT (Generative Pretrained Transformer) is a stack of transformer decoders trained to do exactly one thing: predict the next token. That deceptively simple goal, scaled to billions of parameters and trillions of words, produces ChatGPT-style fluency.
The key constraint is causal (masked) attention: each position may attend only to earlier tokens, never future ones. That's what lets GPT generate — it can't peek at words it hasn't written yet.
Watch it generate, token by token
GPT reads the prompt, predicts the next token, appends it, and feeds the whole thing back in — the autoregressive loop that writes text.
The ingredients
A mask blocks attention to future positions, so prediction can't cheat by seeing ahead.
Each generated token is appended and fed back in to predict the next — one at a time.
Pick the next token from the softmax distribution; temperature controls how creative vs predictable.
Pretrain on raw text (next-token prediction), then align with instruction tuning and RLHF so it follows directions helpfully — see What is an LLM and fine-tuning.
The decoder-only family
The same architecture writes essays, code, and answers — just by predicting the next token. It scales beautifully, and bigger models gain surprising abilities. GPT, Llama, Claude, Gemini, and most modern LLMs are decoder-only transformers. For the contrast with understanding-focused models, see BERT; for the architecture in full, see How Transformers Work.