GPT — Decoder-Only Transformers

One simple objective, scaled enormously

GPT (Generative Pretrained Transformer) is a stack of transformer decoders trained to do exactly one thing: predict the next token. That deceptively simple goal, scaled to billions of parameters and trillions of words, produces ChatGPT-style fluency.

The key constraint is causal (masked) attention: each position may attend only to earlier tokens, never future ones. That's what lets GPT generate — it can't peek at words it hasn't written yet.

Watch it generate, token by token

GPT reads the prompt, predicts the next token, appends it, and feeds the whole thing back in — the autoregressive loop that writes text.

The ingredients

Causal attention past only

A mask blocks attention to future positions, so prediction can't cheat by seeing ahead.

Autoregressive loop the output

Each generated token is appended and fed back in to predict the next — one at a time.

Sampling temperature

Pick the next token from the softmax distribution; temperature controls how creative vs predictable.

Pretrain → align

Pretrain on raw text (next-token prediction), then align with instruction tuning and RLHF so it follows directions helpfully — see What is an LLM and fine-tuning.

The decoder-only family

Why it took over

The same architecture writes essays, code, and answers — just by predicting the next token. It scales beautifully, and bigger models gain surprising abilities. GPT, Llama, Claude, Gemini, and most modern LLMs are decoder-only transformers. For the contrast with understanding-focused models, see BERT; for the architecture in full, see How Transformers Work.