What is an LLM

llm tokenization embeddings next-token prediction decoder-only encoder-decoder moe

What it is

A large language model is a neural network — almost always a transformer — trained on a vast pile of text to do exactly one thing: given some text so far, predict what token comes next. Everything else (chatting, translating, writing code, reasoning) is that one trick applied over and over.

"Large" refers to the parameter count. Modern frontier LLMs sit anywhere from a few billion to a trillion-plus parameters. "Language model" is the older statistical term — a probability distribution over sequences of tokens. The combination of huge scale and the transformer architecture is what unlocked the abilities we now call generative AI.

In one sentence

An LLM is a next-token predictor — pour text in, sample tokens out, loop until done.

The pipeline, end to end

From the text you type to the text the model writes back, every LLM follows roughly the same five stages. The model treats your prompt and its own reply as one growing sequence — it just keeps appending one token at a time.

1 — Tokenize text → token IDs

A tokenizer (typically Byte Pair Encoding) splits the input into sub-word pieces and maps each to an integer ID. "unbelievable" might become ["un", "believ", "able"]. Vocabulary sizes are usually 30k–200k.

2 — Embed IDs → vectors

Each token ID indexes into a learned embedding table, producing a dense vector of a few thousand dimensions. Position information is added (sinusoidal, learned, or RoPE) so the model knows which token came where.

3 — Transformer stack attention + FFN, ×N layers

Embeddings flow through dozens of identical transformer blocks. Each block lets every token mix information with every other token via attention, then runs a per-token MLP. After N blocks, every position holds a deeply contextualised vector.

4 — Logits vector → vocab scores

The final position's vector is multiplied by an output matrix to produce one score per vocabulary token — typically tens of thousands of numbers. A softmax turns those scores into a probability distribution over "what comes next."

5 — Sample distribution → one token

Pick one token from that distribution (greedy, top-k, top-p, with a temperature). Append it to the sequence. Feed the new sequence back into step 2. Repeat until a stop token or length limit.

Watch one token get generated

The animation walks the prompt "The cat sat on the" through every stage of the pipeline — text, tokens, embeddings, the transformer stack, the vocabulary distribution, and the sampled next token "mat" being appended back into the sequence.

Why next-token prediction is enough

It is genuinely surprising that "predict the next word" turns into translation, summarization, code, math, and chat. The trick is in the data and the scale. To predict the next token well across a corpus that contains code, dialogue, articles, books, equations, and translations, the model is forced to internalise grammar, facts, reasoning patterns, and the implicit structure of all of those domains.

At small scale this looks like a bag of tricks. At large scale, capabilities that were absent in smaller models emerge — arithmetic, multi-step reasoning, instruction following, in-context learning. The training objective never changes; only the data and the parameter count grow.

Why one objective covers everything

Translation is "predict the French given the English." Summarization is "predict the short version given the long version." Code is "predict the next line given the function so far." All of it fits inside next-token prediction once the prompt is framed right.

How a chat LLM is trained

A useful chat LLM is built in three stages, each layered on top of the last. Each stage uses a different kind of data and teaches a different kind of skill.

1 — Pre-training trillions of tokens, self-supervised

Train on raw text from the web, books, and code. The objective is plain next-token prediction. This is where the model learns language, facts, and reasoning patterns. It is also by far the most expensive stage — the bulk of the compute bill.

2 — Supervised fine-tuning (SFT) curated instruction pairs

Continue training on a smaller dataset of (instruction, ideal response) pairs written or filtered by humans. This is what teaches the base model to follow instructions and reply in the helpful, on-topic style of a chatbot.

3 — Preference tuning (RLHF / DPO) human or AI ranked outputs

Show the model two candidate replies and a signal for which one is better. RLHF trains a reward model and uses RL; DPO skips the reward model and optimises preferences directly. This is where helpfulness, honesty, and harmlessness get sharpened.

Decoding strategies

At step 5 of the pipeline, the model hands you a probability distribution over the entire vocabulary. How you pick a token from that distribution shapes the output's tone — whether it sounds robotic, creative, or unhinged.

Greedy argmax

Always take the single most likely token. Deterministic and fast, but tends to loop and produces flat, repetitive text.

Temperature sharpen or flatten

Divide the logits by T before softmax. T < 1 sharpens the distribution (more conservative); T > 1 flattens it (more random). T = 0 degenerates to greedy.

Top-k keep the k most likely

Zero out everything except the top k tokens, renormalise, then sample. Cuts off the long tail of nonsense without forcing greedy behaviour.

Top-p (nucleus) smallest set summing to p

Keep the smallest set of tokens whose combined probability exceeds p (e.g. 0.9), then sample from that set. Adapts to the shape of the distribution: narrow when the model is confident, wide when it isn't.

The family of LLM architectures

Every modern LLM is a transformer, but not every transformer looks the same. The choice of which half of the original encoder-decoder stack you keep — and how you wire the attention — defines what the model is good at.

Decoder-only GPT, Claude, Llama, Gemini, Mistral

A stack of decoder blocks with causal (masked) self-attention — every token sees only the past. Trained autoregressively to predict the next token. Dominant for chat, generation, code, and reasoning. The architecture behind virtually every assistant you've used.

Encoder-only BERT, RoBERTa, DeBERTa

Bidirectional self-attention — every token sees every other. Trained with masked-language-modeling (fill in the blank). Cannot generate fluent text, but produces excellent embeddings for search, classification, and re-ranking. Still the workhorse for retrieval and NLU.

Encoder–decoder T5, BART, original Transformer

An encoder reads the input, a decoder writes the output, and cross-attention bridges the two. Natural fit when input and output are clearly distinct sequences — translation, summarisation, structured-to-text. Trained on sequence-to-sequence objectives.

Mixture-of-Experts (MoE) Mixtral, DeepSeek, GPT-4-class

Not a separate architecture — a variant usually layered onto decoder-only. Replace each block's single feed-forward layer with many parallel "experts," and a router that activates only a few per token. Total parameters explode, but per-token compute stays cheap. The trick behind very large modern models.

Compare the four families visually

The animation flips through the four families one at a time, highlighting the key wiring difference: how attention is masked, whether two stacks talk to each other, and where MoE's router fits in.

What lives inside one transformer block

Zooming into any single block of the stack, the recipe is the same across every LLM family — only the masking and the cross-attention plumbing differ.

Self-attention tokens look at each other

Each token computes a query, key, and value, then blends in information from other tokens by query-key similarity. Causal mask in decoders, full attention in encoders.

Feed-forward (FFN) per-token MLP

A two-layer MLP applied independently at each position. This is where most of the parameters live and where most of the per-token "thinking" happens.

Residuals + Norm skip + layer-norm

Wrap each sub-layer with a residual connection and a normalisation step. Without these, deep stacks don't train.

For the full mechanics — Q/K/V projections, multi-head attention, positional encoding, the encoder/decoder block diagrams, and an animated walk-through — see How Transformers Work.

What an LLM is not good at

Limits worth keeping in mind
  • Hallucination — the model is sampling from a distribution, not looking facts up. When the right answer isn't sharply represented in its training data, it can produce confident but invented details.
  • Knowledge cutoff — pre-training data is a frozen snapshot. The model doesn't know about anything that happened after its cutoff date unless you give it tools or context.
  • Context window — attention is O(n²) in sequence length. Models have a hard limit on how many tokens they can attend to at once. Past that, information has to be summarised or retrieved.
  • No grounding by default — out of the box an LLM has no internet, no database, no calculator. Tool use and retrieval (see How RAG Works) are bolted on to fix this.
  • Stateless between calls — the model has no memory across separate conversations. Whatever context you want it to use must be passed in every time.