What is an LLM
What it is
A large language model is a neural network — almost always a transformer — trained on a vast pile of text to do exactly one thing: given some text so far, predict what token comes next. Everything else (chatting, translating, writing code, reasoning) is that one trick applied over and over.
"Large" refers to the parameter count. Modern frontier LLMs sit anywhere from a few billion to a trillion-plus parameters. "Language model" is the older statistical term — a probability distribution over sequences of tokens. The combination of huge scale and the transformer architecture is what unlocked the abilities we now call generative AI.
An LLM is a next-token predictor — pour text in, sample tokens out, loop until done.
The pipeline, end to end
From the text you type to the text the model writes back, every LLM follows roughly the same five stages. The model treats your prompt and its own reply as one growing sequence — it just keeps appending one token at a time.
A tokenizer (typically Byte Pair Encoding) splits the input into sub-word pieces and maps each to an integer ID. "unbelievable" might become ["un", "believ", "able"]. Vocabulary sizes are usually 30k–200k.
Each token ID indexes into a learned embedding table, producing a dense vector of a few thousand dimensions. Position information is added (sinusoidal, learned, or RoPE) so the model knows which token came where.
Embeddings flow through dozens of identical transformer blocks. Each block lets every token mix information with every other token via attention, then runs a per-token MLP. After N blocks, every position holds a deeply contextualised vector.
The final position's vector is multiplied by an output matrix to produce one score per vocabulary token — typically tens of thousands of numbers. A softmax turns those scores into a probability distribution over "what comes next."
Pick one token from that distribution (greedy, top-k, top-p, with a temperature). Append it to the sequence. Feed the new sequence back into step 2. Repeat until a stop token or length limit.
Watch one token get generated
The animation walks the prompt "The cat sat on the" through every stage of the pipeline — text, tokens, embeddings, the transformer stack, the vocabulary distribution, and the sampled next token "mat" being appended back into the sequence.
Why next-token prediction is enough
It is genuinely surprising that "predict the next word" turns into translation, summarization, code, math, and chat. The trick is in the data and the scale. To predict the next token well across a corpus that contains code, dialogue, articles, books, equations, and translations, the model is forced to internalise grammar, facts, reasoning patterns, and the implicit structure of all of those domains.
At small scale this looks like a bag of tricks. At large scale, capabilities that were absent in smaller models emerge — arithmetic, multi-step reasoning, instruction following, in-context learning. The training objective never changes; only the data and the parameter count grow.
Translation is "predict the French given the English." Summarization is "predict the short version given the long version." Code is "predict the next line given the function so far." All of it fits inside next-token prediction once the prompt is framed right.
How a chat LLM is trained
A useful chat LLM is built in three stages, each layered on top of the last. Each stage uses a different kind of data and teaches a different kind of skill.
Train on raw text from the web, books, and code. The objective is plain next-token prediction. This is where the model learns language, facts, and reasoning patterns. It is also by far the most expensive stage — the bulk of the compute bill.
Continue training on a smaller dataset of (instruction, ideal response) pairs written or filtered by humans. This is what teaches the base model to follow instructions and reply in the helpful, on-topic style of a chatbot.
Show the model two candidate replies and a signal for which one is better. RLHF trains a reward model and uses RL; DPO skips the reward model and optimises preferences directly. This is where helpfulness, honesty, and harmlessness get sharpened.
Decoding strategies
At step 5 of the pipeline, the model hands you a probability distribution over the entire vocabulary. How you pick a token from that distribution shapes the output's tone — whether it sounds robotic, creative, or unhinged.
Always take the single most likely token. Deterministic and fast, but tends to loop and produces flat, repetitive text.
Divide the logits by T before softmax. T < 1 sharpens the distribution (more conservative); T > 1 flattens it (more random). T = 0 degenerates to greedy.
Zero out everything except the top k tokens, renormalise, then sample. Cuts off the long tail of nonsense without forcing greedy behaviour.
Keep the smallest set of tokens whose combined probability exceeds p (e.g. 0.9), then sample from that set. Adapts to the shape of the distribution: narrow when the model is confident, wide when it isn't.
The family of LLM architectures
Every modern LLM is a transformer, but not every transformer looks the same. The choice of which half of the original encoder-decoder stack you keep — and how you wire the attention — defines what the model is good at.
A stack of decoder blocks with causal (masked) self-attention — every token sees only the past. Trained autoregressively to predict the next token. Dominant for chat, generation, code, and reasoning. The architecture behind virtually every assistant you've used.
Bidirectional self-attention — every token sees every other. Trained with masked-language-modeling (fill in the blank). Cannot generate fluent text, but produces excellent embeddings for search, classification, and re-ranking. Still the workhorse for retrieval and NLU.
An encoder reads the input, a decoder writes the output, and cross-attention bridges the two. Natural fit when input and output are clearly distinct sequences — translation, summarisation, structured-to-text. Trained on sequence-to-sequence objectives.
Not a separate architecture — a variant usually layered onto decoder-only. Replace each block's single feed-forward layer with many parallel "experts," and a router that activates only a few per token. Total parameters explode, but per-token compute stays cheap. The trick behind very large modern models.
Compare the four families visually
The animation flips through the four families one at a time, highlighting the key wiring difference: how attention is masked, whether two stacks talk to each other, and where MoE's router fits in.
What lives inside one transformer block
Zooming into any single block of the stack, the recipe is the same across every LLM family — only the masking and the cross-attention plumbing differ.
Each token computes a query, key, and value, then blends in information from other tokens by query-key similarity. Causal mask in decoders, full attention in encoders.
A two-layer MLP applied independently at each position. This is where most of the parameters live and where most of the per-token "thinking" happens.
Wrap each sub-layer with a residual connection and a normalisation step. Without these, deep stacks don't train.
For the full mechanics — Q/K/V projections, multi-head attention, positional encoding, the encoder/decoder block diagrams, and an animated walk-through — see How Transformers Work.
What an LLM is not good at
- Hallucination — the model is sampling from a distribution, not looking facts up. When the right answer isn't sharply represented in its training data, it can produce confident but invented details.
- Knowledge cutoff — pre-training data is a frozen snapshot. The model doesn't know about anything that happened after its cutoff date unless you give it tools or context.
- Context window — attention is
O(n²)in sequence length. Models have a hard limit on how many tokens they can attend to at once. Past that, information has to be summarised or retrieved. - No grounding by default — out of the box an LLM has no internet, no database, no calculator. Tool use and retrieval (see How RAG Works) are bolted on to fix this.
- Stateless between calls — the model has no memory across separate conversations. Whatever context you want it to use must be passed in every time.