Multimodal Models
Same transformer, new senses
How does the model that completes your sentences also describe your photos? The trick is almost embarrassingly simple: a transformer never cared what a token means. It just takes a sequence of vectors and lets attention work out the relationships. An LLM turns text into vectors through an embedding table — but nothing says the vectors have to come from text. If you can turn an image, a sound, or a video frame into token-like vectors, the exact same architecture can read it.
So the whole field of multimodal models boils down to one engineering question: how do you tokenize things that aren't words?
Images become tokens
The standard answer for images is the Vision Transformer (ViT) recipe: chop the image into a grid of small patches — say 16×16 pixels each — flatten every patch into one vector, and add a position embedding so the model knows where each patch sat. A 224×224 photo becomes a sequence of 196 patch vectors, and from the transformer's point of view that's just a 196-token "sentence". A vision encoder then refines those patches into image tokens that carry meaning, not just raw pixel values.
It's a real break from how CNNs saw images — sliding small filters and building up features layer by layer — whereas a ViT treats patches like words and lets attention relate any patch to any other in a single step.
Gluing vision onto a language model
Usually pretrained contrastively on millions of image–text pairs, so its image features already line up with language.
A tiny MLP that maps image tokens into the LLM's embedding space. Often the only freshly-trained part.
Image tokens and text tokens flow through the same model — your words can attend directly to patches.
This three-piece recipe — popularized by LLaVA — is how most open vision-language models are built: take a strong vision encoder, take a strong LLM, train a small projector to bridge them, then fine-tune on image-question-answer data. Newer families like Llama 4 and Gemma 3 go further and are natively multimodal — trained on interleaved image and text from early on rather than glued together afterwards.
From pixels to tokens to answers
Watch a photo get sliced into patches, encoded into image tokens, and concatenated with a text question into one sequence through one transformer — then see audio and video join the very same pipeline.
Understanding vs generating
VQA, OCR, chart and screenshot reading: encode the image into tokens, generate text about it.
Generation is usually a separate diffusion model steered by text conditioning — the LLM reads, diffusion paints.
Audio becomes spectrogram patches; video becomes per-frame patch tokens. Tokenize it, and the model can read it.
Unified "any-to-any" models that both read and generate every modality inside one network do exist, but the split design — encoder for understanding, diffusion for generation — is still the workhorse.
The honest catches
A single detailed image can cost hundreds to thousands of tokens — a few screenshots can eat a context window alarmingly fast. And fine visual details still trip models: tiny text, counting objects, precise spatial relations. Higher resolution helps but multiplies the token bill, so every VLM picks a point on the resolution vs token-budget tradeoff — many tile large images into multiple crops, paying for each one.