Multimodal Models · Suman Bhadra Notes

Same transformer, new senses

How does the model that completes your sentences also describe your photos? The trick is almost embarrassingly simple: a transformer never cared what a token means. It just takes a sequence of vectors and lets attention work out the relationships. An LLM turns text into vectors through an embedding table — but nothing says the vectors have to come from text. If you can turn an image, a sound, or a video frame into token-like vectors, the exact same architecture can read it.

So the whole field of multimodal models boils down to one engineering question: how do you tokenize things that aren't words?

Images become tokens

The standard answer for images is the Vision Transformer (ViT) recipe: chop the image into a grid of small patches — say 16×16 pixels each — flatten every patch into one vector, and add a position embedding so the model knows where each patch sat. A 224×224 photo becomes a sequence of 196 patch vectors, and from the transformer's point of view that's just a 196-token "sentence". A vision encoder then refines those patches into image tokens that carry meaning, not just raw pixel values.

It's a real break from how CNNs saw images — sliding small filters and building up features layer by layer — whereas a ViT treats patches like words and lets attention relate any patch to any other in a single step.

Gluing vision onto a language model

Vision encoder CLIP-style

Usually pretrained contrastively on millions of image–text pairs, so its image features already line up with language.

Projector small adapter

A tiny MLP that maps image tokens into the LLM's embedding space. Often the only freshly-trained part.

One transformer shared

Image tokens and text tokens flow through the same model — your words can attend directly to patches.

This three-piece recipe — popularized by LLaVA — is how most open vision-language models are built: take a strong vision encoder, take a strong LLM, train a small projector to bridge them, then fine-tune on image-question-answer data. Newer families like Llama 4 go further and are natively multimodal — trained on interleaved image and text from early on — while others like Gemma 3 bake a (frozen) vision encoder into pretraining rather than gluing it on afterwards.

From pixels to tokens to answers

Watch a photo get sliced into patches, encoded into image tokens, and concatenated with a text question into one sequence through one transformer — then see audio and video join the very same pipeline.

Understanding vs generating

Reading images encoder path

VQA, OCR, chart and screenshot reading: encode the image into tokens, generate text about it.

Making images diffusion

Generation is usually a separate diffusion model steered by text conditioning — the LLM reads, diffusion paints.

Audio & video same trick

Audio becomes spectrogram patches; video becomes per-frame patch tokens. Tokenize it, and the model can read it.

Unified "any-to-any" models that both read and generate every modality inside one network do exist, but the split design — encoder for understanding, diffusion for generation — is still the workhorse.

The honest catches

Images are expensive tokens

A single detailed image can cost hundreds to thousands of tokens — a few screenshots can eat a context window alarmingly fast. And fine visual details still trip models: tiny text, counting objects, precise spatial relations. Higher resolution helps but multiplies the token bill, so every VLM picks a point on the resolution vs token-budget tradeoff — many tile large images into multiple crops, paying for each one.