Diffusion Models — Image Generation Intuition
Generate by removing noise
Diffusion models — behind Stable Diffusion, DALL·E, and Midjourney — create images with a counter-intuitive trick: they learn to undo noise. Sculpt a picture out of pure random static, a little at a time.
Training has two halves. Forward: take a real image and add a tiny bit of noise, over and over, until it's pure static. Reverse: train a network to predict and remove the noise at each step. Once it can denoise, you start from random static and run the reverse process — out comes a brand-new image.
Forward, then reverse
Watch a picture dissolve into noise (forward), then a fresh sample emerge from random static as the model denoises step by step (reverse).
The pieces
A fixed schedule corrupts an image to pure Gaussian noise over many steps. No learning here.
A network trained to predict the noise that was added, so it can be subtracted off.
Start from random static and denoise repeatedly → a new image the model "imagines".
Steering it with a prompt
To make "an astronaut riding a horse", the denoiser is conditioned on a text embedding (from a model like CLIP). At every denoising step, that prompt nudges the image toward matching the words. Modern systems also denoise in a compressed latent space for speed (latent diffusion).
- Stunning, diverse image quality
- Stable to train (vs older GANs)
- Flexible conditioning (text, images, sketches)
- Slow — many denoising steps to sample
- Heavy compute to train
- Less direct control than editing tools