Diffusion Models — Image Generation Intuition

Gen AI image generation denoising diffusion

Generate by removing noise

Diffusion models — behind Stable Diffusion, DALL·E, and Midjourney — create images with a counter-intuitive trick: they learn to undo noise. Sculpt a picture out of pure random static, a little at a time.

Training has two halves. Forward: take a real image and add a tiny bit of noise, over and over, until it's pure static. Reverse: train a network to predict and remove the noise at each step. Once it can denoise, you start from random static and run the reverse process — out comes a brand-new image.

Forward, then reverse

Watch a picture dissolve into noise (forward), then a fresh sample emerge from random static as the model denoises step by step (reverse).

The pieces

Forward process add noise

A fixed schedule corrupts an image to pure Gaussian noise over many steps. No learning here.

Denoiser a U-Net

A network trained to predict the noise that was added, so it can be subtracted off.

Reverse process sample

Start from random static and denoise repeatedly → a new image the model "imagines".

Steering it with a prompt

Text conditioning

To make "an astronaut riding a horse", the denoiser is conditioned on a text embedding (from a model like CLIP). At every denoising step, that prompt nudges the image toward matching the words. Modern systems also denoise in a compressed latent space for speed (latent diffusion).

Strengths
  • Stunning, diverse image quality
  • Stable to train (vs older GANs)
  • Flexible conditioning (text, images, sketches)
Trade-offs
  • Slow — many denoising steps to sample
  • Heavy compute to train
  • Less direct control than editing tools