DDPM (2020) — Images out of Pure Noise

The world before this paper

In 2020, the best images a computer could dream up came out of a knife fight. GANs produced stunning samples, but only after a brutal duel between two networks that collapsed if you looked at it wrong. The stable alternatives produced mush. And buried in a 2015 paper sat a beautiful idea — generation by gradual denoising — that nobody could make competitive.

GANs sharp but savage

They ruled image generation but trained like a knife fight: instability, mode collapse, and an endless bag of tricks to keep the duel alive.

VAEs & flows stable but soft

They trained calmly and covered the data — but the samples looked blurry, or the math demanded restrictive, invertible architectures.

Diffusion, v0 idea since 2015

Slowly noise an image, learn to reverse it — elegant on paper, but the early versions simply couldn't compete on sample quality.

The key idea

Enter Jonathan Ho, Ajay Jain and Pieter Abbeel at Berkeley — Ho, Jain & Abbeel, "Denoising Diffusion Probabilistic Models", NeurIPS 2020. They dug up the five-year-old diffusion idea and made one bet: stop asking the network for the clean image. Ask it for the noise. At every step of the corruption process, train a U-Net (told which timestep it's looking at) to predict ε — the total Gaussian noise blended into that frame so far — using nothing but a plain MSE loss. No adversary. No duel. Just regression, T = 1000 times over.

That one simplification turned a curiosity into a contender. Each denoising step is so small that the network's job is almost trivial — and a thousand almost-trivial jobs, chained together, can do something that looks like magic.

The paper in one sentence

Define a fixed forward process that destroys an image with ~1000 tiny doses of Gaussian noise, train a network to look at any noisy frame and predict the total noise mixed into it — then generate by running the film backwards, denoising pure static step by step into a brand-new image.

Want the full mechanics — noise schedules, the objective, the sampling loop? See Diffusion mechanics.

Watch the film run both ways

One animation, both directions: a photo drowns in noise, the network learns to predict the total noise mixed into any frame — then a fresh patch of static gets denoised into an image that never existed.

The results that mattered

On CIFAR-10, the boring training loop beat the drama. The numbers said diffusion wasn't a curiosity anymore — it was a peer of the best GANs, with none of their failure modes.

FID 3.17 GAN-grade quality

CIFAR-10 sample quality matching the best GANs of the day — minus the instability, the collapse, the tricks.

T = 1000 tiny steps

A thousand small noising steps, each one an easy learning problem. The chain does the heavy lifting, not any single step.

1 MSE loss the whole objective

Predict the noise, compare, repeat. No adversary, no minimax, no instability — and mode coverage by construction.

Legacy — and the catch

What it unlocked

Stable training and full mode coverage — GAN pain, gone
Scaled into the text-to-image revolution (Stable Diffusion, DALL·E 2, Imagen)
A clean probabilistic framework with knobs theory can hold

The limits

Sampling was ~1000× slower than a GAN's single pass
Pixel-space diffusion at high resolution is brutally expensive
Took latent diffusion + samplers (DDIM) to become practical

Go deeper

Read the original: arXiv:2006.11239. For the moving parts, see Diffusion mechanics; for the models it dethroned, see GANs and Variational autoencoders. Next paper: Chinchilla (2022).