DDPM (2020) — Images out of Pure Noise
The world before this paper
In 2020, the best images a computer could dream up came out of a knife fight. GANs produced stunning samples, but only after a brutal duel between two networks that collapsed if you looked at it wrong. The stable alternatives produced mush. And buried in a 2015 paper sat a beautiful idea — generation by gradual denoising — that nobody could make competitive.
They ruled image generation but trained like a knife fight: instability, mode collapse, and an endless bag of tricks to keep the duel alive.
They trained calmly and covered the data — but the samples looked blurry, or the math demanded restrictive, invertible architectures.
Slowly noise an image, learn to reverse it — elegant on paper, but the early versions simply couldn't compete on sample quality.
The key idea
Enter Jonathan Ho, Ajay Jain and Pieter Abbeel at Berkeley — Ho, Jain & Abbeel, "Denoising Diffusion Probabilistic Models", NeurIPS 2020. They dug up the five-year-old diffusion idea and made one bet: stop asking the network for the clean image. Ask it for the noise. At every step of the corruption process, train a U-Net (told which timestep it's looking at) to predict ε — the exact Gaussian noise that was just mixed in — using nothing but a plain MSE loss. No adversary. No duel. Just regression, T = 1000 times over.
That one simplification turned a curiosity into a contender. Each denoising step is so small that the network's job is almost trivial — and a thousand almost-trivial jobs, chained together, can do something that looks like magic.
Define a fixed forward process that destroys an image with ~1000 tiny doses of Gaussian noise, train a network to predict the dose added at each step — then generate by running the film backwards, denoising pure static step by step into a brand-new image.
Want the full mechanics — noise schedules, the objective, the sampling loop? See Diffusion mechanics.
Watch the film run both ways
One animation, both directions: a photo drowns in noise, the network learns to predict each dose — then a fresh patch of static gets denoised into an image that never existed.
The results that mattered
On CIFAR-10, the boring training loop beat the drama. The numbers said diffusion wasn't a curiosity anymore — it was a peer of the best GANs, with none of their failure modes.
CIFAR-10 sample quality matching the best GANs of the day — minus the instability, the collapse, the tricks.
A thousand small noising steps, each one an easy learning problem. The chain does the heavy lifting, not any single step.
Predict the noise, compare, repeat. No adversary, no minimax, no instability — and mode coverage by construction.
Legacy — and the catch
- Stable training and full mode coverage — GAN pain, gone
- Scaled into the text-to-image revolution (Stable Diffusion, DALL·E 2, Imagen)
- A clean probabilistic framework with knobs theory can hold
- Sampling was ~1000× slower than a GAN's single pass
- Pixel-space diffusion at high resolution is brutally expensive
- Took latent diffusion + samplers (DDIM) to become practical
Read the original: arXiv:2006.11239. For the moving parts, see Diffusion mechanics; for the models it dethroned, see GANs and Variational autoencoders. Next paper: Chinchilla (2022).