Diffusion model
Diffusion models (also called denoising diffusion probabilistic models or score-based generative models) are a class of deep generative model that learn to generate data by reversing a gradual noising process. Since 2021 they have largely displaced generative adversarial networks as the dominant paradigm for image synthesis, and underpin systems such as DALL·E 2, Stable Diffusion, Imagen, and Midjourney.
Core idea
A diffusion model defines two processes:
- Forward (noising) process: starting from a real data sample x0, Gaussian noise is added over T time steps to produce a sequence x1, x2, …, xT, where xT is approximately pure Gaussian noise. Each step follows q(xt | xt−1) = N(xt; √(1−βt) xt−1, βtI), where βt is a variance schedule.
- Reverse (denoising) process: a neural network is trained to predict the noise added at each step and progressively remove it, recovering a clean sample from pure noise. The model learns pθ(xt−1 | xt), parameterised by a network that typically predicts the noise ε.
The training objective simplifies to a weighted mean squared error between the true noise ε and the predicted noise εθ(xt, t).
History
The theoretical foundations were laid by Jascha Sohl-Dickstein and colleagues in their 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", which drew an analogy between data generation and the reversal of a thermodynamic diffusion process. However, the approach initially could not match GANs in image quality.
Two key advances in 2019–2020 made diffusion models practical:
- Score matching with Langevin dynamics (Song & Ermon, NeurIPS 2019): reframed the problem as learning the score function (gradient of the log-density) and sampling via Langevin dynamics, achieving FID scores competitive with GANs on CIFAR-10.
- Denoising Diffusion Probabilistic Models (DDPM) (Ho, Jain & Abbeel, NeurIPS 2020): demonstrated that a simplified training objective (predicting added noise) with a U-Net architecture could generate high-fidelity 256×256 images, matching or exceeding the best autoregressive and GAN models. This paper is widely regarded as the catalyst for the diffusion model revolution.
In 2021, Dhariwal & Nichol ("Diffusion Models Beat GANs on Image Synthesis") showed that classifier-guided diffusion models surpassed BigGAN on ImageNet in both FID and classification accuracy score, definitively establishing diffusion models as the state of the art.
Architecture
U-Net backbone
The original DDPM and most early diffusion models used a U-Net architecture — a convolutional encoder-decoder with skip connections, augmented with self-attention layers at lower resolutions and sinusoidal time-step embeddings. The U-Net processes the noisy image xt together with the time step t to predict the noise component.
Diffusion Transformer (DiT)
In 2023, Peebles & Xie proposed the Diffusion Transformer (DiT), replacing the U-Net with a vision transformer. DiT processes image patches as tokens and uses adaptive layer normalisation (adaLN-Zero) to condition on the time step and class label. DiT-XL/2 achieved a new state-of-the-art FID of 2.27 on ImageNet 256×256. Subsequent systems including Stable Diffusion 3 and FLUX adopted transformer-based backbones.
Guidance
Classifier-free guidance (Ho & Salimans, 2022) became the standard technique for trading off sample diversity against fidelity. During training, the class or text condition is randomly dropped (replaced with a null token) some fraction of the time. At inference, the model's unconditional and conditional predictions are combined: ε̃ = εunconditional + w · (εconditional − εunconditional), where w > 1 increases adherence to the condition.
Latent diffusion
Latent diffusion models (Rombach et al., CVPR 2022) perform the diffusion process not in pixel space but in the latent space of a pretrained variational autoencoder (VAE). This dramatically reduces computational cost — a 512×512 image might be encoded to a 64×64 latent — while preserving perceptual quality. The text-to-image system Stable Diffusion is a latent diffusion model conditioned on CLIP text embeddings, and its open-source release in August 2022 made high-quality image generation widely accessible.
Sampling acceleration
Standard DDPM sampling requires hundreds to thousands of denoising steps. Several methods reduce this:
- DDIM (Song, Meng & Ermon, ICLR 2021): deterministic sampling using a non-Markovian process, producing good results in 20–50 steps.
- DPM-Solver (Lu et al., NeurIPS 2022): a high-order ODE solver achieving quality samples in 10–20 steps.
- Consistency models (Song et al., ICML 2023): distill a diffusion model into a single-step or few-step generator by enforcing self-consistency along the probability flow ODE trajectory.
- Rectified flow and flow matching (Lipman et al., 2023; Liu et al., 2023): reformulate diffusion as optimal transport between noise and data, enabling straighter sampling paths and fewer steps.
Key systems
| System | Organisation | Year | Notes |
|---|---|---|---|
| DALL·E 2 | OpenAI | 2022 | CLIP-guided diffusion in pixel space with upsampling |
| Imagen | Google Brain | 2022 | T5-conditioned pixel-space cascade; set FID record on COCO |
| Stable Diffusion | Stability AI / CompVis | 2022 | Open-source latent diffusion; most widely used text-to-image model |
| Midjourney | Midjourney Inc. | 2022 | Proprietary; noted for artistic style |
| SDXL | Stability AI | 2023 | 6.6B parameter latent diffusion with refiner |
| Stable Diffusion 3 | Stability AI | 2024 | MMDiT (multi-modal DiT) backbone with flow matching |
| FLUX | Black Forest Labs | 2024 | Transformer-based, flow matching; founded by ex-Stability AI researchers |
| DALL·E 3 | OpenAI | 2023 | Trained on synthetic captions for improved prompt following |
| Sora | OpenAI | 2024 | Video generation using diffusion transformers on spacetime patches |
Applications beyond images
- Video: Sora (OpenAI), Lumiere (Google), Runway Gen-2, and Stable Video Diffusion generate video by treating temporal sequences as additional dimensions in the diffusion process.
- Audio and music: AudioLDM, Riffusion, and Stable Audio use latent diffusion on spectrograms or audio representations.
- 3D generation: DreamFusion (Poole et al., 2022) uses score distillation sampling to optimise a NeRF using a 2D diffusion prior.
- Molecular design: Diffusion models generate candidate drug molecules by denoising 3D molecular coordinates and atom types, e.g. DiffSBDD for structure-based drug design.
- Protein structure: RFdiffusion (Watson et al., Nature 2023) designs novel protein structures by diffusing backbone coordinates.
- Text: Discrete diffusion models apply the framework to token sequences, though autoregressive models remain dominant for text generation.
Theoretical connections
Diffusion models are connected to several other frameworks:
- Score matching: the reverse process can be viewed as following the score function ∇x log p(x) via Langevin dynamics (Song & Ermon, 2019).
- Stochastic differential equations: Song et al. (ICLR 2021) unified discrete-step DDPM and continuous score-based models under a common SDE/ODE framework.
- Variational autoencoders: the DDPM objective is a special case of the variational lower bound.
- Optimal transport: flow matching interprets diffusion as learning a velocity field that transports noise to data along near-optimal paths.
Limitations
- Sampling speed: even with acceleration, diffusion models are slower than single-pass generators (GANs, VAEs). Real-time applications often require distillation.
- Compute cost: training large diffusion models requires thousands of GPU-days and large-scale datasets.
- Memorisation and copyright: studies have shown diffusion models can memorise and reproduce training images near-verbatim, raising legal and ethical questions about training data.
- Text rendering: early text-to-image diffusion models struggled to render legible text in images, though later systems (DALL·E 3, FLUX) have partially addressed this.
See also
- Generative adversarial network
- Deep learning
- Transformer (machine learning)
- Artificial neural network
- Large language model
- Natural language processing
References
- Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics". ICML 2015. arXiv:1503.03585.
- Song, Y.; Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution". NeurIPS 2019. arXiv:1907.05600.
- Ho, J.; Jain, A.; Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models". NeurIPS 2020. arXiv:2006.11239.
- Dhariwal, P.; Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis". NeurIPS 2021. arXiv:2105.05233.
- Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations". ICLR 2021. arXiv:2011.13456.
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR 2022. arXiv:2112.10752.
- Ho, J.; Salimans, T. (2022). "Classifier-Free Diffusion Guidance". arXiv:2207.12598.
- Peebles, W.; Xie, S. (2023). "Scalable Diffusion Models with Transformers". ICCV 2023. arXiv:2212.09748.
- Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. (2023). "Consistency Models". ICML 2023. arXiv:2303.01469.
- Watson, J. L. et al. (2023). "De novo design of protein structure and function with RFdiffusion". Nature, 620(7976), 1089–1100.