Diffusion model

A diffusion model is a class of deep learning generative model that learns to produce data — typically images, video, audio, or molecular structures — by reversing a gradual noising process. During training, the model observes data samples progressively corrupted by Gaussian noise and learns to predict the noise (or, equivalently, the original sample) at every corruption level. At sampling time, the model starts from pure noise and iteratively denoises it into a coherent sample drawn from the learned data distribution. Diffusion models underpin the 2022–2026 generation of text-to-image systems including Stable Diffusion, DALL-E 2 and 3, Midjourney, Imagen, and the text-to-video systems Sora and Veo.

Diffusion models are closely related to energy-based models, score matching, and stochastic differential equations, and by 2024 had largely displaced generative adversarial networks (GANs) and autoregressive pixel models as the dominant approach to high-resolution image synthesis.

Background and history

The modern diffusion model was introduced by Jascha Sohl-Dickstein and colleagues in 2015 in the paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics, which framed generative modelling as the inversion of a diffusive Markov chain borrowed from statistical physics.^[1] The approach attracted limited attention until 2020, when Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley published Denoising Diffusion Probabilistic Models (DDPM), simplifying the training objective to a weighted mean-squared error on predicted noise and showing that diffusion models could match or exceed the sample quality of the best contemporary GANs on image benchmarks.^[2]

In parallel, Yang Song and Stefano Ermon at Stanford developed the score-based formulation, which models the gradient of the log data density (the "score") at multiple noise scales.^[3] Song et al. (2021) unified the discrete-time DDPM view with the continuous-time score-based view through the lens of stochastic differential equations, showing that both correspond to the forward and reverse trajectories of an SDE.^[4]

The practical explosion came in 2021–2022:

Classifier-free guidance (Ho and Salimans, 2021) allowed a single model to be steered toward conditional samples without a separate classifier, and sharply improved sample fidelity.^[5]
GLIDE (Nichol et al., OpenAI, December 2021) combined diffusion with text conditioning via a frozen language model, producing the first convincing text-to-image diffusion system.^[6]
DALL-E 2 (OpenAI, April 2022) added a CLIP-based prior, making text-to-image generation a mainstream consumer capability.
Imagen (Google, May 2022) demonstrated that a very large frozen text encoder (T5-XXL) was more important than model size for text–image alignment.
Latent Diffusion Models and Stable Diffusion (Rombach et al., August 2022) moved the diffusion process into the compressed latent space of a variational autoencoder, reducing compute by more than an order of magnitude and enabling open-source release on consumer GPUs.^[7]

From 2023 onward, the field extended to video (Make-A-Video, Imagen Video, Sora, Veo), 3D (DreamFusion), audio (AudioLDM), molecules (RFdiffusion for protein design), and code/actions (Diffusion Policy for robotics).

Mathematical formulation

Forward process

Given a data sample <math>x_0</math> drawn from the true distribution <math>q(x_0)</math>, a diffusion model defines a fixed forward Markov chain that gradually adds Gaussian noise over <math>T</math> steps:

<math>q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t \mathbf{I}\right)</math>

where <math>\{\beta_t\}_{t=1}^T</math> is a noise schedule. A key property of Gaussian diffusion is that <math>x_t</math> can be sampled in closed form from <math>x_0</math>:

<math>q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\, (1-\bar\alpha_t)\mathbf{I}\right), \qquad \bar\alpha_t = \prod_{s=1}^{t}(1-\beta_s)</math>

so training needs only the sample and a single random timestep, never a full forward simulation. For <math>T</math> large and <math>\beta_t</math> small, <math>q(x_T)</math> is nearly indistinguishable from a standard Gaussian.

Reverse process

The model learns a parameterised reverse chain

<math>p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t)\right)</math>

In the DDPM parameterisation the network predicts the noise <math>\epsilon</math> that was added to obtain <math>x_t</math>, and the training loss reduces to

<math>\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0,\, \epsilon \sim \mathcal{N}(0,\mathbf{I}),\, t}\!\left[\,\|\epsilon - \epsilon_\theta(x_t, t)\|^2\,\right]</math>

This is a simple denoising regression — far easier to optimise than the KL objective of a variational autoencoder or the minimax game of a GAN, and it explains much of the method's stability.

Score-based view

Equivalently, predicting the noise corresponds to estimating the Stein score <math>\nabla_{x_t}\log q(x_t)</math>. Sampling can then be viewed as solving a reverse-time stochastic differential equation (or an equivalent deterministic probability-flow ODE):

<math>\mathrm{d}x = \left[f(x,t) - g(t)^2 \nabla_x\log p_t(x)\right]\mathrm{d}t + g(t)\,\mathrm{d}\bar w</math>

This perspective enables the use of off-the-shelf numerical ODE/SDE solvers as samplers.

Conditioning and guidance

Most practical diffusion models are conditional — on a text prompt, class label, low-resolution image, or depth map. Two mechanisms dominate:

Classifier guidance uses the gradient of a separately trained classifier <math>\nabla_{x_t}\log p(y\mid x_t)</math> to push samples toward the desired class.
Classifier-free guidance trains a single network to predict <math>\epsilon_\theta(x_t, t, c)</math> conditionally and, with some probability during training, unconditionally (<math>c=\varnothing</math>). At sampling time the two predictions are combined:

<math>\tilde\epsilon_\theta(x_t, t, c) = (1+w)\,\epsilon_\theta(x_t, t, c) - w\,\epsilon_\theta(x_t, t, \varnothing)</math>

Guidance weights of <math>w \approx 3\!-\!7</math> dramatically sharpen conditional samples at the cost of diversity, and have become standard.

Architecture

The denoising network <math>\epsilon_\theta</math> in image diffusion is typically a U-Net with residual blocks, self-attention at lower-resolution stages, and sinusoidal timestep embeddings. Latent Diffusion additionally performs the diffusion in the latent space of a pretrained autoencoder so that the U-Net operates on, for example, 64×64 latents rather than 512×512 pixels.

A major 2023–2024 shift replaced the U-Net with the Diffusion Transformer (DiT) of Peebles and Xie, which treats latent patches as tokens and applies a pure transformer with AdaLN conditioning.^[8] DiTs scale more predictably than U-Nets and power most state-of-the-art systems, including Stable Diffusion 3, Flux, and Sora.

Sampling and acceleration

Naive ancestral sampling requires one network evaluation per diffusion step, often 1,000. Several lines of work have reduced this dramatically:

DDIM (Song, Meng, Ermon, 2020) generalised DDPM to a family of non-Markovian deterministic samplers, typically needing 25–50 steps.
DPM-Solver (Lu et al., 2022) and DPM-Solver++ exploit the semi-linear structure of the probability-flow ODE to reach high-quality samples in 10–20 steps.
Consistency models (Song et al., 2023) train a network to map any point on the ODE trajectory directly to the sample, enabling one-step generation with a small quality cost.^[9]
Rectified flow and flow matching (Lipman et al., 2023; Liu et al., 2023) reframe diffusion as learning straight probability-flow trajectories, which can be sampled in very few steps and underlies Stable Diffusion 3 and Flux.

Applications

Images

Diffusion models produce state-of-the-art results on unconditional benchmarks (CIFAR-10, LSUN, ImageNet) and dominate text-to-image generation. Open models (Stable Diffusion 1/2/XL/3, Flux.1) and closed services (DALL-E 3, Midjourney, Firefly, Ideogram) are all diffusion-based.

Video

Video diffusion treats the additional temporal axis either as extra U-Net blocks (Imagen Video, Make-A-Video) or as extra transformer tokens (Sora, Veo, Runway Gen-3). The resulting models can produce minute-long clips with coherent motion and basic physical plausibility.

Audio and speech

Systems such as WaveGrad, DiffWave, AudioLDM, and Stable Audio use diffusion on raw waveforms, mel-spectrograms, or audio latents. NaturalSpeech 3 and related TTS systems use diffusion for prosody and acoustic modelling.

Molecules and proteins

RFdiffusion (Watson et al., 2023) adapts diffusion to protein backbone design, producing novel binders and enzymes validated experimentally. EDM and related models generate 3D small molecules for drug discovery. DiffDock performs protein–ligand docking.

Robotics

Diffusion Policy (Chi et al., 2023) represents robot action sequences as a conditional diffusion distribution, producing smoother and more multimodal behaviour than behaviour-cloning MLPs.

Editing and inverse problems

Diffusion priors support image inpainting, super-resolution, colorisation, and deblurring as inverse problems — the pretrained model acts as a flexible prior, with the measurement likelihood injected at sampling time (e.g. SDEdit, RePaint, DPS, ControlNet).

Limitations and criticism

Diffusion models have several well-known shortcomings:

Compute cost: even with accelerated samplers, training and inference remain expensive compared with a single forward pass of a GAN or VAE.
Mode coverage vs. fidelity tension: strong guidance weights trade diversity for prompt adherence, and very strong guidance can produce oversaturated or unnatural samples.
Text and compositionality: pure diffusion models have historically struggled with rendering legible text, accurate counting, and compositional prompts ("a red cube on top of a blue sphere"). Approaches like GLIGEN, layout-conditioned diffusion, and DiT scaling have narrowed but not closed this gap.
Memorisation and copyright: diffusion models have been shown to memorise training images verbatim in some cases,^[10] which has figured in copyright lawsuits against Stability AI, Midjourney, and others by artists and by Getty Images.
Misuse: photorealistic image and video diffusion has been used for non-consensual sexual imagery, political deepfakes, and scam content, prompting watermarking schemes (Google SynthID, C2PA) and regulatory responses such as the EU AI Act.

Relationship to other generative models

Variational autoencoders train a single-step encoder–decoder; diffusion models can be viewed as a deep hierarchical VAE with fixed Gaussian posteriors and a shared decoder applied many times.
Generative adversarial networks (GANs) train a generator against a discriminator. Diffusion models avoid the minimax instability but require iterative sampling. Hybrid approaches such as adversarial diffusion distillation (ADD, SDXL-Turbo) combine both.
Autoregressive image/video models (PixelCNN, Parti, VAR) generate tokens sequentially. Diffusion is non-autoregressive in the data axis but autoregressive in the noise axis.
Normalising flows use invertible deterministic transforms. Flow matching closes the gap: the ODE limit of a diffusion model is a continuous normalising flow.

References

↑ Sohl-Dickstein, Jascha, et al. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." Proceedings of the 32nd International Conference on Machine Learning.
↑ Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020. arXiv:2006.11239.
↑ Song, Yang; Ermon, Stefano (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS 2019.
↑ Song, Yang, et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021.
↑ Ho, Jonathan; Salimans, Tim (2021). "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop on Deep Generative Models.
↑ Nichol, Alex, et al. (2021). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741.
↑ Rombach, Robin, et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022.
↑ Peebles, William; Xie, Saining (2023). "Scalable Diffusion Models with Transformers." ICCV 2023.
↑ Song, Yang, et al. (2023). "Consistency Models." ICML 2023.
↑ Carlini, Nicholas, et al. (2023). "Extracting Training Data from Diffusion Models." USENIX Security 2023.

[1] Sohl-Dickstein, Jascha, et al. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." Proceedings of the 32nd International Conference on Machine Learning.

[2] Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020. arXiv:2006.11239.

[3] Song, Yang; Ermon, Stefano (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS 2019.

[4] Song, Yang, et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021.

[5] Ho, Jonathan; Salimans, Tim (2021). "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop on Deep Generative Models.

[6] Nichol, Alex, et al. (2021). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741.

[7] Rombach, Robin, et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022.

[8] Peebles, William; Xie, Saining (2023). "Scalable Diffusion Models with Transformers." ICCV 2023.

[9] Song, Yang, et al. (2023). "Consistency Models." ICML 2023.

[10] Carlini, Nicholas, et al. (2023). "Extracting Training Data from Diffusion Models." USENIX Security 2023.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]