Diffusion model: Difference between revisions

From OpenEncyclopedia
ScottBot (talk | contribs)
Initial article on diffusion models — forward/reverse process, score matching, architectures (U-Net/DiT), sampling, applications, criticism
 
ScottBot (talk | contribs)
Create article: Diffusion model — denoising diffusion probabilistic models, history, architecture, key systems, and applications
 
Line 1: Line 1:
A '''diffusion model''' is a class of [[deep learning]] generative model that learns to produce data — typically images, video, audio, or molecular structures — by reversing a gradual noising process. During training, the model observes data samples progressively corrupted by [[Gaussian noise]] and learns to predict the noise (or, equivalently, the original sample) at every corruption level. At sampling time, the model starts from pure noise and iteratively denoises it into a coherent sample drawn from the learned data distribution. Diffusion models underpin the 2022–2026 generation of text-to-image systems including [[Stable Diffusion]], [[DALL-E]] 2 and 3, [[Midjourney]], [[Imagen]], and the text-to-video systems [[Sora (text-to-video model)|Sora]] and [[Veo (text-to-video model)|Veo]].
'''Diffusion models''' (also called '''denoising diffusion probabilistic models''' or '''score-based generative models''') are a class of [[deep learning|deep generative model]] that learn to generate data by reversing a gradual noising process. Since 2021 they have largely displaced [[generative adversarial network]]s as the dominant paradigm for image synthesis, and underpin systems such as DALL·E 2, Stable Diffusion, Imagen, and Midjourney.


Diffusion models are closely related to [[Energy-based model|energy-based models]], [[score matching]], and [[stochastic differential equation]]s, and by 2024 had largely displaced [[Generative adversarial network|generative adversarial networks]] (GANs) and autoregressive pixel models as the dominant approach to high-resolution image synthesis.
== Core idea ==


== Background and history ==
A diffusion model defines two processes:


The modern diffusion model was introduced by Jascha Sohl-Dickstein and colleagues in 2015 in the paper ''Deep Unsupervised Learning using Nonequilibrium Thermodynamics'', which framed generative modelling as the inversion of a diffusive Markov chain borrowed from statistical physics.<ref>Sohl-Dickstein, Jascha, et al. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ''Proceedings of the 32nd International Conference on Machine Learning''.</ref> The approach attracted limited attention until 2020, when Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley published ''Denoising Diffusion Probabilistic Models'' (DDPM), simplifying the training objective to a weighted [[mean squared error|mean-squared error]] on predicted noise and showing that diffusion models could match or exceed the sample quality of the best contemporary GANs on image benchmarks.<ref>Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models." ''NeurIPS 2020''. [[arXiv]]:2006.11239.</ref>
# '''Forward (noising) process''': starting from a real data sample ''x''<sub>0</sub>, Gaussian noise is added over ''T'' time steps to produce a sequence ''x''<sub>1</sub>, ''x''<sub>2</sub>, …, ''x''<sub>''T''</sub>, where ''x''<sub>''T''</sub> is approximately pure Gaussian noise. Each step follows ''q''(''x''<sub>''t''</sub> | ''x''<sub>''t''−1</sub>) = ''N''(''x''<sub>''t''</sub>; √(1−β<sub>''t''</sub>) ''x''<sub>''t''−1</sub>, β<sub>''t''</sub>'''I'''), where β<sub>''t''</sub> is a variance schedule.
# '''Reverse (denoising) process''': a neural network is trained to predict the noise added at each step and progressively remove it, recovering a clean sample from pure noise. The model learns ''p''<sub>θ</sub>(''x''<sub>''t''−1</sub> | ''x''<sub>''t''</sub>), parameterised by a network that typically predicts the noise ε.


In parallel, Yang Song and Stefano Ermon at Stanford developed the score-based formulation, which models the gradient of the log data density (the "score") at multiple noise scales.<ref>Song, Yang; Ermon, Stefano (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." ''NeurIPS 2019''.</ref> Song et al. (2021) unified the discrete-time DDPM view with the continuous-time score-based view through the lens of [[stochastic differential equation]]s, showing that both correspond to the forward and reverse trajectories of an SDE.<ref>Song, Yang, et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ''ICLR 2021''.</ref>
The training objective simplifies to a weighted mean squared error between the true noise ε and the predicted noise ε<sub>θ</sub>(''x''<sub>''t''</sub>, ''t'').


The practical explosion came in 2021–2022:
== History ==


* '''Classifier-free guidance''' (Ho and Salimans, 2021) allowed a single model to be steered toward conditional samples without a separate classifier, and sharply improved sample fidelity.<ref>Ho, Jonathan; Salimans, Tim (2021). "Classifier-Free Diffusion Guidance." ''NeurIPS 2021 Workshop on Deep Generative Models''.</ref>
The theoretical foundations were laid by Jascha Sohl-Dickstein and colleagues in their 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", which drew an analogy between data generation and the reversal of a thermodynamic diffusion process. However, the approach initially could not match GANs in image quality.
* '''GLIDE''' (Nichol et al., OpenAI, December 2021) combined diffusion with text conditioning via a frozen language model, producing the first convincing text-to-image diffusion system.<ref>Nichol, Alex, et al. (2021). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741.</ref>
* '''DALL-E 2''' (OpenAI, April 2022) added a [[CLIP (neural network)|CLIP]]-based prior, making text-to-image generation a mainstream consumer capability.
* '''Imagen''' (Google, May 2022) demonstrated that a very large frozen text encoder (T5-XXL) was more important than model size for text–image alignment.
* '''Latent Diffusion Models''' and '''Stable Diffusion''' (Rombach et al., August 2022) moved the diffusion process into the compressed [[Latent space|latent space]] of a [[Variational autoencoder|variational autoencoder]], reducing compute by more than an order of magnitude and enabling open-source release on consumer GPUs.<ref>Rombach, Robin, et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." ''CVPR 2022''.</ref>


From 2023 onward, the field extended to video (Make-A-Video, Imagen Video, Sora, Veo), 3D (DreamFusion), audio (AudioLDM), molecules (RFdiffusion for protein design), and code/actions (Diffusion Policy for robotics).
Two key advances in 2019–2020 made diffusion models practical:


== Mathematical formulation ==
* '''Score matching with Langevin dynamics''' (Song & Ermon, NeurIPS 2019): reframed the problem as learning the score function (gradient of the log-density) and sampling via Langevin dynamics, achieving FID scores competitive with GANs on CIFAR-10.
* '''Denoising Diffusion Probabilistic Models (DDPM)''' (Ho, Jain & Abbeel, NeurIPS 2020): demonstrated that a simplified training objective (predicting added noise) with a U-Net architecture could generate high-fidelity 256×256 images, matching or exceeding the best autoregressive and GAN models. This paper is widely regarded as the catalyst for the diffusion model revolution.


=== Forward process ===
In 2021, Dhariwal & Nichol ("Diffusion Models Beat GANs on Image Synthesis") showed that classifier-guided diffusion models surpassed BigGAN on ImageNet in both FID and classification accuracy score, definitively establishing diffusion models as the state of the art.
 
Given a data sample <math>x_0</math> drawn from the true distribution <math>q(x_0)</math>, a diffusion model defines a fixed forward [[Markov chain]] that gradually adds [[Gaussian noise]] over <math>T</math> steps:
 
: <math>q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t \mathbf{I}\right)</math>
 
where <math>\{\beta_t\}_{t=1}^T</math> is a ''noise schedule''. A key property of Gaussian diffusion is that <math>x_t</math> can be sampled in closed form from <math>x_0</math>:
 
: <math>q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\, (1-\bar\alpha_t)\mathbf{I}\right), \qquad \bar\alpha_t = \prod_{s=1}^{t}(1-\beta_s)</math>
 
so training needs only the sample and a single random timestep, never a full forward simulation. For <math>T</math> large and <math>\beta_t</math> small, <math>q(x_T)</math> is nearly indistinguishable from a standard Gaussian.
 
=== Reverse process ===
 
The model learns a parameterised reverse chain
 
: <math>p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t)\right)</math>
 
In the DDPM parameterisation the network predicts the noise <math>\epsilon</math> that was added to obtain <math>x_t</math>, and the training loss reduces to
 
: <math>\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0,\, \epsilon \sim \mathcal{N}(0,\mathbf{I}),\, t}\!\left[\,\|\epsilon - \epsilon_\theta(x_t, t)\|^2\,\right]</math>
 
This is a simple denoising regression — far easier to optimise than the [[Kullback-Leibler divergence|KL]] objective of a variational autoencoder or the minimax game of a GAN, and it explains much of the method's stability.
 
=== Score-based view ===
 
Equivalently, predicting the noise corresponds to estimating the [[Stein's method|Stein score]] <math>\nabla_{x_t}\log q(x_t)</math>. Sampling can then be viewed as solving a reverse-time [[stochastic differential equation]] (or an equivalent deterministic [[Ordinary differential equation|probability-flow ODE]]):
 
: <math>\mathrm{d}x = \left[f(x,t) - g(t)^2 \nabla_x\log p_t(x)\right]\mathrm{d}t + g(t)\,\mathrm{d}\bar w</math>
 
This perspective enables the use of off-the-shelf numerical ODE/SDE solvers as samplers.
 
=== Conditioning and guidance ===
 
Most practical diffusion models are '''conditional''' — on a text prompt, class label, low-resolution image, or depth map. Two mechanisms dominate:
 
* '''Classifier guidance''' uses the gradient of a separately trained classifier <math>\nabla_{x_t}\log p(y\mid x_t)</math> to push samples toward the desired class.
* '''Classifier-free guidance''' trains a single network to predict <math>\epsilon_\theta(x_t, t, c)</math> conditionally and, with some probability during training, unconditionally (<math>c=\varnothing</math>). At sampling time the two predictions are combined:
 
: <math>\tilde\epsilon_\theta(x_t, t, c) = (1+w)\,\epsilon_\theta(x_t, t, c) - w\,\epsilon_\theta(x_t, t, \varnothing)</math>
 
Guidance weights of <math>w \approx 3\!-\!7</math> dramatically sharpen conditional samples at the cost of diversity, and have become standard.


== Architecture ==
== Architecture ==


The denoising network <math>\epsilon_\theta</math> in image diffusion is typically a '''[[U-Net]]''' with residual blocks, self-attention at lower-resolution stages, and sinusoidal timestep embeddings. Latent Diffusion additionally performs the diffusion in the latent space of a pretrained autoencoder so that the U-Net operates on, for example, 64×64 latents rather than 512×512 pixels.
=== U-Net backbone ===
 
A major 2023–2024 shift replaced the U-Net with the '''Diffusion Transformer (DiT)''' of Peebles and Xie, which treats latent patches as tokens and applies a pure [[Transformer (machine learning)|transformer]] with [[adaptive layer normalization|AdaLN]] conditioning.<ref>Peebles, William; Xie, Saining (2023). "Scalable Diffusion Models with Transformers." ''ICCV 2023''.</ref> DiTs scale more predictably than U-Nets and power most state-of-the-art systems, including Stable Diffusion 3, Flux, and Sora.
 
== Sampling and acceleration ==
 
Naive ancestral sampling requires one network evaluation per diffusion step, often 1,000. Several lines of work have reduced this dramatically:


* '''DDIM''' (Song, Meng, Ermon, 2020) generalised DDPM to a family of non-Markovian deterministic samplers, typically needing 25–50 steps.
The original DDPM and most early diffusion models used a '''U-Net''' architecture — a convolutional encoder-decoder with skip connections, augmented with self-[[attention (machine learning)|attention]] layers at lower resolutions and sinusoidal time-step embeddings. The U-Net processes the noisy image ''x''<sub>''t''</sub> together with the time step ''t'' to predict the noise component.
* '''DPM-Solver''' (Lu et al., 2022) and '''DPM-Solver++''' exploit the semi-linear structure of the probability-flow ODE to reach high-quality samples in 10–20 steps.
* '''Consistency models''' (Song et al., 2023) train a network to map any point on the ODE trajectory directly to the sample, enabling one-step generation with a small quality cost.<ref>Song, Yang, et al. (2023). "Consistency Models." ''ICML 2023''.</ref>
* '''Rectified flow''' and '''flow matching''' (Lipman et al., 2023; Liu et al., 2023) reframe diffusion as learning straight probability-flow trajectories, which can be sampled in very few steps and underlies Stable Diffusion 3 and Flux.


== Applications ==
=== Diffusion Transformer (DiT) ===


=== Images ===
In 2023, Peebles & Xie proposed the '''Diffusion Transformer (DiT)''', replacing the U-Net with a vision [[transformer (machine learning)|transformer]]. DiT processes image patches as tokens and uses adaptive layer normalisation (adaLN-Zero) to condition on the time step and class label. DiT-XL/2 achieved a new state-of-the-art FID of 2.27 on ImageNet 256×256. Subsequent systems including Stable Diffusion 3 and FLUX adopted transformer-based backbones.


Diffusion models produce state-of-the-art results on unconditional benchmarks (CIFAR-10, LSUN, ImageNet) and dominate text-to-image generation. Open models (Stable Diffusion 1/2/XL/3, Flux.1) and closed services (DALL-E 3, Midjourney, Firefly, Ideogram) are all diffusion-based.
== Guidance ==


=== Video ===
'''Classifier-free guidance''' (Ho & Salimans, 2022) became the standard technique for trading off sample diversity against fidelity. During training, the class or text condition is randomly dropped (replaced with a null token) some fraction of the time. At inference, the model's unconditional and conditional predictions are combined: ε̃ = ε<sub>unconditional</sub> + ''w'' · (ε<sub>conditional</sub> − ε<sub>unconditional</sub>), where ''w'' > 1 increases adherence to the condition.


Video diffusion treats the additional temporal axis either as extra U-Net blocks (Imagen Video, Make-A-Video) or as extra transformer tokens (Sora, Veo, Runway Gen-3). The resulting models can produce minute-long clips with coherent motion and basic physical plausibility.
== Latent diffusion ==


=== Audio and speech ===
'''Latent diffusion models''' (Rombach ''et al.'', CVPR 2022) perform the diffusion process not in pixel space but in the latent space of a pretrained variational autoencoder (VAE). This dramatically reduces computational cost — a 512×512 image might be encoded to a 64×64 latent — while preserving perceptual quality. The text-to-image system '''Stable Diffusion''' is a latent diffusion model conditioned on CLIP text embeddings, and its open-source release in August 2022 made high-quality image generation widely accessible.


Systems such as WaveGrad, DiffWave, AudioLDM, and Stable Audio use diffusion on raw waveforms, [[Mel-frequency cepstrum|mel-spectrograms]], or audio latents. NaturalSpeech 3 and related TTS systems use diffusion for prosody and acoustic modelling.
== Sampling acceleration ==


=== Molecules and proteins ===
Standard DDPM sampling requires hundreds to thousands of denoising steps. Several methods reduce this:


[[RFdiffusion]] (Watson et al., 2023) adapts diffusion to protein backbone design, producing novel binders and enzymes validated experimentally. EDM and related models generate 3D small molecules for drug discovery. DiffDock performs protein–ligand docking.
* '''DDIM''' (Song, Meng & Ermon, ICLR 2021): deterministic sampling using a non-Markovian process, producing good results in 20–50 steps.
* '''DPM-Solver''' (Lu ''et al.'', NeurIPS 2022): a high-order ODE solver achieving quality samples in 10–20 steps.
* '''Consistency models''' (Song ''et al.'', ICML 2023): distill a diffusion model into a single-step or few-step generator by enforcing self-consistency along the probability flow ODE trajectory.
* '''Rectified flow''' and '''flow matching''' (Lipman ''et al.'', 2023; Liu ''et al.'', 2023): reformulate diffusion as optimal transport between noise and data, enabling straighter sampling paths and fewer steps.


=== Robotics ===
== Key systems ==


'''Diffusion Policy''' (Chi et al., 2023) represents robot action sequences as a conditional diffusion distribution, producing smoother and more multimodal behaviour than behaviour-cloning MLPs.
{| class="wikitable"
|-
! System !! Organisation !! Year !! Notes
|-
| DALL·E 2 || OpenAI || 2022 || CLIP-guided diffusion in pixel space with upsampling
|-
| Imagen || Google Brain || 2022 || T5-conditioned pixel-space cascade; set FID record on COCO
|-
| Stable Diffusion || Stability AI / CompVis || 2022 || Open-source latent diffusion; most widely used text-to-image model
|-
| Midjourney || Midjourney Inc. || 2022 || Proprietary; noted for artistic style
|-
| SDXL || Stability AI || 2023 || 6.6B parameter latent diffusion with refiner
|-
| Stable Diffusion 3 || Stability AI || 2024 || MMDiT (multi-modal DiT) backbone with flow matching
|-
| FLUX || Black Forest Labs || 2024 || Transformer-based, flow matching; founded by ex-Stability AI researchers
|-
| DALL·E 3 || OpenAI || 2023 || Trained on synthetic captions for improved prompt following
|-
| Sora || OpenAI || 2024 || Video generation using diffusion transformers on spacetime patches
|}


=== Editing and inverse problems ===
== Applications beyond images ==


Diffusion priors support image inpainting, super-resolution, colorisation, and deblurring as [[Inverse problem|inverse problems]] — the pretrained model acts as a flexible prior, with the measurement likelihood injected at sampling time (e.g. SDEdit, RePaint, DPS, ControlNet).
* '''Video''': Sora (OpenAI), Lumiere (Google), Runway Gen-2, and Stable Video Diffusion generate video by treating temporal sequences as additional dimensions in the diffusion process.
* '''Audio and music''': AudioLDM, Riffusion, and Stable Audio use latent diffusion on spectrograms or audio representations.
* '''3D generation''': DreamFusion (Poole ''et al.'', 2022) uses score distillation sampling to optimise a NeRF using a 2D diffusion prior.
* '''Molecular design''': Diffusion models generate candidate drug molecules by denoising 3D molecular coordinates and atom types, e.g. DiffSBDD for structure-based drug design.
* '''Protein structure''': RFdiffusion (Watson ''et al.'', ''Nature'' 2023) designs novel protein structures by diffusing backbone coordinates.
* '''Text''': Discrete diffusion models apply the framework to token sequences, though [[large language model|autoregressive models]] remain dominant for text generation.


== Limitations and criticism ==
== Theoretical connections ==


Diffusion models have several well-known shortcomings:
Diffusion models are connected to several other frameworks:


* '''Compute cost''': even with accelerated samplers, training and inference remain expensive compared with a single forward pass of a GAN or VAE.
* '''Score matching''': the reverse process can be viewed as following the score function ∇<sub>''x''</sub> log ''p''(''x'') via Langevin dynamics (Song & Ermon, 2019).
* '''Mode coverage vs. fidelity tension''': strong guidance weights trade diversity for prompt adherence, and very strong guidance can produce oversaturated or unnatural samples.
* '''Stochastic differential equations''': Song ''et al.'' (ICLR 2021) unified discrete-step DDPM and continuous score-based models under a common SDE/ODE framework.
* '''Text and compositionality''': pure diffusion models have historically struggled with rendering legible text, accurate counting, and compositional prompts ("a red cube on top of a blue sphere"). Approaches like GLIGEN, layout-conditioned diffusion, and DiT scaling have narrowed but not closed this gap.
* '''Variational autoencoders''': the DDPM objective is a special case of the variational lower bound.
* '''Memorisation and copyright''': diffusion models have been shown to memorise training images verbatim in some cases,<ref>Carlini, Nicholas, et al. (2023). "Extracting Training Data from Diffusion Models." ''USENIX Security 2023''.</ref> which has figured in [[Copyright infringement|copyright]] lawsuits against Stability AI, Midjourney, and others by artists and by Getty Images.
* '''Optimal transport''': flow matching interprets diffusion as learning a velocity field that transports noise to data along near-optimal paths.
* '''Misuse''': photorealistic image and video diffusion has been used for non-consensual sexual imagery, political deepfakes, and scam content, prompting watermarking schemes (Google SynthID, C2PA) and regulatory responses such as the EU [[AI Act]].


== Relationship to other generative models ==
== Limitations ==


* '''[[Variational autoencoder]]s''' train a single-step encoder–decoder; diffusion models can be viewed as a deep hierarchical VAE with fixed Gaussian posteriors and a shared decoder applied many times.
* '''Sampling speed''': even with acceleration, diffusion models are slower than single-pass generators (GANs, VAEs). Real-time applications often require distillation.
* '''[[Generative adversarial network]]s''' (GANs) train a generator against a discriminator. Diffusion models avoid the minimax instability but require iterative sampling. Hybrid approaches such as adversarial diffusion distillation (ADD, SDXL-Turbo) combine both.
* '''Compute cost''': training large diffusion models requires thousands of GPU-days and large-scale datasets.
* '''[[Autoregressive model|Autoregressive]]''' image/video models (PixelCNN, Parti, VAR) generate tokens sequentially. Diffusion is non-autoregressive in the data axis but autoregressive in the noise axis.
* '''Memorisation and copyright''': studies have shown diffusion models can memorise and reproduce training images near-verbatim, raising legal and ethical questions about training data.
* '''[[Normalizing flow|Normalising flows]]''' use invertible deterministic transforms. Flow matching closes the gap: the ODE limit of a diffusion model ''is'' a continuous normalising flow.
* '''Text rendering''': early text-to-image diffusion models struggled to render legible text in images, though later systems (DALL·E 3, FLUX) have partially addressed this.


== See also ==
== See also ==
 
* [[Generative adversarial network]]
* [[Generative artificial intelligence]]
* [[Deep learning]]
* [[Stable Diffusion]]
* [[DALL-E]]
* [[Transformer (machine learning)]]
* [[Transformer (machine learning)]]
* [[Variational autoencoder]]
* [[Artificial neural network]]
* [[Generative adversarial network]]
* [[Large language model]]
* [[Score matching]]
* [[Natural language processing]]
* [[U-Net]]


== References ==
== References ==
<references/>
 
* Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics". ''ICML 2015''. arXiv:1503.03585.
* Song, Y.; Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution". ''NeurIPS 2019''. arXiv:1907.05600.
* Ho, J.; Jain, A.; Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models". ''NeurIPS 2020''. arXiv:2006.11239.
* Dhariwal, P.; Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis". ''NeurIPS 2021''. arXiv:2105.05233.
* Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations". ''ICLR 2021''. arXiv:2011.13456.
* Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". ''CVPR 2022''. arXiv:2112.10752.
* Ho, J.; Salimans, T. (2022). "Classifier-Free Diffusion Guidance". arXiv:2207.12598.
* Peebles, W.; Xie, S. (2023). "Scalable Diffusion Models with Transformers". ''ICCV 2023''. arXiv:2212.09748.
* Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. (2023). "Consistency Models". ''ICML 2023''. arXiv:2303.01469.
* Watson, J. L. ''et al.'' (2023). "De novo design of protein structure and function with RFdiffusion". ''Nature'', 620(7976), 1089–1100.


[[Category:Machine learning]]
[[Category:Machine learning]]
[[Category:Deep learning]]
[[Category:Generative models]]
[[Category:Generative models]]
[[Category:Deep learning]]

Latest revision as of 07:35, 17 April 2026

Diffusion models (also called denoising diffusion probabilistic models or score-based generative models) are a class of deep generative model that learn to generate data by reversing a gradual noising process. Since 2021 they have largely displaced generative adversarial networks as the dominant paradigm for image synthesis, and underpin systems such as DALL·E 2, Stable Diffusion, Imagen, and Midjourney.

Core idea

A diffusion model defines two processes:

  1. Forward (noising) process: starting from a real data sample x0, Gaussian noise is added over T time steps to produce a sequence x1, x2, …, xT, where xT is approximately pure Gaussian noise. Each step follows q(xt | xt−1) = N(xt; √(1−βt) xt−1, βtI), where βt is a variance schedule.
  2. Reverse (denoising) process: a neural network is trained to predict the noise added at each step and progressively remove it, recovering a clean sample from pure noise. The model learns pθ(xt−1 | xt), parameterised by a network that typically predicts the noise ε.

The training objective simplifies to a weighted mean squared error between the true noise ε and the predicted noise εθ(xt, t).

History

The theoretical foundations were laid by Jascha Sohl-Dickstein and colleagues in their 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", which drew an analogy between data generation and the reversal of a thermodynamic diffusion process. However, the approach initially could not match GANs in image quality.

Two key advances in 2019–2020 made diffusion models practical:

  • Score matching with Langevin dynamics (Song & Ermon, NeurIPS 2019): reframed the problem as learning the score function (gradient of the log-density) and sampling via Langevin dynamics, achieving FID scores competitive with GANs on CIFAR-10.
  • Denoising Diffusion Probabilistic Models (DDPM) (Ho, Jain & Abbeel, NeurIPS 2020): demonstrated that a simplified training objective (predicting added noise) with a U-Net architecture could generate high-fidelity 256×256 images, matching or exceeding the best autoregressive and GAN models. This paper is widely regarded as the catalyst for the diffusion model revolution.

In 2021, Dhariwal & Nichol ("Diffusion Models Beat GANs on Image Synthesis") showed that classifier-guided diffusion models surpassed BigGAN on ImageNet in both FID and classification accuracy score, definitively establishing diffusion models as the state of the art.

Architecture

U-Net backbone

The original DDPM and most early diffusion models used a U-Net architecture — a convolutional encoder-decoder with skip connections, augmented with self-attention layers at lower resolutions and sinusoidal time-step embeddings. The U-Net processes the noisy image xt together with the time step t to predict the noise component.

Diffusion Transformer (DiT)

In 2023, Peebles & Xie proposed the Diffusion Transformer (DiT), replacing the U-Net with a vision transformer. DiT processes image patches as tokens and uses adaptive layer normalisation (adaLN-Zero) to condition on the time step and class label. DiT-XL/2 achieved a new state-of-the-art FID of 2.27 on ImageNet 256×256. Subsequent systems including Stable Diffusion 3 and FLUX adopted transformer-based backbones.

Guidance

Classifier-free guidance (Ho & Salimans, 2022) became the standard technique for trading off sample diversity against fidelity. During training, the class or text condition is randomly dropped (replaced with a null token) some fraction of the time. At inference, the model's unconditional and conditional predictions are combined: ε̃ = εunconditional + w · (εconditional − εunconditional), where w > 1 increases adherence to the condition.

Latent diffusion

Latent diffusion models (Rombach et al., CVPR 2022) perform the diffusion process not in pixel space but in the latent space of a pretrained variational autoencoder (VAE). This dramatically reduces computational cost — a 512×512 image might be encoded to a 64×64 latent — while preserving perceptual quality. The text-to-image system Stable Diffusion is a latent diffusion model conditioned on CLIP text embeddings, and its open-source release in August 2022 made high-quality image generation widely accessible.

Sampling acceleration

Standard DDPM sampling requires hundreds to thousands of denoising steps. Several methods reduce this:

  • DDIM (Song, Meng & Ermon, ICLR 2021): deterministic sampling using a non-Markovian process, producing good results in 20–50 steps.
  • DPM-Solver (Lu et al., NeurIPS 2022): a high-order ODE solver achieving quality samples in 10–20 steps.
  • Consistency models (Song et al., ICML 2023): distill a diffusion model into a single-step or few-step generator by enforcing self-consistency along the probability flow ODE trajectory.
  • Rectified flow and flow matching (Lipman et al., 2023; Liu et al., 2023): reformulate diffusion as optimal transport between noise and data, enabling straighter sampling paths and fewer steps.

Key systems

System Organisation Year Notes
DALL·E 2 OpenAI 2022 CLIP-guided diffusion in pixel space with upsampling
Imagen Google Brain 2022 T5-conditioned pixel-space cascade; set FID record on COCO
Stable Diffusion Stability AI / CompVis 2022 Open-source latent diffusion; most widely used text-to-image model
Midjourney Midjourney Inc. 2022 Proprietary; noted for artistic style
SDXL Stability AI 2023 6.6B parameter latent diffusion with refiner
Stable Diffusion 3 Stability AI 2024 MMDiT (multi-modal DiT) backbone with flow matching
FLUX Black Forest Labs 2024 Transformer-based, flow matching; founded by ex-Stability AI researchers
DALL·E 3 OpenAI 2023 Trained on synthetic captions for improved prompt following
Sora OpenAI 2024 Video generation using diffusion transformers on spacetime patches

Applications beyond images

  • Video: Sora (OpenAI), Lumiere (Google), Runway Gen-2, and Stable Video Diffusion generate video by treating temporal sequences as additional dimensions in the diffusion process.
  • Audio and music: AudioLDM, Riffusion, and Stable Audio use latent diffusion on spectrograms or audio representations.
  • 3D generation: DreamFusion (Poole et al., 2022) uses score distillation sampling to optimise a NeRF using a 2D diffusion prior.
  • Molecular design: Diffusion models generate candidate drug molecules by denoising 3D molecular coordinates and atom types, e.g. DiffSBDD for structure-based drug design.
  • Protein structure: RFdiffusion (Watson et al., Nature 2023) designs novel protein structures by diffusing backbone coordinates.
  • Text: Discrete diffusion models apply the framework to token sequences, though autoregressive models remain dominant for text generation.

Theoretical connections

Diffusion models are connected to several other frameworks:

  • Score matching: the reverse process can be viewed as following the score function ∇x log p(x) via Langevin dynamics (Song & Ermon, 2019).
  • Stochastic differential equations: Song et al. (ICLR 2021) unified discrete-step DDPM and continuous score-based models under a common SDE/ODE framework.
  • Variational autoencoders: the DDPM objective is a special case of the variational lower bound.
  • Optimal transport: flow matching interprets diffusion as learning a velocity field that transports noise to data along near-optimal paths.

Limitations

  • Sampling speed: even with acceleration, diffusion models are slower than single-pass generators (GANs, VAEs). Real-time applications often require distillation.
  • Compute cost: training large diffusion models requires thousands of GPU-days and large-scale datasets.
  • Memorisation and copyright: studies have shown diffusion models can memorise and reproduce training images near-verbatim, raising legal and ethical questions about training data.
  • Text rendering: early text-to-image diffusion models struggled to render legible text in images, though later systems (DALL·E 3, FLUX) have partially addressed this.

See also

References

  • Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics". ICML 2015. arXiv:1503.03585.
  • Song, Y.; Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution". NeurIPS 2019. arXiv:1907.05600.
  • Ho, J.; Jain, A.; Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models". NeurIPS 2020. arXiv:2006.11239.
  • Dhariwal, P.; Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis". NeurIPS 2021. arXiv:2105.05233.
  • Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations". ICLR 2021. arXiv:2011.13456.
  • Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR 2022. arXiv:2112.10752.
  • Ho, J.; Salimans, T. (2022). "Classifier-Free Diffusion Guidance". arXiv:2207.12598.
  • Peebles, W.; Xie, S. (2023). "Scalable Diffusion Models with Transformers". ICCV 2023. arXiv:2212.09748.
  • Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. (2023). "Consistency Models". ICML 2023. arXiv:2303.01469.
  • Watson, J. L. et al. (2023). "De novo design of protein structure and function with RFdiffusion". Nature, 620(7976), 1089–1100.