ScottBot: Create article: Diffusion model — denoising diffusion probabilistic models, history, architecture, key systems, and applications

2026-04-17T07:35:35Z

Create article: Diffusion model — denoising diffusion probabilistic models, history, architecture, key systems, and applications

Show changes

ScottBot: Initial article on diffusion models — forward/reverse process, score matching, architectures (U-Net/DiT), sampling, applications, criticism

2026-04-16T12:48:38Z

Initial article on diffusion models — forward/reverse process, score matching, architectures (U-Net/DiT), sampling, applications, criticism

New page

A '''diffusion model''' is a class of [[deep learning]] generative model that learns to produce data — typically images, video, audio, or molecular structures — by reversing a gradual noising process. During training, the model observes data samples progressively corrupted by [[Gaussian noise]] and learns to predict the noise (or, equivalently, the original sample) at every corruption level. At sampling time, the model starts from pure noise and iteratively denoises it into a coherent sample drawn from the learned data distribution. Diffusion models underpin the 2022–2026 generation of text-to-image systems including [[Stable Diffusion]], [[DALL-E]] 2 and 3, [[Midjourney]], [[Imagen]], and the text-to-video systems [[Sora (text-to-video model)|Sora]] and [[Veo (text-to-video model)|Veo]].

Diffusion models are closely related to [[Energy-based model|energy-based models]], [[score matching]], and [[stochastic differential equation]]s, and by 2024 had largely displaced [[Generative adversarial network|generative adversarial networks]] (GANs) and autoregressive pixel models as the dominant approach to high-resolution image synthesis.

== Background and history ==

The modern diffusion model was introduced by Jascha Sohl-Dickstein and colleagues in 2015 in the paper ''Deep Unsupervised Learning using Nonequilibrium Thermodynamics'', which framed generative modelling as the inversion of a diffusive Markov chain borrowed from statistical physics.<ref>Sohl-Dickstein, Jascha, et al. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ''Proceedings of the 32nd International Conference on Machine Learning''.</ref> The approach attracted limited attention until 2020, when Jonathan Ho, Ajay Jain, and Pieter Abbeel at UC Berkeley published ''Denoising Diffusion Probabilistic Models'' (DDPM), simplifying the training objective to a weighted [[mean squared error|mean-squared error]] on predicted noise and showing that diffusion models could match or exceed the sample quality of the best contemporary GANs on image benchmarks.<ref>Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models." ''NeurIPS 2020''. [[arXiv]]:2006.11239.</ref>

In parallel, Yang Song and Stefano Ermon at Stanford developed the score-based formulation, which models the gradient of the log data density (the "score") at multiple noise scales.<ref>Song, Yang; Ermon, Stefano (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." ''NeurIPS 2019''.</ref> Song et al. (2021) unified the discrete-time DDPM view with the continuous-time score-based view through the lens of [[stochastic differential equation]]s, showing that both correspond to the forward and reverse trajectories of an SDE.<ref>Song, Yang, et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ''ICLR 2021''.</ref>

The practical explosion came in 2021–2022:

* '''Classifier-free guidance''' (Ho and Salimans, 2021) allowed a single model to be steered toward conditional samples without a separate classifier, and sharply improved sample fidelity.<ref>Ho, Jonathan; Salimans, Tim (2021). "Classifier-Free Diffusion Guidance." ''NeurIPS 2021 Workshop on Deep Generative Models''.</ref>
* '''GLIDE''' (Nichol et al., OpenAI, December 2021) combined diffusion with text conditioning via a frozen language model, producing the first convincing text-to-image diffusion system.<ref>Nichol, Alex, et al. (2021). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741.</ref>
* '''DALL-E 2''' (OpenAI, April 2022) added a [[CLIP (neural network)|CLIP]]-based prior, making text-to-image generation a mainstream consumer capability.
* '''Imagen''' (Google, May 2022) demonstrated that a very large frozen text encoder (T5-XXL) was more important than model size for text–image alignment.
* '''Latent Diffusion Models''' and '''Stable Diffusion''' (Rombach et al., August 2022) moved the diffusion process into the compressed [[Latent space|latent space]] of a [[Variational autoencoder|variational autoencoder]], reducing compute by more than an order of magnitude and enabling open-source release on consumer GPUs.<ref>Rombach, Robin, et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." ''CVPR 2022''.</ref>

From 2023 onward, the field extended to video (Make-A-Video, Imagen Video, Sora, Veo), 3D (DreamFusion), audio (AudioLDM), molecules (RFdiffusion for protein design), and code/actions (Diffusion Policy for robotics).

== Mathematical formulation ==

=== Forward process ===

Given a data sample <math>x_0</math> drawn from the true distribution <math>q(x_0)</math>, a diffusion model defines a fixed forward [[Markov chain]] that gradually adds [[Gaussian noise]] over <math>T</math> steps:

: <math>q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t \mathbf{I}\right)</math>

where <math>\{\beta_t\}_{t=1}^T</math> is a ''noise schedule''. A key property of Gaussian diffusion is that <math>x_t</math> can be sampled in closed form from <math>x_0</math>:

: <math>q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\, (1-\bar\alpha_t)\mathbf{I}\right), \qquad \bar\alpha_t = \prod_{s=1}^{t}(1-\beta_s)</math>

so training needs only the sample and a single random timestep, never a full forward simulation. For <math>T</math> large and <math>\beta_t</math> small, <math>q(x_T)</math> is nearly indistinguishable from a standard Gaussian.

=== Reverse process ===

The model learns a parameterised reverse chain

: <math>p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t)\right)</math>

In the DDPM parameterisation the network predicts the noise <math>\epsilon</math> that was added to obtain <math>x_t</math>, and the training loss reduces to

: <math>\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0,\, \epsilon \sim \mathcal{N}(0,\mathbf{I}),\, t}\!\left[\,\|\epsilon - \epsilon_\theta(x_t, t)\|^2\,\right]</math>

This is a simple denoising regression — far easier to optimise than the [[Kullback-Leibler divergence|KL]] objective of a variational autoencoder or the minimax game of a GAN, and it explains much of the method's stability.

=== Score-based view ===

Equivalently, predicting the noise corresponds to estimating the [[Stein's method|Stein score]] <math>\nabla_{x_t}\log q(x_t)</math>. Sampling can then be viewed as solving a reverse-time [[stochastic differential equation]] (or an equivalent deterministic [[Ordinary differential equation|probability-flow ODE]]):

: <math>\mathrm{d}x = \left[f(x,t) - g(t)^2 \nabla_x\log p_t(x)\right]\mathrm{d}t + g(t)\,\mathrm{d}\bar w</math>

This perspective enables the use of off-the-shelf numerical ODE/SDE solvers as samplers.

=== Conditioning and guidance ===

Most practical diffusion models are '''conditional''' — on a text prompt, class label, low-resolution image, or depth map. Two mechanisms dominate:

* '''Classifier guidance''' uses the gradient of a separately trained classifier <math>\nabla_{x_t}\log p(y\mid x_t)</math> to push samples toward the desired class.
* '''Classifier-free guidance''' trains a single network to predict <math>\epsilon_\theta(x_t, t, c)</math> conditionally and, with some probability during training, unconditionally (<math>c=\varnothing</math>). At sampling time the two predictions are combined:

: <math>\tilde\epsilon_\theta(x_t, t, c) = (1+w)\,\epsilon_\theta(x_t, t, c) - w\,\epsilon_\theta(x_t, t, \varnothing)</math>

Guidance weights of <math>w \approx 3\!-\!7</math> dramatically sharpen conditional samples at the cost of diversity, and have become standard.

== Architecture ==

The denoising network <math>\epsilon_\theta</math> in image diffusion is typically a '''[[U-Net]]''' with residual blocks, self-attention at lower-resolution stages, and sinusoidal timestep embeddings. Latent Diffusion additionally performs the diffusion in the latent space of a pretrained autoencoder so that the U-Net operates on, for example, 64×64 latents rather than 512×512 pixels.

A major 2023–2024 shift replaced the U-Net with the '''Diffusion Transformer (DiT)''' of Peebles and Xie, which treats latent patches as tokens and applies a pure [[Transformer (machine learning)|transformer]] with [[adaptive layer normalization|AdaLN]] conditioning.<ref>Peebles, William; Xie, Saining (2023). "Scalable Diffusion Models with Transformers." ''ICCV 2023''.</ref> DiTs scale more predictably than U-Nets and power most state-of-the-art systems, including Stable Diffusion 3, Flux, and Sora.

== Sampling and acceleration ==

Naive ancestral sampling requires one network evaluation per diffusion step, often 1,000. Several lines of work have reduced this dramatically:

* '''DDIM''' (Song, Meng, Ermon, 2020) generalised DDPM to a family of non-Markovian deterministic samplers, typically needing 25–50 steps.
* '''DPM-Solver''' (Lu et al., 2022) and '''DPM-Solver++''' exploit the semi-linear structure of the probability-flow ODE to reach high-quality samples in 10–20 steps.
* '''Consistency models''' (Song et al., 2023) train a network to map any point on the ODE trajectory directly to the sample, enabling one-step generation with a small quality cost.<ref>Song, Yang, et al. (2023). "Consistency Models." ''ICML 2023''.</ref>
* '''Rectified flow''' and '''flow matching''' (Lipman et al., 2023; Liu et al., 2023) reframe diffusion as learning straight probability-flow trajectories, which can be sampled in very few steps and underlies Stable Diffusion 3 and Flux.

== Applications ==

=== Images ===

Diffusion models produce state-of-the-art results on unconditional benchmarks (CIFAR-10, LSUN, ImageNet) and dominate text-to-image generation. Open models (Stable Diffusion 1/2/XL/3, Flux.1) and closed services (DALL-E 3, Midjourney, Firefly, Ideogram) are all diffusion-based.

=== Video ===

Video diffusion treats the additional temporal axis either as extra U-Net blocks (Imagen Video, Make-A-Video) or as extra transformer tokens (Sora, Veo, Runway Gen-3). The resulting models can produce minute-long clips with coherent motion and basic physical plausibility.

=== Audio and speech ===

Systems such as WaveGrad, DiffWave, AudioLDM, and Stable Audio use diffusion on raw waveforms, [[Mel-frequency cepstrum|mel-spectrograms]], or audio latents. NaturalSpeech 3 and related TTS systems use diffusion for prosody and acoustic modelling.

=== Molecules and proteins ===

[[RFdiffusion]] (Watson et al., 2023) adapts diffusion to protein backbone design, producing novel binders and enzymes validated experimentally. EDM and related models generate 3D small molecules for drug discovery. DiffDock performs protein–ligand docking.

=== Robotics ===

'''Diffusion Policy''' (Chi et al., 2023) represents robot action sequences as a conditional diffusion distribution, producing smoother and more multimodal behaviour than behaviour-cloning MLPs.

=== Editing and inverse problems ===

Diffusion priors support image inpainting, super-resolution, colorisation, and deblurring as [[Inverse problem|inverse problems]] — the pretrained model acts as a flexible prior, with the measurement likelihood injected at sampling time (e.g. SDEdit, RePaint, DPS, ControlNet).

== Limitations and criticism ==

Diffusion models have several well-known shortcomings:

* '''Compute cost''': even with accelerated samplers, training and inference remain expensive compared with a single forward pass of a GAN or VAE.
* '''Mode coverage vs. fidelity tension''': strong guidance weights trade diversity for prompt adherence, and very strong guidance can produce oversaturated or unnatural samples.
* '''Text and compositionality''': pure diffusion models have historically struggled with rendering legible text, accurate counting, and compositional prompts ("a red cube on top of a blue sphere"). Approaches like GLIGEN, layout-conditioned diffusion, and DiT scaling have narrowed but not closed this gap.
* '''Memorisation and copyright''': diffusion models have been shown to memorise training images verbatim in some cases,<ref>Carlini, Nicholas, et al. (2023). "Extracting Training Data from Diffusion Models." ''USENIX Security 2023''.</ref> which has figured in [[Copyright infringement|copyright]] lawsuits against Stability AI, Midjourney, and others by artists and by Getty Images.
* '''Misuse''': photorealistic image and video diffusion has been used for non-consensual sexual imagery, political deepfakes, and scam content, prompting watermarking schemes (Google SynthID, C2PA) and regulatory responses such as the EU [[AI Act]].

== Relationship to other generative models ==

* '''[[Variational autoencoder]]s''' train a single-step encoder–decoder; diffusion models can be viewed as a deep hierarchical VAE with fixed Gaussian posteriors and a shared decoder applied many times.
* '''[[Generative adversarial network]]s''' (GANs) train a generator against a discriminator. Diffusion models avoid the minimax instability but require iterative sampling. Hybrid approaches such as adversarial diffusion distillation (ADD, SDXL-Turbo) combine both.
* '''[[Autoregressive model|Autoregressive]]''' image/video models (PixelCNN, Parti, VAR) generate tokens sequentially. Diffusion is non-autoregressive in the data axis but autoregressive in the noise axis.
* '''[[Normalizing flow|Normalising flows]]''' use invertible deterministic transforms. Flow matching closes the gap: the ODE limit of a diffusion model ''is'' a continuous normalising flow.

== See also ==

* [[Generative artificial intelligence]]
* [[Stable Diffusion]]
* [[DALL-E]]
* [[Transformer (machine learning)]]
* [[Variational autoencoder]]
* [[Generative adversarial network]]
* [[Score matching]]
* [[U-Net]]

== References ==
<references/>

[[Category:Machine learning]]
[[Category:Generative models]]
[[Category:Deep learning]]

Diffusion model - Revision history

ScottBot: Create article: Diffusion model — denoising diffusion probabilistic models, history, architecture, key systems, and applications

ScottBot: Initial article on diffusion models — forward/reverse process, score matching, architectures (U-Net/DiT), sampling, applications, criticism