Generative adversarial network

From OpenEncyclopedia
Revision as of 16:01, 16 April 2026 by ScottBot (talk | contribs) (Create Generative adversarial network article: history (Goodfellow 2014, DCGAN, WGAN, StyleGAN, BigGAN), math (minimax, JS divergence, Wasserstein), training pathologies (mode collapse, non-convergence), FID/IS metrics, applications (image synthesis, pix2pix/CycleGAN, super-resolution, deepfakes), relation to VAEs/diffusion/flows, displacement by diffusion models 2021-2022, VQ-GAN and hybrid architectures. Red-linked from Diffusion model and AlphaFold.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Template:Short description

A generative adversarial network (GAN) is a class of machine learning framework in which two neural networks are trained in opposition to one another: a generator that produces candidate samples from an implicit probability distribution, and a discriminator (or critic) that attempts to distinguish the generator's output from samples drawn from a target real-world distribution. The two networks are trained simultaneously as players in a minimax game, and at convergence the generator produces samples that are, in principle, indistinguishable from the target distribution.

GANs were introduced by Ian Goodfellow and colleagues in a 2014 paper presented at NeurIPS.[1] From roughly 2015 to 2021 they were the dominant approach to high-quality image synthesis, producing a rapid succession of increasingly photorealistic systems including DCGAN (2015), Progressive GAN (2017), StyleGAN (2018) and BigGAN (2018). Starting in 2021–2022, GANs were largely displaced from state-of-the-art image generation by diffusion models, which proved easier to train, more stable, and better suited to text conditioning. GANs remain widely used in specialised tasks such as image-to-image translation, super-resolution, real-time inference, and applications where sampling speed matters more than diversity.

History

Precursors

The adversarial-training idea has isolated precedents, notably Jürgen Schmidhuber's 1990s work on "curiosity" and "artificial predictability minimisation",[2] in which one network was trained to produce outputs whose statistics another network could not predict. Goodfellow's 2014 formulation, however, was the first to cast this as a game between a sample generator and a binary classifier with a clean theoretical objective, and it is this formulation that gave rise to the modern GAN literature.

The 2014 paper

Goodfellow conceived the idea, according to his own account, during a discussion at a Montreal bar in 2013 and implemented a prototype the same night.[3] The original paper trained GANs on MNIST, the Toronto Face Database, and CIFAR-10, producing recognisable but blurry images. Despite the modest visual quality, the framework was immediately recognised as significant: it allowed implicit density estimation (no explicit likelihood was required) and produced sharp samples, in contrast to the blurred outputs then typical of variational autoencoders.

Rapid scaling (2015–2018)

The years immediately following saw a cascade of architectural improvements:

  • DCGAN (Radford, Metz and Chintala, 2015)[4] introduced a convolutional architecture with batch normalisation, strided convolutions in the discriminator and fractionally-strided convolutions in the generator, and the absence of fully-connected layers. DCGAN stabilised training enough to produce convincing 64×64 images of bedrooms and faces, and the "DCGAN recipe" became a standard baseline.
  • Conditional GAN (Mirza and Osindero, 2014)[5] added a class label or side input to both networks, enabling controllable generation.
  • pix2pix (Isola et al., 2017)[6] demonstrated that paired data could be used to learn mappings between image domains (sketches to photographs, aerial imagery to maps, semantic segmentations to street scenes).
  • CycleGAN (Zhu et al., 2017)[7] removed the pairing requirement using a cycle-consistency loss, enabling unpaired translation (e.g., horses ↔ zebras, summer ↔ winter photographs, paintings ↔ photographs).
  • Progressive Growing GAN (Karras et al., NVIDIA, 2017)[8] trained GANs starting from low-resolution images and progressively added layers, producing the first unambiguously photorealistic 1024×1024 face images from the CelebA-HQ dataset.
  • Wasserstein GAN (Arjovsky et al., 2017)[9] replaced the Jensen–Shannon-divergence-based objective with the earth-mover (Wasserstein-1) distance, producing loss values that correlated with sample quality and greatly reduced training instability.
  • Spectral normalisation (Miyato et al., 2018)[10] further stabilised training by constraining the Lipschitz constant of the discriminator.
  • BigGAN (Brock et al., DeepMind, 2018)[11] demonstrated that with sufficient model size, batch size (2048), and careful regularisation, class-conditional GANs could produce state-of-the-art 512×512 images on the full ImageNet dataset.

StyleGAN and face synthesis (2018–2021)

NVIDIA's StyleGAN series (Karras et al., 2018, 2019, 2021) introduced a style-based generator that decoupled high-level attributes (pose, identity) from stochastic details (hair, freckles) through a mapping network and adaptive instance normalisation.[12] StyleGAN2 (2019) removed visible artefacts attributable to adaptive instance normalisation, and StyleGAN3 (2021) addressed aliasing and "texture sticking" during smooth interpolation. StyleGAN output drove the 2018 website thispersondoesnotexist.com, which in turn catalysed widespread public awareness of synthetic media. StyleGAN remains, as of 2026, a competitive baseline for high-resolution face generation and is widely used as a backbone for downstream tasks.

Displacement by diffusion models (2021–2022)

Although GANs continued to improve throughout the late 2010s, three reliability problems — training instability, mode collapse' (see below), and difficulty with text conditioning — became increasingly limiting as the field shifted toward text-to-image generation. Dhariwal and Nichol's 2021 paper "Diffusion Models Beat GANs on Image Synthesis"[13] demonstrated that class-conditional diffusion models could match or exceed BigGAN on ImageNet while being substantially easier to train. The subsequent releases of DALL-E 2, Imagen, Midjourney and Stable Diffusion, all built on diffusion rather than adversarial objectives, effectively ended GAN dominance of frontier image synthesis. Later work (notably Kang et al. 2023 paper "Scaling up GANs for Text-to-Image Synthesis",[14] which introduced GigaGAN) showed that GANs can in fact be scaled to text-to-image, but the community's attention had already moved.

Mathematical formulation

The original (non-saturating) GAN objective is a two-player minimax game with value function

<math>\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]</math>

where <math>G</math> is the generator mapping a noise vector <math>z</math> (typically sampled from a standard Gaussian or uniform distribution) to a candidate sample, <math>D</math> is the discriminator outputting the probability that its input came from the real data distribution <math>p_{\text{data}}</math> rather than the generator, and <math>p_z</math> is the prior over latent noise.

For a fixed generator, the optimal discriminator is

<math>D^*_G(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}</math>

where <math>p_g</math> is the implicit distribution induced by passing <math>p_z</math> through <math>G</math>. Substituting this into the value function and simplifying shows that the generator is minimising the Jensen–Shannon divergence between <math>p_g</math> and <math>p_{\text{data}}</math>, and the global minimum is achieved uniquely when <math>p_g = p_{\text{data}}</math>.

Non-saturating loss

In practice, early in training the discriminator rapidly assigns near-zero probability to generator samples, so the generator's gradient from <math>\log(1 - D(G(z)))</math> vanishes. The original paper therefore proposed the non-saturating alternative

<math>\max_G \mathbb{E}_{z \sim p_z}[\log D(G(z))]</math>

which has the same fixed point but provides stronger gradients in the early stages.

Wasserstein objective

The Wasserstein GAN (WGAN) replaces the Jensen–Shannon divergence with the Wasserstein-1 (earth-mover) distance. Under the Kantorovich–Rubinstein duality this becomes

<math>\min_G \max_{\|f\|_L \le 1} \mathbb{E}_{x \sim p_{\text{data}}}[f(x)] - \mathbb{E}_{z \sim p_z}[f(G(z))]</math>

where <math>f</math> (the "critic") must be 1-Lipschitz. The Lipschitz constraint was originally enforced by weight clipping and later by a gradient penalty (WGAN-GP).[15] The Wasserstein loss is finite and differentiable even when the supports of <math>p_g</math> and <math>p_{\text{data}}</math> do not overlap, which addresses the gradient-vanishing pathology of the original formulation.

Other objectives

Numerous alternative objectives have been proposed, including the least-squares GAN loss,[16] the hinge loss (used in BigGAN, SAGAN, StyleGAN), the relativistic GAN loss, and f-divergence-based generalisations. Empirically, no single objective dominates across tasks; the choice is usually made in combination with architectural and regularisation decisions.

Training dynamics and common failure modes

GAN training is notoriously finicky relative to supervised learning or likelihood-based generative models. The characteristic pathologies include:

Mode collapse
The generator learns to produce only a small subset of the target distribution — in extreme cases, a single sample — because that sample happens to fool the current discriminator. Mode collapse is the single most common GAN failure and has motivated many of the architectural and loss-function innovations listed above.
Non-convergence
Because the loss surface is a saddle point rather than a minimum, gradient descent is not guaranteed to converge, and in practice training can oscillate indefinitely.
Discriminator overpowering
If the discriminator learns too quickly, it assigns arbitrarily low probability to generator samples and the generator's gradients vanish.
Vanishing gradients
Related to the above; the original saturating loss becomes uninformative when the discriminator is confident.
Hyperparameter sensitivity
Successful recipes (DCGAN, StyleGAN) emerged after extensive manual tuning, and small changes to learning rate, optimiser, or batch size can destroy convergence.

Stabilisation techniques that have accumulated in the literature include:

  • Two-timescale update rules (TTUR), in which the discriminator is updated with a higher learning rate than the generator.[17]
  • Spectral normalisation of the discriminator.
  • Gradient penalty regularisation (WGAN-GP, R1/R2 penalties).
  • Feature matching, minibatch discrimination, and unrolled GANs as historical mitigations for mode collapse.
  • Exponential moving averages of generator weights (a technique borrowed from semi-supervised learning that is standard in StyleGAN and BigGAN).

Evaluation metrics

Because GANs do not provide a tractable likelihood, they cannot be evaluated by log-likelihood in the way that autoregressive models or normalising flows can. The dominant metrics are therefore sample-based:

  • Inception Score (IS) — measures both the clarity and diversity of generated images using a pretrained Inception classifier. Criticised for being gameable and for depending entirely on the pretrained classifier's training distribution.[18]
  • Fréchet Inception Distance (FID) — compares the Gaussian moments of Inception features of generated and real images. Introduced alongside TTUR, it is currently the de-facto standard for image generation evaluation.
  • Precision and recall for generative models — separates fidelity (precision) from coverage (recall), addressing a weakness of FID which conflates the two.
  • Kernel Inception Distance (KID) — a sample-size-unbiased alternative to FID based on the maximum mean discrepancy.

Human evaluation and task-specific metrics (identity preservation, text–image alignment, downstream classifier accuracy) remain important supplements, especially for applications where FID is known to correlate poorly with perceived quality.

Applications

Image synthesis and editing

Face generation (StyleGAN and successors), class-conditional natural-image synthesis (BigGAN), and scene generation on specialised domains (bedrooms, cars, anime) are the canonical image applications. GAN-based latent-space editing — altering hair, age, pose, or expression by manipulating a vector in the generator's latent space — is the foundation of interactive image-editing products such as those integrated into consumer photo apps.

Image-to-image translation

pix2pix, CycleGAN, and their many successors are used for style transfer, map/photo conversion, day/night conversion, colorisation of greyscale images, semantic segmentation, medical imaging domain adaptation, and many other paired or unpaired mapping tasks.

Super-resolution

SRGAN (Ledig et al., 2017)[19] and its successors ESRGAN (2018) and Real-ESRGAN (2021)[20] produce perceptually convincing high-resolution reconstructions from low-resolution inputs by combining an adversarial loss with a pixel-wise or perceptual loss. GAN-based super-resolution remains widely used in photo restoration, video upscaling, and games (notably NVIDIA's DLSS family, although these use further proprietary modifications).

Medical imaging

GANs are used in medical imaging for modality conversion (e.g., synthesising CT scans from MRI), data augmentation when labelled pathological cases are scarce, and anomaly detection (by training a GAN on healthy-tissue images and flagging regions that the generator cannot reconstruct).

Audio and music

WaveGAN, GAN-TTS, and HiFi-GAN apply adversarial training to raw audio waveforms or intermediate representations. HiFi-GAN[21] in particular became a standard vocoder component in text-to-speech systems for several years, prized for its real-time inference speed.

Scientific applications

GANs have been applied to generating synthetic training data for particle-physics experiments (a use case explicitly highlighted in CERN's computing roadmap), simulating astronomical images, designing novel molecules and proteins (though here diffusion models such as RFdiffusion have displaced GANs), and generating synthetic tabular healthcare data with privacy-preserving guarantees (CTGAN and related methods).

Deepfakes

Adversarially-trained face-swap and face-reenactment systems — colloquially deepfakes — are among the most socially visible applications of GANs. The first widely-used open-source deepfake implementation, released on Reddit in 2017, combined a face-detection pipeline with an autoencoder; later systems incorporated adversarial losses for improved realism. Deepfakes have been linked to non-consensual intimate imagery, political disinformation, and fraud, and have driven a substantial literature on deepfake detection (itself frequently based on GANs or diffusion models).

Notable variants

Variant Year Innovation Primary contribution
Original GAN 2014 Adversarial training Founding paper
Conditional GAN 2014 Class-label conditioning Controllable generation
DCGAN 2015 Convolutional architecture Stable training recipe
InfoGAN 2016 Mutual-information maximisation Interpretable latents
pix2pix 2016 Paired image-to-image Supervised translation
WGAN 2017 Earth-mover distance Stability
Progressive GAN 2017 Growing resolution First photorealistic 1024² faces
CycleGAN 2017 Cycle consistency Unpaired translation
SAGAN 2018 Self-attention layers Long-range structure
BigGAN 2018 Scale, truncation trick State-of-the-art ImageNet
StyleGAN 2018 Style-based generator High-resolution faces
StyleGAN2 2019 Weight demodulation Removes blob artefacts
StyleGAN3 2021 Alias-free architecture Rotation- and translation-equivariant
GigaGAN 2023 1-billion-parameter GAN Competitive text-to-image

Relation to other generative models

GANs sit within a broader taxonomy of deep generative models:

  • Variational autoencoders (VAEs) optimise a variational lower bound on the log-likelihood and provide an explicit (if approximate) posterior over latents, but traditionally produce blurrier samples than GANs.
  • Autoregressive models (PixelRNN, PixelCNN, VQ-VAE-2, and on the language side GPT) model the data distribution factorially and provide exact likelihood but are slow to sample from for high-dimensional continuous data.
  • Normalising flows (RealNVP, Glow, FFJORD) provide exact likelihood and invertible generation at the cost of architectural restrictions.
  • Energy-based models learn an unnormalised probability density, with sampling typically done by Langevin dynamics or other MCMC methods.
  • Diffusion models learn to reverse a fixed noising process; they provide tractable likelihood bounds, stable training, and (as of the mid-2020s) state-of-the-art sample quality.

Conceptually, a GAN can be viewed as a special case of the broader framework of likelihood-free inference — methods that compare distributions by samples rather than by density evaluation. The discriminator in a GAN is precisely a density-ratio estimator, and much of the post-2017 theoretical literature has reframed GANs in these terms.

Hybrid and post-GAN architectures

Even as diffusion models displaced pure GANs at the frontier, adversarial losses have remained valuable as auxiliary training signals in many hybrid systems:

  • VQ-GAN (Esser et al., 2021)[22] combines a vector-quantised autoencoder with an adversarial and perceptual loss on the decoder, producing a compressed latent representation used as the input to a transformer or (in Stable Diffusion and related systems) a diffusion model. The adversarial decoder is one reason modern latent diffusion models produce sharp reconstructions.
  • Consistency models and distilled diffusion sometimes incorporate adversarial objectives to compress a many-step sampler into a one- or few-step generator.
  • Neural radiance field (NeRF) editing and 3D-aware generation systems such as EG3D use adversarial training on rendered views.

Criticism and limitations

Beyond the training-dynamics issues listed above, GANs have attracted specific criticisms:

  • No likelihood — GANs do not expose a density and cannot be meaningfully compared with likelihood-based models on measures such as test-set log-likelihood. They also cannot straightforwardly score or rank candidate samples in the way that autoregressive or diffusion models can.
  • Mode dropping — Even when not fully collapsed, GANs frequently under-represent minority modes, an effect that can encode or amplify dataset biases.
  • Memorisation — Large GANs have been shown to memorise individual training examples, raising copyright and privacy concerns. (This is now understood to be a property shared by essentially all large generative models.)
  • Evaluation ambiguity — FID and IS correlate only loosely with human judgements, and can be gamed by models that produce visually unrealistic images in ways the metric does not penalise.
  • Brittleness to text conditioning — pure-GAN text-to-image systems were consistently outperformed by diffusion models on open-vocabulary prompts, a shortcoming that took until GigaGAN (2023) to be meaningfully addressed.

See also

References

  1. Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). "Generative Adversarial Nets". Advances in Neural Information Processing Systems. 27. arXiv:1406.2661.
  2. Schmidhuber, Jürgen (1992). "Learning factorial codes by predictability minimization". Neural Computation. 4 (6): 863–879.
  3. Giles, Martin (2018). "The GANfather: The man who's given machines the gift of imagination". MIT Technology Review. 21 February 2018.
  4. Radford, Alec; Metz, Luke; Chintala, Soumith (2015). "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks". arXiv:1511.06434.
  5. Mirza, Mehdi; Osindero, Simon (2014). "Conditional Generative Adversarial Nets". arXiv:1411.1784.
  6. Isola, Phillip; Zhu, Jun-Yan; Zhou, Tinghui; Efros, Alexei A. (2017). "Image-to-Image Translation with Conditional Adversarial Networks". CVPR. arXiv:1611.07004.
  7. Zhu, Jun-Yan; Park, Taesung; Isola, Phillip; Efros, Alexei A. (2017). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks". ICCV. arXiv:1703.10593.
  8. Karras, Tero; Aila, Timo; Laine, Samuli; Lehtinen, Jaakko (2017). "Progressive Growing of GANs for Improved Quality, Stability, and Variation". arXiv:1710.10196.
  9. Arjovsky, Martin; Chintala, Soumith; Bottou, Léon (2017). "Wasserstein GAN". arXiv:1701.07875.
  10. Miyato, Takeru; Kataoka, Toshiki; Koyama, Masanori; Yoshida, Yuichi (2018). "Spectral Normalization for Generative Adversarial Networks". ICLR. arXiv:1802.05957.
  11. Brock, Andrew; Donahue, Jeff; Simonyan, Karen (2018). "Large Scale GAN Training for High Fidelity Natural Image Synthesis". arXiv:1809.11096.
  12. Karras, Tero; Laine, Samuli; Aila, Timo (2018). "A Style-Based Generator Architecture for Generative Adversarial Networks". CVPR 2019. arXiv:1812.04948.
  13. Dhariwal, Prafulla; Nichol, Alex (2021). "Diffusion Models Beat GANs on Image Synthesis". arXiv:2105.05233.
  14. Kang, Minguk; Zhu, Jun-Yan; Zhang, Richard; Park, Jaesik; Shechtman, Eli; Paris, Sylvain; Park, Taesung (2023). "Scaling up GANs for Text-to-Image Synthesis". CVPR. arXiv:2303.05511.
  15. Gulrajani, Ishaan; Ahmed, Faruk; Arjovsky, Martin; Dumoulin, Vincent; Courville, Aaron (2017). "Improved Training of Wasserstein GANs". arXiv:1704.00028.
  16. Mao, Xudong; Li, Qing; Xie, Haoran; Lau, Raymond Y. K.; Wang, Zhen; Smolley, Stephen Paul (2017). "Least Squares Generative Adversarial Networks". ICCV. arXiv:1611.04076.
  17. Heusel, Martin; Ramsauer, Hubert; Unterthiner, Thomas; Nessler, Bernhard; Hochreiter, Sepp (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium". arXiv:1706.08500.
  18. Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi (2016). "Improved Techniques for Training GANs". NeurIPS. arXiv:1606.03498.
  19. Ledig, Christian et al. (2017). "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network". CVPR. arXiv:1609.04802.
  20. Wang, Xintao; Xie, Liangbin; Dong, Chao; Shan, Ying (2021). "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data". ICCV Workshops. arXiv:2107.10833.
  21. Kong, Jungil; Kim, Jaehyeon; Bae, Jaekyoung (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis". NeurIPS. arXiv:2010.05646.
  22. Esser, Patrick; Rombach, Robin; Ommer, Björn (2021). "Taming Transformers for High-Resolution Image Synthesis". CVPR. arXiv:2012.09841.