// Read these in sequence. Each one builds on the last.
2014
Foundational
GAN
Generative Adversarial Nets โ Goodfellow et al.
The paper that introduced GANs. Two networks compete: a generator that creates images and a discriminator that judges them.
๐ Why read it: The idea of adversarial training is foundational. Short paper (~9 pages). Very readable.
โ arxiv.org/abs/1406.2661
2013
Foundational
Auto-Encoding Variational Bayes โ Kingma & Welling
Introduced VAEs โ learning probabilistic latent representations. The math is dense but the concept is critical.
๐ Why read it: VAE's latent space concept is used directly in Stable Diffusion's architecture.
โ arxiv.org/abs/1312.6114
2018
GAN
A Style-Based Generator Architecture for GANs (StyleGAN) โ Karras et al.
Introduced disentangled style control, W-space, and adaptive instance normalization. Photorealistic face generation.
๐ Why read it: StyleGAN's concept of disentangled latent spaces directly influenced how diffusion model conditioning works.
โ arxiv.org/abs/1812.04948
2017
Attention
Attention Is All You Need โ Vaswani et al.
The Transformer paper. Introduced self-attention, multi-head attention, and positional encodings. Revolutionized all of AI.
๐ Why read it: Every modern generation model uses transformers. This is non-negotiable reading.
โ arxiv.org/abs/1706.03762
2021
CLIP
Learning Transferable Visual Models From Natural Language (CLIP) โ Radford et al.
Trained image and text encoders together via contrastive learning on 400M image-text pairs. Enables text-to-image.
๐ Why read it: CLIP is the text-understanding backbone of Stable Diffusion, DALLยทE 2, and most modern models.
โ arxiv.org/abs/2103.00020
2020
Diffusion
Denoising Diffusion Probabilistic Models (DDPM) โ Ho et al.
The paper that made diffusion models practical. Simplified the training objective to just predict the noise added to an image.
๐ Why read it: This is THE paper. Everything in modern image generation descends from this. Read it twice.
โ arxiv.org/abs/2006.11239
2020
Diffusion
Denoising Diffusion Implicit Models (DDIM) โ Song et al.
Non-Markovian diffusion process that enables deterministic sampling in far fewer steps (50 instead of 1000).
๐ Why read it: Practical diffusion models use DDIM-style sampling. You'll configure this constantly.
โ arxiv.org/abs/2010.02502
2022
LDM
High-Resolution Image Synthesis with Latent Diffusion Models โ Rombach et al.
Stable Diffusion's architecture. Moves diffusion into a compressed latent space using a pre-trained VAE. Huge efficiency gain.
๐ Why read it: This IS Stable Diffusion. Understanding this paper means you understand how SD1.5, SD2, SDXL work.
โ arxiv.org/abs/2112.10752
2023
Control
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) โ Zhang & Agrawala
Adds spatial conditioning (edge maps, depth, pose) by copying encoder weights into a trainable copy with zero-conv layers.
๐ Why read it: ControlNet is everywhere in practical applications. The architecture idea is elegant and widely reused.
โ arxiv.org/abs/2302.05543
2022
Personalization
DreamBooth โ Ruiz et al.
Fine-tune a diffusion model on 3-5 images of a subject using a unique identifier token. Teaches the model a new "concept."
๐ Why read it: Foundation of model personalization โ your face, your dog, your product in any style.
โ arxiv.org/abs/2208.12242
2022
Video
Video Diffusion Models โ Ho et al.
Extends image diffusion to video with a 3D U-Net (joint spatial-temporal attention). First strong diffusion-based video results.
๐ Why read it: Gateway into video generation research โ all later video models reference this.
โ arxiv.org/abs/2204.03458
2022
DiT
Scalable Diffusion Models with Transformers (DiT) โ Peebles & Xie
Replaces the U-Net backbone in diffusion models with a Vision Transformer. Scales better, powers Sora and SD3.
๐ Why read it: DiT is the future architecture. Sora, Stable Diffusion 3, most frontier models use this design.
โ arxiv.org/abs/2212.09748
2023
Speed
Consistency Models โ Song et al.
Train models to generate high-quality images in a single step by enforcing consistency along diffusion trajectories.
๐ Why read it: The direction the field is heading โ single-step generation without quality loss.
โ arxiv.org/abs/2303.01469