Zero to Research-Level

Image & Video
Generation AI

A complete, structured roadmap from absolute beginner to building your own generative models โ€” with every paper, course, and project you need.

5
Phases
12โ€“18
Months
30+
Key Papers
20+
Projects
Phase 1
Foundations
Phase 2
Deep Learning
Phase 3
Generative
Phase 4
Diffusion
Phase 5
Advanced
PHASE 01

๐ŸŒฑ Foundations

โฑ Estimated time: 6โ€“8 weeks

Build the math and programming bedrock. Everything else depends on this. Don't skip it.

๐Ÿ
Python Programming
Learn Python syntax, NumPy for arrays, Matplotlib for plotting, and Jupyter notebooks. This is your coding language for everything.
CourseCS50P โ€“ Harvard Python (free on edX)
Coursefast.ai Practical Deep Learning โ€“ setup chapters
VideoCorey Schafer Python YouTube series
BookPython Crash Course โ€“ Eric Matthes
๐Ÿ“
Linear Algebra
Vectors, matrices, matrix multiplication, dot products, eigenvalues. Neural networks are literally just matrix math.
Video3Blue1Brown โ€“ Essence of Linear Algebra (YouTube)
CourseMIT 18.06 Gilbert Strang (free OCW)
BookMathematics for Machine Learning โ€“ Cambridge (free PDF)
๐Ÿ“Š
Calculus & Probability
Derivatives, chain rule (backprop!), partial derivatives. Probability distributions, Bayes theorem, Gaussian โ€” used constantly in generative models.
Video3Blue1Brown โ€“ Essence of Calculus (YouTube)
CourseKhan Academy โ€“ Probability & Statistics (free)
BookDeep Learning Book Ch. 3 โ€“ Goodfellow (free online)
๐Ÿค–
Classic Machine Learning
Linear regression, logistic regression, decision trees, SVMs, overfitting, train/val/test splits. Understanding this prevents tons of confusion later.
CourseAndrew Ng โ€“ ML Specialization (Coursera, audit free)
BookHands-On ML with Scikit-Learn & TF โ€“ Aurรฉlien Gรฉron
Reposcikit-learn documentation + examples
๐Ÿ’ก Beginner Tip Don't try to master everything here before moving on. Get 70-80% comfortable, then proceed. You'll revisit these concepts dozens of times as you build things โ€” that's when it actually sticks.
PHASE 02

๐Ÿง  Deep Learning Core

โฑ Estimated time: 8โ€“10 weeks

Neural networks, CNNs, and the frameworks that power every model you'll ever train.

๐Ÿ”—
Neural Networks from Scratch
Forward pass, loss functions, backpropagation, gradient descent. Build a neural net in pure NumPy โ€” painful but deeply educational.
VideoAndrej Karpathy โ€“ Neural Networks Zero to Hero (YouTube)
Coursefast.ai Part 1 โ€“ Practical Deep Learning (free)
BookDeep Learning โ€“ Goodfellow, Bengio, Courville (free online)
Video3Blue1Brown โ€“ Neural Networks series
๐Ÿ–ผ๏ธ
Convolutional Neural Networks (CNNs)
Convolutions, pooling, feature maps, ResNet, VGG. CNNs are the backbone of almost all image generation architectures. Must know thoroughly.
CourseStanford CS231n โ€“ CNNs for Visual Recognition (free notes + videos)
PaperDeep Residual Learning (ResNet) โ€“ He et al., 2015
VideoYannic Kilcher โ€“ ResNet paper walkthrough
๐Ÿ”ฅ
PyTorch Framework
Tensors, autograd, datasets, dataloaders, training loops, GPU usage. PyTorch is the standard for research. Learn it well.
CourseOfficial PyTorch Tutorials (pytorch.org)
VideoPatrick Loeber โ€“ PyTorch for Deep Learning (YouTube)
BookDeep Learning with PyTorch โ€“ Eli Stevens (free online)
Repopytorch/examples on GitHub
๐Ÿ”€
Attention & Transformers
Self-attention, multi-head attention, positional encoding, ViT (Vision Transformer). Transformers now dominate generative AI โ€” understand them deeply.
PaperAttention Is All You Need โ€“ Vaswani et al., 2017
BlogThe Illustrated Transformer โ€“ Jay Alammar
VideoAndrej Karpathy โ€“ Let's build GPT from scratch
PaperAn Image is Worth 16x16 Words (ViT) โ€“ Dosovitskiy, 2020
๐Ÿ‹๏ธ
Training Best Practices
Batch normalization, dropout, learning rate schedules, mixed precision (fp16), gradient clipping, wandb/TensorBoard logging.
BlogAndrej Karpathy โ€“ A Recipe for Training Neural Networks
Coursefast.ai Part 2 โ€“ Deep Learning from the Foundations
Repowandb.ai tutorials on experiment tracking
๐Ÿ”ฃ
Autoencoders & Latent Spaces
Encoders, decoders, bottleneck representations. The concept of "latent space" is central to VAEs, GANs, and diffusion models.
VideoSerrano Academy โ€“ Autoencoders explained (YouTube)
PaperAuto-Encoding Variational Bayes (VAE) โ€“ Kingma & Welling, 2013
BlogUnderstanding VAEs โ€“ Lilian Weng (lilianweng.github.io)
PHASE 03

๐ŸŽจ Generative Models

โฑ Estimated time: 8โ€“10 weeks

Now the fun begins. GANs, VAEs, flow models โ€” learn how machines learn to generate images.

โš”๏ธ
GANs โ€” Generative Adversarial Networks
Generator vs discriminator, adversarial loss, mode collapse, training instability. The architecture that started the modern image generation era.
PaperGenerative Adversarial Nets โ€“ Goodfellow et al., 2014
PaperDCGAN โ€“ Radford et al., 2015
VideoAri Seff โ€“ GAN lecture (YouTube)
BlogGAN โ€” Lilian Weng's comprehensive overview
๐ŸŒ€
Advanced GANs
Progressive growing, StyleGAN's W-space, conditional generation, image-to-image translation. Understanding these makes diffusion models click faster.
PaperProgressive Growing of GANs โ€“ Karras et al., 2017
PaperA Style-Based Generator (StyleGAN) โ€“ Karras et al., 2018
PaperPix2Pix โ€“ Isola et al., 2016
PaperCycleGAN โ€“ Zhu et al., 2017
๐Ÿงฎ
Variational Autoencoders (VAEs)
ELBO loss, reparameterization trick, KL divergence, structured latent spaces. VAEs underpin the latent space in Stable Diffusion's design.
PaperAuto-Encoding Variational Bayes โ€“ Kingma & Welling, 2013
VideoAladdinPersson โ€“ VAE from scratch in PyTorch (YouTube)
BlogFrom Autoencoder to Beta-VAE โ€“ Lilian Weng
๐ŸŒŠ
Normalizing Flows
Invertible transformations, exact likelihood, GLOW, RealNVP. Not used as widely anymore but builds critical intuition for probability-based generation.
PaperGlow โ€“ Kingma & Dhariwal, 2018
BlogFlow-based Deep Generative Models โ€“ Lilian Weng
VideoPieter Abbeel โ€“ Flow Models lecture (Berkeley)
๐Ÿ“Ž
CLIP โ€” Connecting Text & Images
Contrastive learning, text-image alignment, zero-shot classification. CLIP is what gives Stable Diffusion and DALLยทE their text understanding.
PaperLearning Transferable Visual Models from Natural Language (CLIP) โ€“ Radford et al., 2021
VideoYannic Kilcher โ€“ CLIP paper walkthrough
Repoopenai/CLIP on GitHub
PHASE 04

โœจ Diffusion Models โ€” The State of the Art

โฑ Estimated time: 10โ€“12 weeks

The dominant paradigm for image and video generation. Stable Diffusion, DALLยทE, Sora โ€” all diffusion-based.

โ„๏ธ
Diffusion Fundamentals
Forward noising process, reverse denoising, Markov chains, score matching, noise schedules. The mathematical heart of everything modern.
PaperDDPM โ€“ Ho et al., 2020 (the paper that started it)
BlogWhat are Diffusion Models? โ€“ Lilian Weng
VideoOutlier โ€“ Diffusion Models Explained (YouTube)
BlogThe Annotated Diffusion Model โ€“ Hugging Face
๐ŸŽ๏ธ
Faster Sampling โ€” DDIM, PNDM
DDPM is slow (1000 steps). DDIM enables 50-step generation with deterministic sampling. Understanding samplers lets you tune quality vs speed.
PaperDDIM โ€“ Song et al., 2020
PaperPNDM โ€“ Liu et al., 2022
BlogHugging Face Diffusers โ€“ Schedulers docs
๐Ÿ—œ๏ธ
Latent Diffusion (Stable Diffusion)
Compress images into latent space first (VAE), then diffuse there. This is exactly how Stable Diffusion works โ€” 8x more efficient than pixel-space diffusion.
PaperHigh-Resolution Image Synthesis with LDMs โ€“ Rombach et al., 2022
RepoCompVis/stable-diffusion on GitHub
VideoTanishq Abraham โ€“ Stable Diffusion deep dive (YouTube)
๐ŸŽฏ
Guidance: Classifier & CFG
Classifier guidance, classifier-free guidance (CFG), guidance scale. This is how you steer generation toward your text prompt. Critical for text-to-image.
PaperDiffusion Models Beat GANs โ€“ Dhariwal & Nichol, 2021
PaperClassifier-Free Diffusion Guidance โ€“ Ho & Salimans, 2021
BlogCFG explained โ€“ Hugging Face blog
๐ŸŽ›๏ธ
ControlNet & Conditioning
Add spatial conditioning (edges, depth, pose) to diffusion models. Enables precise control over composition โ€” huge in creative applications.
PaperAdding Conditional Control to Text-to-Image Diffusion (ControlNet) โ€“ Zhang et al., 2023
Repolllyasviel/ControlNet on GitHub
VideoYannic Kilcher โ€“ ControlNet walkthrough
๐Ÿ”ง
Fine-tuning: LoRA & DreamBooth
Personalize pre-trained models with a few images. LoRA is parameter-efficient; DreamBooth teaches a model a new concept (your face, an object, a style).
PaperDreamBooth โ€“ Ruiz et al., 2022
PaperLoRA โ€“ Hu et al., 2021
Repohuggingface/diffusers training examples
๐ŸŽฌ
Video Generation โ€” Fundamentals
Temporal consistency, 3D U-Nets, video diffusion, optical flow. Extending image generation to video is the current research frontier.
PaperVideo Diffusion Models โ€“ Ho et al., 2022
PaperImagen Video โ€“ Ho et al., 2022
PaperMake-A-Video โ€“ Singer et al., 2022 (Meta)
PaperAlign Your Latents โ€“ Blattmann et al., 2023
๐ŸŒ
Hugging Face Ecosystem
diffusers library, transformers, model hub, pipelines, training scripts. Hugging Face is how you actually build with these models in practice.
CourseHugging Face Diffusion Models Course (free)
Repohuggingface/diffusers (official library)
BlogHugging Face Blog โ€“ new posts weekly
โšก Key Insight Diffusion models do one thing: predict and remove noise. Everything else โ€” guidance, ControlNet, video, LoRA โ€” is built on top of this single idea. Truly understanding DDPM makes all of Phase 4 much easier.
PHASE 05

๐Ÿš€ Advanced & Research Level

โฑ Estimated time: Ongoing

Cutting-edge architectures, video models like Sora, consistency models, and reading live research. This is where you contribute.

โšก
Consistency Models & Flow Matching
1-step generation, consistency training, rectified flow. The next evolution beyond DDPM โ€” dramatically faster and newer research direction.
PaperConsistency Models โ€“ Song et al., 2023
PaperFlow Matching for Generative Modeling โ€“ Lipman et al., 2022
PaperStable Diffusion 3 (MM-DiT) โ€“ Esser et al., 2024
๐ŸงŠ
DiT โ€” Diffusion Transformers
Replace U-Net with Transformers in diffusion. DiT is the architecture behind Sora, SD3, and most frontier video models. The current dominant architecture.
PaperScalable Diffusion Models with Transformers (DiT) โ€“ Peebles & Xie, 2022
VideoYannic Kilcher โ€“ DiT walkthrough
Repofacebookresearch/DiT on GitHub
๐ŸŽž๏ธ
Sora & Long Video Generation
Video patches (spacetime tokens), world simulation, temporal coherence at scale. Understanding Sora's architecture (video DiT + patches) is frontier knowledge.
PaperSora Technical Report โ€“ OpenAI, 2024
BlogSora: First principles analysis โ€“ various (arXiv blog)
PaperCogVideoX โ€“ ZHIPU AI, 2024 (open-source Sora alternative)
๐Ÿ…
RLHF & Alignment for Generation
Reinforcement learning from human feedback applied to image models. InstructPix2Pix, reward models for aesthetics โ€” how models learn human preferences.
PaperInstructPix2Pix โ€“ Brooks et al., 2022
PaperImageReward โ€“ Xu et al., 2023
BlogRLHF for diffusion โ€“ Hugging Face blog
๐Ÿ“ก
Keep Up With Research
The field moves incredibly fast. arXiv daily reading, Twitter/X ML community, Papers With Code โ€” staying current is a skill in itself.
Blogarxiv.org/list/cs.CV/recent (daily)
Repopaperswithcode.com/sota/image-generation
VideoYannic Kilcher YouTube โ€“ weekly paper reviews
BlogThe Batch โ€“ DeepLearning.AI newsletter

๐Ÿ“š Essential Papers โ€” In Order

// Read these in sequence. Each one builds on the last.

2014 Foundational GAN
Generative Adversarial Nets โ€” Goodfellow et al.
The paper that introduced GANs. Two networks compete: a generator that creates images and a discriminator that judges them.
๐Ÿ“Œ Why read it: The idea of adversarial training is foundational. Short paper (~9 pages). Very readable.
โ†’ arxiv.org/abs/1406.2661
2013 Foundational
Auto-Encoding Variational Bayes โ€” Kingma & Welling
Introduced VAEs โ€” learning probabilistic latent representations. The math is dense but the concept is critical.
๐Ÿ“Œ Why read it: VAE's latent space concept is used directly in Stable Diffusion's architecture.
โ†’ arxiv.org/abs/1312.6114
2018 GAN
A Style-Based Generator Architecture for GANs (StyleGAN) โ€” Karras et al.
Introduced disentangled style control, W-space, and adaptive instance normalization. Photorealistic face generation.
๐Ÿ“Œ Why read it: StyleGAN's concept of disentangled latent spaces directly influenced how diffusion model conditioning works.
โ†’ arxiv.org/abs/1812.04948
2017 Attention
Attention Is All You Need โ€” Vaswani et al.
The Transformer paper. Introduced self-attention, multi-head attention, and positional encodings. Revolutionized all of AI.
๐Ÿ“Œ Why read it: Every modern generation model uses transformers. This is non-negotiable reading.
โ†’ arxiv.org/abs/1706.03762
2021 CLIP
Learning Transferable Visual Models From Natural Language (CLIP) โ€” Radford et al.
Trained image and text encoders together via contrastive learning on 400M image-text pairs. Enables text-to-image.
๐Ÿ“Œ Why read it: CLIP is the text-understanding backbone of Stable Diffusion, DALLยทE 2, and most modern models.
โ†’ arxiv.org/abs/2103.00020
2020 Diffusion
Denoising Diffusion Probabilistic Models (DDPM) โ€” Ho et al.
The paper that made diffusion models practical. Simplified the training objective to just predict the noise added to an image.
๐Ÿ“Œ Why read it: This is THE paper. Everything in modern image generation descends from this. Read it twice.
โ†’ arxiv.org/abs/2006.11239
2020 Diffusion
Denoising Diffusion Implicit Models (DDIM) โ€” Song et al.
Non-Markovian diffusion process that enables deterministic sampling in far fewer steps (50 instead of 1000).
๐Ÿ“Œ Why read it: Practical diffusion models use DDIM-style sampling. You'll configure this constantly.
โ†’ arxiv.org/abs/2010.02502
2022 LDM
High-Resolution Image Synthesis with Latent Diffusion Models โ€” Rombach et al.
Stable Diffusion's architecture. Moves diffusion into a compressed latent space using a pre-trained VAE. Huge efficiency gain.
๐Ÿ“Œ Why read it: This IS Stable Diffusion. Understanding this paper means you understand how SD1.5, SD2, SDXL work.
โ†’ arxiv.org/abs/2112.10752
2023 Control
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) โ€” Zhang & Agrawala
Adds spatial conditioning (edge maps, depth, pose) by copying encoder weights into a trainable copy with zero-conv layers.
๐Ÿ“Œ Why read it: ControlNet is everywhere in practical applications. The architecture idea is elegant and widely reused.
โ†’ arxiv.org/abs/2302.05543
2022 Personalization
DreamBooth โ€” Ruiz et al.
Fine-tune a diffusion model on 3-5 images of a subject using a unique identifier token. Teaches the model a new "concept."
๐Ÿ“Œ Why read it: Foundation of model personalization โ€” your face, your dog, your product in any style.
โ†’ arxiv.org/abs/2208.12242
2022 Video
Video Diffusion Models โ€” Ho et al.
Extends image diffusion to video with a 3D U-Net (joint spatial-temporal attention). First strong diffusion-based video results.
๐Ÿ“Œ Why read it: Gateway into video generation research โ€” all later video models reference this.
โ†’ arxiv.org/abs/2204.03458
2022 DiT
Scalable Diffusion Models with Transformers (DiT) โ€” Peebles & Xie
Replaces the U-Net backbone in diffusion models with a Vision Transformer. Scales better, powers Sora and SD3.
๐Ÿ“Œ Why read it: DiT is the future architecture. Sora, Stable Diffusion 3, most frontier models use this design.
โ†’ arxiv.org/abs/2212.09748
2023 Speed
Consistency Models โ€” Song et al.
Train models to generate high-quality images in a single step by enforcing consistency along diffusion trajectories.
๐Ÿ“Œ Why read it: The direction the field is heading โ€” single-step generation without quality loss.
โ†’ arxiv.org/abs/2303.01469

๐Ÿ› ๏ธ Projects โ€” Build as You Learn

// Concrete things to build at each stage. Theory without building is nothing.

01
Digit Generator with VAE
Train a Variational Autoencoder on MNIST. Interpolate between digits in latent space. Visualize the latent manifold with t-SNE.
PyTorchVAEMNIST
Beginner
02
DCGAN for Face Generation
Train a Deep Convolutional GAN on CelebA or Anime Faces dataset. Learn about training stability, mode collapse, and discriminator balancing.
PyTorchGANCelebA
Beginner
03
Unconditional DDPM from Scratch
Implement DDPM on CIFAR-10 or MNIST following the original paper. Build the noise scheduler, U-Net, training loop, and sampling loop yourself.
PyTorchDiffusionCIFAR-10
Intermediate
04
Build a Text-to-Image App with Stable Diffusion
Use Hugging Face diffusers to build a web app (Gradio or Streamlit) that generates images from text. Experiment with CFG scale, samplers, seeds, negative prompts.
DiffusersGradioStable Diffusion
Intermediate
05
Fine-tune SD with DreamBooth on Custom Subject
Take 5-10 photos of yourself (or an object), fine-tune Stable Diffusion with DreamBooth, then generate yourself in various artistic styles.
DreamBoothFine-tuningLoRA
Intermediate
06
Image-to-Image Pipeline with ControlNet
Build a pipeline that takes a sketch or edge map and generates a full image. Try canny edge, depth map, and human pose conditioning.
ControlNetDiffusersOpenCV
Intermediate
07
Video Interpolation with Diffusion
Use a pre-trained video diffusion model (like AnimateDiff or SVD) to generate short video clips from images. Build a Gradio interface. Analyze temporal coherence.
AnimateDiffSVDVideo
Advanced
08
Train a Mini DiT Diffusion Transformer
Implement a small-scale DiT architecture from the paper. Train on a simple dataset (flowers, icons). Compare quality and compute vs U-Net baseline.
DiTTransformersResearch
Expert

๐Ÿงฐ Tools & Libraries

// Your complete toolkit

๐Ÿ”ฅ
PyTorch
Primary deep learning framework. Research standard.
๐Ÿค—
Hugging Face
diffusers + transformers libraries. Hub for models.
๐Ÿ–ฅ๏ธ
Google Colab
Free GPU for training small models. Start here.
โšก
Lambda Labs
Affordable GPU cloud for bigger training runs.
๐Ÿ“Š
Weights & Biases
Experiment tracking, loss curves, model comparison.
๐ŸŽ›๏ธ
Gradio
Build demo UIs for your models in minutes.
๐Ÿ™
GitHub
Version control. Clone research repos. Share work.
๐Ÿ“ฆ
AUTOMATIC1111
Local SD WebUI. Experiment with models easily.
๐ŸŽจ
ComfyUI
Node-based SD interface. Great for complex workflows.
๐Ÿงญ The Most Important Advice Don't wait until you "know enough" to start building. Pick a project from Phase 1 and start now, even if you don't understand everything. The confusion you feel when building is what forces real understanding. Read papers alongside building โ€” you'll absorb them 10x faster when you have context from your own experiments. The goal isn't to finish this roadmap โ€” it's to become someone who ships real generative AI projects and reads research every week.