Zero to Research-Level

Image & Video
Generation AI

A complete, structured roadmap from absolute beginner to building your own generative models — with every paper, course, and project you need.

Phases

12–18

Months

30+

Key Papers

20+

Projects

Phase 1
Foundations

Phase 2
Deep Learning

Phase 3
Generative

Phase 4
Diffusion

Phase 5
Advanced

PHASE 01

🌱 Foundations

⏱ Estimated time: 6–8 weeks

Build the math and programming bedrock. Everything else depends on this. Don't skip it.

🐍

Python Programming

Learn Python syntax, NumPy for arrays, Matplotlib for plotting, and Jupyter notebooks. This is your coding language for everything.

CourseCS50P – Harvard Python (free on edX)

Coursefast.ai Practical Deep Learning – setup chapters

VideoCorey Schafer Python YouTube series

BookPython Crash Course – Eric Matthes

📐

Linear Algebra

Vectors, matrices, matrix multiplication, dot products, eigenvalues. Neural networks are literally just matrix math.

Video3Blue1Brown – Essence of Linear Algebra (YouTube)

CourseMIT 18.06 Gilbert Strang (free OCW)

BookMathematics for Machine Learning – Cambridge (free PDF)

📊

Calculus & Probability

Derivatives, chain rule (backprop!), partial derivatives. Probability distributions, Bayes theorem, Gaussian — used constantly in generative models.

Video3Blue1Brown – Essence of Calculus (YouTube)

CourseKhan Academy – Probability & Statistics (free)

BookDeep Learning Book Ch. 3 – Goodfellow (free online)

🤖

Classic Machine Learning

Linear regression, logistic regression, decision trees, SVMs, overfitting, train/val/test splits. Understanding this prevents tons of confusion later.

CourseAndrew Ng – ML Specialization (Coursera, audit free)

BookHands-On ML with Scikit-Learn & TF – Aurélien Géron

Reposcikit-learn documentation + examples

💡 Beginner Tip Don't try to master everything here before moving on. Get 70-80% comfortable, then proceed. You'll revisit these concepts dozens of times as you build things — that's when it actually sticks.

PHASE 02

🧠 Deep Learning Core

⏱ Estimated time: 8–10 weeks

Neural networks, CNNs, and the frameworks that power every model you'll ever train.

🔗

Neural Networks from Scratch

Forward pass, loss functions, backpropagation, gradient descent. Build a neural net in pure NumPy — painful but deeply educational.

VideoAndrej Karpathy – Neural Networks Zero to Hero (YouTube)

Coursefast.ai Part 1 – Practical Deep Learning (free)

BookDeep Learning – Goodfellow, Bengio, Courville (free online)

Video3Blue1Brown – Neural Networks series

🖼️

Convolutional Neural Networks (CNNs)

Convolutions, pooling, feature maps, ResNet, VGG. CNNs are the backbone of almost all image generation architectures. Must know thoroughly.

CourseStanford CS231n – CNNs for Visual Recognition (free notes + videos)

PaperDeep Residual Learning (ResNet) – He et al., 2015

VideoYannic Kilcher – ResNet paper walkthrough

🔥

PyTorch Framework

Tensors, autograd, datasets, dataloaders, training loops, GPU usage. PyTorch is the standard for research. Learn it well.

CourseOfficial PyTorch Tutorials (pytorch.org)

VideoPatrick Loeber – PyTorch for Deep Learning (YouTube)

BookDeep Learning with PyTorch – Eli Stevens (free online)

Repopytorch/examples on GitHub

🔀

Attention & Transformers

Self-attention, multi-head attention, positional encoding, ViT (Vision Transformer). Transformers now dominate generative AI — understand them deeply.

PaperAttention Is All You Need – Vaswani et al., 2017

BlogThe Illustrated Transformer – Jay Alammar

VideoAndrej Karpathy – Let's build GPT from scratch

PaperAn Image is Worth 16x16 Words (ViT) – Dosovitskiy, 2020

🏋️

Training Best Practices

Batch normalization, dropout, learning rate schedules, mixed precision (fp16), gradient clipping, wandb/TensorBoard logging.

BlogAndrej Karpathy – A Recipe for Training Neural Networks

Coursefast.ai Part 2 – Deep Learning from the Foundations

Repowandb.ai tutorials on experiment tracking

🔣

Autoencoders & Latent Spaces

Encoders, decoders, bottleneck representations. The concept of "latent space" is central to VAEs, GANs, and diffusion models.

VideoSerrano Academy – Autoencoders explained (YouTube)

PaperAuto-Encoding Variational Bayes (VAE) – Kingma & Welling, 2013

BlogUnderstanding VAEs – Lilian Weng (lilianweng.github.io)

PHASE 03

🎨 Generative Models

⏱ Estimated time: 8–10 weeks

Now the fun begins. GANs, VAEs, flow models — learn how machines learn to generate images.

⚔️

GANs — Generative Adversarial Networks

Generator vs discriminator, adversarial loss, mode collapse, training instability. The architecture that started the modern image generation era.

PaperGenerative Adversarial Nets – Goodfellow et al., 2014

PaperDCGAN – Radford et al., 2015

VideoAri Seff – GAN lecture (YouTube)

BlogGAN — Lilian Weng's comprehensive overview

🌀

Advanced GANs

Progressive growing, StyleGAN's W-space, conditional generation, image-to-image translation. Understanding these makes diffusion models click faster.

PaperProgressive Growing of GANs – Karras et al., 2017

PaperA Style-Based Generator (StyleGAN) – Karras et al., 2018

PaperPix2Pix – Isola et al., 2016

PaperCycleGAN – Zhu et al., 2017

🧮

Variational Autoencoders (VAEs)

ELBO loss, reparameterization trick, KL divergence, structured latent spaces. VAEs underpin the latent space in Stable Diffusion's design.

PaperAuto-Encoding Variational Bayes – Kingma & Welling, 2013

VideoAladdinPersson – VAE from scratch in PyTorch (YouTube)

BlogFrom Autoencoder to Beta-VAE – Lilian Weng

🌊

Normalizing Flows

Invertible transformations, exact likelihood, GLOW, RealNVP. Not used as widely anymore but builds critical intuition for probability-based generation.

PaperGlow – Kingma & Dhariwal, 2018

BlogFlow-based Deep Generative Models – Lilian Weng

VideoPieter Abbeel – Flow Models lecture (Berkeley)

📎

CLIP — Connecting Text & Images

Contrastive learning, text-image alignment, zero-shot classification. CLIP is what gives Stable Diffusion and DALL·E their text understanding.

PaperLearning Transferable Visual Models from Natural Language (CLIP) – Radford et al., 2021

VideoYannic Kilcher – CLIP paper walkthrough

Repoopenai/CLIP on GitHub

PHASE 04

✨ Diffusion Models — The State of the Art

⏱ Estimated time: 10–12 weeks

The dominant paradigm for image and video generation. Stable Diffusion, DALL·E, Sora — all diffusion-based.

❄️

Diffusion Fundamentals

Forward noising process, reverse denoising, Markov chains, score matching, noise schedules. The mathematical heart of everything modern.

PaperDDPM – Ho et al., 2020 (the paper that started it)

BlogWhat are Diffusion Models? – Lilian Weng

VideoOutlier – Diffusion Models Explained (YouTube)

BlogThe Annotated Diffusion Model – Hugging Face

🏎️

Faster Sampling — DDIM, PNDM

DDPM is slow (1000 steps). DDIM enables 50-step generation with deterministic sampling. Understanding samplers lets you tune quality vs speed.

PaperDDIM – Song et al., 2020

PaperPNDM – Liu et al., 2022

BlogHugging Face Diffusers – Schedulers docs

🗜️

Latent Diffusion (Stable Diffusion)

Compress images into latent space first (VAE), then diffuse there. This is exactly how Stable Diffusion works — 8x more efficient than pixel-space diffusion.

PaperHigh-Resolution Image Synthesis with LDMs – Rombach et al., 2022

RepoCompVis/stable-diffusion on GitHub

VideoTanishq Abraham – Stable Diffusion deep dive (YouTube)

🎯

Guidance: Classifier & CFG

Classifier guidance, classifier-free guidance (CFG), guidance scale. This is how you steer generation toward your text prompt. Critical for text-to-image.

PaperDiffusion Models Beat GANs – Dhariwal & Nichol, 2021

PaperClassifier-Free Diffusion Guidance – Ho & Salimans, 2021

BlogCFG explained – Hugging Face blog

🎛️

ControlNet & Conditioning

Add spatial conditioning (edges, depth, pose) to diffusion models. Enables precise control over composition — huge in creative applications.

PaperAdding Conditional Control to Text-to-Image Diffusion (ControlNet) – Zhang et al., 2023

Repolllyasviel/ControlNet on GitHub

VideoYannic Kilcher – ControlNet walkthrough

🔧

Fine-tuning: LoRA & DreamBooth

Personalize pre-trained models with a few images. LoRA is parameter-efficient; DreamBooth teaches a model a new concept (your face, an object, a style).

PaperDreamBooth – Ruiz et al., 2022

PaperLoRA – Hu et al., 2021

Repohuggingface/diffusers training examples

🎬

Video Generation — Fundamentals

Temporal consistency, 3D U-Nets, video diffusion, optical flow. Extending image generation to video is the current research frontier.

PaperVideo Diffusion Models – Ho et al., 2022

PaperImagen Video – Ho et al., 2022

PaperMake-A-Video – Singer et al., 2022 (Meta)

PaperAlign Your Latents – Blattmann et al., 2023

🌐

Hugging Face Ecosystem

diffusers library, transformers, model hub, pipelines, training scripts. Hugging Face is how you actually build with these models in practice.

CourseHugging Face Diffusion Models Course (free)

Repohuggingface/diffusers (official library)

BlogHugging Face Blog – new posts weekly

⚡ Key Insight Diffusion models do one thing: predict and remove noise. Everything else — guidance, ControlNet, video, LoRA — is built on top of this single idea. Truly understanding DDPM makes all of Phase 4 much easier.

PHASE 05

🚀 Advanced & Research Level

⏱ Estimated time: Ongoing

Cutting-edge architectures, video models like Sora, consistency models, and reading live research. This is where you contribute.

⚡

Consistency Models & Flow Matching

1-step generation, consistency training, rectified flow. The next evolution beyond DDPM — dramatically faster and newer research direction.

PaperConsistency Models – Song et al., 2023

PaperFlow Matching for Generative Modeling – Lipman et al., 2022

PaperStable Diffusion 3 (MM-DiT) – Esser et al., 2024

🧊

DiT — Diffusion Transformers

Replace U-Net with Transformers in diffusion. DiT is the architecture behind Sora, SD3, and most frontier video models. The current dominant architecture.

PaperScalable Diffusion Models with Transformers (DiT) – Peebles & Xie, 2022

VideoYannic Kilcher – DiT walkthrough

Repofacebookresearch/DiT on GitHub

🎞️

Sora & Long Video Generation

Video patches (spacetime tokens), world simulation, temporal coherence at scale. Understanding Sora's architecture (video DiT + patches) is frontier knowledge.

PaperSora Technical Report – OpenAI, 2024

BlogSora: First principles analysis – various (arXiv blog)

PaperCogVideoX – ZHIPU AI, 2024 (open-source Sora alternative)

🏅

RLHF & Alignment for Generation

Reinforcement learning from human feedback applied to image models. InstructPix2Pix, reward models for aesthetics — how models learn human preferences.

PaperInstructPix2Pix – Brooks et al., 2022

PaperImageReward – Xu et al., 2023

BlogRLHF for diffusion – Hugging Face blog

📡

Keep Up With Research

The field moves incredibly fast. arXiv daily reading, Twitter/X ML community, Papers With Code — staying current is a skill in itself.

Blogarxiv.org/list/cs.CV/recent (daily)

Repopaperswithcode.com/sota/image-generation

VideoYannic Kilcher YouTube – weekly paper reviews

BlogThe Batch – DeepLearning.AI newsletter

📚 Essential Papers — In Order

// Read these in sequence. Each one builds on the last.

2014 Foundational GAN

Generative Adversarial Nets — Goodfellow et al.

The paper that introduced GANs. Two networks compete: a generator that creates images and a discriminator that judges them.

📌 Why read it: The idea of adversarial training is foundational. Short paper (~9 pages). Very readable.

→ arxiv.org/abs/1406.2661

2013 Foundational

Auto-Encoding Variational Bayes — Kingma & Welling

Introduced VAEs — learning probabilistic latent representations. The math is dense but the concept is critical.

📌 Why read it: VAE's latent space concept is used directly in Stable Diffusion's architecture.

→ arxiv.org/abs/1312.6114

2018 GAN

A Style-Based Generator Architecture for GANs (StyleGAN) — Karras et al.

Introduced disentangled style control, W-space, and adaptive instance normalization. Photorealistic face generation.

📌 Why read it: StyleGAN's concept of disentangled latent spaces directly influenced how diffusion model conditioning works.

→ arxiv.org/abs/1812.04948

2017 Attention

Attention Is All You Need — Vaswani et al.

The Transformer paper. Introduced self-attention, multi-head attention, and positional encodings. Revolutionized all of AI.

📌 Why read it: Every modern generation model uses transformers. This is non-negotiable reading.

→ arxiv.org/abs/1706.03762

2021 CLIP

Learning Transferable Visual Models From Natural Language (CLIP) — Radford et al.

Trained image and text encoders together via contrastive learning on 400M image-text pairs. Enables text-to-image.

📌 Why read it: CLIP is the text-understanding backbone of Stable Diffusion, DALL·E 2, and most modern models.

→ arxiv.org/abs/2103.00020

2020 Diffusion

Denoising Diffusion Probabilistic Models (DDPM) — Ho et al.

The paper that made diffusion models practical. Simplified the training objective to just predict the noise added to an image.

📌 Why read it: This is THE paper. Everything in modern image generation descends from this. Read it twice.

→ arxiv.org/abs/2006.11239

2020 Diffusion

Denoising Diffusion Implicit Models (DDIM) — Song et al.

Non-Markovian diffusion process that enables deterministic sampling in far fewer steps (50 instead of 1000).

📌 Why read it: Practical diffusion models use DDIM-style sampling. You'll configure this constantly.

→ arxiv.org/abs/2010.02502

2022 LDM

High-Resolution Image Synthesis with Latent Diffusion Models — Rombach et al.

Stable Diffusion's architecture. Moves diffusion into a compressed latent space using a pre-trained VAE. Huge efficiency gain.

📌 Why read it: This IS Stable Diffusion. Understanding this paper means you understand how SD1.5, SD2, SDXL work.

→ arxiv.org/abs/2112.10752

2023 Control

Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) — Zhang & Agrawala

Adds spatial conditioning (edge maps, depth, pose) by copying encoder weights into a trainable copy with zero-conv layers.

📌 Why read it: ControlNet is everywhere in practical applications. The architecture idea is elegant and widely reused.

→ arxiv.org/abs/2302.05543

2022 Personalization

DreamBooth — Ruiz et al.

Fine-tune a diffusion model on 3-5 images of a subject using a unique identifier token. Teaches the model a new "concept."

📌 Why read it: Foundation of model personalization — your face, your dog, your product in any style.

→ arxiv.org/abs/2208.12242

2022 Video

Video Diffusion Models — Ho et al.

Extends image diffusion to video with a 3D U-Net (joint spatial-temporal attention). First strong diffusion-based video results.

📌 Why read it: Gateway into video generation research — all later video models reference this.

→ arxiv.org/abs/2204.03458

2022 DiT

Scalable Diffusion Models with Transformers (DiT) — Peebles & Xie

Replaces the U-Net backbone in diffusion models with a Vision Transformer. Scales better, powers Sora and SD3.

📌 Why read it: DiT is the future architecture. Sora, Stable Diffusion 3, most frontier models use this design.

→ arxiv.org/abs/2212.09748

2023 Speed

Consistency Models — Song et al.

Train models to generate high-quality images in a single step by enforcing consistency along diffusion trajectories.

📌 Why read it: The direction the field is heading — single-step generation without quality loss.

→ arxiv.org/abs/2303.01469

🛠️ Projects — Build as You Learn

// Concrete things to build at each stage. Theory without building is nothing.

Digit Generator with VAE

Train a Variational Autoencoder on MNIST. Interpolate between digits in latent space. Visualize the latent manifold with t-SNE.

PyTorchVAEMNIST

Beginner

DCGAN for Face Generation

Train a Deep Convolutional GAN on CelebA or Anime Faces dataset. Learn about training stability, mode collapse, and discriminator balancing.

PyTorchGANCelebA

Beginner

Unconditional DDPM from Scratch

Implement DDPM on CIFAR-10 or MNIST following the original paper. Build the noise scheduler, U-Net, training loop, and sampling loop yourself.

PyTorchDiffusionCIFAR-10

Intermediate

Build a Text-to-Image App with Stable Diffusion

Use Hugging Face diffusers to build a web app (Gradio or Streamlit) that generates images from text. Experiment with CFG scale, samplers, seeds, negative prompts.

DiffusersGradioStable Diffusion

Intermediate

Fine-tune SD with DreamBooth on Custom Subject

Take 5-10 photos of yourself (or an object), fine-tune Stable Diffusion with DreamBooth, then generate yourself in various artistic styles.

DreamBoothFine-tuningLoRA

Intermediate

Image-to-Image Pipeline with ControlNet

Build a pipeline that takes a sketch or edge map and generates a full image. Try canny edge, depth map, and human pose conditioning.

ControlNetDiffusersOpenCV

Intermediate

Video Interpolation with Diffusion

Use a pre-trained video diffusion model (like AnimateDiff or SVD) to generate short video clips from images. Build a Gradio interface. Analyze temporal coherence.

AnimateDiffSVDVideo

Advanced

Train a Mini DiT Diffusion Transformer

Implement a small-scale DiT architecture from the paper. Train on a simple dataset (flowers, icons). Compare quality and compute vs U-Net baseline.

DiTTransformersResearch

Expert

🧰 Tools & Libraries

// Your complete toolkit

🔥

PyTorch

Primary deep learning framework. Research standard.

🤗

Hugging Face

diffusers + transformers libraries. Hub for models.

🖥️

Google Colab

Free GPU for training small models. Start here.

⚡

Lambda Labs

Affordable GPU cloud for bigger training runs.

📊

Weights & Biases

Experiment tracking, loss curves, model comparison.

🎛️

Gradio

Build demo UIs for your models in minutes.

🐙

GitHub

Version control. Clone research repos. Share work.

📦

AUTOMATIC1111

Local SD WebUI. Experiment with models easily.

🎨

ComfyUI

Node-based SD interface. Great for complex workflows.

🧭 The Most Important Advice Don't wait until you "know enough" to start building. Pick a project from Phase 1 and start now, even if you don't understand everything. The confusion you feel when building is what forces real understanding. Read papers alongside building — you'll absorb them 10x faster when you have context from your own experiments. The goal isn't to finish this roadmap — it's to become someone who ships real generative AI projects and reads research every week.

Image & VideoGeneration AI

🌱 Foundations

🧠 Deep Learning Core

🎨 Generative Models

✨ Diffusion Models — The State of the Art

🚀 Advanced & Research Level

📚 Essential Papers — In Order

🛠️ Projects — Build as You Learn

🧰 Tools & Libraries

Image & Video
Generation AI