Deep Learning Architecture Design: A Comprehensive Learning Roadmap
A structured guide to mastering the skills needed to design, build, and train neural network architectures that combine CNNs, Transformers, and generative models for real-world applications.
Table of Contents
- Overview
- Phase 1: Mathematical Foundations
- Phase 2: Programming and Tools
- Phase 3: Machine Learning Fundamentals
- Phase 4: Deep Learning Core Concepts
- Phase 5: Architecture Deep Dives
- Phase 6: Generative Models
- Phase 7: Advanced Training Techniques
- Phase 8: Domain Specialization
- Phase 9: Research and Innovation
- Project Milestones
- Resource Summary
Overview
What This Roadmap Covers
By following this roadmap, you will learn to:
- Understand the mathematical foundations behind neural networks
- Implement core architectures (MLP, CNN, RNN, Transformer) from scratch
- Design hybrid architectures combining multiple techniques
- Build and train generative models (VAE, GAN, Diffusion, Flow Matching)
- Apply these skills to real-world domains (vision, audio, text, multimodal)
- Scale training to production-level systems
Prerequisites
- Basic programming experience (any language)
- High school mathematics
- Curiosity and persistence
Time Estimate
| Track |
Duration |
| Part-time (10-15 hrs/week) |
18-24 months |
| Full-time (40+ hrs/week) |
6-9 months |
Phase 1: Mathematical Foundations
Duration: 4-6 weeks
1.1 Linear Algebra (Essential)
Everything in deep learning is matrix operations.
Topics to Master
| Topic |
Why It Matters |
| Vectors and matrices |
Data representation |
| Matrix multiplication |
Core of neural networks |
| Transpose, inverse |
Weight manipulation |
| Eigenvalues/eigenvectors |
Understanding PCA, stability |
| Norms (L1, L2) |
Loss functions, regularization |
| Dot product, cosine similarity |
Attention mechanisms |
| Broadcasting |
Efficient tensor operations |
Resources
| Resource |
Type |
Cost |
| 3Blue1Brown "Essence of Linear Algebra" |
Video series |
Free |
| MIT 18.06 Linear Algebra (Gilbert Strang) |
Full course |
Free |
| "Linear Algebra Done Right" by Axler |
Textbook |
~$40 |
| Khan Academy Linear Algebra |
Interactive |
Free |
Checkpoint
You should be able to:
- [ ] Multiply matrices by hand and understand dimensions
- [ ] Explain what eigenvalues represent geometrically
- [ ] Implement matrix operations in NumPy without looking up syntax
1.2 Calculus (Essential)
Backpropagation is just the chain rule applied systematically.
Topics to Master
| Topic |
Why It Matters |
| Derivatives |
Gradient computation |
| Partial derivatives |
Multi-variable functions |
| Chain rule |
Backpropagation |
| Gradients and Jacobians |
Vector calculus for NNs |
| Basic integrals |
Probability distributions |
Resources
| Resource |
Type |
Cost |
| 3Blue1Brown "Essence of Calculus" |
Video series |
Free |
| Khan Academy Calculus |
Interactive |
Free |
| "Calculus Made Easy" by Thompson |
Textbook |
Free online |
Checkpoint
You should be able to:
- [ ] Compute derivatives of common functions
- [ ] Apply chain rule to composite functions
- [ ] Calculate gradients of multi-variable functions
1.3 Probability and Statistics (Essential)
Understanding uncertainty, distributions, and model evaluation.
Topics to Master
| Topic |
Why It Matters |
| Probability basics |
Bayesian thinking |
| Random variables |
Modeling uncertainty |
| Common distributions (Gaussian, Bernoulli, Categorical) |
Generative models |
| Expectation and variance |
Loss functions |
| Bayes' theorem |
Probabilistic models |
| Maximum likelihood estimation |
Training objective |
| KL divergence |
VAE, information theory |
| Sampling methods |
Generative inference |
Resources
| Resource |
Type |
Cost |
| Khan Academy Statistics & Probability |
Interactive |
Free |
| "Think Stats" by Downey |
Textbook |
Free online |
| StatQuest YouTube channel |
Video series |
Free |
| "Pattern Recognition and ML" Ch. 1-2 by Bishop |
Textbook |
~$80 |
Checkpoint
You should be able to:
- [ ] Explain Bayes' theorem with an example
- [ ] Sample from a Gaussian distribution and explain parameters
- [ ] Derive maximum likelihood for simple distributions
- [ ] Explain what KL divergence measures
1.4 Optimization (Important)
How neural networks actually learn.
Topics to Master
| Topic |
Why It Matters |
| Gradient descent |
Core optimization algorithm |
| Convexity |
Understanding loss landscapes |
| Local vs global minima |
Training challenges |
| Learning rate |
Hyperparameter tuning |
| Momentum, Adam |
Modern optimizers |
Resources
| Resource |
Type |
Cost |
| "Convex Optimization" Ch. 1-3 by Boyd |
Textbook |
Free online |
| Sebastian Ruder's "Gradient Descent Overview" |
Blog post |
Free |
Duration: 3-4 weeks
2.1 Python Proficiency
Topics to Master
| Topic |
Why It Matters |
| Data structures (lists, dicts, sets) |
Efficient coding |
| Object-oriented programming |
Building models |
| Functional programming (map, lambda) |
Data pipelines |
| Decorators, context managers |
Advanced patterns |
| Debugging and profiling |
Finding bottlenecks |
Resources
| Resource |
Type |
Cost |
| "Automate the Boring Stuff with Python" |
Book |
Free online |
| "Fluent Python" by Ramalho |
Book |
~$50 |
| Real Python tutorials |
Website |
Free/Paid |
2.2 NumPy (Essential)
The foundation of scientific Python.
Topics to Master
# You should be fluent in:
np.array, np.zeros, np.ones, np.random
np.reshape, np.transpose, np.squeeze, np.expand_dims
np.matmul, np.dot, @
Broadcasting rules
Indexing and slicing
np.sum, np.mean, np.max (with axis parameter)
np.concatenate, np.stack
Resources
| Resource |
Type |
Cost |
| NumPy official tutorial |
Documentation |
Free |
| "From Python to NumPy" by Rougier |
Book |
Free online |
2.3 PyTorch (Essential)
The dominant framework for research.
Topics to Master
| Topic |
Priority |
| Tensors and operations |
Essential |
| Autograd (automatic differentiation) |
Essential |
| nn.Module and building models |
Essential |
| DataLoader and datasets |
Essential |
| Training loops |
Essential |
| GPU usage (.to(device)) |
Essential |
| Saving/loading models |
Important |
| Custom layers and functions |
Important |
| Hooks for debugging |
Useful |
| torch.compile (PyTorch 2.0) |
Useful |
Resources
| Resource |
Type |
Cost |
| PyTorch official tutorials |
Documentation |
Free |
| "Deep Learning with PyTorch" by Stevens |
Book |
Free online |
| fast.ai course |
Video course |
Free |
Checkpoint Project
Implement from scratch (no nn.Linear, etc.):
- [ ] A linear layer with forward and backward pass
- [ ] A simple 2-layer MLP
- [ ] Train on MNIST using manual gradient computation
2.4 Development Environment
| Tool |
Purpose |
| Git |
Version control |
| VS Code / PyCharm |
IDE |
| Jupyter notebooks |
Experimentation |
| conda / venv |
Environment management |
| Docker (later) |
Reproducibility |
| Tool |
Purpose |
| TensorBoard |
Visualization |
| Weights & Biases |
Experiment tracking |
| Hydra |
Configuration management |
| pytest |
Testing |
Phase 3: Machine Learning Fundamentals
Duration: 4-6 weeks
3.1 Core Concepts
Topics to Master
| Topic |
Description |
| Supervised learning |
Learning from labeled data |
| Unsupervised learning |
Finding patterns without labels |
| Train/validation/test splits |
Proper evaluation |
| Cross-validation |
Robust evaluation |
| Overfitting and underfitting |
Model capacity |
| Bias-variance tradeoff |
Understanding errors |
| Regularization (L1, L2) |
Preventing overfitting |
| Feature engineering |
Domain knowledge integration |
| Hyperparameter tuning |
Model optimization |
3.2 Classical Algorithms
Understanding these helps appreciate what deep learning improves upon.
| Algorithm |
Learn To |
| Linear/Logistic regression |
Implement from scratch |
| Decision trees |
Understand concept |
| Random forests |
Understand ensembling |
| SVM |
Understand kernels |
| K-means clustering |
Implement from scratch |
| PCA |
Understand dimensionality reduction |
3.3 Resources
| Resource |
Type |
Cost |
| "Hands-On Machine Learning" by Géron (Ch. 1-8) |
Book |
~$60 |
| Andrew Ng's Machine Learning (Coursera) |
Course |
Free to audit |
| "The Elements of Statistical Learning" |
Book |
Free online |
| Scikit-learn tutorials |
Documentation |
Free |
3.4 Checkpoint Project
- [ ] Implement logistic regression from scratch with gradient descent
- [ ] Build a complete ML pipeline: data loading, preprocessing, training, evaluation
- [ ] Achieve >95% on MNIST with a non-deep-learning method
Phase 4: Deep Learning Core Concepts
Duration: 6-8 weeks
4.1 Neural Network Fundamentals
Topics to Master
| Topic |
Description |
Priority |
| Perceptron |
Single neuron |
Essential |
| Multi-layer perceptron (MLP) |
Basic architecture |
Essential |
| Activation functions |
ReLU, GELU, Sigmoid, Tanh, Softmax |
Essential |
| Loss functions |
MSE, Cross-entropy, etc. |
Essential |
| Backpropagation |
How networks learn |
Essential |
| Weight initialization |
Xavier, He, etc. |
Essential |
| Batch normalization |
Training stability |
Essential |
| Layer normalization |
Transformer standard |
Essential |
| Dropout |
Regularization |
Essential |
| Residual connections |
Deep networks |
Essential |
Activation Functions Deep Dive
ReLU: max(0, x) - Simple, sparse, can "die"
LeakyReLU: max(0.01x, x) - Prevents dying ReLU
GELU: x * Φ(x) - Smooth, used in Transformers
Sigmoid: 1/(1+e^(-x)) - Output in (0,1), gradients vanish
Tanh: (e^x - e^(-x))/... - Output in (-1,1), centered
Softmax: e^xi / Σe^xj - Probability distribution
SiLU/Swish: x * sigmoid(x) - Self-gated, smooth
Understanding Backpropagation
This is critical. You must understand:
- Forward pass: compute outputs
- Loss computation: compare to target
- Backward pass: compute gradients via chain rule
- Parameter update: gradient descent step
4.2 Optimizers
| Optimizer |
Key Idea |
When to Use |
| SGD |
Basic gradient descent |
Simple baselines |
| SGD + Momentum |
Accumulate velocity |
Better convergence |
| Adam |
Adaptive learning rates |
Default choice |
| AdamW |
Adam + weight decay fix |
Current best practice |
| LAMB |
Large batch training |
Distributed training |
4.3 Normalization Techniques
| Technique |
Normalizes Across |
Used In |
| Batch Norm |
Batch dimension |
CNNs |
| Layer Norm |
Feature dimension |
Transformers |
| Instance Norm |
Spatial dimensions |
Style transfer |
| Group Norm |
Channel groups |
Small batches |
| RMS Norm |
Simplified Layer Norm |
LLMs |
4.4 Resources
| Resource |
Type |
Cost |
| "Deep Learning" by Goodfellow (Ch. 6-8) |
Book |
Free online |
| "Dive into Deep Learning" (d2l.ai) |
Interactive book |
Free |
| CS231n Stanford (first 5 lectures) |
Course |
Free |
| Andrej Karpathy's "Neural Networks: Zero to Hero" |
YouTube |
Free |
4.5 Checkpoint Projects
- [ ] Implement backpropagation manually for a 2-layer MLP
- [ ] Train MLP on MNIST, achieve >98% accuracy
- [ ] Implement BatchNorm and LayerNorm from scratch
- [ ] Experiment with different optimizers, document findings
Phase 5: Architecture Deep Dives
Duration: 10-12 weeks
This is the core of understanding modern architectures.
5.1 Convolutional Neural Networks (CNNs)
Duration: 3-4 weeks
Core Concepts
| Concept |
Description |
| Convolution operation |
Sliding filter over input |
| Kernels/filters |
Learnable patterns |
| Stride and padding |
Output size control |
| Pooling (max, average) |
Downsampling |
| Receptive field |
What each neuron "sees" |
| Feature maps |
Intermediate representations |
| 1D vs 2D vs 3D convolutions |
Different data types |
| Depthwise separable convolutions |
Efficient variant |
| Dilated/atrous convolutions |
Increased receptive field |
| Transposed convolutions |
Upsampling |
Architecture Evolution
Study these in order:
| Year |
Architecture |
Key Innovation |
| 2012 |
AlexNet |
GPU training, ReLU, dropout |
| 2014 |
VGGNet |
Deeper with small (3x3) filters |
| 2014 |
GoogLeNet/Inception |
Multi-scale processing |
| 2015 |
ResNet |
Skip connections (revolutionary!) |
| 2016 |
DenseNet |
Dense connections |
| 2017 |
MobileNet |
Depthwise separable convolutions |
| 2019 |
EfficientNet |
Compound scaling |
| 2022 |
ConvNeXt |
Modernized with Transformer tricks |
ConvNeXt Deep Dive
Since ConvNeXt represents modern CNN design, understand:
ConvNeXt Block:
1. Depthwise Conv 7x7 (large kernel, each channel separate)
2. LayerNorm (from Transformers)
3. Linear → 4x expansion (inverted bottleneck)
4. GELU activation (from Transformers)
5. Linear → back to original dim
6. Residual connection
Key modernizations:
- Larger kernels (7x7) like Transformer's global attention
- LayerNorm instead of BatchNorm
- GELU instead of ReLU
- Inverted bottleneck (expand then shrink)
- Fewer, wider layers
Resources
| Resource |
Type |
Cost |
| CS231n CNN lectures |
Course |
Free |
| "A ConvNet for the 2020s" (ConvNeXt paper) |
Paper |
Free |
| d2l.ai CNN chapters |
Book |
Free |
Checkpoint Projects
- [ ] Implement Conv2d from scratch
- [ ] Build and train LeNet-5 on MNIST
- [ ] Implement ResNet-18 from scratch
- [ ] Implement ConvNeXt block from scratch
- [ ] Train on CIFAR-10, achieve >90% accuracy
5.2 Recurrent Neural Networks (RNNs)
Duration: 2 weeks
Less critical now but provides important context.
Core Concepts
| Concept |
Description |
| Vanilla RNN |
Basic recurrence |
| Hidden state |
Memory mechanism |
| BPTT |
Backpropagation through time |
| Vanishing/exploding gradients |
Why vanilla RNNs fail |
| LSTM |
Gated memory cells |
| GRU |
Simplified gating |
| Bidirectional RNNs |
Both directions |
| Sequence-to-sequence |
Encoder-decoder |
Why Learn RNNs?
- Historical context for understanding Transformer motivation
- Still used in some applications
- Understanding sequential processing concepts
Resources
| Resource |
Type |
Cost |
| "The Unreasonable Effectiveness of RNNs" by Karpathy |
Blog |
Free |
| d2l.ai RNN chapters |
Book |
Free |
| Chris Olah's "Understanding LSTM Networks" |
Blog |
Free |
Checkpoint Projects
- [ ] Implement vanilla RNN from scratch
- [ ] Implement LSTM cell from scratch
- [ ] Train character-level language model
Duration: 4-5 weeks
This is the most important architecture to master deeply.
Core Concepts
| Concept |
Description |
Priority |
| Self-attention |
Each position attends to all others |
Essential |
| Query, Key, Value |
Attention mechanism components |
Essential |
| Scaled dot-product attention |
QK^T/√d_k |
Essential |
| Multi-head attention |
Multiple attention patterns |
Essential |
| Positional encoding |
Injecting position information |
Essential |
| Feed-forward network |
Per-position MLP |
Essential |
| Encoder vs Decoder |
Different attention masks |
Essential |
| Causal/masked attention |
Autoregressive generation |
Essential |
| Cross-attention |
Attending to different sequence |
Essential |
| Layer normalization placement |
Pre-norm vs post-norm |
Important |
| Residual connections |
Gradient flow |
Essential |
Attention Mechanism Deeply Understood
Input: X (sequence of embeddings)
Step 1: Project to Q, K, V
Q = X @ W_q (queries: what am I looking for?)
K = X @ W_k (keys: what do I contain?)
V = X @ W_v (values: what do I output?)
Step 2: Compute attention scores
scores = Q @ K.T / sqrt(d_k)
Step 3: Apply softmax (per query)
weights = softmax(scores)
Step 4: Weighted sum of values
output = weights @ V
Multi-Head Attention
Instead of one attention:
- Split into h heads
- Each head has smaller dimension (d_k / h)
- Concatenate outputs
- Project back to original dimension
Why? Different heads learn different patterns:
- Head 1: syntactic relationships
- Head 2: semantic relationships
- Head 3: positional patterns
- etc.
Architecture Variants
| Architecture |
Type |
Key Use |
| Original Transformer |
Encoder-Decoder |
Machine translation |
| BERT |
Encoder only |
Understanding, embeddings |
| GPT |
Decoder only |
Generation |
| T5 |
Encoder-Decoder |
Text-to-text |
| Vision Transformer (ViT) |
Encoder |
Image classification |
| DETR |
Encoder-Decoder |
Object detection |
Modern Improvements
| Improvement |
Description |
| RoPE |
Rotary position embeddings |
| ALiBi |
Attention with Linear Biases |
| Flash Attention |
Memory-efficient attention |
| Grouped Query Attention |
Efficient KV sharing |
| RMS Norm |
Simplified normalization |
| SwiGLU |
Gated activation |
| Pre-norm |
LayerNorm before attention |
Resources
| Resource |
Type |
Cost |
| "The Illustrated Transformer" by Jay Alammar |
Blog |
Free |
| "Attention Is All You Need" (original paper) |
Paper |
Free |
| Andrej Karpathy's "Let's build GPT" |
YouTube |
Free |
| "The Annotated Transformer" by Harvard NLP |
Blog |
Free |
| d2l.ai Attention chapters |
Book |
Free |
| Lilian Weng's "The Transformer Family" |
Blog |
Free |
Checkpoint Projects
- [ ] Implement scaled dot-product attention from scratch
- [ ] Implement multi-head attention from scratch
- [ ] Implement positional encoding (sinusoidal and learned)
- [ ] Build complete Transformer encoder from scratch
- [ ] Build complete Transformer decoder from scratch
- [ ] Train small GPT on character-level text
- [ ] Implement cross-attention for encoder-decoder model
5.4 Hybrid Architectures
Duration: 2 weeks
Understanding how to combine architectures.
Common Combinations
| Combination |
Example |
Use Case |
| CNN + Transformer |
ViT, Swin Transformer |
Vision |
| CNN + Attention |
CBAM, SE-Net |
Vision enhancement |
| Transformer + CNN head |
Detection Transformers |
Object detection |
| ConvNeXt + Cross-Attention |
SupertonicTTS style |
Multimodal |
Design Principles
-
Match architecture to data structure
- Images: Start with CNN for local features
- Sequences: Transformer for global relationships
- Both: Hybrid
-
Consider computational budget
- Attention: O(n²) in sequence length
- Convolution: O(n) in sequence length
- Use CNN for long sequences, attention where needed
-
Conditioning mechanisms
- Cross-attention: Flexible, powerful
- Concatenation: Simple, limited
- FiLM/AdaLN: Efficient for style transfer
- Addition: Global conditioning
Phase 6: Generative Models
Duration: 10-12 weeks
This phase covers how to generate new data.
6.1 Autoencoders
Duration: 2 weeks
The foundation of generative latent models.
Concepts
| Concept |
Description |
| Encoder |
Compress input to latent |
| Decoder |
Reconstruct from latent |
| Latent space |
Compressed representation |
| Bottleneck |
Forces meaningful compression |
| Reconstruction loss |
Training objective |
Types
| Type |
Description |
Use Case |
| Vanilla AE |
Basic encoder-decoder |
Compression |
| Denoising AE |
Reconstruct from noisy input |
Robust features |
| Sparse AE |
L1 penalty on latent |
Interpretable features |
| Contractive AE |
Penalty on Jacobian |
Robust latent |
Checkpoint
- [ ] Implement autoencoder for MNIST
- [ ] Visualize latent space
- [ ] Implement denoising autoencoder
6.2 Variational Autoencoders (VAEs)
Duration: 2-3 weeks
Probabilistic generative model.
Key Concepts
| Concept |
Description |
| Probabilistic encoder |
Output distribution, not point |
| Reparameterization trick |
Enable backprop through sampling |
| KL divergence |
Regularize latent to prior |
| ELBO |
Evidence Lower Bound objective |
| Latent prior |
Usually N(0, I) |
| Posterior collapse |
When decoder ignores latent |
The Math
Encoder outputs: μ, σ (parameters of q(z|x))
Sampling: z = μ + σ * ε, where ε ~ N(0, I)
Loss = Reconstruction + β * KL(q(z|x) || p(z))
Variants
| Variant |
Innovation |
| β-VAE |
Disentangled representations |
| VQ-VAE |
Discrete latents |
| VAE-GAN |
Combined with adversarial loss |
| Hierarchical VAE |
Multiple latent levels |
Resources
| Resource |
Type |
Cost |
| "Auto-Encoding Variational Bayes" (original paper) |
Paper |
Free |
| "Tutorial on VAEs" by Carl Doersch |
Paper |
Free |
| Lilian Weng's VAE blog |
Blog |
Free |
Checkpoint Projects
- [ ] Implement VAE from scratch
- [ ] Train on MNIST, generate samples
- [ ] Implement reparameterization trick
- [ ] Interpolate in latent space
- [ ] Implement β-VAE, compare disentanglement
6.3 Generative Adversarial Networks (GANs)
Duration: 2-3 weeks
Adversarial training paradigm.
Key Concepts
| Concept |
Description |
| Generator |
Creates fake samples |
| Discriminator |
Distinguishes real vs fake |
| Adversarial training |
Two-player game |
| Mode collapse |
Generator produces limited variety |
| Training instability |
GANs are hard to train |
The Training Loop
For each batch:
1. Train Discriminator:
- Real samples → D should output 1
- Fake samples (from G) → D should output 0
2. Train Generator:
- Generate fake samples
- D(fake) should output 1 (fool discriminator)
Evolution of GANs
| Year |
Model |
Innovation |
| 2014 |
GAN |
Original |
| 2015 |
DCGAN |
Convolutional architecture |
| 2017 |
WGAN |
Wasserstein distance |
| 2017 |
WGAN-GP |
Gradient penalty |
| 2018 |
BigGAN |
Large scale |
| 2019 |
StyleGAN |
Style-based generator |
| 2020 |
StyleGAN2 |
Improved quality |
Resources
| Resource |
Type |
Cost |
| "Generative Adversarial Networks" (original) |
Paper |
Free |
| "NIPS 2016 GAN Tutorial" by Goodfellow |
Tutorial |
Free |
| "GAN Hacks" repository |
GitHub |
Free |
Checkpoint Projects
- [ ] Implement vanilla GAN from scratch
- [ ] Train on MNIST
- [ ] Implement DCGAN
- [ ] Implement Wasserstein loss
- [ ] Diagnose and fix mode collapse
6.4 Diffusion Models
Duration: 3-4 weeks
Current state-of-the-art for generation.
Key Concepts
| Concept |
Description |
| Forward process |
Gradually add noise to data |
| Reverse process |
Learn to denoise |
| Noise schedule |
How fast to add noise |
| Score function |
Gradient of log probability |
| U-Net |
Common architecture |
| Classifier-free guidance |
Conditional generation |
The Process
Forward (fixed):
x_0 → x_1 → x_2 → ... → x_T (pure noise)
+ε +ε +ε
Reverse (learned):
x_T → x_{T-1} → ... → x_1 → x_0 (clean data)
predict predict predict
noise noise noise
Training
1. Sample clean data x_0
2. Sample timestep t
3. Sample noise ε
4. Create noisy version: x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε
5. Predict noise: ε_pred = model(x_t, t)
6. Loss = ||ε - ε_pred||²
Inference
1. Start with pure noise x_T
2. For t = T, T-1, ..., 1:
- Predict noise: ε_pred = model(x_t, t)
- Remove noise: x_{t-1} = denoise(x_t, ε_pred, t)
3. Return x_0
Important Papers
| Paper |
Contribution |
| "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" |
Original idea |
| "Denoising Diffusion Probabilistic Models" (DDPM) |
Practical formulation |
| "Score-Based Generative Modeling through SDEs" |
Unified framework |
| "Denoising Diffusion Implicit Models" (DDIM) |
Faster sampling |
| "Classifier-Free Diffusion Guidance" |
Better conditioning |
| "High-Resolution Image Synthesis with Latent Diffusion" |
Stable Diffusion |
Resources
| Resource |
Type |
Cost |
| "What are Diffusion Models?" by Lilian Weng |
Blog |
Free |
| "The Annotated Diffusion Model" |
Blog |
Free |
| Hugging Face Diffusers course |
Tutorial |
Free |
| "Understanding Diffusion Models: A Unified Perspective" |
Paper |
Free |
Checkpoint Projects
- [ ] Implement forward diffusion process
- [ ] Implement simple noise prediction network
- [ ] Train DDPM on MNIST
- [ ] Implement DDIM sampling
- [ ] Implement classifier-free guidance
- [ ] Train on CIFAR-10
6.5 Flow Matching
Duration: 2-3 weeks
Modern alternative to diffusion.
Key Concepts
| Concept |
Description |
| Continuous Normalizing Flows |
ODE-based transformation |
| Vector field |
Direction at each point |
| Optimal transport |
Straight paths |
| Simulation-free training |
Efficient training |
Comparison with Diffusion
| Aspect |
Diffusion |
Flow Matching |
| Predicts |
Noise ε |
Direction v |
| Update |
Subtract noise |
Add direction |
| Paths |
Can be curved |
Often straighter |
| Steps needed |
Often 50-1000 |
Often 10-50 |
| Training |
Noise prediction |
Direction prediction |
The Math
Interpolation:
x_t = (1-t) * x_0 + t * x_1
where x_0 ~ noise, x_1 ~ data
True velocity:
v = x_1 - x_0
Training:
Loss = ||v_pred - v||²
Resources
| Resource |
Type |
Cost |
| "Flow Matching for Generative Modeling" |
Paper |
Free |
| "Flow Straight and Fast" |
Paper |
Free |
| Meta's flow matching tutorial |
Blog |
Free |
Checkpoint Projects
- [ ] Implement flow matching training
- [ ] Compare with diffusion on same task
- [ ] Experiment with different ODE solvers
- [ ] Implement conditional flow matching
6.6 Latent Generative Models
Duration: 2 weeks
Combining autoencoders with generative models.
Architecture Pattern
Training:
1. Train autoencoder: data ↔ latent
2. Train generative model in latent space
Inference:
1. Generate latent from noise
2. Decode latent to data
Why Latent Space?
| Benefit |
Description |
| Computational efficiency |
Smaller space to model |
| Semantic compression |
Latent captures meaning |
| Decoupled training |
Separate reconstruction from generation |
Examples
| Model |
Autoencoder |
Generative |
| Stable Diffusion |
VAE |
Diffusion |
| DALL-E |
VQ-VAE |
Transformer |
| AudioLDM |
VAE |
Diffusion |
Checkpoint Projects
- [ ] Train VAE, then train diffusion in latent space
- [ ] Compare quality vs training in pixel space
- [ ] Experiment with different latent dimensions
Phase 7: Advanced Training Techniques
Duration: 4-6 weeks
Scaling to real-world systems.
7.1 Large-Scale Training
Topics
| Topic |
Description |
| Multi-GPU training |
DataParallel, DistributedDataParallel |
| Mixed precision |
FP16/BF16 training |
| Gradient accumulation |
Simulate larger batches |
| Gradient checkpointing |
Trade compute for memory |
| Model parallelism |
Split model across GPUs |
| Pipeline parallelism |
Sequential stages |
Resources
| Resource |
Type |
Cost |
| PyTorch Distributed tutorial |
Documentation |
Free |
| Hugging Face Accelerate |
Library + docs |
Free |
| DeepSpeed documentation |
Library + docs |
Free |
7.2 Stabilizing Training
Techniques
| Technique |
Purpose |
| Learning rate warmup |
Stable early training |
| Learning rate scheduling |
Cosine, linear decay |
| Gradient clipping |
Prevent explosions |
| Weight decay |
Regularization |
| EMA (Exponential Moving Average) |
Smooth checkpoints |
| Proper initialization |
Stable forward pass |
7.3 Hyperparameter Optimization
Approaches
| Approach |
Description |
| Grid search |
Exhaustive (expensive) |
| Random search |
Often better than grid |
| Bayesian optimization |
Sample efficient |
| Population-based training |
Adaptive |
7.4 Debugging Deep Learning
Common Issues and Solutions
| Issue |
Diagnosis |
Solution |
| Loss NaN |
Gradient explosion |
Gradient clipping, lower LR |
| Loss plateau |
Stuck in local minimum |
LR schedule, different init |
| Overfitting |
Train/val gap |
Regularization, data augmentation |
| Underfitting |
High train loss |
Larger model, longer training |
| Mode collapse (GAN) |
Limited variety |
Spectral norm, WGAN |
# Things to monitor:
- Loss curves (train and val)
- Gradient norms per layer
- Activation statistics
- Weight distributions
- Learning rate
- GPU memory usage
Phase 8: Domain Specialization
Choose one or more domains to specialize in.
8.1 Computer Vision
Duration: 4-6 weeks
Topics
| Topic |
Models |
| Image classification |
ResNet, ViT, ConvNeXt |
| Object detection |
YOLO, DETR |
| Semantic segmentation |
U-Net, SegFormer |
| Instance segmentation |
Mask R-CNN |
| Image generation |
GAN, Diffusion |
| Image-to-image |
pix2pix, CycleGAN |
Resources
| Resource |
Type |
Cost |
| CS231n Stanford |
Course |
Free |
| "Deep Learning for Vision" (d2l.ai) |
Book |
Free |
8.2 Natural Language Processing
Duration: 4-6 weeks
Topics
| Topic |
Models |
| Text classification |
BERT |
| Named entity recognition |
BERT + CRF |
| Machine translation |
Transformer, T5 |
| Language modeling |
GPT |
| Text generation |
GPT, LLaMA |
| Question answering |
BERT, T5 |
Resources
| Resource |
Type |
Cost |
| CS224n Stanford |
Course |
Free |
| Hugging Face NLP course |
Course |
Free |
| "Speech and Language Processing" |
Book |
Free online |
8.3 Audio and Speech
Duration: 4-6 weeks
Fundamentals
| Topic |
Description |
| Digital audio |
Sampling, quantization |
| Fourier transform |
Time to frequency |
| Spectrograms |
Time-frequency representation |
| Mel scale |
Perceptual frequency scale |
| MFCC |
Classic features |
Tasks
| Task |
Description |
| Speech recognition (ASR) |
Audio → text |
| Text-to-speech (TTS) |
Text → audio |
| Voice conversion |
Change speaker identity |
| Music generation |
Create music |
| Audio classification |
Classify sounds |
| Source separation |
Unmix audio |
Key Models
| Model |
Task |
| Wav2Vec 2.0 |
Speech representation |
| Whisper |
Speech recognition |
| Tacotron |
TTS |
| VITS |
TTS |
| HiFi-GAN |
Vocoder |
| Vocos |
Vocoder (ConvNeXt) |
Resources
| Resource |
Type |
Cost |
| "Speech and Language Processing" |
Book |
Free online |
| ESPnet tutorials |
Code |
Free |
| librosa documentation |
Library |
Free |
8.4 Multimodal Learning
Duration: 4-6 weeks
Combining multiple modalities.
Topics
| Topic |
Examples |
| Vision-Language |
CLIP, BLIP |
| Text-to-Image |
Stable Diffusion, DALL-E |
| Image-to-Text |
Image captioning |
| Audio-Visual |
Video understanding |
| Any-to-any |
Unified models |
Phase 9: Research and Innovation
9.1 Reading Papers Effectively
Strategy
- First pass (5 min): Title, abstract, figures, conclusion
- Second pass (30 min): Introduction, method overview, experiments
- Third pass (2+ hrs): Full details, math, implementation
Where to Find Papers
| Source |
Description |
| arXiv |
Preprints |
| Papers With Code |
Papers + implementations |
| Google Scholar |
Search + citations |
| Semantic Scholar |
AI-powered search |
| Conference proceedings |
ICML, NeurIPS, ICLR, CVPR |
9.2 Implementing Papers
Process
- Read paper multiple times
- Find official code (if available)
- Identify key components
- Implement incrementally
- Reproduce results on small scale first
- Debug against reference implementation
9.3 Staying Current
Resources
| Resource |
Type |
| Twitter/X ML community |
Social |
| r/MachineLearning |
Forum |
| Papers With Code newsletter |
Email |
| Two Minute Papers (YouTube) |
Video |
| The Batch (Andrew Ng) |
Newsletter |
Project Milestones
Beginner Projects
| Project |
Skills Practiced |
| MNIST classifier (MLP) |
Basics, training loop |
| CIFAR-10 classifier (CNN) |
Convolutions, augmentation |
| Sentiment analysis (RNN) |
Sequential processing |
| Character-level LM (Transformer) |
Attention, generation |
| Project |
Skills Practiced |
| Image autoencoder |
Encoder-decoder, latent space |
| VAE for faces |
Probabilistic models |
| DCGAN for images |
Adversarial training |
| BERT fine-tuning |
Transfer learning |
| Small GPT training |
Language modeling |
Advanced Projects
| Project |
Skills Practiced |
| Diffusion model for images |
Modern generative models |
| Latent diffusion model |
Combining AE + diffusion |
| Vision Transformer from scratch |
Full architecture |
| Multimodal model |
Cross-attention, multiple encoders |
| Custom TTS system |
Full pipeline integration |
Capstone Ideas
| Project |
Description |
| Novel architecture design |
Combine techniques in new way |
| Reproduce recent paper |
Validate understanding |
| Domain application |
Apply to specific problem |
| Efficiency improvement |
Make existing model faster/smaller |
Resource Summary
Essential Books
| Book |
Focus |
Cost |
| "Deep Learning" by Goodfellow et al. |
Theory |
Free online |
| "Dive into Deep Learning" (d2l.ai) |
Practical |
Free online |
| "Hands-On ML" by Géron |
Applied |
~$60 |
| "Understanding Deep Learning" by Prince |
Modern |
Free online |
Essential Courses
| Course |
Provider |
Cost |
| CS231n (CNNs) |
Stanford |
Free |
| CS224n (NLP) |
Stanford |
Free |
| Fast.ai |
Fast.ai |
Free |
| Deep Learning Specialization |
Coursera |
Free to audit |
Essential Blogs
| Blog |
Focus |
| Jay Alammar |
Visual explanations |
| Lilian Weng |
Comprehensive surveys |
| Chris Olah |
Neural network intuition |
| Sebastian Ruder |
NLP |
| The Gradient |
Research summaries |
Essential YouTube Channels
| Channel |
Focus |
| 3Blue1Brown |
Math intuition |
| Andrej Karpathy |
Implementation |
| Yannic Kilcher |
Paper reviews |
| Two Minute Papers |
Research summaries |
| StatQuest |
Statistics |
Essential Code Repositories
| Repository |
Content |
| lucidrains |
Clean implementations |
| huggingface/transformers |
Production models |
| karpathy/nanoGPT |
Minimal GPT |
| huggingface/diffusers |
Diffusion models |
| labmlai/annotated_deep_learning_paper_implementations |
Paper implementations |
Final Notes
Keys to Success
- Consistency over intensity: Regular practice beats sporadic cramming
- Implement everything: Reading is not enough
- Start simple: Get basics working before adding complexity
- Debug systematically: Use visualization, logging, unit tests
- Join communities: Learn from others, ask questions
- Read code: Study implementations, not just papers
- Build projects: Apply knowledge to real problems
Common Pitfalls to Avoid
| Pitfall |
Solution |
| Tutorial hell |
Build original projects |
| Skipping math |
Invest time in foundations |
| Only reading, not coding |
Implement everything |
| Jumping to advanced topics |
Master basics first |
| Working in isolation |
Join communities |
| Ignoring debugging skills |
Learn systematic debugging |
Measure Your Progress
You're ready for the next level when you can:
- [ ] Basics: Implement MLP, train on MNIST without tutorials
- [ ] CNN: Implement ResNet block, explain receptive fields
- [ ] Transformer: Implement attention from scratch, explain every component
- [ ] Generative: Train VAE, GAN, or diffusion model from scratch
- [ ] Integration: Design hybrid architecture for a new problem
- [ ] Research: Read papers and implement them independently
Last updated: November 2025
Remember: The goal is not to memorize, but to understand deeply enough to create something new.