CS5720 - Week 3
Slide 51 of 60

Optimization Algorithms Beyond SGD

The Optimization Landscape

While SGD laid the foundation, modern deep learning requires more sophisticated optimization algorithms to train complex networks efficiently and reliably.
Evolution of Optimizers:

1950s-1980s: Basic gradient descent
1990s: Momentum methods
2000s: Adaptive learning rates
2010s: Adam and variants dominate
2020s: Optimizer combinations & scheduling
Key Insight:
Different optimizers excel at different tasks. Understanding their strengths helps you choose the right tool for your problem.

Why SGD Isn't Enough

  • 📊
    Fixed Learning Rate Dilemma
    One learning rate doesn't fit all parameters or training stages
  • 🎯
    Sparse Gradients
    Some parameters rarely update (e.g., word embeddings)
  • 📈
    Noisy Gradients
    Mini-batch sampling introduces variance
  • ⚖️
    Different Parameter Scales
    Weights in different layers need different learning rates

Modern Optimization Algorithms

📐
AdaGrad
Adaptive learning rates per parameter
🎚️
RMSprop
Fixes AdaGrad's diminishing learning rates
👑
Adam
Best of momentum and adaptive rates
AdamW
Adam with decoupled weight decay
🚀
NAdam
Nesterov momentum meets Adam
🐑
LAMB
Layer-wise adaptive for large batches

Optimizer Paths Visualization

SGD
Momentum
Adam
RMSprop
Prepared by Dr. Gorkem Kar