CS5720 - Week 3
Slide 51 of 60
Optimization Algorithms Beyond SGD
The Optimization Landscape
While SGD laid the foundation, modern deep learning requires more sophisticated optimization algorithms to train complex networks efficiently and reliably.
Evolution of Optimizers:
•
1950s-1980s:
Basic gradient descent
•
1990s:
Momentum methods
•
2000s:
Adaptive learning rates
•
2010s:
Adam and variants dominate
•
2020s:
Optimizer combinations & scheduling
Key Insight:
Different optimizers excel at different tasks. Understanding their strengths helps you choose the right tool for your problem.
Why SGD Isn't Enough
📊
Fixed Learning Rate Dilemma
One learning rate doesn't fit all parameters or training stages
🎯
Sparse Gradients
Some parameters rarely update (e.g., word embeddings)
📈
Noisy Gradients
Mini-batch sampling introduces variance
⚖️
Different Parameter Scales
Weights in different layers need different learning rates
Modern Optimization Algorithms
📐
AdaGrad
Adaptive learning rates per parameter
🎚️
RMSprop
Fixes AdaGrad's diminishing learning rates
👑
Adam
Best of momentum and adaptive rates
⚡
AdamW
Adam with decoupled weight decay
🚀
NAdam
Nesterov momentum meets Adam
🐑
LAMB
Layer-wise adaptive for large batches
Optimizer Paths Visualization
SGD
Momentum
Adam
RMSprop
← Previous
Next →
Prepared by Dr. Gorkem Kar
Modal Title
×
Modal content goes here...