CS5720 - Optimization Algorithms Beyond SGD

The Optimization Landscape

While SGD laid the foundation, modern deep learning requires more sophisticated optimization algorithms to train complex networks efficiently and reliably.

Evolution of Optimizers:

• 1950s-1980s: Basic gradient descent
• 1990s: Momentum methods
• 2000s: Adaptive learning rates
• 2010s: Adam and variants dominate
• 2020s: Optimizer combinations & scheduling

Key Insight:

Different optimizers excel at different tasks. Understanding their strengths helps you choose the right tool for your problem.

Why SGD Isn't Enough

📊

Fixed Learning Rate Dilemma

One learning rate doesn't fit all parameters or training stages
🎯

Sparse Gradients

Some parameters rarely update (e.g., word embeddings)
📈

Noisy Gradients

Mini-batch sampling introduces variance
⚖️

Different Parameter Scales

Weights in different layers need different learning rates

Modern Optimization Algorithms

📐

AdaGrad

Adaptive learning rates per parameter

🎚️

RMSprop

Fixes AdaGrad's diminishing learning rates

👑

Adam

Best of momentum and adaptive rates

⚡

AdamW

Adam with decoupled weight decay

🚀

NAdam

Nesterov momentum meets Adam

🐑

LAMB

Layer-wise adaptive for large batches

Optimization Algorithms Beyond SGD

The Optimization Landscape

Why SGD Isn't Enough

Modern Optimization Algorithms

Optimizer Paths Visualization

Modal Title