CS5720 - Week 3
Slide 53 of 60

Adam Optimizer: Adaptive Learning

The Adam Algorithm

Adam (Adaptive Moment Estimation) combines the best of momentum and RMSprop, providing adaptive learning rates with momentum for each parameter.
First moment: mt = β₁mt-1 + (1-β₁)gt
Second moment: vt = β₂vt-1 + (1-β₂)gt²
Bias correction: t = mt/(1-β₁t), v̂t = vt/(1-β₂t)
Update rule: θt+1 = θt - η·m̂t/√(v̂t + ε)
Default hyperparameters (work well for most problems):
• Learning rate (η): 0.001
• β₁ (momentum): 0.9
• β₂ (RMSprop): 0.999
• ε (stability): 10-8

Key Components

📊 First Moment (Momentum)
Exponential moving average of gradients. Provides velocity and helps escape saddle points.
📈 Second Moment (Adaptive LR)
Exponential moving average of squared gradients. Adapts learning rate per parameter.
⚖️ Bias Correction
Corrects for initialization bias in early training steps. Critical for stability.

Adam in Action

Effective Learning Rate
0.001
Momentum Strength
0.0
Gradient Adaptation
1.0x
0.001
0.9
0.999
1
Prepared by Dr. Gorkem Kar