CS5720 - Momentum: Adding Memory to SGD

The Momentum Method

Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient descent, like a ball rolling down a hill.

v_t = βv_t-1 + η∇L(θ_t)
θ_t+1 = θ_t - v_t

where β ∈ [0,1] is the momentum coefficient

🚀 Accelerates Convergence

Builds up speed in consistent gradient directions
🎯 Reduces Oscillations

Dampens zigzag motion in narrow valleys
🏔️ Escapes Local Minima

Momentum can carry optimizer past shallow minima

Physical Intuition

🏀 Ball Rolling Down a Hill

Think of optimization as a ball rolling down a loss surface. Momentum gives the ball inertia - it remembers its previous motion and tends to keep moving in the same direction.

Key Insight: Without momentum, the optimizer makes decisions based only on the current gradient. With momentum, it considers the accumulated history of gradients, leading to smoother and faster convergence.

Momentum: Adding Memory to SGD

The Momentum Method

Physical Intuition

SGD vs Momentum Comparison

Modal Title