CS5720 - Week 3
Slide 52 of 60

Momentum: Adding Memory to SGD

The Momentum Method

Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient descent, like a ball rolling down a hill.
vt = βvt-1 + η∇L(θt)
θt+1 = θt - vt

where β ∈ [0,1] is the momentum coefficient
  • 🚀 Accelerates Convergence
    Builds up speed in consistent gradient directions
  • 🎯 Reduces Oscillations
    Dampens zigzag motion in narrow valleys
  • 🏔️ Escapes Local Minima
    Momentum can carry optimizer past shallow minima

Physical Intuition

🏀 Ball Rolling Down a Hill
Think of optimization as a ball rolling down a loss surface. Momentum gives the ball inertia - it remembers its previous motion and tends to keep moving in the same direction.
Key Insight: Without momentum, the optimizer makes decisions based only on the current gradient. With momentum, it considers the accumulated history of gradients, leading to smoother and faster convergence.

SGD vs Momentum Comparison

Vanilla SGD
SGD with Momentum
0.01
0.9
Prepared by Dr. Gorkem Kar