The Momentum Method
Momentum accelerates SGD by accumulating a velocity vector in directions of persistent gradient descent, like a ball rolling down a hill.
vt = βvt-1 + η∇L(θt)
θt+1 = θt - vt
where β ∈ [0,1] is the momentum coefficient
-
🚀 Accelerates Convergence
Builds up speed in consistent gradient directions
-
🎯 Reduces Oscillations
Dampens zigzag motion in narrow valleys
-
🏔️ Escapes Local Minima
Momentum can carry optimizer past shallow minima
Physical Intuition
🏀 Ball Rolling Down a Hill
Think of optimization as a ball rolling down a loss surface. Momentum gives the ball inertia - it remembers its previous motion and tends to keep moving in the same direction.
Key Insight: Without momentum, the optimizer makes decisions based only on the current gradient. With momentum, it considers the accumulated history of gradients, leading to smoother and faster convergence.