CS5720 - Week 2
Slide 38 of 40

Gradient Descent Variants

📚
Batch Gradient Descent
Uses the entire dataset to compute gradients for each update step.
  • Data per update: All training examples
  • Update frequency: Once per epoch
  • Memory usage: High
  • Convergence: Smooth but slow
✅ Pros
Stable convergence, optimal direction
❌ Cons
Slow updates, high memory
🎲
Stochastic Gradient Descent
Uses one example at a time to compute gradients and update weights.
  • Data per update: Single example
  • Update frequency: After each example
  • Memory usage: Very low
  • Convergence: Noisy but fast
✅ Pros
Fast updates, low memory, escapes local minima
❌ Cons
Noisy convergence, unstable
⚖️
Mini-Batch Gradient Descent
Uses small batches (typically 32-256 examples) for each update.
  • Data per update: 32-256 examples
  • Update frequency: Multiple per epoch
  • Memory usage: Moderate
  • Convergence: Balanced
✅ Pros
Best of both worlds, GPU efficient
❌ Cons
Batch size tuning needed

Interactive Gradient Descent Comparison

Loss Landscape Navigation
Performance Metrics
Updates per Epoch
1
Noise Level
Low
Memory Usage
High
Convergence
Smooth
🎮 Try Different Gradient Descent Variants
Click a button above to see how different gradient descent variants navigate the loss landscape!
Prepared by Dr. Gorkem Kar