CS5720 - Week 2
Slide 38 of 40
Gradient Descent Variants
📚
Batch Gradient Descent
Uses the
entire dataset
to compute gradients for each update step.
Data per update:
All training examples
Update frequency:
Once per epoch
Memory usage:
High
Convergence:
Smooth but slow
✅ Pros
Stable convergence, optimal direction
❌ Cons
Slow updates, high memory
🎲
Stochastic Gradient Descent
Uses
one example
at a time to compute gradients and update weights.
Data per update:
Single example
Update frequency:
After each example
Memory usage:
Very low
Convergence:
Noisy but fast
✅ Pros
Fast updates, low memory, escapes local minima
❌ Cons
Noisy convergence, unstable
⚖️
Mini-Batch Gradient Descent
Uses
small batches
(typically 32-256 examples) for each update.
Data per update:
32-256 examples
Update frequency:
Multiple per epoch
Memory usage:
Moderate
Convergence:
Balanced
✅ Pros
Best of both worlds, GPU efficient
❌ Cons
Batch size tuning needed
Interactive Gradient Descent Comparison
Loss Landscape Navigation
Performance Metrics
Updates per Epoch
1
Noise Level
Low
Memory Usage
High
Convergence
Smooth
🎮 Try Different Gradient Descent Variants
📚 Batch GD
🎲 Stochastic GD
⚖️ Mini-Batch GD
Click a button above to see how different gradient descent variants navigate the loss landscape!
← Previous
Next →
Prepared by Dr. Gorkem Kar
Modal Title
×
Modal content goes here...