CS5720 - Week 6
Slide 109 of 120

Vanishing Gradients in RNNs

The Vanishing Gradient Problem

Vanishing gradients occur when gradients become exponentially smaller as they propagate backward through time, making it impossible to learn long-term dependencies.
Why It Happens:

Chain rule multiplication across many time steps
Activation function derivatives are small (≤ 0.25)
Weight matrix eigenvalues less than 1
Exponential decay over long sequences
📉 The Math:
Gradient at step 1: ∂L/∂h₁ = ∂L/∂h₃₀ × ∏(∂h_{t+1}/∂h_t) ≈ 1.0 × 0.25³⁰ ≈ 10⁻¹⁸

Solutions & Approaches

  • 🔧
    LSTM & GRU
    Gated architectures that maintain gradient flow
  • ✂️
    Gradient Clipping
    Prevents gradients from becoming too large or small
  • 🎯
    Better Initialization
    Xavier/He initialization for stable gradients
  • 🔀
    Residual Connections
    Skip connections for direct gradient paths
  • 👁️
    Attention Mechanisms
    Direct connections to all previous states

Gradient Magnitude Visualization

Standard RNN - Gradient Decay
1.0
0.7
0.5
0.3
0.2
0.1
0.05
0.01
Time Steps: T=8 → T=1 (backward)
LSTM - Preserved Gradients
1.0
0.9
0.85
0.8
0.75
0.7
0.65
0.6
Time Steps: T=8 → T=1 (LSTM gates preserve flow)
Impact on Learning Ability
Standard RNN
• Can learn 3-5 step dependencies
• Forgets early information
• Training is slow/unstable
LSTM/GRU
• Can learn 100+ step dependencies
• Maintains long-term memory
• Stable, efficient training
Prepared by Dr. Gorkem Kar