CS5720 - Vanishing Gradients in RNNs

The Vanishing Gradient Problem

Vanishing gradients occur when gradients become exponentially smaller as they propagate backward through time, making it impossible to learn long-term dependencies.

Why It Happens:

• Chain rule multiplication across many time steps
• Activation function derivatives are small (≤ 0.25)
• Weight matrix eigenvalues less than 1
• Exponential decay over long sequences

📉 The Math:

Gradient at step 1: ∂L/∂h₁ = ∂L/∂h₃₀ × ∏(∂h_{t+1}/∂h_t) ≈ 1.0 × 0.25³⁰ ≈ 10⁻¹⁸

Solutions & Approaches

🔧

LSTM & GRU

Gated architectures that maintain gradient flow
✂️

Gradient Clipping

Prevents gradients from becoming too large or small
🎯

Better Initialization

Xavier/He initialization for stable gradients
🔀

Residual Connections

Skip connections for direct gradient paths
👁️

Attention Mechanisms

Direct connections to all previous states

Gradient Magnitude Visualization

Standard RNN - Gradient Decay

1.0

0.7

0.5

0.3

0.2

0.1

0.05

0.01

Time Steps: T=8 → T=1 (backward)

LSTM - Preserved Gradients

1.0

0.9

0.85

0.8

0.75

0.7

0.65

0.6

Time Steps: T=8 → T=1 (LSTM gates preserve flow)

Impact on Learning Ability

Standard RNN

• Can learn 3-5 step dependencies
• Forgets early information
• Training is slow/unstable

LSTM/GRU

• Can learn 100+ step dependencies
• Maintains long-term memory
• Stable, efficient training

Vanishing Gradients in RNNs

The Vanishing Gradient Problem

Solutions & Approaches

Gradient Magnitude Visualization

Modal Title