CS5720 - The Vanishing Gradient Problem

The Silent Killer of Deep Networks

⚠️ Critical Problem

As gradients backpropagate through many layers, they become exponentially smaller, eventually becoming too small to cause any meaningful weight updates. This effectively stops learning in early layers!

∂L/∂W₁ = ∂L/∂h_n × ∂h_n/∂h_(n-1) × ... × ∂h₂/∂h₁ × ∂h₁/∂W₁

The Chain Rule Culprit:
Each layer multiplies the gradient by values typically less than 1:

Sigmoid derivatives: max 0.25
Tanh derivatives: max 1.0
Many small multiplications → vanishing gradient!

Gradient Flow Visualization

→

Layer 10

1.000

Output

Layer 7

0.125

Hidden

Layer 4

0.016

Hidden

Layer 1

0.002

Input

Solutions to Vanishing Gradients

⚡

ReLU Activation

ReLU(x) = max(0, x) has derivative of 1 for positive inputs, preventing gradient shrinking

🔄

Skip Connections

ResNets add identity shortcuts that allow gradients to flow directly through

📊

Batch Normalization

Normalizes inputs to each layer, maintaining healthy gradient magnitudes

🎯

Better Initialization

Xavier/He initialization starts weights in optimal ranges for gradient flow

🧠

LSTM/GRU

Gated architectures designed specifically to combat vanishing gradients in RNNs

✂️

Gradient Clipping

Clips gradients to prevent both vanishing and exploding gradient problems

Modal Title

The Vanishing Gradient Problem

The Silent Killer of Deep Networks

Gradient Flow Visualization

Solutions to Vanishing Gradients