The Silent Killer of Deep Networks

⚠️ Critical Problem
As gradients backpropagate through many layers, they become exponentially smaller, eventually becoming too small to cause any meaningful weight updates. This effectively stops learning in early layers!
∂L/∂W₁ = ∂L/∂h_n × ∂h_n/∂h_(n-1) × ... × ∂h₂/∂h₁ × ∂h₁/∂W₁
The Chain Rule Culprit:
Each layer multiplies the gradient by values typically less than 1:
  • Sigmoid derivatives: max 0.25
  • Tanh derivatives: max 1.0
  • Many small multiplications → vanishing gradient!

Gradient Flow Visualization

Layer 10
1.000
Output
Layer 7
0.125
Hidden
Layer 4
0.016
Hidden
Layer 1
0.002
Input