The Degradation Problem
🚨 The Shocking Discovery
Deeper networks performed worse than shallow ones - even on training data! This wasn't overfitting; it was an optimization problem.
Before ResNet:
• 20-layer network: 91.25% accuracy
• 56-layer network: 72.07% accuracy
• Training error was higher for deeper networks!
💡 Why This Happens
As networks get deeper, gradients vanish and optimization becomes exponentially harder. The network can't even learn the identity function!
The Identity Shortcut Solution
ResNet's Genius Insight: Instead of learning H(x), learn the residual F(x) = H(x) - x, then add x back.
The Magic Formula:
• Traditional: y = H(x)
• ResNet: y = F(x) + x
• Where F(x) learns the "residual" or difference
🎯 Why Identity Works
If the optimal function is identity, F(x) just needs to learn zero. This is much easier than learning the identity mapping from scratch!