The Degradation Problem
Shocking Discovery:
Adding more layers made networks perform WORSE - not just on test data (overfitting), but on training data too!
The Evidence:
• 56-layer network had higher training error than 20-layer
• Not overfitting - training error was worse!
• Optimization difficulty, not model capacity
• Deeper networks couldn't even learn identity mappings
The Paradox:
If we stack a 20-layer network with 36 identity layers, we should get at least the same performance. But optimization couldn't find this solution!
The Skip Connection Solution
Revolutionary Idea:
Instead of learning H(x), learn the residual F(x) = H(x) - x
Then H(x) = F(x) + x
Why This Works:
• Easier to learn F(x) = 0 than H(x) = x
• Gradients flow directly through shortcuts
• Creates ensemble of shallow networks
• Identity mapping is the default
The Result:
ResNet-152 won ILSVRC 2015 with 3.57% error - better than human performance (5.1%)!