CS5720 - Week 5
Slide 87 of 100

Why ResNet Works: The Identity Shortcut

The Degradation Problem

🚨 The Shocking Discovery
Deeper networks performed worse than shallow ones - even on training data! This wasn't overfitting; it was an optimization problem.
Before ResNet:
• 20-layer network: 91.25% accuracy
• 56-layer network: 72.07% accuracy
• Training error was higher for deeper networks!
💡 Why This Happens
As networks get deeper, gradients vanish and optimization becomes exponentially harder. The network can't even learn the identity function!

The Identity Shortcut Solution

ResNet's Genius Insight: Instead of learning H(x), learn the residual F(x) = H(x) - x, then add x back.
The Magic Formula:
• Traditional: y = H(x)
• ResNet: y = F(x) + x
• Where F(x) learns the "residual" or difference
🎯 Why Identity Works
If the optimal function is identity, F(x) just needs to learn zero. This is much easier than learning the identity mapping from scratch!

Residual Block vs Standard Block

Standard Deep Block
Input x
Conv + ReLU
Conv + ReLU
Output H(x)
Hard to optimize when deep!
ResNet Residual Block
Input x
Conv + ReLU
Conv
Add Skip Connection
ReLU → F(x) + x
Easy to optimize at any depth!
The ResNet Formula
y = F(x, {Wi}) + x
Where F(x, {Wi}) represents the residual mapping to be learned by the stacked layers
🌊 Gradient Flow
Skip connections provide direct gradient paths, solving vanishing gradients
🎯 Easier Optimization
Learning residuals is easier than learning full mappings from scratch
🔄 Ensemble Effect
Network acts like an ensemble of many shorter networks of different lengths
🆔 Identity Preservation
Can always learn identity function when deeper layers aren't helpful
Prepared by Dr. Gorkem Kar