CS5720 - Why ResNet Works: The Identity Shortcut

The Degradation Problem

🚨 The Shocking Discovery

Deeper networks performed worse than shallow ones - even on training data! This wasn't overfitting; it was an optimization problem.

Before ResNet:
• 20-layer network: 91.25% accuracy
• 56-layer network: 72.07% accuracy
• Training error was higher for deeper networks!

💡 Why This Happens

As networks get deeper, gradients vanish and optimization becomes exponentially harder. The network can't even learn the identity function!

ResNet's Genius Insight: Instead of learning H(x), learn the residual F(x) = H(x) - x, then add x back.

The Magic Formula:
• Traditional: y = H(x)
• ResNet: y = F(x) + x
• Where F(x) learns the "residual" or difference

🎯 Why Identity Works

If the optimal function is identity, F(x) just needs to learn zero. This is much easier than learning the identity mapping from scratch!

Standard Deep Block

Input x

Conv + ReLU

Output H(x)

Hard to optimize when deep!

ResNet Residual Block

Input x

Conv + ReLU

Conv

Add Skip Connection

ReLU → F(x) + x

Easy to optimize at any depth!

The ResNet Formula

y = F(x, {Wi}) + x

Where F(x, {Wi}) represents the residual mapping to be learned by the stacked layers

🌊 Gradient Flow

Skip connections provide direct gradient paths, solving vanishing gradients

🎯 Easier Optimization

Learning residuals is easier than learning full mappings from scratch

🔄 Ensemble Effect

Network acts like an ensemble of many shorter networks of different lengths

🆔 Identity Preservation

Can always learn identity function when deeper layers aren't helpful