CS5720 - He Initialization for ReLU Networks

Designed for ReLU

Kaiming He Formula:

W ~ N(0, σ²) where σ = √(2/n_in)

The ReLU Problem:

ReLU activation kills half the neurons (sets negatives to zero), reducing effective network capacity. He initialization compensates by doubling the variance to maintain signal strength.

Key Insight:

Since ReLU(x) = max(0, x) zeroes out negative inputs, we need √2 larger variance to maintain the same output variance as linear activations.

ReLU vs Other Activations

Sigmoid

Max Derivative: 0.25

Active Neurons: 100%

ReLU

Max Derivative: 1.0

Active Neurons: ~50%

Initialization Method Comparison

⚖️

Xavier/Glorot

σ = √(2/(n_in + n_out))

Best for: Sigmoid, Tanh
Performance with ReLU: ⭐⭐⭐

🔥

He/Kaiming

σ = √(2/n_in)

Best for: ReLU, Leaky ReLU
Performance with ReLU: ⭐⭐⭐⭐⭐

📐

LeCun

σ = √(1/n_in)

Best for: SELU
Performance with ReLU: ⭐⭐

Modal Title

He Initialization for ReLU Networks

Designed for ReLU

The ReLU Problem:

Key Insight:

ReLU vs Other Activations

Initialization Method Comparison