Designed for ReLU

Kaiming He Formula:
W ~ N(0, σ²) where σ = √(2/n_in)

The ReLU Problem:

ReLU activation kills half the neurons (sets negatives to zero), reducing effective network capacity. He initialization compensates by doubling the variance to maintain signal strength.

Key Insight:

Since ReLU(x) = max(0, x) zeroes out negative inputs, we need √2 larger variance to maintain the same output variance as linear activations.

ReLU vs Other Activations

Sigmoid
Max Derivative: 0.25
Active Neurons: 100%
ReLU
Max Derivative: 1.0
Active Neurons: ~50%