Designed for ReLU
Kaiming He Formula:
W ~ N(0, σ²) where σ = √(2/n_in)
The ReLU Problem:
ReLU activation kills half the neurons (sets negatives to zero), reducing effective network capacity. He initialization compensates by doubling the variance to maintain signal strength.
Key Insight:
Since ReLU(x) = max(0, x) zeroes out negative inputs, we need √2 larger variance to maintain the same output variance as linear activations.
ReLU vs Other Activations
Sigmoid
Max Derivative:
0.25
Active Neurons:
100%
ReLU
Max Derivative:
1.0
Active Neurons:
~50%