The Mathematical Foundation
Xavier Normal:
W ~ N(0, σ²) where σ = √(2/(n_in + n_out))
Core Principle:
Xavier initialization maintains the variance of activations and gradients approximately equal across all layers. This prevents signal degradation in deep networks.
Key Assumptions:
- Activations are linear around zero
- Weights and inputs are independent
- Inputs have zero mean
- Network uses tanh or sigmoid activation
Variance Flow Visualization
Input
Var = 1.00
→
→
→
Output
Var = 1.00