The Mathematical Foundation

Xavier Normal:
W ~ N(0, σ²) where σ = √(2/(n_in + n_out))

Core Principle:

Xavier initialization maintains the variance of activations and gradients approximately equal across all layers. This prevents signal degradation in deep networks.

Key Assumptions:

  • Activations are linear around zero
  • Weights and inputs are independent
  • Inputs have zero mean
  • Network uses tanh or sigmoid activation

Variance Flow Visualization

Input Var = 1.00
Output Var = 1.00
Fan In: 256
Fan Out: 128