What is Batch Normalization?
Batch Normalization (BatchNorm) is a technique that normalizes the inputs to each layer, making deep neural networks much easier to train by addressing internal covariate shift.
🚨 The Problem It Solves
During training, the distribution of each layer's inputs changes as the parameters of the previous layers change. This "internal covariate shift" makes training deep networks difficult and slow.
Key Insight:
If we normalize inputs to our network, why not normalize inputs to every layer within the network?
Why BatchNorm Works
-
🎯
Stabilizes learning by normalizing layer inputs
-
⚡
Allows higher learning rates
-
🎲
Reduces sensitivity to initialization
-
🛡️
Acts as a regularizer (reduces need for dropout)
-
📈
Helps prevent vanishing/exploding gradients
Click on each benefit to learn more about how BatchNorm achieves these improvements!
How Batch Normalization Transforms Data
Interactive Scatter Plot
Multi-Layer Flow
Step-by-Step Math
Gradient Flow
Without BatchNorm (top) vs With BatchNorm (bottom)
μ_B = (1/m) Σ x_i
Example: Batch = [2.3, -0.5, 1.8, 3.2, -1.2]
μ_B = (2.3 + (-0.5) + 1.8 + 3.2 + (-1.2)) / 5 = 1.12
σ²_B = (1/m) Σ (x_i - μ_B)²
Deviations: [1.18², (-1.62)², 0.68², 2.08², (-2.32)²]
σ²_B = (1.39 + 2.62 + 0.46 + 4.33 + 5.38) / 5 = 2.84
x̂_i = (x_i - μ_B) / √(σ²_B + ε)
Normalized values:
x̂₁ = (2.3 - 1.12) / √2.84 = 0.70
x̂₂ = (-0.5 - 1.12) / √2.84 = -0.96
... (mean ≈ 0, std ≈ 1)
y_i = γx̂_i + β
Learnable parameters:
γ (scale) = 1.2, β (shift) = 0.3
y₁ = 1.2 × 0.70 + 0.3 = 1.14
Network learns optimal γ and β during training!
Without BatchNorm
❌ Vanishing Gradients!
With BatchNorm
✅ Healthy Gradient Flow!