CS5720 - Week 3
Slide 49 of 60

Batch Normalization - Concept

What is Batch Normalization?

Batch Normalization (BatchNorm) is a technique that normalizes the inputs to each layer, making deep neural networks much easier to train by addressing internal covariate shift.
🚨 The Problem It Solves
During training, the distribution of each layer's inputs changes as the parameters of the previous layers change. This "internal covariate shift" makes training deep networks difficult and slow.
Key Insight:
If we normalize inputs to our network, why not normalize inputs to every layer within the network?

Why BatchNorm Works

  • 🎯 Stabilizes learning by normalizing layer inputs
  • Allows higher learning rates
  • 🎲 Reduces sensitivity to initialization
  • 🛡️ Acts as a regularizer (reduces need for dropout)
  • 📈 Helps prevent vanishing/exploding gradients
Click on each benefit to learn more about how BatchNorm achieves these improvements!

How Batch Normalization Transforms Data

Interactive Scatter Plot
Multi-Layer Flow
Step-by-Step Math
Gradient Flow
Input Distribution
After BatchNorm
2.0
2.0
50
Without BatchNorm (top) vs With BatchNorm (bottom)
Input
μ=0, σ=1
Layer 1
μ=2.3, σ=0.5
Layer 2
μ=-1.5, σ=3.2
Output
Unstable!
Input
μ=0, σ=1
BN → Layer 1
μ=0, σ=1
BN → Layer 2
μ=0, σ=1
Output
Stable!
1
Calculate Batch Mean
μ_B = (1/m) Σ x_i
Example: Batch = [2.3, -0.5, 1.8, 3.2, -1.2]
μ_B = (2.3 + (-0.5) + 1.8 + 3.2 + (-1.2)) / 5 = 1.12
2
Calculate Batch Variance
σ²_B = (1/m) Σ (x_i - μ_B)²
Deviations: [1.18², (-1.62)², 0.68², 2.08², (-2.32)²]
σ²_B = (1.39 + 2.62 + 0.46 + 4.33 + 5.38) / 5 = 2.84
3
Normalize Each Value
x̂_i = (x_i - μ_B) / √(σ²_B + ε)
Normalized values:
x̂₁ = (2.3 - 1.12) / √2.84 = 0.70
x̂₂ = (-0.5 - 1.12) / √2.84 = -0.96
... (mean ≈ 0, std ≈ 1)
4
Scale and Shift
y_i = γx̂_i + β
Learnable parameters:
γ (scale) = 1.2, β (shift) = 0.3
y₁ = 1.2 × 0.70 + 0.3 = 1.14
Network learns optimal γ and β during training!
Without BatchNorm
Layer 20: |∇| = 1.0
Layer 15: |∇| = 0.01
Layer 10: |∇| = 0.0001
Layer 5: |∇| = 0.000001
Layer 1: |∇| ≈ 0
❌ Vanishing Gradients!
With BatchNorm
Layer 20: |∇| = 1.0
Layer 15: |∇| = 0.95
Layer 10: |∇| = 0.90
Layer 5: |∇| = 0.85
Layer 1: |∇| = 0.80
✅ Healthy Gradient Flow!
Batch Normalization Formula
ŷ = γ × (x - μ) / √(σ² + ε) + β
Where: μ = batch mean, σ² = batch variance, γ & β = learnable parameters
Prepared by Dr. Gorkem Kar