CS5720 - Batch Normalization

What is Batch Normalization?

Batch Normalization (BatchNorm) is a technique that normalizes the inputs to each layer, making deep neural networks much easier to train by addressing internal covariate shift.

🚨 The Problem It Solves

During training, the distribution of each layer's inputs changes as the parameters of the previous layers change. This "internal covariate shift" makes training deep networks difficult and slow.

Key Insight:
If we normalize inputs to our network, why not normalize inputs to every layer within the network?

Why BatchNorm Works

🎯 Stabilizes learning by normalizing layer inputs
⚡ Allows higher learning rates
🎲 Reduces sensitivity to initialization
🛡️ Acts as a regularizer (reduces need for dropout)
📈 Helps prevent vanishing/exploding gradients

Click on each benefit to learn more about how BatchNorm achieves these improvements!

How Batch Normalization Transforms Data

Interactive Scatter Plot

Multi-Layer Flow

Step-by-Step Math

Gradient Flow

Input Distribution

→

After BatchNorm

Input Mean

2.0

Input Std Dev

2.0

Batch Size

Without BatchNorm (top) vs With BatchNorm (bottom)

Input

μ=0, σ=1

→

Layer 1

μ=2.3, σ=0.5

→

Layer 2

μ=-1.5, σ=3.2

→

Output

Unstable!

Input

μ=0, σ=1

→

BN → Layer 1

μ=0, σ=1

→

BN → Layer 2

μ=0, σ=1

→

Output

Stable!

Calculate Batch Mean

μ_B = (1/m) Σ x_i

Example: Batch = [2.3, -0.5, 1.8, 3.2, -1.2]
μ_B = (2.3 + (-0.5) + 1.8 + 3.2 + (-1.2)) / 5 = 1.12

Calculate Batch Variance

σ²_B = (1/m) Σ (x_i - μ_B)²

Deviations: [1.18², (-1.62)², 0.68², 2.08², (-2.32)²]
σ²_B = (1.39 + 2.62 + 0.46 + 4.33 + 5.38) / 5 = 2.84

Normalize Each Value

x̂_i = (x_i - μ_B) / √(σ²_B + ε)

Normalized values:
x̂₁ = (2.3 - 1.12) / √2.84 = 0.70
x̂₂ = (-0.5 - 1.12) / √2.84 = -0.96
... (mean ≈ 0, std ≈ 1)

Scale and Shift

y_i = γx̂_i + β

Learnable parameters:
γ (scale) = 1.2, β (shift) = 0.3
y₁ = 1.2 × 0.70 + 0.3 = 1.14
Network learns optimal γ and β during training!

Without BatchNorm

Layer 20: |∇| = 1.0

Layer 15: |∇| = 0.01

Layer 10: |∇| = 0.0001

Layer 5: |∇| = 0.000001

Layer 1: |∇| ≈ 0

❌ Vanishing Gradients!

With BatchNorm

Layer 20: |∇| = 1.0

Layer 15: |∇| = 0.95

Layer 10: |∇| = 0.90

Layer 5: |∇| = 0.85

Layer 1: |∇| = 0.80

✅ Healthy Gradient Flow!

Batch Normalization Formula

ŷ = γ × (x - μ) / √(σ² + ε) + β

Where: μ = batch mean, σ² = batch variance, γ & β = learnable parameters

Batch Normalization - Concept

What is Batch Normalization?

Why BatchNorm Works

How Batch Normalization Transforms Data

Modal Title