Weight decay adds a "simplicity tax" to the loss function. Large weights are expensive, so the model only uses them when absolutely necessary.
Why it works:
• Prevents any single weight from becoming too large
• Encourages the network to use many small weights
• Creates smoother decision boundaries
• Reduces model sensitivity to input noise
🎛️ Interactive Lambda (λ) Control
Lambda:0.010
Adjust to see the effect on regularization strength
The Mathematical Formula
Regularized Loss Function:
L_total = L_original + λ × Σ(w²)
where λ (lambda) controls regularization strength
Breaking it down:
• L_original: Your normal loss (MSE, cross-entropy, etc.)
• λ: Regularization strength (hyperparameter)
• Σ(w²): Sum of all squared weights
• L_total: What we actually minimize
🔍 Key Insight
The gradient now has two parts: one pushing toward lower loss, another pushing weights toward zero!
Live Weight Decay Visualization
Network Weights
Height represents weight magnitude
Loss Components
Original Loss
2.45
+ Weight Penalty
0.23
Total Loss: 2.68
🎯 Effect of Current λ = 0.010
Moderate regularization: Balances original loss with weight control