Where:
• n = number of examples
• y_predicted = what our network predicts
• y_actual = the true value
• We square the difference and average over all examples
Why square the errors?
• Makes all errors positive (no cancellation)
• Penalizes large errors more heavily
• Makes the math differentiable (important for learning!)
• Creates a smooth error surface
Key Properties of MSE
📈 Always Non-negative
MSE ≥ 0, and equals 0 only when predictions are perfect
🎯 Differentiable Everywhere
Smooth gradient allows efficient optimization
⚖️ Sensitive to Outliers
Large errors contribute disproportionately to the loss