CS5720 - Week 10
Slide 197 of 200

Handling Imbalanced Datasets

The Imbalance Problem

Class imbalance occurs when one class significantly outnumbers others in your dataset, leading to biased models that ignore minority classes.
Why it's a problem:

• Models learn to predict the majority class
• High accuracy can be misleading
• Minority classes get ignored
• Real-world consequences can be severe
🚨 Real Example
Medical diagnosis: 95% healthy patients, 5% with disease. A model predicting "always healthy" gets 95% accuracy but misses all sick patients!

Solution Strategies

  • ⚖️
    Sampling Techniques
    Oversample minorities or undersample majorities
  • 🔬
    Data Synthesis
    Generate synthetic examples using SMOTE, GANs
  • 🎯
    Loss Modification
    Weighted loss, focal loss, cost-sensitive learning
  • 📊
    Better Metrics
    Use precision, recall, F1-score, AUC instead of accuracy

Interactive: Before vs After Balancing

Imbalanced Dataset
50
Class A
(Minority)
950
Class B
(Majority)
Ratio: 5% vs 95%
Balanced Dataset
500
Class A
(Balanced)
500
Class B
(Balanced)
Ratio: 50% vs 50%
Accuracy
95%
Can be misleading!
Precision
12%
Shows true performance
Recall
5%
Captures minority class
Prepared by Dr. Gorkem Kar