CS5720 - Week 10
Slide 197 of 200
Handling Imbalanced Datasets
The Imbalance Problem
Class imbalance
occurs when one class significantly outnumbers others in your dataset, leading to biased models that ignore minority classes.
Why it's a problem:
• Models learn to predict the
majority class
• High accuracy can be
misleading
• Minority classes get
ignored
• Real-world consequences can be
severe
🚨 Real Example
Medical diagnosis: 95% healthy patients, 5% with disease. A model predicting "always healthy" gets 95% accuracy but misses all sick patients!
Solution Strategies
⚖️
Sampling Techniques
Oversample minorities or undersample majorities
🔬
Data Synthesis
Generate synthetic examples using SMOTE, GANs
🎯
Loss Modification
Weighted loss, focal loss, cost-sensitive learning
📊
Better Metrics
Use precision, recall, F1-score, AUC instead of accuracy
Interactive: Before vs After Balancing
Imbalanced Dataset
50
Class A
(Minority)
950
Class B
(Majority)
Ratio: 5% vs 95%
Balanced Dataset
500
Class A
(Balanced)
500
Class B
(Balanced)
Ratio: 50% vs 50%
Accuracy
95%
Can be misleading!
Precision
12%
Shows true performance
Recall
5%
Captures minority class
← Previous
Next →
Prepared by Dr. Gorkem Kar
Modal Title
×
Modal content goes here...