CS5720 - Week 8
Slide 156 of 160

Model Compression Techniques

Why Model Compression?

Model Compression reduces the size, memory usage, and computational requirements of neural networks while maintaining acceptable performance for deployment in resource-constrained environments.
Compression Goals:

Size Reduction: Smaller file sizes for storage/transfer
Speed: Faster inference for real-time applications
Memory: Lower RAM requirements for mobile devices
Energy: Reduced power consumption for edge computing
💡 The Mobile Reality
Modern smartphones have ~6GB RAM, but a large language model can be 175GB! Compression makes AI accessible everywhere.

Compression Techniques

  • 🎓
    Knowledge Distillation
    Large teacher model trains smaller student model
  • ✂️
    Network Pruning
    Remove unnecessary weights and connections
  • 📏
    Quantization
    Reduce precision of weights and activations
  • 🔢
    Low-Rank Factorization
    Decompose weight matrices into smaller factors

Model Compression Impact

Original Model
Size: 500 MB
Params: 100M
Inference: 100ms
Accuracy: 95%
Compressed Model
Size: 50 MB (10x smaller)
Params: 10M (10x fewer)
Inference: 20ms (5x faster)
Accuracy: 93% (minimal loss)
Result: 10x smaller model with only 2% accuracy drop - perfect for mobile deployment!
Prepared by Dr. Gorkem Kar