CS5720 - Week 8
Slide 158 of 160

Quantization for Efficiency

What is Quantization?

Quantization is the process of reducing the precision of weights and activations in neural networks from 32-bit floating point to lower bit-width representations (16-bit, 8-bit, or even 4-bit).
Core Benefits:

4x smaller models - FP32 → INT8
Faster inference - Integer ops are faster
Lower memory usage - Less RAM required
Energy efficient - Crucial for mobile/edge
📊 Quantization Formula
q = round(x / scale) + zero_point
x̂ = (q - zero_point) × scale

Types of Quantization

  • Post-Training Quantization
    Quantize after training is complete
  • 🎯
    Quantization-Aware Training
    Include quantization in training process
  • 🔄
    Dynamic Quantization
    Quantize weights, keep activations in FP32
  • 📌
    Static Quantization
    Pre-calibrate ranges for optimal performance

Interactive Bit-Width Comparison

FP32
32
±3.4 × 10³⁸
Size 100%
Speed 1.0x
FP16
16
±6.5 × 10⁴
Size 50%
Speed 1.7x
INT8
8
-128 to 127
Size 25%
Speed 2.8x
INT4
4
-8 to 7
Size 12.5%
Speed 4.2x
Performance Impact Analysis
Model Size
4x
Reduction vs FP32
Inference Speed
2.8x
Faster than FP32
Accuracy Loss
-1.2%
Typical degradation
Try Different Precisions
Prepared by Dr. Gorkem Kar