CS5720 - Quantization for Efficiency

What is Quantization?

Quantization is the process of reducing the precision of weights and activations in neural networks from 32-bit floating point to lower bit-width representations (16-bit, 8-bit, or even 4-bit).

Core Benefits:

• 4x smaller models - FP32 → INT8
• Faster inference - Integer ops are faster
• Lower memory usage - Less RAM required
• Energy efficient - Crucial for mobile/edge

📊 Quantization Formula

q = round(x / scale) + zero_point
x̂ = (q - zero_point) × scale

Types of Quantization

⚡

Post-Training Quantization

Quantize after training is complete
🎯

Quantization-Aware Training

Include quantization in training process
🔄

Dynamic Quantization

Quantize weights, keep activations in FP32
📌

Static Quantization

Pre-calibrate ranges for optimal performance

Interactive Bit-Width Comparison

FP32

±3.4 × 10³⁸

Size 100%

Speed 1.0x

FP16

±6.5 × 10⁴

Size 50%

Speed 1.7x

INT8

-128 to 127

Size 25%

Speed 2.8x

INT4

-8 to 7

Size 12.5%

Speed 4.2x

Performance Impact Analysis

Model Size

Reduction vs FP32

Inference Speed

2.8x

Faster than FP32

Accuracy Loss

-1.2%

Typical degradation

Try Different Precisions

Quantization for Efficiency

What is Quantization?

Types of Quantization

Interactive Bit-Width Comparison

Modal Title