YOLO's Big Idea: Treat object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.
Key Innovation:
• Single Network Pass: No region proposals
• Global Context: Sees entire image
• Real-time Speed: 45+ FPS capability
• End-to-end Training: Unified optimization
S×S
Grid
Each
Cell
Predicts
B Boxes
+ Class
7×7
Grid
2 Boxes
Per Cell
20 Classes
(PASCAL)
Output
7×7×30
Tensor
Single
Forward
Pass
Real-time
Speed
⚡
Extremely Fast
Real-time processing at 45+ FPS
🌍
Global Context
Less background false positives
🎯
Simple Architecture
Easy to understand and implement
How YOLO Works
Three-Step Process
1. Divide image into S×S grid
2. Each cell predicts B bounding boxes
3. Apply NMS to final detections
📐 Architecture Details
• 24 convolutional layers + 2 fully connected
• Inspired by GoogLeNet architecture
• 1×1 reduction layers followed by 3×3 conv
• Final output: 7×7×30 tensor
⚠️ YOLO Limitations
• Struggles with small objects
• Limited to 2 objects per grid cell
• Lower localization accuracy
• Difficulty with unusual aspect ratios
Perfect For:
Real-time applications where speed is more important than perfect accuracy!
YOLO Evolution Timeline
YOLO v1
2015
• Pioneered unified detection
• 7×7 grid, 2 boxes per cell
• 45 FPS on Titan X GPU
mAP: 63.4%
YOLO v2
2016
• Batch normalization
• Anchor boxes
• Multi-scale training