CS5720 - Week 10
Slide 185 of 200

R-CNN Family: Region-Based Detection

Two-Stage Detection

R-CNN Approach: First generate object proposals (regions of interest), then classify each region and refine its location. This two-stage process achieves high accuracy but at the cost of speed.
Core Philosophy:

Region Proposals: Find potential object locations
Feature Extraction: CNN features for each region
Classification: What object is in each region?
Bounding Box Regression: Refine location
Key Advantage:
Higher accuracy than YOLO, especially for challenging cases with small objects or complex scenes.

R-CNN Evolution

  • 2014
    R-CNN
    Regional CNN - The original two-stage detector
  • 2015
    Fast R-CNN
    RoI pooling makes training and testing faster
  • 2016
    Faster R-CNN
    RPN network generates proposals end-to-end
  • 2017
    Mask R-CNN
    Adds instance segmentation capabilities
Trade-off:
Better accuracy but slower inference - typically 5-10 FPS vs YOLO's 45+ FPS.

R-CNN Family Architectures

🔍
R-CNN
• Selective Search proposals
• CNN feature extraction
• SVM classification
• Linear regression for boxes
Speed: ~0.02 FPS
mAP: 66.0%
Fast R-CNN
• RoI pooling layer
• End-to-end training
• Multi-task loss
• Shared CNN features
Speed: ~0.5 FPS
mAP: 70.0%
🚀
Faster R-CNN
• Region Proposal Network
• Anchor boxes
• Fully convolutional
• GPU-optimized
Speed: ~7 FPS
mAP: 73.2%
🎨
Mask R-CNN
• Instance segmentation
• FCN mask head
• RoIAlign (vs RoI pooling)
• Parallel mask prediction
Speed: ~5 FPS
mAP: 37.1% (mask)
Prepared by Dr. Gorkem Kar