CS5720 - R-CNN Family: Region-Based Detection

Two-Stage Detection

R-CNN Approach: First generate object proposals (regions of interest), then classify each region and refine its location. This two-stage process achieves high accuracy but at the cost of speed.

Core Philosophy:

• Region Proposals: Find potential object locations
• Feature Extraction: CNN features for each region
• Classification: What object is in each region?
• Bounding Box Regression: Refine location

Key Advantage:

Higher accuracy than YOLO, especially for challenging cases with small objects or complex scenes.

R-CNN Evolution

2014

R-CNN

Regional CNN - The original two-stage detector
2015

Fast R-CNN

RoI pooling makes training and testing faster
2016

Faster R-CNN

RPN network generates proposals end-to-end
2017

Mask R-CNN

Adds instance segmentation capabilities

Trade-off:

Better accuracy but slower inference - typically 5-10 FPS vs YOLO's 45+ FPS.

R-CNN Family Architectures

🔍

R-CNN

• Selective Search proposals
• CNN feature extraction
• SVM classification
• Linear regression for boxes

Speed: ~0.02 FPS
mAP: 66.0%

⚡

Fast R-CNN

• RoI pooling layer
• End-to-end training
• Multi-task loss
• Shared CNN features

Speed: ~0.5 FPS
mAP: 70.0%

🚀

Faster R-CNN

• Region Proposal Network
• Anchor boxes
• Fully convolutional
• GPU-optimized

Speed: ~7 FPS
mAP: 73.2%

🎨

Mask R-CNN

• Instance segmentation
• FCN mask head
• RoIAlign (vs RoI pooling)
• Parallel mask prediction

Speed: ~5 FPS
mAP: 37.1% (mask)

R-CNN Family: Region-Based Detection

Two-Stage Detection

R-CNN Evolution

R-CNN Family Architectures

Modal Title