CS5720 - Week 6
Slide 112 of 120
One-to-Many RNN: Image Captioning
One-to-Many RNN Architecture
🖼️
Image
Single Input
(CNN Features)
→
RNN₁
"A"
→
RNN₂
"cat"
→
RNN₃
"sits"
→
RNNₙ
"..."
Multiple Sequential Outputs from Single Image Input
Key Concepts
🖼️ Image Encoding
CNN extracts visual features as initial context
📝 Sequential Generation
RNN generates words one at a time
🎯 Visual Conditioning
Text generation conditioned on image content
👁️ Attention Mechanism
Focus on different image regions while generating
Real-World Applications
♿ Accessibility
Automatic alt-text for visually impaired users
📁 Content Management
Automatic tagging and description of media
📱 Social Media
Suggested captions for user posts
🏥 Medical Imaging
Automated report generation from scans
Image Captioning Examples
🏖️
"A beautiful beach with clear blue water and white sand under a sunny sky"
🏙️
"A busy city street with tall buildings and people walking on the sidewalk"
🐕
"A golden retriever playing with a ball in a green park"
🍝
"A plate of spaghetti with tomato sauce and fresh basil leaves"
Interactive Image Captioning Demo
Explore how one-to-many RNNs generate captions from visual input
🎬 Generate Caption
👁️ Show Attention
📊 Compare Models
🏗️ View Architecture
← Previous
Next →
Prepared by Dr. Gorkem Kar
Modal Title
×
Modal content goes here...