CS5720 - Week 6
Slide 112 of 120

One-to-Many RNN: Image Captioning

One-to-Many RNN Architecture

🖼️
Image
Single Input
(CNN Features)
RNN₁
"A"
RNN₂
"cat"
RNN₃
"sits"
RNNₙ
"..."
Multiple Sequential Outputs from Single Image Input

Key Concepts

  • 🖼️ Image Encoding
    CNN extracts visual features as initial context
  • 📝 Sequential Generation
    RNN generates words one at a time
  • 🎯 Visual Conditioning
    Text generation conditioned on image content
  • 👁️ Attention Mechanism
    Focus on different image regions while generating

Real-World Applications

  • ♿ Accessibility
    Automatic alt-text for visually impaired users
  • 📁 Content Management
    Automatic tagging and description of media
  • 📱 Social Media
    Suggested captions for user posts
  • 🏥 Medical Imaging
    Automated report generation from scans

Image Captioning Examples

🏖️
"A beautiful beach with clear blue water and white sand under a sunny sky"
🏙️
"A busy city street with tall buildings and people walking on the sidewalk"
🐕
"A golden retriever playing with a ball in a green park"
🍝
"A plate of spaghetti with tomato sauce and fresh basil leaves"

Interactive Image Captioning Demo

Explore how one-to-many RNNs generate captions from visual input

Prepared by Dr. Gorkem Kar