CS5720 - Week 11
Slide 203 of 220

Tokenization and Vocabulary Building

What is Tokenization?

Tokenization is the process of breaking down text into smaller, meaningful units called tokens that machines can understand and process.
Why is it important?

• Computers don't understand words naturally
• We need to convert text into numbers
• Tokens are the basic units for NLP models
• Foundation for all text processing
"Hello"
Word Token
"un-" "happy"
Subword Token
"H" "e" "l" "l" "o"
Character Token

Tokenization Process

  • 1️⃣
    Text Preprocessing
    Clean and normalize the input text
  • 2️⃣
    Text Splitting
    Break text into candidate tokens
  • 3️⃣
    Vocabulary Building
    Create a dictionary of unique tokens
  • 4️⃣
    Token Encoding
    Convert tokens to numerical IDs
Key Insight:
Good tokenization is crucial for model performance. Different strategies work better for different languages and tasks!

Interactive Tokenization Demo

Tokens:
Click "Tokenize Text" to see tokens appear here...
Vocabulary Statistics:
Total tokens: 0 | Unique tokens: 0 | Vocabulary size: 0
Prepared by Dr. Gorkem Kar