CS5720 - Week 11
Slide 204 of 220

Text Representation Methods

How do we convert text into numbers that machines can understand? Text representation is the bridge between human language and mathematical computation.

From simple counting methods to sophisticated neural embeddings, let's explore the evolution of text representation!

🎯

One-Hot Encoding

Simple binary vectors where each dimension represents one word in the vocabulary

Sparse, high-dimensional, no semantic meaning
🛍️

Bag of Words (BoW)

Count-based representation that captures word frequency but ignores order

Word frequency vectors, document classification
⚖️

TF-IDF

Weighted representation that balances term frequency with document frequency

Search engines, document similarity, information retrieval
🧠

Word Embeddings

Dense vector representations that capture semantic relationships between words

Word2Vec, GloVe, FastText - semantic similarity
🔄

Contextual Embeddings

Dynamic representations that change based on surrounding context

BERT, ELMo, GPT - context-aware meanings
🔤

Subword Representations

Character or subword-level embeddings for handling unknown words

BPE, SentencePiece, character-level models

Method Comparison Overview

Sparse Vectors
[0,0,1,0,0,0,0,1,0,0]
High-dimensional, mostly zeros
Dense Vectors
[0.2,-0.1,0.8,0.3,-0.5]
Lower-dimensional, meaningful values
Method Pros Cons Best Use Cases
One-Hot Simple, interpretable Sparse, no semantics Small vocabularies, categorical features
Bag of Words Fast, interpretable, baseline No word order, sparse Document classification, sentiment analysis
TF-IDF Handles common words well Still sparse, no semantics Information retrieval, search engines
Word Embeddings Dense, semantic, efficient Fixed context, training needed Most NLP tasks, similarity computation
Contextual Context-aware, SOTA performance Computationally expensive Complex NLP, ambiguous words
Prepared by Dr. Gorkem Kar