CS5720 - Text Representation Methods

How do we convert text into numbers that machines can understand? Text representation is the bridge between human language and mathematical computation.

From simple counting methods to sophisticated neural embeddings, let's explore the evolution of text representation!

🎯

One-Hot Encoding

Simple binary vectors where each dimension represents one word in the vocabulary

Sparse, high-dimensional, no semantic meaning

🛍️

Bag of Words (BoW)

Count-based representation that captures word frequency but ignores order

Word frequency vectors, document classification

⚖️

TF-IDF

Weighted representation that balances term frequency with document frequency

Search engines, document similarity, information retrieval

🧠

Word Embeddings

Dense vector representations that capture semantic relationships between words

Word2Vec, GloVe, FastText - semantic similarity

🔄

Contextual Embeddings

Dynamic representations that change based on surrounding context

BERT, ELMo, GPT - context-aware meanings

🔤

Subword Representations

Character or subword-level embeddings for handling unknown words

BPE, SentencePiece, character-level models

Method Comparison Overview

Sparse Vectors

[0,0,1,0,0,0,0,1,0,0]
High-dimensional, mostly zeros

→

Dense Vectors

[0.2,-0.1,0.8,0.3,-0.5]
Lower-dimensional, meaningful values

Method	Pros	Cons	Best Use Cases
One-Hot	Simple, interpretable	Sparse, no semantics	Small vocabularies, categorical features
Bag of Words	Fast, interpretable, baseline	No word order, sparse	Document classification, sentiment analysis
TF-IDF	Handles common words well	Still sparse, no semantics	Information retrieval, search engines
Word Embeddings	Dense, semantic, efficient	Fixed context, training needed	Most NLP tasks, similarity computation
Contextual	Context-aware, SOTA performance	Computationally expensive	Complex NLP, ambiguous words

Text Representation Methods