MachinaLearning

Introduction

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection of documents. It increases with the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

← Previous: Bag of Words → Next: Dense Embeddings

The TF-IDF Formula

TF-IDF(t,d,D) = TF(t,d) × IDF(t,D)

Term Frequency (TF)

How frequently a term appears in a document.

Raw Count: f(t,d)

Log Normalization: 1 + log(f(t,d))

Normalized: f(t,d) / |d|

Inverse Document Frequency (IDF)

How rare or common a term is across all documents.

Standard: log(N / df(t))

Smooth: log((N+1) / (df(t)+1)) + 1

Max: log(1 + max(df) / df(t))

Where:

t = term, d = document, D = collection of documents
N = total number of documents
df(t) = number of documents containing term t

Interactive TF-IDF Calculator

Configuration

TF Variant

IDF Variant

Document Collection

Term Importance Analysis

TF-IDF Matrix Visualization

Heatmap showing TF-IDF scores. Brighter colors indicate higher importance.

Document Search with TF-IDF

Applications of TF-IDF

Information Retrieval

Search engines use TF-IDF to rank documents by relevance to search queries.

• Document ranking
• Query expansion
• Relevance scoring

Text Mining

Extract key terms and themes from large document collections.

• Keyword extraction
• Document summarization
• Topic modeling preprocessing

Machine Learning

Feature engineering for text classification and clustering tasks.

• Text classification
• Document clustering
• Content recommendation

Advantages and Limitations

✓ Advantages

• Balances local and global word importance
• Reduces impact of common words automatically
• Simple to implement and interpret
• Works well for keyword extraction
• Language-agnostic approach

✗ Limitations

• Still ignores word order and context
• No semantic understanding
• Assumes term independence
• Biased toward longer documents
• Sparse, high-dimensional vectors

Key Takeaways

TF-IDF balances term frequency with document frequency
High TF-IDF indicates a term is frequent in a document but rare overall
Widely used in search engines and information retrieval systems
Forms the basis for many text mining and NLP applications
Simple yet effective for identifying important terms in documents
Can be combined with other techniques for better performance

Next Steps

Word Embeddings →

Learn about dense vector representations

Text Classification →

Apply TF-IDF to classification

Topic Modeling →

Discover themes in documents

TF-IDF (Term Frequency-Inverse Document Frequency)