TF-IDF (Term Frequency-Inverse Document Frequency)
Introduction
TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection of documents. It increases with the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
The TF-IDF Formula
TF-IDF(t,d,D) = TF(t,d) × IDF(t,D)
Term Frequency (TF)
How frequently a term appears in a document.
Raw Count: f(t,d)
Log Normalization: 1 + log(f(t,d))
Normalized: f(t,d) / |d|
Inverse Document Frequency (IDF)
How rare or common a term is across all documents.
Standard: log(N / df(t))
Smooth: log((N+1) / (df(t)+1)) + 1
Max: log(1 + max(df) / df(t))
Where:
- t = term, d = document, D = collection of documents
- N = total number of documents
- df(t) = number of documents containing term t
Interactive TF-IDF Calculator
Term Importance Analysis
TF-IDF Matrix Visualization
Heatmap showing TF-IDF scores. Brighter colors indicate higher importance.
Document Search with TF-IDF
Applications of TF-IDF
Information Retrieval
Search engines use TF-IDF to rank documents by relevance to search queries.
- • Document ranking
- • Query expansion
- • Relevance scoring
Text Mining
Extract key terms and themes from large document collections.
- • Keyword extraction
- • Document summarization
- • Topic modeling preprocessing
Machine Learning
Feature engineering for text classification and clustering tasks.
- • Text classification
- • Document clustering
- • Content recommendation
Advantages and Limitations
✓ Advantages
- • Balances local and global word importance
- • Reduces impact of common words automatically
- • Simple to implement and interpret
- • Works well for keyword extraction
- • Language-agnostic approach
✗ Limitations
- • Still ignores word order and context
- • No semantic understanding
- • Assumes term independence
- • Biased toward longer documents
- • Sparse, high-dimensional vectors
Key Takeaways
- TF-IDF balances term frequency with document frequency
- High TF-IDF indicates a term is frequent in a document but rare overall
- Widely used in search engines and information retrieval systems
- Forms the basis for many text mining and NLP applications
- Simple yet effective for identifying important terms in documents
- Can be combined with other techniques for better performance