TF-IDF vs. BM25
1. Introduction
In search engine databases, efficient querying and ranking of documents is crucial. This lesson explores two popular methods for ranking: TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25). Understanding their differences, advantages, and use cases is essential for optimizing search algorithms.
2. TF-IDF
TF-IDF is a statistical measurement that evaluates the importance of a word in a document relative to a collection (corpus) of documents. It is calculated using two components:
- Term Frequency (TF): Measures how frequently a term appears in a document. The formula is:
- Inverse Document Frequency (IDF): Measures the importance of the term across all documents. The formula is:
- Finally, TF-IDF is calculated as:
TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)
IDF(t) = log_e(Total number of documents / Number of documents containing term t)
TF-IDF(t, d) = TF(t, d) * IDF(t)
TF-IDF helps prioritize documents that contain the search terms more frequently while penalizing common words.
3. BM25
BM25 is an extension of TF-IDF and incorporates several improvements for ranking documents:
- Document Length Normalization: It adjusts for the length of documents, ensuring that longer documents do not have an unfair advantage.
- Term Saturation: It includes parameters to control how much additional occurrences of a term contribute to the score.
- The BM25 scoring formula is:
- Where:
BM25(t, d) = IDF(t) * (TF(t, d) * (k1 + 1)) / (TF(t, d) + k1 * (1 - b + b * (len(d) / avglen)))
k1 = 1.2 (a tuning parameter)
b = 0.75 (controls the influence of document length)
len(d) = length of the document
avglen = average document length in the collection
4. Comparison
- TF-IDF is simpler and easier to implement.
- BM25 handles varying document lengths better.
- BM25 often performs better in practical applications.
5. Implementation
Below is a Python code example demonstrating how to implement TF-IDF and BM25 using the scikit-learn
and rank_bm25
libraries:
from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi
# Sample documents
documents = [
"the cat in the hat",
"the cat",
"the hat",
"a cat sat on the mat"
]
# TF-IDF Implementation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
# BM25 Implementation
tokenized_corpus = [doc.split(" ") for doc in documents]
bm25 = BM25Okapi(tokenized_corpus)
# Query
query = "cat hat"
tokenized_query = query.split(" ")
bm25_scores = bm25.get_scores(tokenized_query)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
print("BM25 Scores:\n", bm25_scores)
6. FAQ
What is the main advantage of BM25 over TF-IDF?
BM25 accounts for document length and term saturation, providing a more accurate ranking of documents in a search context.
Can TF-IDF still be useful in modern applications?
Yes, TF-IDF is still widely used for its simplicity and effectiveness in certain contexts, especially in smaller datasets.
Which algorithm should I use for my search engine?
For most applications, BM25 is preferred due to its robustness and adaptability to different scenarios.