TF-IDF vs. BM25

1. Introduction

In search engine databases, efficient querying and ranking of documents is crucial. This lesson explores two popular methods for ranking: TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25). Understanding their differences, advantages, and use cases is essential for optimizing search algorithms.

2. TF-IDF

TF-IDF is a statistical measurement that evaluates the importance of a word in a document relative to a collection (corpus) of documents. It is calculated using two components:

Term Frequency (TF): Measures how frequently a term appears in a document. The formula is:

TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

Inverse Document Frequency (IDF): Measures the importance of the term across all documents. The formula is:

IDF(t) = log_e(Total number of documents / Number of documents containing term t)

Finally, TF-IDF is calculated as:

TF-IDF(t, d) = TF(t, d) * IDF(t)

TF-IDF helps prioritize documents that contain the search terms more frequently while penalizing common words.

3. BM25

BM25 is an extension of TF-IDF and incorporates several improvements for ranking documents:

Document Length Normalization: It adjusts for the length of documents, ensuring that longer documents do not have an unfair advantage.
Term Saturation: It includes parameters to control how much additional occurrences of a term contribute to the score.
The BM25 scoring formula is:

BM25(t, d) = IDF(t) * (TF(t, d) * (k1 + 1)) / (TF(t, d) + k1 * (1 - b + b * (len(d) / avglen)))

Where:

k1 = 1.2 (a tuning parameter)
b = 0.75 (controls the influence of document length)
len(d) = length of the document
avglen = average document length in the collection

4. Comparison

Note: While both TF-IDF and BM25 rank documents based on term importance, BM25 typically yields better results due to its normalization and saturation features.

TF-IDF is simpler and easier to implement.
BM25 handles varying document lengths better.
BM25 often performs better in practical applications.

5. Implementation

Below is a Python code example demonstrating how to implement TF-IDF and BM25 using the scikit-learn and rank_bm25 libraries:

from sklearn.feature_extraction.text import TfidfVectorizer
from rank_bm25 import BM25Okapi

# Sample documents
documents = [
    "the cat in the hat",
    "the cat",
    "the hat",
    "a cat sat on the mat"
]

# TF-IDF Implementation
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# BM25 Implementation
tokenized_corpus = [doc.split(" ") for doc in documents]
bm25 = BM25Okapi(tokenized_corpus)

# Query
query = "cat hat"
tokenized_query = query.split(" ")
bm25_scores = bm25.get_scores(tokenized_query)

print("TF-IDF Matrix:\n", tfidf_matrix.toarray())
print("BM25 Scores:\n", bm25_scores)

6. FAQ

What is the main advantage of BM25 over TF-IDF?

BM25 accounts for document length and term saturation, providing a more accurate ranking of documents in a search context.

Can TF-IDF still be useful in modern applications?

Yes, TF-IDF is still widely used for its simplicity and effectiveness in certain contexts, especially in smaller datasets.

Which algorithm should I use for my search engine?

For most applications, BM25 is preferred due to its robustness and adaptability to different scenarios.