BM25 Relevance Scoring

July 2, 2023

BM25 is a ranking algorithm used by popular search engines such as Elasticsearch. It’s an evolution of the TF/IDF model for relevance scoring, and is defined as:

Given a query $Q$ , containing keywords $q_1, q_2, ..., q_n$ , the BM25 score of a document $D$ is:
$\text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i,D) \cdot (k_1+1)}{f(q_i,D) + k_1 \cdot \left(1-b+b \cdot \frac{|D|}{\text{avgdl}}\right)}$
where $f(q_i, D)$ is the number of times that the keyword $q_i$ occurs in the document $D$ , $|D|$ is the length of the document $D$ in words, and $\text{avgdl}$ is the average document length in the text collection from which documents are drawn.

To help understand the scoring function, we can visualize the algorithm as a series of steps:

  graph TD
Query["Query Q<br/>(q₁, q₂, ..., qₙ)"] --> Terms["Query Terms<br/>qᵢ"]

    Collection["Document Collection<br/>N total documents"] --> GlobalStats["Global Statistics<br/>• n(qᵢ): docs containing qᵢ<br/>• avgdl: average doc length"]

    GlobalStats --> IDF["IDF(qᵢ)<br/>ln((N-n(qᵢ)+0.5)/<br/>(n(qᵢ)+0.5)+1)"]

    Terms --> IDF

    Documents["For each Document D"] --> DocStats["Document Statistics<br/>• f(qᵢ,D): term frequency<br/>• |D|: document length"]

    DocStats --> TF["TF(qᵢ)<br/>f(qᵢ,D)·(k₁+1) /<br/>(f(qᵢ,D) + k₁·LN)"]

    GlobalStats --> LengthNorm["Length Normalization (LN)<br/>1 - b + b·(|D|/avgdl)"]

    Params["Parameters<br/>k₁ ∈ [1.2, 2.0]<br/>b = 0.75"] --> LengthNorm
    LengthNorm --> TF

    IDF --> TermScore["Term Score<br/>IDF(qᵢ) × TF(qᵢ,D)"]
    TF --> TermScore

    TermScore --> Sum["Sum for all query terms<br/>Σᵢ₌₁ⁿ Term Scores"]

    Sum --> DocScore["BM25 Score(D,Q)<br/>for each document"]

    DocScore --> Rank["Sort Documents<br/>by BM25 Score"]

    Rank --> Output["Ranked Results<br/>Doc₁ (highest score)<br/>Doc₂<br/>Doc₃<br/>..."]

Query $Q$

The process begins with the users query $Q$ containing the keywords $(q₁, q₂, ..., qₙ)$ that they type into the search engine. The query gets parsed and tokenized into individual search terms that will be matched against documents in the collection.

Example:

"running shoes for marathoners"

Query Terms $qᵢ$

After parsing the query, it gets broken up into individual keywords known as the query terms $qᵢ$ . Each term will be analyzed separately to determine how relevant it is to the documents in the collection.

At this point, some common preprocessing may occur such as stemming or the removal of stop words. This normalizes the query terms to those that are stored in the search index.

Example:

["running", "shoes", "marathoner"] → ["run", "shoe", "marathon"]

Notice that the stop word for has been removed, and each term has been stemmed by removing the suffix to reduce it to its root form.

Document Collection

This represents all the content available to the search engine (the entire corpus of data). It could be web pages, articles, or any text documents. The collection size $N$ is crucial for calculating how important a query term is.

Global Statistics

These are metrics regarding the entire collection of documents which are calculated once & used for scoring all search queries. It includes:

$n(qᵢ)$ - the number of documents containing each query term $qᵢ$
$avgdl$ - the average length of document in the corpus

They’re used to help normalize scores across documents of different lengths and term rarities.

In theory these values are recalculated whenever a new document is added, removed, or updated. In practice this causes undue load on the cluster, so implementations commonly batch the calculation at specific time intervals or after a threshold of the % of documents changed is reached.

The trade-off in doing so is that global statistics may become stale in a corpus that’s constantly being updated. However this has a negligible impact in larger collections with millions of documents.

IDF (Inverse Document Frequency)

This measures how rare and important a term is across the collection. It’s often computed with the formula:

\text{IDF}(q_i) = \ln\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\right)

Where $N$ is the total number of documents in the collection, and $n(qᵢ)$ is the number of documents containing term $qᵢ$ .

The formula includes smoothing factors (0.5 is the common default) to avoid division by zero and extreme values (e.g., the term doesn’t appear in any documents at all).

In general, IDF dictates that terms which appear in fewer documents are considered more valuable for determining relevancy. As a result, common words like "the" have low IDF scores, while rarer & more specific terms score higher.

Assuming a collection of a million documents, IDF scores for our example (with “for” included to show stop word impact) might be:

{
  "run": 4.1,
  "shoe": 8.5,
  "for": 0.051,
  "marathon": 13.3
}

For each Document $D$

A set of candidate documents from the collection are selected based on the query terms using the inverted index. Each candidate document $D$ will receive a BM25 score indicating its relevance to the query.

This is where the algorithm shifts from collection-level to document-level analysis.

Document Statistics

Metrics on a document-specific level extracted for scoring:

$f(qᵢ,D)$ - The number of times each term appears in this specific document.
$|D|$ - The total word count of the document.

These raw statistics will be normalized and weighted in subsequent steps. For example, a blog post about “Marathon Training” might have:

{
  "f('run', D)": 12,
  "f('shoe', D)": 3,
  "f('marathon', D)": 8,
  "document_length": 850
}

Parameters

These are tunable constants that control BM25’s behavior:

$k₁$ - Controls term frequency saturation: how quickly additional term occurrences diminish in value, typically 1.2 - 2.0.
$b$ - Controls document length normalization strength, from 0 (none) to 1 (full normalization), typically 0.75.

The default values work well across most collections, though they can be optimized for specific domains through experimentation with held-out test queries.

Length Normalization $LN$

Here the scores are adjusted to account for document length bias. Longer documents naturally contain more term occurrences, but aren’t necessarily more relevant. This component penalizes very long documents and boosts very short ones relative to the average length.

The parameter $b$ controls how strongly length affects the final score.

TF (Term Frequency)

This measures how often a query term appears in a specific document, with built-in saturation to ensure diminishing returns.

Unlike classic TF-IDF, BM25’s term frequency component makes the first occurrence of a term more valuable than subsequent ones. It ensures that additional occurrences still help but with decreasing impact, preventing keyword stuffing from overly inflating scores.

Term Score

The relevance score of a single query term to a single document. It combines how important the term is globally (IDF) with its presence in the specific document (TF).

For our example, if “marathon” has IDF = 13.3 and TF = 0.8 for a document then the term score would be calculated as 13.3 × 0.8 = 10.64

Each query term generates one term score per document, which represents how well that document satisfies that particular aspect of the query.

Sum for all query terms

The individual term scores are summed into a single document score. Documents matching multiple query terms score higher. This additive approach assumes query terms are independent and equally important.

BM25 Score $(D,Q)$

The final relevance score for document $D$ given query $Q$ . Higher scores indicate greater predicted relevance.

This single number encapsulates term frequency, term rarity, and document length considerations. BM25 scores are comparable across documents, enabling ranking.

Sort Documents by BM25 Score

All the scored documents are ordered from highest to lowest BM25 score. This sorting step transforms individual scores into a ranked list. Documents with identical scores may be further ordered by secondary criteria like recency or popularity.

Ranked Results

Finally we get an ordered list of documents which are most likely to satisfy the user’s information need. The top-ranked documents are those that best balance having the query terms (especially rare ones), containing them multiple times (with diminishing returns), and being of appropriate length.

For our “running shoes for marathoners” query, top results might be:

“Best Marathon Running Shoes 2023” (Score: 42.3)
“Marathon Training: Choosing the Right Footwear” (Score: 38.7)
“Runner’s Guide to Marathon Shoes” (Score: 35.1)

This ranked list is what users see as search results.