BM25 Relevance Scoring

July 2, 2023

BM25 is the ranking algorithm used by most modern search engines, including Elasticsearch. It’s an evolution of the TF/IDF model for relevance scoring, and is defined as:

Given a query $Q$ , containing keywords $q_1, q_2, ..., q_n$ , the BM25 score of a document $D$ is:
$\text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i,D) \cdot (k_1+1)}{f(q_i,D) + k_1 \cdot \left(1-b+b \cdot \frac{|D|}{\text{avgdl}}\right)}$
where $f(q_i, D)$ is the number of times that the keyword $q_i$ occurs in the document $D$ , $|D|$ is the length of the document $D$ in words, and $\text{avgdl}$ is the average document length in the text collection from which documents are drawn.

It helps to walk through the algorithm as a series of steps:

  graph TD
Query["<strong>Query Q</strong><br/><span class='mermaid-detail'>(q₁, q₂, ..., qₙ)</span>"] --> Terms["<strong>Query Terms</strong><br/><span class='mermaid-detail'>qᵢ</span>"]

    Collection["<strong>Document Collection</strong><br/><span class='mermaid-detail'>N total documents</span>"] --> GlobalStats["<strong>Global Statistics</strong><br/><span class='mermaid-detail'>n(qᵢ): docs containing qᵢ<br/>avgdl: average doc length</span>"]

    GlobalStats --> IDF["<strong>IDF(qᵢ)</strong><br/><span class='mermaid-detail'>ln((N-n(qᵢ)+0.5)/<br/>(n(qᵢ)+0.5)+1)</span>"]

    Terms --> IDF

    Documents["<strong>For each Document D</strong>"] --> DocStats["<strong>Document Statistics</strong><br/><span class='mermaid-detail'>f(qᵢ,D): term frequency<br/>|D|: document length</span>"]

    DocStats --> TF["<strong>TF(qᵢ)</strong><br/><span class='mermaid-detail'>f(qᵢ,D)·(k₁+1) /<br/>(f(qᵢ,D) + k₁·LN)</span>"]

    GlobalStats --> LengthNorm["<strong>Length Normalization</strong><br/><span class='mermaid-detail'>1 - b + b·(|D|/avgdl)</span>"]

    Params["<strong>Parameters</strong><br/><span class='mermaid-detail'>k₁ ∈ [1.2, 2.0]<br/>b = 0.75</span>"] --> LengthNorm
    LengthNorm --> TF

    IDF --> TermScore["<strong>Term Score</strong><br/><span class='mermaid-detail'>IDF(qᵢ) × TF(qᵢ,D)</span>"]
    TF --> TermScore

    TermScore --> Sum["<strong>Sum query terms</strong><br/><span class='mermaid-detail'>Σᵢ₌₁ⁿ term scores</span>"]

    Sum --> DocScore["<strong>BM25 Score(D,Q)</strong><br/><span class='mermaid-detail'>for each document</span>"]

    DocScore --> Rank["<strong>Sort Documents</strong><br/><span class='mermaid-detail'>by BM25 score</span>"]

    Rank --> Output["<strong>Ranked Results</strong><br/><span class='mermaid-detail'>Doc₁ highest score<br/>Doc₂<br/>Doc₃</span>"]

Query $Q$

The user’s query $Q$ is the starting input. It contains the keywords $(q_1, q_2, ..., q_n)$ that get parsed and tokenized into individual terms before they’re matched against documents in the collection.

Example:

"running shoes for marathoners"

Query terms $qᵢ$

Parsing breaks the query into individual keywords called the query terms $q_i$ . Each one is analyzed separately against the documents. Typical preprocessing happens at this stage too: stemming and removal of stop words so the query terms line up with what’s actually stored in the index.

Example:

["running", "shoes", "marathoner"] → ["run", "shoe", "marathon"]

The stop word for has been removed, and each remaining term has been stemmed to its root form.

Document collection

The collection is everything the search engine has indexed: web pages, articles, product catalogs, or whatever the domain is. Its size $N$ feeds directly into how rare and how important each query term is judged to be.

Global statistics

Two collection-wide metrics get computed once and reused across every query: $n(q_i)$ , the number of documents containing each query term, and $\text{avgdl}$ , the average document length in the corpus. They’re what lets BM25 normalize scores across documents of varying length and across terms of varying rarity.

In theory these values are recalculated whenever a document is added or removed. In practice that puts too much load on the cluster, so implementations batch the calculation on a schedule or after a percentage of documents has changed.

The trade-off is that the statistics drift behind reality on a corpus that’s constantly being updated. For collections in the millions, that drift has a negligible effect on scoring.

IDF (inverse document frequency)

IDF measures how rare a term is across the collection. It’s often computed with the formula:

\text{IDF}(q_i) = \ln\left(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1\right)

Where $N$ is the total number of documents in the collection, and $n(qᵢ)$ is the number of documents containing term $qᵢ$ .

The smoothing factor (0.5 is the common default) keeps the formula well-behaved when a term appears in no documents at all or in every document.

The result is that rarer terms get higher IDF scores. Common words like "the" barely contribute, and specific terms drive most of the ranking signal.

Assuming a collection of a million documents, IDF scores for our example (with "for" included to show stop word impact) might be:

{
  "run": 4.1,
  "shoe": 8.5,
  "for": 0.051,
  "marathon": 13.3
}

For each document $D$

The inverted index picks candidate documents that contain any of the query terms. Each candidate then gets a BM25 score. This is where the algorithm shifts from collection-level to document-level analysis.

Document statistics

For each candidate document, BM25 needs two more numbers: $f(q_i, D)$ , the number of times a term appears in this document, and $|D|$ , the document’s total word count.

For example, a blog post about “Marathon Training” might have:

{
  "f('run', D)": 12,
  "f('shoe', D)": 3,
  "f('marathon', D)": 8,
  "document_length": 850
}

Parameters

BM25 has two tunable parameters. $k_1$ , typically between 1.2 and 2.0, controls how quickly extra occurrences of a term in a document stop adding to the score. $b$ , typically 0.75, controls how strongly document length affects the score: at 0 length has no effect, at 1 it fully normalizes.

The defaults work for most collections. Tuning them past the defaults usually means a held-out test set and an offline evaluation loop.

Length normalization $LN$

Longer documents naturally contain more occurrences of any given term, but aren’t necessarily more relevant for it. Length normalization adjusts for this bias, penalizing documents above the average length and boosting those below it.

TF (term frequency)

TF measures how often a query term appears in a specific document, with saturation built in so the first few occurrences matter much more than the next few.

Unlike classic TF-IDF, BM25’s TF curve flattens out as occurrences pile up. Extra instances of the same term still contribute, but with diminishing returns, which makes it harder to game the score by stuffing keywords.

Term score

The term score combines global rarity (IDF) with how often the term appears in this document (TF). For the example above, if “marathon” has IDF = 13.3 and TF = 0.8 in some document, the term score for “marathon” on that document is 13.3 × 0.8 = 10.64.

Every query term produces one term score per document. Each is a measurement of how well that document matches that part of the query.

Sum for all query terms

Term scores are summed into a single document score. Documents that match more query terms score higher. The additive shape of the sum assumes query terms are independent and equally weighted.

BM25 score $(D,Q)$

That sum is the final BM25 score for document $D$ given query $Q$ . Higher means more relevant. The score folds term rarity and frequency together with document-length normalization, in one number that ranks consistently across the collection.

Sort documents by BM25 score

Sort the scored documents from highest to lowest. Ties can be broken with secondary criteria like recency or popularity, depending on what the catalog values.

Ranked results

The output is an ordered list of documents likely to match the user’s intent. The top-ranked ones tend to contain the rarer query terms more than once, and to fall near the average length of the collection.

For our “running shoes for marathoners” query, top results might be:

“Best Marathon Running Shoes 2023” (Score: 42.3)
“Marathon Training: Choosing the Right Footwear” (Score: 38.7)
“Runner’s Guide to Marathon Shoes” (Score: 35.1)