BM25 Relevance Scoring
BM25 is the ranking algorithm used by most modern search engines, including Elasticsearch. It’s an evolution of the TF/IDF model for relevance scoring, and is defined as:
Given a query , containing keywords , the BM25 score of a document is:
where is the number of times that the keyword occurs in the document , is the length of the document in words, and is the average document length in the text collection from which documents are drawn.
It helps to walk through the algorithm as a series of steps:
graph TD
Query["Query Q<br/>(q₁, q₂, ..., qₙ)"] --> Terms["Query Terms<br/>qᵢ"]
Collection["Document Collection<br/>N total documents"] --> GlobalStats["Global Statistics<br/>• n(qᵢ): docs containing qᵢ<br/>• avgdl: average doc length"]
GlobalStats --> IDF["IDF(qᵢ)<br/>ln((N-n(qᵢ)+0.5)/<br/>(n(qᵢ)+0.5)+1)"]
Terms --> IDF
Documents["For each Document D"] --> DocStats["Document Statistics<br/>• f(qᵢ,D): term frequency<br/>• |D|: document length"]
DocStats --> TF["TF(qᵢ)<br/>f(qᵢ,D)·(k₁+1) /<br/>(f(qᵢ,D) + k₁·LN)"]
GlobalStats --> LengthNorm["Length Normalization (LN)<br/>1 - b + b·(|D|/avgdl)"]
Params["Parameters<br/>k₁ ∈ [1.2, 2.0]<br/>b = 0.75"] --> LengthNorm
LengthNorm --> TF
IDF --> TermScore["Term Score<br/>IDF(qᵢ) × TF(qᵢ,D)"]
TF --> TermScore
TermScore --> Sum["Sum for all query terms<br/>Σᵢ₌₁ⁿ Term Scores"]
Sum --> DocScore["BM25 Score(D,Q)<br/>for each document"]
DocScore --> Rank["Sort Documents<br/>by BM25 Score"]
Rank --> Output["Ranked Results<br/>Doc₁ (highest score)<br/>Doc₂<br/>Doc₃<br/>..."]
Query
The user’s query is the starting input. It contains the keywords that get parsed and tokenized into individual terms before they’re matched against documents in the collection.
Example:
"running shoes for marathoners"
Query terms
Parsing breaks the query into individual keywords called the query terms . Each one is analyzed separately against the documents. Typical preprocessing happens at this stage too: stemming and removal of stop words so the query terms line up with what’s actually stored in the index.
Example:
["running", "shoes", "marathoner"] → ["run", "shoe", "marathon"]
The stop word for has been removed, and each remaining term has been
stemmed to its root form.
Document collection
The collection is everything the search engine has indexed: web pages, articles, product catalogs, or whatever the domain is. Its size feeds directly into how rare and how important each query term is judged to be.
Global statistics
Two collection-wide metrics get computed once and reused across every query: , the number of documents containing each query term, and , the average document length in the corpus. They’re what lets BM25 normalize scores across documents of varying length and across terms of varying rarity.
In theory these values are recalculated whenever a document is added or removed. In practice that puts too much load on the cluster, so implementations batch the calculation on a schedule or after a percentage of documents has changed.
The trade-off is that the statistics drift behind reality on a corpus that’s constantly being updated. For collections in the millions, that drift has a negligible effect on scoring.
IDF (inverse document frequency)
IDF measures how rare a term is across the collection. It’s often computed with the formula:
Where is the total number of documents in the collection, and is the number of documents containing term .
The smoothing factor (0.5 is the common default) keeps the formula
well-behaved when a term appears in no documents at all or in every
document.
The result is that rarer terms get higher IDF scores. Common words like
"the" barely contribute, and specific terms drive most of the ranking
signal.
Assuming a collection of a million documents, IDF scores for our example
(with "for" included to show stop word impact) might be:
{
"run": 4.1,
"shoe": 8.5,
"for": 0.051,
"marathon": 13.3
}
For each document
The inverted index picks candidate documents that contain any of the query terms. Each candidate then gets a BM25 score. This is where the algorithm shifts from collection-level to document-level analysis.
Document statistics
For each candidate document, BM25 needs two more numbers: , the number of times a term appears in this document, and , the document’s total word count.
For example, a blog post about “Marathon Training” might have:
{
"f('run', D)": 12,
"f('shoe', D)": 3,
"f('marathon', D)": 8,
"document_length": 850
}
Parameters
BM25 has two tunable parameters. , typically between 1.2 and
2.0, controls how quickly extra occurrences of a term in a document stop
adding to the score. , typically 0.75, controls how strongly
document length affects the score: at 0 length has no effect, at 1 it
fully normalizes.
The defaults work for most collections. Tuning them past the defaults usually means a held-out test set and an offline evaluation loop.
Length normalization
Longer documents naturally contain more occurrences of any given term, but aren’t necessarily more relevant for it. Length normalization adjusts for this bias, penalizing documents above the average length and boosting those below it.
TF (term frequency)
TF measures how often a query term appears in a specific document, with saturation built in so the first few occurrences matter much more than the next few.
Unlike classic TF-IDF, BM25’s TF curve flattens out as occurrences pile up. Extra instances of the same term still contribute, but with diminishing returns, which makes it harder to game the score by stuffing keywords.
Term score
The term score combines global rarity (IDF) with how often the term
appears in this document (TF). For the example above, if “marathon” has
IDF = 13.3 and TF = 0.8 in some document, the term score for
“marathon” on that document is 13.3 × 0.8 = 10.64.
Every query term produces one term score per document. Each is a measurement of how well that document matches that part of the query.
Sum for all query terms
Term scores are summed into a single document score. Documents that match more query terms score higher. The additive shape of the sum assumes query terms are independent and equally weighted.
BM25 score
That sum is the final BM25 score for document given query . Higher means more relevant. The score folds term rarity and frequency together with document-length normalization, in one number that ranks consistently across the collection.
Sort documents by BM25 score
Sort the scored documents from highest to lowest. Ties can be broken with secondary criteria like recency or popularity, depending on what the catalog values.
Ranked results
The output is an ordered list of documents likely to match the user’s intent. The top-ranked ones tend to contain the rarer query terms more than once, and to fall near the average length of the collection.
For our “running shoes for marathoners” query, top results might be:
- “Best Marathon Running Shoes 2023” (Score:
42.3) - “Marathon Training: Choosing the Right Footwear” (Score:
38.7) - “Runner’s Guide to Marathon Shoes” (Score:
35.1)