Offline Ranking Metrics

November 17, 2024

In Layered Search Ranking, I built a custom ranking function for a furniture catalog. Tuning a ranker against a handful of sample queries is fine to start, but beyond that you want a number that moves when the ranker gets better.

Offline evaluation is how you get one. Given a set of judged queries, you score the ranker against the judgments and watch the score change as you tune. The metrics in this post are the standard ways to turn “these results look right” into a number you can iterate against.

Judgments

Every offline metric below works from the same input: a set of queries paired with documents that have been labeled for relevance.

The simplest version is binary. For the query walnut record cabinet, a small set of product IDs gets marked relevant and everything else is implicitly irrelevant. A more useful version uses graded relevance, where each judgment carries an integer scoring how good a match the document is. A common scale is 0 (irrelevant) to 3 (obvious match). Graded judgments unlock the gain-based metrics later in the post.

Where the judgments come from depends on what you have. Editorial review is the cleanest but doesn’t scale. Click logs and downstream conversions can be used as weak labels at much higher volume. Most catalogs end up with some blend of the two.

For the rest of this post I’ll assume judgments look like this:

judgments = {
    "walnut record cabinet": {
        "vinyl_record_cabinet_v3": 3,
        "walnut_storage_console": 2,
        "oak_record_stand": 1,
    },
    # ...more queries
}

Precision

Precision is the share of retrieved documents that are relevant.

\text{precision} = \frac{|\{\text{relevant}\} \cap \{\text{retrieved}\}|}{|\{\text{retrieved}\}|}

For ranking work, the unbounded version isn’t very useful by itself. Search systems return long lists, and precision over the whole list drops as the list grows. The version that matters in practice is Precision at k, which restricts the calculation to the top $k$ results:

\text{P@k} = \frac{|\{\text{relevant}\} \cap \{\text{retrieved}_{1..k}\}|}{k}

P@5 and P@10 are the most common cuts. They map roughly to “the first page” of most product listings.

def precision_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    top_k = retrieved[:k]
    return sum(1 for doc in top_k if doc in relevant) / k

P@k is easy to read but throws away the order of the relevant items inside the top $k$ . A run that puts the best match at rank 1 scores the same as a run that puts it at rank 5.

Recall

Recall is the share of all relevant documents the system actually returned.

\text{recall} = \frac{|\{\text{relevant}\} \cap \{\text{retrieved}\}|}{|\{\text{relevant}\}|}

Recall and precision pull against each other. Returning more documents tends to lift recall and drag precision down, since the relevant matches are still in there but they’re now diluted by extra noise.

def recall(retrieved: list[str], relevant: set[str]) -> float:
    if not relevant:
        return 0.0
    return sum(1 for doc in retrieved if doc in relevant) / len(relevant)

For a catalog search where the user only sees ten results, recall over the full corpus isn’t very actionable. Recall at $k$ follows the same pattern as P@k.

F-measure

The F-measure (specifically $F_1$ ) is the harmonic mean of precision and recall:

F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

The harmonic mean penalizes the lower of the two numbers, so a system can’t win at $F_1$ by being great at precision and terrible at recall. The general form lets you weight one over the other:

F_\beta = \frac{(1 + \beta^2) \cdot \text{precision} \cdot \text{recall}}{\beta^2 \cdot \text{precision} + \text{recall}}

$\beta > 1$ weights recall higher and $\beta < 1$ weights precision higher. The default $F_1$ is fine for most search use cases.

def f1(precision: float, recall: float) -> float:
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

$F_1$ shows up more in classification than in search ranking. For ranking, MAP and nDCG are stronger choices because both account for position.

Mean Average Precision

Average Precision (AP) for a single query is the average of the precision values at each rank where a relevant document was found:

\text{AP} = \frac{\sum_{k=1}^{n} P(k) \cdot \text{rel}(k)}{|\{\text{relevant}\}|}

where $P(k)$ is the precision at cutoff $k$ , and $\text{rel}(k)$ is 1 when the document at rank $k$ is relevant and 0 otherwise. Mean Average Precision averages AP across a query set:

\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)

MAP rewards rankers that put relevant documents earlier. Because AP only counts ranks where a relevant document landed, lifting a relevant item from rank 9 to rank 1 moves the score noticeably.

def average_precision(retrieved: list[str], relevant: set[str]) -> float:
    if not relevant:
        return 0.0
    hits = 0
    score = 0.0
    for k, doc in enumerate(retrieved, start=1):
        if doc in relevant:
            hits += 1
            score += hits / k
    return score / len(relevant)


def mean_average_precision(per_query: list[tuple[list[str], set[str]]]) -> float:
    return sum(average_precision(r, j) for r, j in per_query) / len(per_query)

MAP works on binary judgments. If your judgments have grades, nDCG is the better fit.

Normalized Discounted Cumulative Gain

DCG sums the relevance grades of retrieved documents, discounted by their rank:

\text{DCG}_p = \sum_{i=1}^{p} \frac{\text{rel}_i}{\log_2(i + 1)}

The logarithm in the denominator is the discount. A relevant document at rank 1 contributes its full grade. The same document at rank 5 contributes about a third as much. The shape of the discount mirrors how user attention drops down the page.

Raw DCG isn’t comparable across queries because the maximum possible value depends on how many relevant documents exist for that query. Normalizing by the ideal ranking fixes that:

\text{nDCG}_p = \frac{\text{DCG}_p}{\text{IDCG}_p}

where $\text{IDCG}_p$ is the DCG of the best possible ordering of the judged documents. nDCG always falls between 0 and 1, which makes it easy to compare runs and aggregate across queries.

from math import log2


def dcg(retrieved: list[str], judgments: dict[str, int], k: int) -> float:
    return sum(
        judgments.get(doc, 0) / log2(i + 1)
        for i, doc in enumerate(retrieved[:k], start=1)
    )


def ndcg(retrieved: list[str], judgments: dict[str, int], k: int) -> float:
    ideal = sorted(judgments.values(), reverse=True)[:k]
    idcg = sum(rel / log2(i + 1) for i, rel in enumerate(ideal, start=1))
    return dcg(retrieved, judgments, k) / idcg if idcg else 0.0

nDCG is the standard offline metric for graded relevance and the one I reach for first when judgments have grades.

Mean Reciprocal Rank

MRR averages the reciprocal of the rank at which the first relevant document appears:

\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}

A first relevant result at rank 1 contributes 1.0. Buried at rank 5 it contributes 0.2. The drop is sharp, which is the whole point. MRR captures whether the user’s first useful answer was right at the top.

def reciprocal_rank(retrieved: list[str], relevant: set[str]) -> float:
    for k, doc in enumerate(retrieved, start=1):
        if doc in relevant:
            return 1 / k
    return 0.0


def mean_reciprocal_rank(per_query: list[tuple[list[str], set[str]]]) -> float:
    return sum(reciprocal_rank(r, j) for r, j in per_query) / len(per_query)

MRR is most useful for known-item search and Q&A, where there’s typically one right answer per query. For broad catalog search, MAP and nDCG capture more signal.

The rank evaluation API

OpenSearch ships an API that runs all of this without leaving the cluster. Given queries and judgments, the _rank_eval endpoint executes the queries and computes whichever metric you ask for.

POST /products/_rank_eval
{
  "requests": [
    {
      "id": "walnut_record_cabinet",
      "request": {
        "query": {
          "multi_match": {
            "query": "walnut record cabinet",
            "fields": ["title^4", "description"]
          }
        }
      },
      "ratings": [
        { "_index": "products", "_id": "vinyl_record_cabinet_v3", "rating": 3 },
        { "_index": "products", "_id": "walnut_storage_console",  "rating": 2 },
        { "_index": "products", "_id": "oak_record_stand",        "rating": 1 }
      ]
    }
  ],
  "metric": {
    "dcg": {
      "k": 10,
      "normalize": true
    }
  }
}

The metric block is the only thing that changes between runs. Swap dcg for precision, recall, mean_reciprocal_rank, or expected_reciprocal_rank and the rest of the body stays the same. Each metric has its own knobs (k, relevant_rating_threshold, maximum_relevance) that the docs cover.

The response gives you an aggregate metric_score and a per-query details map. That’s enough to plot scores over time or compare two ranker versions side by side without leaving the cluster.

Picking one

The metric you pick should match what you’re trying to improve. For a product catalog with graded judgments and a long tail of acceptable matches, nDCG@10 is what I reach for first. MRR fits better when each query has a single correct answer, as in known-item search or Q&A. P@k is easier to explain to non-engineers and works well alongside one of those as a dashboard supplement, while $F_1$ is really a classification metric and rarely earns a primary slot in ranking work.

Offline metrics give you a fast iteration loop and a way to compare ranker versions before any of them touch real traffic. They don’t tell you whether the shipped change is helping shoppers in the way you predicted. That question is what Online Ranking Metrics covers.