Online Ranking Metrics

December 22, 2024

In Offline Ranking Metrics, I covered the standard metrics for scoring a layered ranker against judged data. Those scores are great for picking between candidate rankers before any of them touch production. Once a version is live, you need a different kind of measurement.

That’s online evaluation. The signal that maps to revenue and retention is what users do after seeing the ranked page.

The event model

Every metric below rolls up from two IDs. A session_id identifies a browser or signed-in session. A search_id identifies one submitted query and the result page it produced. Even when a metric uses the word “session”, I calculate it at the search_id level first. A browser session can contain several searches, and each one needs to be measured on its own.

The backend should always log the search response. That gives you a durable record of what the ranker returned even when the browser closes before sending interaction events. The frontend should log what the user saw and did: rendered impressions and clicks at minimum, plus dwell time and downstream conversions if you can capture them. Impressions should count rendered result cards, not every hit OpenSearch returned. That difference matters as soon as you try to reason about position bias.

Once those events are flowing, the metrics are mostly aggregation work.

Session abandonment rate

Session abandonment rate is the share of search sessions that don’t produce a click.

abandonment_rate = searches_with_no_click / searches_viewed

For the catalog example, a shopper searches walnut record cabinet, sees the result page, and leaves without clicking any product. That search counts as abandoned.

This metric catches broad dissatisfaction without telling you what kind. The top of the page might be full of stock-outs, or the query might be matching too generically to feel useful. Abandonment can’t pin down which it is, only that the page isn’t landing.

I usually track it twice: once across every rendered search page, and again excluding zero-result searches. The second view separates ranking problems from recall problems. For the layered ranker, abandonment is one of the first metrics I’d watch after touching negative_boost or the popularity factor. A rise in abandonment while zero-result rate stays flat usually means the first page got worse.

Given a list of per-search records rolled up from the event log, the math is small:

def abandonment_rate(searches: list[dict]) -> float:
    viewed = [s for s in searches if s["viewed"]]
    abandoned = [s for s in viewed if not s["clicked"]]
    return len(abandoned) / len(viewed) if viewed else 0.0

Click-through rate

Click-through rate is the ratio of clicks to impressions.

ctr = result_clicks / result_impressions

For search, CTR is most useful broken down by rank and ranker version. A global CTR hides position bias. Rank one usually gets clicked more than rank ten because it’s first, even when rank ten is the better result.

CTR is good at catching changes in result attractiveness. Title and thumbnail can move it. So can the price display or any rendering tweak that has nothing to do with ranking. That overlap also makes CTR easy to misread, so I never look at it without a satisfaction metric beside it.

For ranking work, I care most about the first page and especially the top five results. A change that improves rank 8 but hurts rank 1 usually doesn’t show up to users.

Computed straight from impression and click events:

from collections import Counter

def ctr_by_rank(events: list[dict]) -> dict[int, float]:
    impressions, clicks = Counter(), Counter()
    for e in events:
        if e["type"] == "impression":
            impressions[e["rank"]] += 1
        elif e["type"] == "click":
            clicks[e["rank"]] += 1
    return {rank: clicks[rank] / n for rank, n in impressions.items()}

Session success rate

Session success rate is the share of search sessions that lead to a successful outcome.

session_success_rate = successful_searches / searches_viewed

Success is product-specific. For this furniture catalog, a practical first rule is: the search succeeds if the user clicks a result and then either spends at least 30 seconds on the product page or adds the item to cart.

That rule isn’t perfect, but it’s a much better signal than clicks alone. CTR can climb on its own when the result cards get more clickable without becoming better matches. Success rate going up despite a flat CTR usually means the ranker is showing fewer but better choices, which is what you want.

For the layered ranker, success rate is where you catch a popularity boost that’s drifted too far. If shoppers searching walnut record cabinet keep bouncing from popular oak cabinets, either the popularity factor is too strong or the material signal is too weak.

Always store the raw components alongside the rolled-up boolean. A success flag is convenient for dashboards, but fields like dwell_ms and added_to_cart give you room to adjust the rule later without a backfill.

Once successful is on each per-search record, the rate looks identical to abandonment:

def session_success_rate(searches: list[dict]) -> float:
    viewed = [s for s in searches if s["viewed"]]
    successful = [s for s in viewed if s["successful"]]
    return len(successful) / len(viewed) if viewed else 0.0

Zero result rate

Zero result rate is the share of search responses that returned nothing.

zero_result_rate = zero_result_searches / search_responses

This is the cleanest recall metric in the set. If a shopper searches walnut record cabinet and OpenSearch returns no hits, no ranking function can save the session. Either the catalog doesn’t have matching products or the query isn’t being analyzed well enough to find them.

ZRR is one of the easier metrics to act on. High-volume zero-result queries are usually pointing at missing synonyms or analyzer gaps you can fix directly. In a furniture catalog, record console, vinyl storage, and LP cabinet may all represent the same intent.

I separate unfiltered ZRR from filtered ZRR. The first one is almost always a recall or coverage issue worth fixing. The second often isn’t a problem at all, since a narrow filter combination can produce an empty set legitimately.

The top zero-result queries are usually more useful than the rate itself. A ZRR of 2% may be fine if the misses are nonsense queries. If the misses are vinyl cabinet or walnut console, you have indexing work to do.

Because result count comes from the backend, ZRR runs against the logged search responses directly:

def zero_result_rate(responses: list[dict]) -> float:
    if not responses:
        return 0.0
    zero = sum(1 for r in responses if r["result_count"] == 0)
    return zero / len(responses)

Using them together

None of these metrics should drive ranking decisions on their own. Each one catches a different failure mode, and looking at any of them in isolation usually means missing context the others would have given you.

After a ranking change I check them in priority order. ZRR comes first, because if recall is broken, no amount of ranking can save the page. Abandonment is next, but only across searches that did return results. CTR and success rate sit on top of all that, and they should both move in the right direction at the top of the page.

The interactions matter more than the absolute numbers. The most common one I look for is CTR climbing while success rate falls, which usually means the ranker is producing more attractive cards but worse matches. The other is ZRR moving when query logic hasn’t changed; that sends me to analyzers and filters first, since function_score won’t fix what’s missing from the candidate set.

Online evaluation is most useful when the metrics line up with how shoppers actually move through the page. A shopper types walnut record cabinet, sees a ranked page, clicks a product, and either finds what they need or doesn’t. If the instrumentation preserves that path cleanly, ranking changes turn into something you can measure instead of something you have to argue about.