Hybrid Search
Layered Search Ranking started with lexical search. Semantic Search added embeddings to the same kind of corpus. Each one handles a different failure mode.
BM25 is good when the query uses the same
terms as the document. It knows that walnut record cabinet should care about
the exact words walnut, record, and cabinet. Embeddings cover the
vocabulary gap. A shopper can type vinyl storage console and still be looking
for the record cabinet, even if the product title never uses those words.
Hybrid search runs both retrieval paths and combines the ranked lists. The hard part is deciding how to combine them without letting one scoring system drown out the other.
The shape looks like this:
flowchart TD
Q["<strong>Query</strong><br/><span class='mermaid-detail'>vinyl storage console</span>"]
L["<strong>BM25</strong>"]
V["<strong>Vector</strong>"]
LR["<strong>BM25 ranks</strong><br/><span class='mermaid-detail'>1. vinyl_record_cabinet<br/>2. oak_record_stand</span>"]
VR["<strong>Vector ranks</strong><br/><span class='mermaid-detail'>1. walnut_media_console<br/>2. vinyl_record_cabinet</span>"]
RRF["<strong>RRF</strong>"]
F["<strong>Fused ranks</strong><br/><span class='mermaid-detail'>1. vinyl_record_cabinet<br/>2. walnut_media_console</span>"]
Q --> L
Q --> V
L --> LR
V --> VR
LR --> RRF
VR --> RRF
RRF --> F
Mismatched scores
A BM25 score comes from term frequency, inverse document frequency, field length normalization, field boosts, and whatever query structure you put around the text match. A vector score comes from an embedding model, a distance function, and the shape of the vector index. Both come back as floating point numbers, but they don’t share a scale.
You can normalize the scores and combine them. OpenSearch has supported that through its normalization processor for a while. The problem is that normalization has to make a judgment about the score distribution for each query. A query with one very strong title match behaves differently from a query where BM25 returns a flat page of decent matches. Vector scores have their own version of the same problem.
Reciprocal rank fusion sidesteps this. Instead of asking whether a BM25 score
of 14.2 is worth more than a cosine score of 0.78, RRF asks a simpler
question: where did the document rank in each list?
Reciprocal rank fusion
Reciprocal rank fusion gives every document a small contribution from each ranked list where it appears:
rrf_score(document) =
1 / (rank_constant + bm25_rank(document)) +
1 / (rank_constant + vector_rank(document))
If a document doesn’t appear in one list, it gets no contribution from that
list. Ranks are 1-based. The default rank_constant in OpenSearch is 60. A
smaller value gives more weight to the very top of each list. A larger value
flattens the curve and makes rank positions feel more similar.
With rank_constant = 60, a small example looks like this:
document bm25 rank vector rank rrf score
vinyl_record_cabinet 1 3 0.0323
walnut_media_console 4 2 0.0318
oak_record_stand 2 - 0.0161
The first two documents are close because both branches liked them. The third document ranked well lexically, but it only got credit from one branch.
OpenSearch recently published
a writeup introducing RRF for hybrid search.
The implementation uses the
score-ranker-processor,
which runs between the query and fetch phases and fuses the query clause
rankings before the final hits are returned.
Create the RRF pipeline
Create a search pipeline with a score-ranker-processor and set the
combination technique to rrf:
PUT /_search/pipeline/products-rrf
{
"description": "Rank fusion for product hybrid search",
"phase_results_processors": [
{
"score-ranker-processor": {
"combination": {
"technique": "rrf",
"rank_constant": 60
}
}
}
]
}
rank_constant is optional because 60 is the default, but I prefer to set it
explicitly. It makes the pipeline easier to review later.
Query both paths
Once the pipeline exists, send a
hybrid query
through it. OpenSearch runs the query clauses independently, then the pipeline
combines the lists with RRF.
For the product index from the previous posts, the lexical branch can stay close to the BM25 query from the layered ranking post. The vector branch uses the embedding field from the semantic search post:
POST /products_semantic/_search?search_pipeline=products-rrf
{
"size": 10,
"_source": {
"excludes": ["search_embedding"]
},
"query": {
"hybrid": {
"queries": [
{
"bool": {
"must": [
{
"multi_match": {
"query": "vinyl storage console",
"fields": ["title^4", "description"],
"type": "best_fields"
}
}
],
"filter": [
{
"range": {
"stock_available": {
"gt": 0
}
}
}
]
}
},
{
"knn": {
"search_embedding": {
"vector": [0.012, -0.034, 0.008],
"k": 100,
"filter": {
"range": {
"stock_available": {
"gt": 0
}
}
}
}
}
}
]
}
}
}
The vector is abbreviated. In application code, generate it with the same model and query role you used when building the embedding index.
The hybrid query supports up to five clauses, but two is the right starting
point: one lexical branch and one vector branch, enough to see whether hybrid
retrieval is helping.
Notice that the stock filter appears in both branches. Keep hard filters hard. If the lexical branch can return out-of-stock products but the vector branch can’t, RRF is still allowed to give those lexical-only documents credit.
What to tune
RRF gives you fewer knobs than score normalization, which is part of the appeal. The few that remain still matter.
Start with candidate depth. RRF can only fuse documents each branch returns. If
the vector branch uses k = 10 and the final page is also size = 10, there
isn’t much room for fusion. Pull a deeper vector candidate set, measure latency,
and decide how much recall you can afford.
Then tune rank_constant. Smaller values make rank 1 matter much more than
rank 20. Larger values smooth the difference. The default 60 is a good
baseline because it keeps lower-ranked documents from vanishing immediately,
but it still rewards documents that appear near the top of both lists.
Most of the remaining work is branch quality. The RRF processor doesn’t give you
per-branch weights. If the lexical branch is too weak, tune the multi_match
query, field boosts, analyzers, and exact fields. If the vector branch is too
noisy, tune the embedding text, model, k, and filters. Don’t use
rank_constant as a pretend BM25-versus-vector weight. It changes the rank
curve for every branch.
Failure modes
RRF doesn’t rescue bad inputs. If BM25 returns broad matches and the vector index returns loose semantic neighbors, fusion can still produce a mediocre page.
The common failure is exact matches. Catalog part numbers and filesystem paths often belong to the lexical branch only. That’s fine if the lexical branch ranks them highly enough, but don’t assume the vector branch will help. For known-item queries, I would either keep exact clauses very strong or route the query away from hybrid search when the intent is clearly exact.
The other failure is false agreement. Two branches can like the same bad document for different reasons. A product with generic marketing copy may match the query text and also sit near the query vector. RRF will reward that overlap, so inspect misses by query class instead of only looking at aggregate metrics.
Evaluation
This is the same judged-query loop from Offline Ranking Metrics, so the only new part is what you compare:
bm25
vector
hybrid_rrf
Split the report by query class. Exact product names, broad category searches, vocabulary-mismatch queries, and known-item queries won’t all move in the same direction, and watching them separately is what shows where fusion actually helps.
OpenSearch’s own benchmark in the RRF launch post showed a small relevance tradeoff against score-normalized hybrid search and a small latency win. That matches how I think about it: RRF is a strong first hybrid baseline because it avoids score calibration. A carefully tuned normalization pipeline can still win.
Once it reaches production, watch the online ranking metrics. The wins look like the vector index’s: fewer empty pages and quick exits on vocabulary-gap queries. The new risk is fusion surfacing results that look plausible but send the user to the wrong item, so weight success rate over CTR.
RRF is a practical way to get BM25 and embeddings onto one page in OpenSearch. It runs each retrieval path on its own terms and fuses them by rank, so neither score scale has to win. Start there, measure it against judged queries, and only reach for heavier normalization or learned ranking once you have evidence that simple rank fusion has stopped moving.