./ahmedhashim

Semantic Search

In Layered Search Ranking, the examples were still anchored in words from the query matching words in the document. walnut record cabinet should match a product with those words in the title or description.

Search gets messier when the words drift but the intent stays the same. A shopper searching vinyl storage console may still want the Vinyl Record Cabinet from the earlier post. A support user searching reset my login may need the document titled Changing your password. You can get there with synonyms and careful analyzers, but those rules become another system to maintain.

Semantic search starts with a different artifact: an embedding. Generate embeddings for documents and queries, store the document vectors in OpenSearch, then retrieve the documents closest to the query vector. The rest of this post covers how to pick a model, what the vector field actually looks like, and how to query it when meaning survives a vocabulary change.

Embeddings

An embedding is a fixed-length list of numbers produced by a model. Texts with similar meaning should land near each other in vector space.

For a product catalog, this text:

Vinyl Record Cabinet
Mid-century inspired storage cabinet for vinyl records with sliding doors.
Materials: walnut, iron

should land close to queries like:

record storage
vinyl cabinet
walnut media console for LPs

The value stored in the index is just an array of floats. Shortened for readability, the product text might come back like this:

text = """Vinyl Record Cabinet
Mid-century inspired storage cabinet for vinyl records with sliding doors.
Materials: walnut, iron
"""

embedding = [
    0.0124,
    -0.0341,
    0.0087,
    0.0219,
    -0.0146,
    0.0032,
    # ...many more dimensions
]

The model doesn’t know your catalog inventory. It only turns text into coordinates. OpenSearch still has to store those vectors, search nearby documents, apply filters, and return useful neighbors.

Similarity is usually cosine similarity, dot product, or Euclidean distance. The right choice depends on the model and index configuration. In OpenSearch, that choice lives in the space_type for the knn_vector field. Hosted model docs usually tell you which distance function to use.

A score only means something inside one model and one index. A cosine score of 0.82 from one model isn’t comparable to 0.82 from another, and the numbers shift again when you change the text template you embed.

Before any of that, the model tokenizes the text. Tokenizers are model-specific. They split text into words, subwords, punctuation, and whitespace-shaped pieces:

text = "Vinyl Record Cabinet"

# Illustrative only. Each model family has its own tokenizer.
tokens = ["Vinyl", " Record", " Cabinet"]
token_ids = tokenizer.encode(text)

Context length is measured in tokens, not characters or words, so a chunk that fits one model can be too long for another. The output stays a fixed length regardless: a 1024-dimensional model returns 1024 numbers for a short title and 1024 numbers for a long description. When the input runs past the context length, the tokenizer drops the overflow, so chunk the text before embedding if you need all of it.

Choosing a model

Start with the data. Benchmarks like MTEB are good for screening, but the model has to work on your corpus. A model that does well on general sentence similarity can still struggle with furniture materials, internal abbreviations, legal text, code snippets, or multilingual product names.

Decide early whether the search is symmetric or asymmetric. In symmetric search, the query and documents look roughly alike, such as finding duplicate questions. In asymmetric search, a short query retrieves a longer document, like a shopper’s phrase matching a product description or a user question matching a knowledge base article. Sentence Transformers has a good explanation of the distinction.

For asymmetric search, use query and document roles when the model supports them. Cohere Embed models use search_query and search_document. Voyage models use query and document. Sentence Transformers exposes encode_query and encode_document for models that treat queries and documents differently. Use those methods when the model expects them; embedding both sides with a generic path can quietly hurt retrieval quality.

General text embeddings are a fine baseline for catalogs, blogs, support docs, and many internal search systems. If the corpus is mostly code, finance, legal text, biomedical text, or another specialized domain, test a domain model against the same queries. Don’t assume the bigger general model wins.

Model families differ in dimension, context length, and tokenization. OpenAI’s small and large embedding models return 1536 and 3072 dimensions by default, and can shorten output vectors through a dimensions parameter. Cohere Embed models return 1024-dimensional vectors. Voyage general models default around 1024 dimensions and can emit smaller or larger vectors. Open models vary more: MiniLM is commonly 384 dimensions, BGE small, BGE base, and BGE large use 384/768/1024, E5 large is 1024, and BGE-M3 is 1024 with a much longer token window.

Those numbers change the index you store. A 384-dimensional model is cheap and fast, but it may miss nuance in longer or domain-heavy text. A 3072-dimensional model gives the retriever more room, but every query touches more memory. Some hosted models expose Matryoshka-style embeddings: the earlier dimensions still form a useful vector when you truncate the output. That lets you test smaller stored dimensions without changing model families. Try it when the shorter vector stays close on your judged queries.

Storage tradeoffs

The storage math is easy to underestimate:

1,000,000 documents * 1,536 dimensions * 4 bytes = 6.1 GB
1,000,000 documents * 3,072 dimensions * 4 bytes = 12.3 GB

That’s FP32: 32 bits, or 4 bytes, per dimension. The vector graph, _source, replicas, and segment overhead add more.

Precision changes that multiplier:

FP32       32 bits  4 bytes per dimension
FP16       16 bits  2 bytes per dimension
bfloat16   16 bits  2 bytes per dimension
FP8         8 bits  1 byte per dimension
int8        8 bits  1 byte per dimension
binary      1 bit   1/8 byte per dimension

FP16 and bfloat16 both cut raw vector storage in half, but they spend those 16 bits differently. FP16 keeps more mantissa precision, so it resolves nearby numbers in finer detail. bfloat16 trades that detail for a wider exponent range, so it covers a larger span of very large and very small numbers. For normalized embeddings, where values are usually bounded and nearest-neighbor quality depends on small differences between dimensions, FP16 is often enough.

bfloat16 is more common in the model serving path than the index storage path. Model internals can have a much wider dynamic range than the final normalized embedding, especially around matmuls, attention, and activations. bfloat16 keeps a larger exponent range than FP16, so serving stacks can cut memory bandwidth while reducing overflow and underflow risk. The tradeoff is less mantissa precision, which is usually acceptable for inference. That doesn’t mean the vector index can store bfloat16 directly. Many indexes expose FP32, byte, int8, or binary vector storage instead.

FP8 is more aggressive. It can cut storage and memory bandwidth, but it needs a model and index path that tolerate that loss of precision. In practice, vector databases more often expose int8, uint8, or binary quantization than raw FP8 storage. Cohere and Voyage both expose compressed embedding formats, and Elasticsearch-style engines also support byte or binary vector paths. When you quantize, pull extra candidates and recompute vector similarity at higher precision before trusting the top results.

For the same 1M documents, the raw storage shape looks like this:

1,536 dimensions * FP32 = 6.1 GB
1,536 dimensions * FP16 = 3.1 GB
1,536 dimensions * int8 = 1.5 GB
1,536 dimensions * binary = 192 MB

Smaller isn’t automatically better. Lower precision can flatten distances between close neighbors. Lower dimensions can remove signal before the index even sees the vector. If two configurations are close on nDCG@10 for your judged queries, the smaller or cheaper one is usually the better starting point.

The operating model matters too. Hosted APIs are simple to start with and push model serving out of your system. Local models give you lower marginal cost and more control over privacy, batching, and failure modes. Either way, store the model ID, tokenizer, dimension, precision, distance metric, and input template with every embedding generation. Changing any of those means a new vector field or index, not a score update in place.

Indexing embeddings

The main indexing choice is what text you embed. For a product, I usually build a compact search document from the fields that explain the item:

def product_embedding_text(product: dict) -> str:
    return "\n".join(
        [
            f"Title: {product['title']}",
            f"Materials: {product.get('materials', '')}",
            f"Description: {product.get('description', '')}",
        ]
    )

Don’t embed every field just because it exists. Prices, inventory counts, and category IDs are better as structured fields for filters, sorting, or downstream application logic. Embedding text should describe meaning. Index fields should still carry facts.

For OpenSearch, create a vector field beside the normal document fields:

PUT /products_semantic_v1
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "description": { "type": "text" },
      "materials": { "type": "keyword" },
      "stock_available": { "type": "integer" },
      "amount_sold": { "type": "integer" },
      "embedding_model": { "type": "keyword" },
      "embedding_text_hash": { "type": "keyword" },
      "search_embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "engine": "lucene",
          "space_type": "cosinesimil",
          "parameters": {
            "m": 16,
            "ef_construction": 100
          }
        }
      }
    }
  }
}

OpenSearch supports vector search through the k-NN and Neural Search plugins. The knn_vector field stores embeddings. HNSW is the usual approximate nearest neighbor method, and the Lucene engine is a practical default when vectors need to live close to normal OpenSearch filtering.

Point a stable alias at the index so queries never hard-code a version:

POST /_aliases
{
  "actions": [
    { "add": { "index": "products_semantic_v1", "alias": "products_semantic" } }
  ]
}

Then generate embeddings in batches and index them with the document:

from hashlib import sha256

from openai import OpenAI
from opensearchpy import OpenSearch, helpers


client = OpenAI()
search = OpenSearch(hosts=[{"host": "localhost", "port": 9200}])

EMBEDDING_MODEL = "text-embedding-3-small"


def embed_batch(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=texts,
    )
    return [item.embedding for item in response.data]


def bulk_actions(products: list[dict]):
    texts = [product_embedding_text(product) for product in products]
    embeddings = embed_batch(texts)

    for product, text, embedding in zip(products, texts, embeddings):
        yield {
            "_op_type": "index",
            "_index": "products_semantic_v1",
            "_id": product["id"],
            "_source": {
                **product,
                "embedding_model": EMBEDDING_MODEL,
                "embedding_text_hash": sha256(text.encode()).hexdigest(),
                "search_embedding": embedding,
            },
        }


helpers.bulk(search, bulk_actions(products))

For new documents, this runs in the normal indexing path. For an existing corpus, treat it like any other reindex: stand up products_semantic_v2, backfill embeddings, compare quality against v1, then point the products_semantic alias at the new index.

Two failure modes are worth catching early. Don’t mix embedding generations in the same vector field. If half the documents came from one model and half came from another, nearest-neighbor search no longer measures one coherent space. Also watch long inputs. If your product descriptions or documents exceed the model’s context length, chunk the content or build a shorter embedding template. Otherwise the tokenizer may drop the title to keep the tail of a marketing description, which is a bad trade.

Vector-only retrieval

At query time, embed the user’s query and use that vector to retrieve candidates. The choices here are mostly about recall, latency, and whether structured filters run inside the vector query.

A basic k-NN query against the embedding field looks like this:

GET /products_semantic/_search
{
  "size": 10,
  "_source": {
    "excludes": ["search_embedding"]
  },
  "query": {
    "knn": {
      "search_embedding": {
        "vector": [0.012, -0.034, 0.008, ...],
        "k": 100,
        "filter": {
          "range": {
            "stock_available": {
              "gt": 0
            }
          }
        }
      }
    }
  }
}

The query vector is shortened here for readability. In production it has the same dimension as the stored field.

Vector-only retrieval works well for similar-item pages, recommendations, deduplication, and natural-language search over narrative documents. It also helps when the user’s vocabulary is hard to predict.

Its weakness is exactness. A model may decide that oak media console is close to walnut record cabinet, even though the material mismatch matters. It can also handle SKUs and file paths poorly because those strings don’t behave like ordinary language.

The tuning knobs are small but sharp:

  • k: how many nearest neighbors to retrieve before returning size results
  • ef_search: how wide HNSW searches at query time for engines that expose it
  • nprobes: how many buckets to search in an IVF index
  • oversample_factor: how many extra candidates to pull before recomputing vector similarity

Higher values improve recall, but they spend more CPU and add latency. Tune them against a judged query set, using the same discipline from offline ranking metrics, not against one satisfying query.

Threshold and filtered retrieval

Top k retrieval always returns k neighbors if enough documents exist. For “similar products” or “related articles”, a weak neighbor may be worse than showing fewer items. OpenSearch supports radial-style vector queries with min_score or max_distance, which lets you ask for documents inside a similarity threshold instead of a fixed count.

Thresholds fit recommendation surfaces and deduplication workflows. They’re harder for primary search because users expect a page of results. A threshold that works for one query can be too strict for another, especially when query lengths vary.

Filters have their own job. If the user selects walnut or in_stock=true, that filter should stay exact. Let the vector query handle meaning, and let structured filters enforce facts.

When filters are narrow, test the execution path carefully. Some vector engines handle pre-filtering more efficiently than others. If a filter leaves only a small candidate set, exact vector comparison over those documents can beat an approximate graph search that keeps stepping around excluded results.

Evaluation

The mistake is trusting the embedding index because a few demo queries look impressive. Demo queries are usually the ones embeddings handle best.

Use the same loop from the metrics posts, but keep the experiment about the vector index. Build a judged query set, run each embedding configuration, then compare nDCG@10 and MRR by query class. Keep an eye on recall too. You’re trying to find which embedding index returns the right neighbors for your corpus, not which one wins on a single happy-path query.

Before swapping the alias, inspect the misses. If walnut record cabinet pulls oak consoles too often, the problem may be the text template, model choice, or precision setting. If SKUs or model numbers matter to the surface, decide whether they belong in the embedding text or in exact fields outside the vector query.

Once the vector index reaches production, watch the online metrics. Zero-result rate should fall for vocabulary mismatch queries. Abandonment should fall on searches that used to return plausible but wrong neighbors. CTR can rise for the wrong reason, so pair it with success rate before calling the change good.

Most of the durable work is in the index plumbing. Pick the model, tokenization path, dimension, precision, and text template on purpose, and store enough metadata to rebuild the field from scratch. Once that’s in place you have an embedding-backed OpenSearch index you can measure and change without guessing.