Elasticsearch doesn’t just store documents; it can find the most similar documents to a query, not just documents that match criteria.
Let’s say you have a collection of product descriptions and you want to find products that are conceptually similar to "a durable, waterproof backpack for hiking" even if the exact words aren’t present. This is where k-Nearest Neighbors (k-NN) vector search in Elasticsearch shines. It moves beyond keyword matching to semantic similarity.
Here’s how it works, using a concrete example:
Imagine we’re indexing book data. Each book will have a title, author, and a vector representation of its content or summary. This vector is a list of numbers, typically generated by a machine learning model (like a transformer model) that captures the semantic meaning of the text.
First, we need to set up an Elasticsearch index that can store these vectors. We use the dense_vector field type.
PUT /book_index
{
"mappings": {
"properties": {
"title": { "type": "text" },
"author": { "type": "keyword" },
"description_vector": {
"type": "dense_vector",
"dims": 768, // This MUST match the dimensionality of your vectors
"index": true,
"similarity": "cosine" // Or "l2", "dot_product"
}
}
}
}
dense_vector: This is the crucial field type for k-NN.dims: This number must precisely match the number of dimensions in the vectors you generate. If your ML model outputs 768-dimensional vectors, this must be 768. Mismatches here are a common point of failure.index: true: This tells Elasticsearch to build an index for these vectors, enabling fast k-NN queries. Without it, searches would be slow (brute-force).similarity: This defines how "closeness" between vectors is measured.cosineis very common for text similarity,l2(Euclidean distance) is good for general-purpose vectors, anddot_productis useful in specific ML contexts.
Now, let’s index a few books. We’ll use placeholder vectors for demonstration. In a real scenario, these would come from an ML model.
POST /book_index/_doc/1
{
"title": "The Hitchhiker's Guide to the Galaxy",
"author": "Douglas Adams",
"description_vector": [0.1, 0.2, ..., 0.9] // 768 dimensions
}
POST /book_index/_doc/2
{
"title": "Pride and Prejudice",
"author": "Jane Austen",
"description_vector": [-0.5, 0.1, ..., 0.2] // 768 dimensions
}
POST /book_index/_doc/3
{
"title": "Foundation",
"author": "Isaac Asimov",
"description_vector": [0.3, 0.4, ..., 0.7] // 768 dimensions
}
To perform a k-NN search, you query the description_vector field. You provide a query vector (representing your search query) and specify how many nearest neighbors (k) you want.
Let’s say our query is "a funny science fiction adventure". We’d generate a vector for this query using the same ML model used for indexing.
GET /book_index/_search
{
"knn": {
"field": "description_vector",
"query_vector": [0.15, 0.25, ..., 0.85], // Vector for "a funny science fiction adventure"
"k": 2, // We want the 2 most similar books
"num_candidates": 100 // How many potential candidates to examine
},
"_source": ["title", "author"]
}
knn: This block signifies a k-NN search.field: Thedense_vectorfield to search within.query_vector: The vector representing your search query.k: The number of nearest neighbors to return.num_candidates: This is a performance tuning parameter. A higher number means Elasticsearch will explore more potential matches before settling on theknearest, potentially increasing accuracy but also latency. A good starting point is often10 * k.
Elasticsearch will use an approximate nearest neighbor (ANN) algorithm (like HNSW) to efficiently find the k vectors in the index that are closest to your query_vector based on the chosen similarity metric. It returns the documents corresponding to those vectors, sorted by similarity.
The mental model to build here is that Elasticsearch is transforming your text into points in a high-dimensional space. k-NN search is simply finding the closest points in that space. The power comes from how you generate those points (your ML model) and how you configure Elasticsearch to efficiently search them.
The dense_vector field type uses a specialized index structure, often Hierarchical Navigable Small Worlds (HNSW), to achieve fast approximate nearest neighbor searches. Unlike traditional inverted indexes for keywords, HNSW builds a graph where nodes are vector points and edges connect "nearby" points. When you query, it traverses this graph to find the closest neighbors efficiently, rather than comparing your query vector against every single vector in the index. This is why index: true is critical for performance and why num_candidates acts as a knob to balance search speed and recall.
One thing most people don’t know is how sensitive the num_candidates parameter is to the underlying ANN index structure and data distribution. If your vectors are clustered very tightly, a smaller num_candidates might suffice. If they are spread out or you have many dimensions, you might need a much larger value to ensure you find the true nearest neighbors. It’s not uncommon to see values range from k * 10 to k * 100 or even more, depending on the specific use case and the desired trade-off between accuracy and speed.
The next hurdle you’ll likely encounter is tuning the ANN index parameters for optimal performance and accuracy, or exploring hybrid search where you combine k-NN with traditional keyword (BM25) search.