Elasticsearch’s indexing speed isn’t just about disk speed; it’s about how efficiently you can batch up operations and avoid overwhelming the cluster.
Let’s see Elasticsearch in action. Imagine you have a stream of incoming events, maybe from logs or user activity. You need to get these into Elasticsearch for searching.
// Incoming event 1
{
"timestamp": "2023-10-27T10:00:00Z",
"message": "User logged in",
"user_id": "alice"
}
// Incoming event 2
{
"timestamp": "2023-10-27T10:00:01Z",
"message": "Page viewed",
"user_id": "alice",
"page": "/dashboard"
}
If you index these one by one using the standard PUT /my-index/_doc/1 and PUT /my-index/_doc/2 APIs, you’re asking Elasticsearch to do a lot of work for each document: acquire a shard, process the document, write to the transaction log, perform segment merges, and update the index. This is slow.
The _bulk API is the key. Instead of sending individual requests, you send a single request containing multiple indexing or deletion operations.
POST /_bulk
{ "index" : { "_index" : "my-index", "_id" : "1" } }
{ "timestamp": "2023-10-27T10:00:00Z", "message": "User logged in", "user_id": "alice" }
{ "index" : { "_index" : "my-index", "_id" : "2" } }
{ "timestamp": "2023-10-27T10:00:01Z", "message": "Page viewed", "user_id": "alice", "page": "/dashboard" }
This reduces network overhead and allows Elasticsearch to optimize the processing of the batch. The _bulk API returns a JSON object detailing the success or failure of each operation within the batch.
The default batch size for the _bulk API is often cited as 1,000 to 5,000 documents, or around 5MB. However, the optimal size is highly dependent on your hardware, document complexity, and cluster load. You’ll need to experiment. A common starting point for tuning is to increase it gradually, monitoring cluster health. For instance, if you find 5,000 documents is too slow, try 2,000. If it’s too fast and you have resources, try 10,000.
Beyond _bulk, you can use asynchronous indexing. This means your application submits a bulk request and immediately continues processing other tasks, without waiting for the bulk request to complete. This is typically handled by client libraries. For example, the Python Elasticsearch client has an actions iterator that can handle this:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
def generate_actions():
for i in range(10000):
yield {
"_index": "my-async-index",
"_source": {
"timestamp": f"2023-10-27T10:{str(i).zfill(4)}Z",
"message": f"Async event {i}",
"counter": i
}
}
success_count, errors = bulk(es, generate_actions(), chunk_size=5000, request_timeout=60)
print(f"Indexed {success_count} documents.")
if errors:
print("Errors encountered:", errors)
Here, bulk(es, generate_actions(), ...) handles batching and sending the requests asynchronously. The chunk_size parameter in the helpers.bulk function is analogous to the _bulk API’s batching.
The true power of _bulk and async indexing lies in minimizing the overhead per document. Each document requires a certain amount of CPU and I/O for things like parsing, mapping, validation, and writing to the transaction log. By sending many documents in one go, you amortize these fixed costs over a larger number of documents. Elasticsearch can then process these larger chunks more efficiently, especially when it comes to segment merging, where it combines smaller segments into larger ones to improve search performance and reduce resource usage. A well-tuned bulk process means fewer, larger segments, which means faster searches and less background I/O.
The most impactful tuning parameter you’ll often overlook is the refresh_interval setting in your index. By default, it’s set to 1s, meaning that changes become visible for search every second. For high-volume indexing, especially if your data doesn’t need to be immediately searchable, increasing this interval significantly boosts write throughput. For example, setting index.refresh_interval: "30s" or even "60s" means Elasticsearch only performs the expensive refresh operation less frequently. This is because a refresh involves flushing data from the transaction log to disk and creating a new search segment. Doing this less often directly reduces I/O and CPU load, allowing more resources to be dedicated to indexing. You can change this dynamically: PUT /my-index/_settings { "index" : { "refresh_interval" : "30s" } }.
Once you’ve mastered bulk and async indexing, the next challenge is often managing shard allocation and ensuring your cluster can handle the write load across all nodes without becoming a bottleneck.