Elasticsearch Data Streams, designed for time-series data, are actually a thin abstraction layer over standard Elasticsearch indices, but they automate a critical, error-prone manual process.
Let’s see this in action. Imagine we’re ingesting logs from a web server.
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"
}
}
This cluster setting defines thresholds for disk usage that trigger Elasticsearch’s shard allocation awareness. When disk usage on a node exceeds high (90%), Elasticsearch will try to move shards off that node to free up space. If it hits flood_stage (95%), it will prevent new writes to prevent data loss. This is the manual process data streams automate.
Here’s how you set up a data stream for web logs:
PUT /my-web-logs-datastream
{
"datastream": {}
}
That’s it. No complex index templates or rollover configurations needed upfront. Elasticsearch assumes you want to use the my-web-logs-datastream name as a prefix for time-based indices.
When you index your first document into my-web-logs-datastream, Elasticsearch automatically creates the backing index, typically named something like my-web-logs-datastream-000001.
POST /my-web-logs-datastream/_doc
{
"@timestamp": "2023-10-27T10:00:00Z",
"message": "192.168.1.10 - - [27/Oct/2023:10:00:00 +0000] \"GET /index.html HTTP/1.1\" 200 1024"
}
The real magic happens with rollover. By default, Elasticsearch’s data stream will create a new backing index when certain conditions are met. You can explicitly define these conditions using an index template.
First, let’s create a template that applies to our data stream’s backing indices:
PUT /_index_template/my-web-logs-template
{
"index_patterns": ["my-web-logs-datastream-*"],
"template": {
"settings": {
"index.lifecycle.name": "my-web-logs-data-lifecycle",
"index.lifecycle.rollover_alias": "my-web-logs-datastream"
}
}
}
This template tells Elasticsearch that any index matching my-web-logs-datastream-* should use a lifecycle policy named my-web-logs-data-lifecycle and that the data stream’s alias is my-web-logs-datastream. This alias is crucial; it’s the stable endpoint you always write to.
Now, define the lifecycle policy:
PUT /_ilm/policy/my-web-logs-data-lifecycle
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "7d",
"max_docs": 1000000,
"max_size": "50gb"
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
This policy defines three phases: hot, warm, cold, and delete. In the hot phase, we’ve configured rollover to happen when an index reaches 7 days old, contains 1 million documents, or exceeds 50GB in size. Whichever condition is met first triggers the rollover. Elasticsearch will automatically create a new backing index (e.g., my-web-logs-datastream-000002), update the my-web-logs-datastream alias to point to this new index, and then apply the index template to the new index. The old index remains accessible via its specific name. The delete phase automatically cleans up indices older than 30 days.
The data stream itself is just an alias that always points to the current write index. When you index data, you always use the data stream name: POST /my-web-logs-datastream/_doc. Elasticsearch, via the alias managed by the rollover action, ensures your data lands in the correct, active backing index.
The most surprising true thing about data streams is that they don’t introduce a new indexing mechanism; they orchestrate the creation and management of standard indices using aliases and lifecycle management, abstracting away the complexity of manual index creation, rollover, and deletion. This automation prevents common errors like writing to old, closed indices or running out of disk space because new indices weren’t created.
The rollover action in ILM doesn’t just create a new index; it updates the rollover_alias specified in the index template to point to the newly created index. This ensures that subsequent writes, still directed at the alias, go to the fresh index. The previous index remains accessible via its concrete name and is no longer the target of the rollover alias.
Once you’ve mastered data streams and ILM for basic time-series data, your next step is exploring advanced ILM phases like warm and cold for optimizing storage and query performance on older data.