Elasticsearch’s disk usage has exceeded a threshold, causing indices to be automatically set to read-only to prevent data loss.

This is almost always caused by the flood_stage watermark being hit. Elasticsearch has three watermarks to prevent disk space issues:

  • low: Disk usage is above this percentage, but no immediate action is taken.
  • normal: Disk usage is below this percentage.
  • high: Disk usage is above this percentage, and Elasticsearch will start relocating shards away from nodes that exceed this watermark.
  • flood_stage: Disk usage is above this percentage, and Elasticsearch will mark indices as read-only.

The flood_stage watermark is the one that triggers the read-only state. When this happens, you’ll see messages in your Elasticsearch logs like: high disk watermark exceeded on [...] all shards failed for [...] index [...] is now read-only

Here are the most common causes and how to fix them:

  1. Actual Disk Fullness on Nodes: This is the most straightforward cause. One or more of your Elasticsearch nodes have run out of disk space.

    • Diagnosis: Check the disk usage on each node.
      curl -X GET "localhost:9200/_cat/allocation?v"
      
      Look for nodes where disk.used_percent is higher than your configured flood_stage watermark (default is 85%).
    • Fix: Free up disk space on the affected node(s). This could involve deleting old indices, moving data off the node, or adding more disk capacity.
      # Example: Delete an old index
      curl -X DELETE "localhost:9200/my-old-index-2023.01.01"
      
    • Why it works: By reducing the disk usage below the flood_stage watermark, Elasticsearch can automatically remove the read-only block.
  2. Large or Numerous Unreferenced Indices: Sometimes, indices that are no longer actively used or referenced by applications can accumulate, consuming significant disk space.

    • Diagnosis: Use the _cat/indices API to list indices and their sizes.
      curl -X GET "localhost:9200/_cat/indices?v&s=store.size:desc"
      
      Identify large indices that haven’t been written to recently.
    • Fix: Delete any indices that are no longer needed.
      # Example: Delete an index that hasn't been written to in a year
      curl -X DELETE "localhost:9200/old-log-archive-2022.*"
      
    • Why it works: Removing these indices directly reduces the total disk space occupied by Elasticsearch data.
  3. Too Many Shards: While not directly a disk fullness issue, having an excessive number of shards, especially small ones, can lead to higher disk usage due to overhead (e.g., Lucene index files, transaction logs). If many nodes are nearing their watermark, a high shard count can push them over the edge.

    • Diagnosis: Check the number of shards per index and per node.
      curl -X GET "localhost:9200/_cat/shards?v"
      curl -X GET "localhost:9200/_cat/indices?v"
      
      Look for indices with a very high number of shards, or many indices with a small number of shards.
    • Fix: Reindex data into fewer indices with more shards, or adjust your index lifecycle management (ILM) policies to create fewer, larger indices.
      # Example: Reindex data into a new index with fewer shards
      POST /my-old-index/_split
      {
        "settings": {
          "index.number_of_shards": 5,
          "index.number_of_replicas": 1
        }
      }
      
    • Why it works: Consolidating shards reduces the overall overhead and disk footprint associated with managing them.
  4. Retention Policies Not Being Applied: If you have Index Lifecycle Management (ILM) configured but it’s not executing correctly, old indices might not be rolled over, deleted, or moved to warmer/colder storage, leading to unchecked growth.

    • Diagnosis: Check your ILM policies and their application to indices.
      curl -X GET "localhost:9200/_ilm/policy/my_ilm_policy"
      curl -X GET "localhost:9200/_cat/indices?v&h=index,health,status,uuid,docs.count,store.size,pri.shards,rep.shards,creation.date.string" | grep my-data-index
      
      Verify if indices are following the expected ILM phases.
    • Fix: Ensure your ILM policies are correctly defined and applied to relevant indices. Troubleshoot any ILM execution errors. You might need to manually force a rollover or deletion of old indices if ILM failed.
      # Example: Manually trigger a rollover for an index
      POST /my-data-index-000001/_rollover
      
    • Why it works: Properly functioning ILM automates the management of index size and retention, preventing them from growing indefinitely.
  5. Large Lucene Segments: Over time, Elasticsearch creates many small Lucene segments. During merges, these can temporarily consume more disk space. If a merge is happening on a nearly full disk, it could trigger the watermark.

    • Diagnosis: While harder to diagnose directly without deep internals, an indicator is frequent segment merging activity and high disk I/O. You can check segment counts:
      curl -X GET "localhost:9200/_cat/segments?v&h=index,shard,prirep,segment,docs.count,size" | grep UNASSIGNED
      
      (Note: This command lists all segments, you’d look for a high number of small ones on a specific shard/index).
    • Fix: Force merge older, un-updated indices to reduce segment count. This is an I/O intensive operation and should be done during off-peak hours.
      curl -X POST "localhost:9200/my-old-index/_forcemerge?max_num_segments=1&only_expunge_deletes=true"
      
    • Why it works: Force merging consolidates segments into fewer, larger ones, reducing overhead and freeing up space from deleted documents.
  6. Temporary Disk Spikes During Operations: Operations like shard rebalancing, snapshots, or large reindexing jobs can cause temporary spikes in disk usage. If your disk is already close to the flood_stage watermark, these spikes can trigger the read-only state.

    • Diagnosis: Monitor disk usage during these operations.
      watch -n 5 curl -X GET "localhost:9200/_cat/allocation?v"
      
    • Fix: Temporarily increase the cluster.routing.allocation.disk.watermark.flood_stage setting to a higher value, perform the operation, and then revert it.
      # Temporarily set flood stage to 90%
      curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
      {
        "persistent": {
          "cluster.routing.allocation.disk.watermark.flood_stage": "90%"
        }
      }
      '
      # After operation, revert to default (or desired value)
      curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
      {
        "persistent": {
          "cluster.routing.allocation.disk.watermark.flood_stage": "85%"
        }
      }
      '
      
    • Why it works: This gives Elasticsearch more headroom during intensive operations, preventing premature read-only states. Remember to reset it afterward.

After applying fixes, you might need to manually unmark indices as read-only if they were set that way. You can do this via the cluster.routing.allocation.enable setting or by updating index settings.

# Example: Re-enable shard allocation (which implies unmarking read-only indices)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}
'

The next error you’ll hit is likely related to cluster.routing.allocation.enable being set to none if you manually disabled it to prevent shard movement, or a cluster_block_exception if you try to write to an index that is still marked read-only and haven’t explicitly unblocked it.

Want structured learning?

Take the full Elasticsearch course →