Your Elasticsearch cluster is refusing to write new data because it’s running out of disk space.

Here’s what’s actually happening: Elasticsearch, to prevent data loss when disks fill up, automatically puts indices into a read-only state. This ClusterBlockException is a safety mechanism, not a failure of Elasticsearch itself, but a clear indicator that your storage is critical.

Common Causes and Fixes:

  1. Full Disk on Master Node:

    • Diagnosis: Check disk usage on your master nodes.
      ssh your_master_node_ip 'df -h /'
      
    • Cause: Even if data nodes have space, if a master node’s disk is full, it can prevent cluster operations, including writes, due to its role in cluster state management.
    • Fix: Free up space on the master node by deleting old logs, temporary files, or unused Docker images. If the node is dedicated to Elasticsearch and consistently fills up, you likely need to increase its disk size or move Elasticsearch data to a different disk.
      # Example: Remove old log files
      ssh your_master_node_ip 'sudo find /var/log/elasticsearch/ -type f -mtime +30 -delete'
      
    • Why it works: Master nodes need stable disk space to operate. Clearing unnecessary files allows the master to function correctly and remove the cluster block.
  2. Full Disk on Data Nodes:

    • Diagnosis: Check disk usage on all data nodes.
      ssh your_data_node_ip 'df -h /path/to/elasticsearch/data'
      
      Or, if using multiple data paths:
      ssh your_data_node_ip 'sudo -u elasticsearch /usr/share/elasticsearch/bin/elasticsearch-node-disk-usage --disk-path /var/lib/elasticsearch'
      
    • Cause: Data nodes store the actual shard data. When their disks reach the threshold (default 85% for low watermarks, 90% for high watermarks), Elasticsearch starts moving shards away and eventually blocks writes to prevent data corruption.
    • Fix:
      • Add more disk space: The most straightforward solution.
      • Add more data nodes: Distribute data across more nodes.
      • Delete old indices: If you have indices that are no longer needed, delete them.
        curl -X DELETE "localhost:9200/old_index_name*"
        
      • Configure disk watermarks: You can adjust the thresholds at which Elasticsearch takes action. Caution: This is a temporary workaround and doesn’t solve the underlying space issue.
        PUT _cluster/settings
        {
          "persistent": {
            "cluster.routing.allocation.disk.watermark.low": "85%",
            "cluster.routing.allocation.disk.watermark.high": "90%",
            "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
          }
        }
        
        If the cluster is already blocked at flood_stage, you’ll need to clear space before you can change this.
    • Why it works: By providing more disk capacity or removing data, you bring the disk usage below the configured watermarks, allowing Elasticsearch to resume normal operations.
  3. Large Number of Unassigned Shards:

    • Diagnosis: Check for unassigned shards.
      curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason&s=state" | grep UNASSIGNED
      
    • Cause: If you’ve recently deleted a lot of data or nodes, or if there were cluster instability issues, shards might be unassigned. Elasticsearch tries to rebalance and allocate these, which consumes resources and can indirectly contribute to disk pressure or block writes if it can’t find suitable nodes (e.g., due to disk space constraints on available nodes).
    • Fix: Ensure your data nodes have sufficient disk space and that there are no node-level disk blocks. Sometimes, a cluster restart can help re-initiate shard allocation if the underlying disk issues are resolved.
      # Example: Restarting Elasticsearch on a node
      sudo systemctl restart elasticsearch
      
    • Why it works: Resolving the underlying cause of unassigned shards (usually disk space or network issues) allows Elasticsearch to properly allocate them, reducing cluster stress.
  4. Too Many Small Indices:

    • Diagnosis: Count your indices.
      curl -X GET "localhost:9200/_cat/indices?v" | wc -l
      
      Check index sizes.
      curl -X GET "localhost:9200/_cat/indices?h=index,health,status,uuid,store.size&s=store.size:desc"
      
    • Cause: Each index has overhead (metadata, segment files). A very large number of indices, especially if many are small and active, can consume significant file handles and disk space for metadata, even if the raw data size isn’t massive. This can indirectly trigger disk watermarks or impact performance to the point of blocking.
    • Fix: Implement an Index Lifecycle Management (ILM) policy to automatically merge small indices, roll over to new ones, and delete old ones.
      PUT _template/my_index_template
      {
        "index_patterns": ["my-logs-*"],
        "template": {
          "settings": {
            "index.lifecycle.name": "my_log_ilmpolicy"
          }
        }
      }
      PUT _ilm/policy/my_log_ilmpolicy
      {
        "policy": {
          "phases": {
            "hot": {
              "min_age": "0ms",
              "actions": {
                "rollover": {
                  "max_age": "7d",
                  "max_primary_shard_size": "50gb"
                }
              }
            },
            "delete": {
              "min_age": "30d",
              "actions": {
                "delete": {
                  "min_age": "30d"
                }
              }
            }
          }
        }
      }
      
    • Why it works: ILM consolidates data into fewer, larger indices and automates the deletion of old data, reducing overall disk footprint and metadata overhead.
  5. Corrupted Index Metadata or Segments:

    • Diagnosis: Check Elasticsearch logs for specific shard corruption errors.
      tail -n 100 /var/log/elasticsearch/your_cluster_name.log
      
    • Cause: While rare, disk errors or unexpected shutdowns can sometimes lead to corrupted index files or metadata that Elasticsearch interprets as disk-full conditions or prevents access to.
    • Fix: This is the trickiest. If a specific index is the culprit, you might need to temporarily disable its block (cluster.routing.allocation.enable: all) to try and force a recovery or move it, or in extreme cases, delete the problematic index after ensuring you have backups or can afford data loss for that index.
      # Temporarily allow all shard routing (use with extreme caution)
      PUT _cluster/settings
      {
        "persistent": {
          "cluster.routing.allocation.enable": "all"
        }
      }
      
      Then, try to delete the problematic index:
      curl -X DELETE "localhost:9200/corrupted_index_name"
      
    • Why it works: By forcing allocation or removing the corrupt data, you allow the cluster to regain a stable state.
  6. cluster.routing.allocation.disk.watermark.flood_stage is too low:

    • Diagnosis: Check current cluster settings.
      curl -X GET "localhost:9200/_cluster/settings?flat_settings"
      
    • Cause: The flood_stage watermark is the absolute last line of defense before Elasticsearch stops all writes. If this is set too aggressively (e.g., below 90%), even if you have some free space, the cluster might block writes.
    • Fix: Increase the flood_stage watermark. This is a temporary band-aid if your disks are genuinely full, but if the value is just set too low, this will resolve it.
      PUT _cluster/settings
      {
        "persistent": {
          "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
        }
      }
      
    • Why it works: This setting tells Elasticsearch at what disk usage percentage it should absolutely stop writes. Increasing it allows writes to continue up to a higher usage threshold, giving you more time to address actual disk space issues.

After resolving the disk space issue, you might see unassigned_shards errors as Elasticsearch rebalances shards.

Want structured learning?

Take the full Elasticsearch course →