etcd, the distributed key-value store, is the beating heart of Kubernetes, and understanding its command-line tools is crucial for any operator.

Here’s etcd in action, demonstrating a simple member list operation:

ETCDCTL_API=3 etcdctl member list

This command, when executed against a healthy etcd cluster, will output a list of all members, their IDs, if they are healthy, and their peer and client URLs. For instance:

3b2a81a7d2835e24, started, xxxx-xxxx-xxxx-xxxx, https://10.0.0.1:2380, https://10.0.0.1:2379
f7b4a1b7d2835e25, started, xxxx-xxxx-xxxx-xxxx, https://10.0.0.2:2380, https://10.0.0.2:2379
a1c2b3d4e5f6a7b8, started, xxxx-xxxx-xxxx-xxxx, https://10.0.0.3:2380, https://10.0.0.3:2379

The core problem etcd solves is providing a reliable, consistent, and highly available source of truth for distributed systems like Kubernetes. It ensures that all nodes in the cluster agree on the state of the system, even in the face of network partitions or node failures. Internally, etcd uses the Raft consensus algorithm to achieve this consistency. Each etcd member maintains a replicated log of all state changes. When a client makes a request to write data, that request is sent to the leader of the Raft group. The leader proposes the change to its followers, and once a majority of members have acknowledged the change, it’s committed and applied to the state machine. Reads can be served by any member, typically the leader for strong consistency, or followers for eventual consistency.

The etcdctl utility is your primary interface for interacting with an etcd cluster. It allows you to perform administrative tasks, inspect cluster health, and even directly manipulate data. Understanding these commands empowers you to diagnose issues, manage cluster membership, and ensure the overall stability of your Kubernetes control plane.

Key etcdctl Commands:

  • member list: As shown above, this command lists all members of the etcd cluster, their status (started, stopped), ID, and network endpoints. This is your first stop for checking cluster health and identifying which nodes are participating.

  • endpoint health: This command checks the health of each etcd endpoint. It’s a more granular check than member list and will tell you if specific members are responding.

    ETCDCTL_API=3 etcdctl endpoint health --endpoints=$(etcdctl member list | awk -F', ' '{print $4}' | paste -sd,)
    

    This command constructs a comma-separated list of client endpoints from the member list output and then checks their health.

  • endpoint status: This provides more detailed information about each endpoint, including its Raft term, index, leader, and uptime. This is invaluable for diagnosing Raft-related issues.

    ETCDCTL_API=3 etcdctl endpoint status --endpoints=$(etcdctl member list | awk -F', ' '{print $4}' | paste -sd,) --write-out=table
    
  • alarm list: etcd can trigger alarms when certain thresholds are met, most commonly when disk space is low. This command lists any active alarms.

    ETCDCTL_API=3 etcdctl alarm list
    
  • alarm disarm: If you have an active alarm (e.g., due to low disk space), you’ll need to disarm it after resolving the underlying issue.

    ETCDCTL_API=3 etcdctl alarm disarm
    
  • defrag: Over time, etcd’s data file can become fragmented, leading to performance degradation. This command compacts the database and reclaims space. This is a maintenance operation that should be performed periodically.

    ETCDCTL_API=3 etcdctl defrag
    
  • move-leader: In rare cases, you might need to manually trigger a leader election. This command attempts to move the leadership to a different member. Use with caution.

    ETCDCTL_API=3 etcdctl move-leader <new-leader-member-id>
    
  • snapshot save: Essential for backups. This command saves a snapshot of the etcd cluster’s current state to a file.

    ETCDCTL_API=3 etcdctl snapshot save snapshot.db
    
  • snapshot restore: Used to restore etcd from a snapshot. This is critical for disaster recovery.

    ETCDCTL_API=3 etcdctl snapshot restore snapshot.db --data-dir /var/lib/etcd-restore
    
  • alarm trip: This command is useful for testing your alarm handling mechanisms. It manually triggers an alarm.

    ETCDCTL_API=3 etcdctl alarm trip
    

One of the most subtle aspects of etcd operations is understanding the interaction between Raft terms and Raft indexes. A Raft term is a period of leadership. If leadership changes, the term increments. The Raft index is a monotonically increasing counter for the commands applied within a given term. When diagnosing issues, observing term and index in endpoint status can reveal if a member is lagging behind the leader or if there’s a persistent leader election loop.

The next critical concept to grasp is etcd’s leasing mechanism, which provides a way to set TTLs on keys, automatically expiring them after a set duration.

Want structured learning?

Take the full Etcd course →