CAP Theorem Trade-offs: Consistency vs. Availability

The CAP theorem doesn’t say you must sacrifice one of these properties; it says you must choose which one to sacrifice when a network partition occurs.

Let’s watch a distributed database like etcd handle a scenario where it’s split into two networks. Imagine we have three etcd nodes, etcd-1, etcd-2, and etcd-3, all in a cluster. We’re going to simulate a network partition between etcd-1 and the other two.

Here’s our etcd configuration, simplified:

name: etcd-1
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://<ETCD_1_IP>:2379
listen-peer-urls: http://0.0.0.0:2380
initial-advertise-peer-urls: http://<ETCD_1_IP>:2380
initial-cluster: etcd-1=http://<ETCD_1_IP>:2380,etcd-2=http://<ETCD_2_IP>:2380,etcd-3=http://<ETCD_3_IP>:2380
initial-cluster-state: new

(We’d have similar configs for etcd-2 and etcd-3, with their respective IPs.)

Now, let’s simulate the partition. We’ll use iptables on etcd-1 to block traffic to etcd-2 and etcd-3 on port 2380 (the peer port).

# On etcd-1
iptables -A OUTPUT -d <ETCD_2_IP> -p tcp --dport 2380 -j DROP
iptables -A OUTPUT -d <ETCD_3_IP> -p tcp --dport 2380 -j DROP

Before the partition, if we curl http://<ETCD_1_IP>:2379/v3/kv/put?key=d29yZCBhbmQgbW9yZSBzZWFyY2g=&value=Y29tcGxleCBjb25jZXB0cw== (which is base64 for "word and more search=" and "complex concepts="), any node in the cluster could accept the write. etcd, using Raft, would elect a leader, and that leader would replicate the write to a majority of nodes. If etcd-1 was the leader, it would write it locally and send it to etcd-2 and etcd-3. If etcd-2 was the leader, it would write it locally and send it to etcd-1 and etcd-3. The system is consistent (all nodes agree on the order of operations) and available (any client can get a response).

Once the partition happens, etcd-1 can no longer talk to etcd-2 and etcd-3 (and vice-versa). etcd nodes are constantly heartbeating to maintain their view of the cluster. etcd-1 will eventually time out its connections to etcd-2 and etcd-3. From etcd-1’s perspective, it’s still part of a cluster, but it can’t reach a majority of its peers.

If we try to write to etcd-1 now: curl http://<ETCD_1_IP>:2379/v3/kv/put?key=bGlnaHQ=&value=ZGFyayA= (base64 for "light=" and "dark="). etcd-1 knows it cannot get a quorum (a majority of nodes) to agree on this write because it’s isolated. To maintain consistency, it will reject the write. The client gets an error: {"error":"etcdserver: no leader"} or {"error":"etcdserver: leader changed"} if it briefly thought it was the leader and then realized it couldn’t confirm. This is the Consistency and Partition Tolerance (CP) choice.

Meanwhile, etcd-2 and etcd-3 can still talk to each other. They form a majority of the original cluster. If etcd-2 is the leader among them, and we try to write to etcd-2: curl http://<ETCD_2_IP>:2379/v3/kv/put?key=YmxhY2s=&value=d2hpdGU= (base64 for "black=" and "white="). This write will succeed because etcd-2 and etcd-3 can form a quorum. The client gets a {"header":{"cluster_id":...},"index":...} response. This is the Availability and Partition Tolerance (AP) choice.

The etcd documentation explicitly states it’s a CP system. It prioritizes consistency. During a network partition, it will sacrifice availability for the minority partition. The isolated node (etcd-1 in our example) becomes unavailable for writes. It will still serve reads from its local state, but these reads might be stale if the majority partition has accepted newer writes.

The mental model here is about distributed consensus. Protocols like Raft (which etcd uses) are designed to ensure that all non-faulty nodes agree on the order of operations. This agreement requires communication. When that communication breaks (a partition), the protocol has to decide: do we allow operations to proceed without guaranteed agreement (availability), or do we halt operations to ensure agreement is never violated (consistency)?

The surprising thing about the CAP theorem is that it’s often presented as a fundamental trade-off that’s always in effect. In reality, the "P" (Partition Tolerance) is a given in any distributed system. Network failures will happen. So, the real choice is between C and A when P occurs. Most systems are designed to be CP or AP. Databases like Cassandra or DynamoDB are often AP, allowing writes to succeed in a minority partition but potentially leading to divergent states that need reconciliation later. Systems like etcd or ZooKeeper are CP, ensuring that if a partition occurs, only the partition with a majority of nodes can make progress, guaranteeing consistency but making the minority partition unavailable.

The critical insight is that the choice between C and A is not constant; it’s a choice made during a network partition event. The system’s configuration and underlying consensus protocol dictate which path it will take when faced with this inevitable scenario.

The next problem you’ll run into is how to handle the data divergence in AP systems after a partition heals.