The most surprising truth about distributed systems is that they are fundamentally designed to fail, and the best ones embrace this.

Let’s watch a simple distributed system in action: a basic key-value store. Imagine two nodes, node-1 and node-2.

node-1

{
  "key": "mykey",
  "value": "myvalue"
}

If a client wants to write {"key": "mykey", "value": "newvalue"} to this system, it might send the request to node-1. node-1 then replicates this write to node-2.

node-2

{
  "key": "mykey",
  "value": "newvalue"
}

Now, if node-1 crashes, node-2 still holds the latest data. A client can query node-2 directly and get {"key": "mykey", "value": "newvalue"}. The system continued to operate despite a failure.

This is the core problem distributed systems solve: availability in the face of failure. Traditional single-server systems are a single point of failure. If that server goes down, the entire application is unavailable. Distributed systems mitigate this by spreading data and computation across multiple, independent machines.

The mental model involves several key components:

  1. Replication: As seen above, data is copied across multiple nodes. This ensures that if one node fails, others have a copy of the data. Replication can be synchronous (all nodes acknowledge a write before it’s considered complete) or asynchronous (writes are acknowledged by a primary node and then propagated). Synchronous replication offers stronger consistency but higher latency. Asynchronous replication offers lower latency but a risk of data loss if a primary fails before replication completes.

  2. Partitioning (Sharding): For very large datasets, replicating everything to every node becomes impractical. Partitioning divides the data into smaller chunks (shards) and distributes these shards across different nodes. A request for a specific key is routed to the node(s) holding that key’s shard. This scales the system’s capacity both for data storage and for request throughput.

  3. Consistency Models: How do we ensure that all nodes eventually agree on the state of the data, especially when writes are happening concurrently and nodes might fail? This is where consistency models come in.

    • Strong Consistency: Every read returns the most recent write. This is simple to reason about but can be slow and complex to implement in a distributed setting.
    • Eventual Consistency: If no new writes are made to a given data item, eventually all reads will return the last written value. This is more common in highly available systems, as it tolerates temporary network partitions and node failures more gracefully.
    • Causal Consistency: Preserves the order of causally related operations. If operation A happens before operation B, then any node that sees B must also see A.
  4. Consensus Algorithms (e.g., Raft, Paxos): When strong consistency is required, especially for critical operations like leader election or managing replicated state machines, nodes need a way to agree on a single decision. Consensus algorithms provide a robust mechanism for a group of nodes to reach agreement, even if some nodes fail or network messages are lost or delayed. They are the backbone of many distributed coordination services.

Consider a leader election process. In a cluster of nodes, one node needs to be designated as the "leader" to coordinate writes or manage a shard. If the current leader fails, the remaining nodes must elect a new leader. A consensus algorithm like Raft allows the nodes to propose themselves as leaders, vote, and eventually agree on a single, new leader, ensuring continuity of service.

The trickiest part of distributed systems is often not the failure itself, but the detection of failure and the subsequent recovery. A node might be slow, or it might be truly dead. Distinguishing between these two states is hard, especially across a network. This is why many systems err on the side of assuming a node is available unless there’s strong evidence otherwise, which can lead to "split-brain" scenarios where partitions of the network believe different nodes are the leader.

The next concept you’ll grapple with is how to manage stateful services in a distributed environment, particularly when those services need to be scalable and resilient.

Want structured learning?

Take the full Distributed Systems course →