A surprising number of distributed system outages stem from a single, fundamental misunderstanding: treating distributed components as if they were monolithic.
Imagine a simple web service. You have a frontend, an API, and a database. In a monolith, if the database is slow, the whole application grinds to a halt. But in a distributed system, the frontend might still be responding, the API might be happily accepting requests, but something is broken. That "something" is usually a failure to gracefully handle the inevitable latency and partial failures inherent in distributed communication.
Here’s a common scenario: a service relies on a downstream dependency, say a user profile service. The primary service makes a synchronous request. If the user profile service is slow or unavailable, the primary service’s thread gets blocked. If this happens enough, the primary service’s connection pool saturates, or its own request queue fills up, leading to cascading failures.
Let’s look at the common culprits and how to fix them.
1. The "Synchronous Everything" Antipattern
What broke: The primary service is blocking indefinitely on a downstream service, leading to resource exhaustion and unresponsiveness.
Common Causes & Fixes:
-
Cause: Synchronous HTTP calls to downstream services without timeouts.
- Diagnosis: Observe thread dumps during high load. You’ll see many threads stuck in
java.net.SocketInputStream.socketRead0or similar network I/O calls. Monitor network latency and error rates to the downstream service. - Fix: Implement aggressive timeouts on all outbound HTTP requests. For example, in Java with Apache HttpClient, set
setConnectTimeout(2000)andsetSocketTimeout(3000)(2 seconds to connect, 3 seconds to read). - Why it works: This prevents threads from waiting forever, allowing them to return errors or retry.
- Diagnosis: Observe thread dumps during high load. You’ll see many threads stuck in
-
Cause: Blocking database calls without query timeouts.
- Diagnosis: Database monitoring showing long-running queries. Application logs showing slow query durations.
- Fix: Set a query timeout on your database connection or within your ORM. For PostgreSQL, use
SET statement_timeout TO 5000;(5 seconds). For JDBC,connection.createStatement().setQueryTimeout(5);. - Why it works: Ensures that a single slow query doesn’t tie up a database connection and application thread indefinitely.
-
Cause: Relational database locks held for too long due to complex transactions.
- Diagnosis: Database monitoring tools showing lock wait times and deadlocks.
- Fix: Break down large transactions into smaller, independent units. Use optimistic locking (version numbers) instead of pessimistic locking where possible.
- Why it works: Reduces the window where resources are held exclusively, minimizing contention.
2. The "No Circuit Breaker" Antipattern
What broke: When a downstream service fails, the primary service keeps hammering it, overwhelming the failing service and preventing it from recovering, while also consuming its own resources.
Common Causes & Fixes:
- Cause: Lack of a circuit breaker pattern.
- Diagnosis: When the downstream service experiences errors, the primary service continues to send requests, and its own error rate spikes, often with connection refused or timeout errors.
- Fix: Implement a circuit breaker library (e.g., Resilience4j in Java, Polly in .NET, Hystrix in older Java projects). Configure it to open the circuit after a certain percentage of failures (e.g., 50%) and to have a short timeout for half-open state (e.g., 30 seconds).
- Why it works: Temporarily stops calls to a failing service, allowing it time to recover and preventing the caller from wasting resources.
3. The "Single Point of Failure" Antipattern
What broke: A critical component, even if distributed, is deployed in a single instance or zone, and its failure brings down the entire system.
Common Causes & Fixes:
-
Cause: Single instance of a critical service or database.
- Diagnosis: Observe the system during a single instance failure (e.g., by stopping a VM or killing a container). The entire application becomes unavailable.
- Fix: Deploy critical services and databases with redundancy. For services, ensure at least two instances are running in different availability zones. For databases, implement replication and failover mechanisms.
- Why it works: Provides a warm standby or active-active setup, so if one instance fails, another immediately takes over.
-
Cause: Reliance on a single external dependency (e.g., DNS, certificate authority).
- Diagnosis: All services fail to resolve hostnames or establish secure connections when the external dependency is unavailable.
- Fix: Use multiple DNS providers. Have fallback mechanisms for critical external services, or cache critical information (like resolved IPs) locally with a reasonable TTL.
- Why it works: Decouples the system from single points of failure in the external world.
4. The "Unbounded Request Queues" Antipattern
What broke: Incoming requests are accepted indefinitely, overwhelming internal processing capacity and leading to memory exhaustion or dropped requests.
Common Causes & Fixes:
-
Cause: Web server/application server accepting connections and queuing requests without limit.
- Diagnosis: High memory usage on application servers. Application logs showing "out of memory" errors or dropped requests. Network monitoring showing many open connections.
- Fix: Configure limits on the number of concurrent connections and the size of request queues for your web server (e.g., Nginx
worker_connectionsandlisten backlog) and application server (e.g., TomcatmaxConnectionsandacceptCount). - Why it works: Prevents the system from accepting more work than it can handle, forcing clients to wait or retry gracefully.
-
Cause: Asynchronous processing queues (e.g., Kafka, RabbitMQ) that are not monitored or scaled.
- Diagnosis: Consumer lag increases over time. Message queues grow excessively large.
- Fix: Implement monitoring for consumer lag. Scale up the number of consumers based on lag. Configure dead-letter queues for messages that cannot be processed.
- Why it works: Ensures that messages are processed at a rate that keeps pace with ingestion, preventing unbounded growth.
5. The "Chatty Service" Antipattern
What broke: A service makes an excessive number of small, sequential calls to downstream services, creating a high latency tail and consuming significant network and processing resources.
Common Causes & Fixes:
-
Cause: Fetching data one item at a time instead of in batches.
- Diagnosis: Network monitoring shows a high volume of small requests between services. Performance profiling shows most time spent in network I/O.
- Fix: Redesign the API to support batch operations. For example, instead of
GET /users/{id}called 100 times, useGET /users?ids={id1},{id2},.... - Why it works: Reduces the overhead of individual network round trips and context switching, significantly improving throughput and latency.
-
Cause: N+1 query problems in data retrieval.
- Diagnosis: Similar to the "Chatty Service" diagnosis, but specifically observable in ORM logs or application code where a loop fetches related entities one by one.
- Fix: Use eager loading or join fetch in your ORM to retrieve related data in a single query.
- Why it works: Combines multiple small queries into one efficient query, drastically reducing database load and network traffic.
6. The "No Idempotency" Antipattern
What broke: Retries due to transient network issues or service failures cause duplicate operations (e.g., charging a customer twice, creating duplicate orders).
Common Causes & Fixes:
- Cause: Non-idempotent POST or PUT requests being retried.
- Diagnosis: Users report duplicate transactions or data. Logs show the same operation being initiated multiple times within a short period.
- Fix: Implement idempotency keys. The client generates a unique key for each operation and sends it with the request. The server stores this key and returns the same response if the key is seen again, ensuring the operation is performed only once. For example, use an
Idempotency-Key: abc-123header. - Why it works: The server can detect and ignore duplicate requests, ensuring that even if a request is sent multiple times, the side effects occur only once.
Fixing these antipatterns requires a shift in mindset from building robust components to building robust interactions between components.
The next logical problem you’ll encounter after fixing these is ensuring your system can gracefully handle expected load increases and changes in traffic patterns, which leads into understanding rate limiting and traffic shaping.