Run Distributed Systems in Production: Observability, Failure, and Recovery (2026)

Distributed systems are like a city’s power grid: a failure in one neighborhood can cascade and take down the whole metropolis if you don’t have the right monitoring and backup plans.

Let’s say you’re running a microservices architecture. A user request comes in, hitting a load balancer. It’s routed to a frontend service, which then calls a user service, then an order service, and finally a payment service. Each of these services is running on multiple instances, potentially across different availability zones.

Here’s a simplified trace of a successful request:

User Request: GET /orders/123
Load Balancer (e.g., ALB): Receives request, forwards to frontend-service-instance-a.
Frontend Service: Receives request, needs user and order details. Calls user-service-instance-b for user 456.
User Service: Receives request, looks up user 456 in its database. Returns user data.
Frontend Service: Receives user data. Calls order-service-instance-c for order 123.
Order Service: Receives request, looks up order 123 in its database. Returns order data.
Frontend Service: Receives order data. Combines, returns response to user.

Now, what if something goes wrong?

Fix: Service Unavailable Error (503) in Load Balancer

The 503 Service Unavailable error from your load balancer means it tried to send a request to a backend service instance, but that instance either didn’t respond, responded with an error, or was deemed unhealthy by the load balancer’s health checks. The load balancer itself is usually healthy, but it can’t reach its targets.

Here are the most common reasons:

Target Instances are Unhealthy: The most frequent culprit. Your backend service instances are genuinely not working correctly, failing their health checks.
- Diagnosis: Check the load balancer’s target group health status. In AWS, this is done via the EC2 console -> Target Groups -> select your target group -> Targets tab. Look for instances marked as unhealthy or initial (if they never became healthy).
- Fix: SSH into one of the unhealthy instances. Examine application logs (/var/log/app.log or similar). Common issues include:
  - Application Crashed: A segmentation fault, unhandled exception, or out-of-memory error. Restart the application process. For a systemd service: sudo systemctl restart myapp.service.
  - Port Not Listening: The application started but isn’t binding to the expected port (e.g., 8080). Check with sudo netstat -tulnp | grep 8080. If it’s not listening, investigate why the app failed to bind (e.g., port already in use by another process, or configuration error).
  - External Dependency Failure: The app can’t reach its database or another critical upstream service. Check database connectivity from the instance: psql -h my-db.rds.amazonaws.com -U user -d mydb.
- Why it works: Health checks are designed to fail if the application isn’t running or responding correctly. Fixing the underlying application issue allows it to pass health checks and be registered as healthy by the load balancer.
Health Check Misconfiguration: The load balancer is configured with incorrect health check parameters, causing it to wrongly deem healthy instances as unhealthy.
- Diagnosis: In your load balancer’s target group configuration, review the health check settings. Specifically, check the Path (e.g., /health), Port (should be traffic-port or a specific port like 8080), Healthy threshold (e.g., 3), Unhealthy threshold (e.g., 2), Timeout (e.g., 5 seconds), and Interval (e.g., 30 seconds).
- Fix: If the health check path is /health, ensure there’s an endpoint at that path in your application that returns a 200 OK status code quickly. If your app runs on port 8080 but the health check is configured for port 80, change the health check port to 8080. If the Timeout is too short (e.g., 1 second) and your application’s health check endpoint takes 2 seconds to respond, increase the timeout to 3 seconds.
- Why it works: The health check is the load balancer’s way of asking "are you alive and well?". If the question is wrong (wrong path, wrong port) or the criteria for "well" are too strict (too low timeout, too many consecutive failures needed), it will incorrectly mark instances as unhealthy. Correcting these settings ensures the load balancer accurately assesses instance health.
Network ACLs or Security Groups Blocking Health Check Traffic: Firewalls are preventing the load balancer’s health check probes from reaching your instances.
- Diagnosis: Examine the Security Group attached to your backend instances and the Network ACLs associated with the subnet(s) where your instances reside.
- Fix:
  - Security Group: Add an inbound rule allowing TCP traffic on the health check port (e.g., 8080) from the load balancer’s security group or its known IP ranges (e.g., 10.0.0.0/8 for internal ELBs, or the specific ALB security group ID).
  - Network ACL: Ensure the Network ACL for the subnet allows inbound TCP traffic on the health check port (e.g., 8080) from the load balancer’s IP range, and importantly, outbound traffic on ephemeral ports (e.g., 1024-65535) back to the load balancer.
- Why it works: Health checks are network requests. If network security rules block these requests (either the request going in or the response coming out), the load balancer will never receive a successful response and will mark the instance as unhealthy.
Load Balancer Resource Exhaustion: The load balancer itself is overwhelmed and unable to process new connections or health check requests.
- Diagnosis: Check the load balancer’s CloudWatch metrics. Look for HTTPCode_Target_5XX_Count (if the LB is returning 5xxs directly, not just passing them from targets), UnHealthyHostCount, ActiveConnectionCount, and SpilloverCount (for Application Load Balancers). High connection counts or spillover indicate it’s struggling.
- Fix: For ALBs, you can increase the number of nodes by increasing the LoadBalancer.NumberOfNodes setting if using Classic Load Balancers, or by scaling up the underlying compute for certain modern load balancers. More commonly, you’d optimize your application to respond faster, reduce the number of concurrent connections, or scale up your backend instances to handle the load more efficiently so the LB doesn’t get saturated.
- Why it works: If the load balancer is swamped, it can’t even perform its basic functions like routing traffic or checking if targets are alive, leading to 503s and unhealthy targets.
DNS Resolution Issues: Your instances, or the load balancer trying to reach them, cannot resolve hostnames correctly. This is less common for direct IP-based health checks but can happen if CNAMEs are involved or if internal DNS is used.
- Diagnosis: On an instance, try dig or nslookup for external services it depends on. For the load balancer, check if it’s using IP addresses or hostnames for its targets.
- Fix: Ensure your VPC’s DNS settings are correct and that instances can reach your DNS resolver (e.g., VPC Resolver at 10.0.0.2). If using hostnames, verify the DNS records are accurate and propagated.
- Why it works: If an instance cannot resolve the hostname of a database it needs to connect to, its application will fail to start or function, leading to unhealthiness. If the load balancer cannot resolve the IP of its target instances (less common for standard setups), it also can’t reach them.
Instance Health Check Timeout: The health check response from the instance is taking too long, exceeding the load balancer’s configured timeout.
- Diagnosis: As mentioned in #2, check the Timeout setting for the health check in the target group configuration. If it’s set to 2 seconds, and your application’s /health endpoint consistently takes 3 seconds to respond, this will cause failures.
- Fix: Optimize the application’s health check endpoint to respond faster. This might involve reducing the number of checks it performs, caching certain data, or ensuring its dependencies are responsive. Alternatively, increase the health check Timeout in the target group configuration to a value greater than your endpoint’s typical response time (e.g., 5 seconds).
- Why it works: The load balancer has a limited time to receive a valid health check response. If the application is slow to respond, it misses this window, and the load balancer assumes the instance is unavailable.

After fixing these issues, the next error you’ll likely encounter is a 502 Bad Gateway if your backend services are not correctly handling upstream errors or if there’s a communication problem between services after the load balancer has successfully routed the request.