Observing distributed systems is more about understanding the emergent behavior of many independent services than about inspecting each one in isolation.

Let’s see how metrics, traces, and logs work together to paint a coherent picture of a request flowing through a hypothetical e-commerce system. Imagine a user browsing products, adding an item to their cart, and then checking out. This single user action triggers a cascade of calls across multiple services: a ProductCatalog service to fetch details, an Inventory service to check stock, a Cart service to manage the user’s cart, and finally, a Payment service to process the transaction.

Here’s a simplified view of what that might look like in code, using a common pattern where each service emits its own telemetry.

# In ProductCatalogService:
def get_product_details(product_id):
    start_time = time.time()
    # ... fetch from database ...
    metrics.increment("product_catalog.requests_total")
    metrics.histogram("product_catalog.request_duration_seconds", time.time() - start_time)
    trace_id = generate_trace_id() # Assume this is propagated
    span_id = generate_span_id()
    log.info(f"Fetching details for {product_id}", extra={"trace_id": trace_id, "span_id": span_id})
    return {"name": "Awesome Gadget", "price": 99.99}

# In InventoryService:
def check_stock(product_id, quantity):
    start_time = time.time()
    # ... check inventory system ...
    metrics.increment("inventory.requests_total")
    metrics.histogram("inventory.request_duration_seconds", time.time() - start_time)
    trace_id = get_current_trace_id() # Propagated from previous service
    span_id = generate_span_id()
    log.info(f"Checking stock for {product_id}, quantity {quantity}", extra={"trace_id": trace_id, "span_id": span_id})
    return {"available": True}

# In PaymentService:
def process_payment(user_id, amount):
    start_time = time.time()
    # ... interact with payment gateway ...
    metrics.increment("payment.requests_total")
    metrics.histogram("payment.request_duration_seconds", time.time() - start_time)
    trace_id = get_current_trace_id()
    span_id = generate_span_id()
    log.info(f"Processing payment for user {user_id}, amount {amount}", extra={"trace_id": trace_id, "span_id": span_id})
    return {"status": "success"}

Metrics give you the high-level pulse of your system. They are aggregated, numerical values that tell you what is happening across all requests. Think of them as the dashboard of a car: product_catalog.request_duration_seconds shows you the average time spent in the ProductCatalog service, inventory.requests_total counts how many times the Inventory service was called, and payment.errors_total (if we had it) would show how often payments failed. These are invaluable for spotting trends, identifying performance bottlenecks across the board, and setting alerts. If payment.request_duration_seconds suddenly spikes to 5 seconds on average, you know something is wrong with payments, even if you don’t know which payment.

Traces are like a single, detailed flight recorder for a specific request as it journeys through your distributed system. A trace is composed of "spans," where each span represents an operation within a service (like a database query or an external API call) or a call between services. Crucially, spans within the same trace are linked by a common trace_id. This allows you to reconstruct the entire path a request took, including the time spent in each service and the dependencies between them. When a user reports that their checkout is slow, you can use the trace_id from their session to see that the PaymentService took 3 seconds, while ProductCatalog was a snappy 50ms. This pinpointing capability is where tracing truly shines.

Logs are the detailed, human-readable (or machine-readable, with structured logging) narratives. Each log entry typically includes a timestamp, severity level, and a message. In a distributed system, the magic happens when you enrich logs with context from the request. By including the trace_id and span_id in every log message generated during the processing of a specific request, you can filter logs to see only those related to a particular slow checkout, or a specific payment failure. This turns a firehose of log data into a focused investigation tool. If a trace shows PaymentService was slow, you can then filter logs for that specific trace_id and span_id to find the exact error message or detailed steps that led to the delay.

The real power emerges when you correlate these three. A sudden increase in payment.request_duration_seconds (metric) might prompt you to look at recent traces. You find a few traces showing high latency in PaymentService. You then pick one of those slow traces and examine the associated logs (using the trace_id and span_id from the trace) to find a specific error message like "Timeout connecting to upstream payment gateway" or "Invalid credit card number format received." This combined approach allows you to go from a vague system-wide symptom to a precise root cause.

The most surprising aspect of distributed tracing is how it reveals unexpected dependencies or fan-out patterns. You might trace a simple "get user profile" request and discover it’s silently calling half a dozen other internal services you weren’t aware of, each adding a small but cumulative latency. This emergent complexity is often invisible without tracing.

When you’re debugging a request that seems to hang indefinitely, and you’ve confirmed no explicit errors in logs or metrics, it’s often because a service is waiting for a response from another service that has itself hung or crashed. The trace will show the span for the call to the unresponsive service, but it will never complete, leaving its duration as "unknown" or a very large number, and crucially, no subsequent spans will appear. This lack of a follow-on span is as informative as an error message.

The next step is often to implement distributed tracing for your asynchronous message queues, where the request context (trace IDs) needs careful propagation across producers and consumers.

Want structured learning?

Take the full Distributed Systems course →