Event-driven systems are more like a city with thousands of interconnected roads than a single highway.

Let’s watch a hypothetical OrderPlaced event flow through a microservices architecture. Our system has an API Gateway, an Order Service, a Payment Service, and a Notification Service.

Here’s a simplified view of the event flow and how we’d observe it:

  1. API Gateway receives an order:

    • Trace: A trace starts, let’s call it trace-12345. The gateway logs an entry like:
      {
        "traceId": "trace-12345",
        "spanId": "span-gateway-ingress",
        "timestamp": "2023-10-27T10:00:00Z",
        "message": "Received POST /orders",
        "http.method": "POST",
        "http.url": "/orders",
        "request.body.size": 512
      }
      
    • Metric: A counter increments: api_gateway_requests_total{method="POST", path="/orders"} to 1. A histogram records latency: api_gateway_request_duration_seconds with a value of 0.05s.
    • Log: The gateway logs the event for auditing: INFO: API Gateway received new order request. Trace ID: trace-12345
  2. API Gateway publishes OrderPlaced to a message queue (e.g., Kafka):

    • Trace: A new span, span-gateway-publish-order, is created, linked to trace-12345.
      {
        "traceId": "trace-12345",
        "spanId": "span-gateway-publish-order",
        "parentId": "span-gateway-ingress",
        "timestamp": "2023-10-27T10:00:00.100Z",
        "message": "Published OrderPlaced event to kafka topic 'orders'",
        "messaging.system": "kafka",
        "messaging.destination": "orders",
        "messaging.messageId": "msg-order-abc"
      }
      
    • Metric: A gauge might track queue depth if available, or a counter for messages produced: kafka_messages_produced_total{topic="orders"} to 1.
    • Log: INFO: Published OrderPlaced event. Trace ID: trace-12345, Message ID: msg-order-abc
  3. Order Service consumes OrderPlaced:

    • Trace: The Order Service starts a new span, span-order-service-consume, linked to trace-12345 and span-gateway-publish-order.
      {
        "traceId": "trace-12345",
        "spanId": "span-order-service-consume",
        "parentId": "span-gateway-publish-order",
        "timestamp": "2023-10-27T10:00:01.200Z",
        "message": "Consumed OrderPlaced event",
        "messaging.system": "kafka",
        "messaging.destination": "orders",
        "messaging.messageId": "msg-order-abc"
      }
      
    • Metric: A counter for consumed messages: kafka_messages_consumed_total{topic="orders"} to 1. A histogram for processing time: order_service_event_processing_duration_seconds with a value of 0.3s.
    • Log: INFO: OrderService processing OrderPlaced. Trace ID: trace-12345, Order ID: 123
  4. Order Service publishes PaymentRequested:

    • Trace: A new span, span-order-service-publish-payment, linked to trace-12345 and span-order-service-consume.
      {
        "traceId": "trace-12345",
        "spanId": "span-order-service-publish-payment",
        "parentId": "span-order-service-consume",
        "timestamp": "2023-10-27T10:00:01.500Z",
        "message": "Published PaymentRequested event",
        "messaging.system": "kafka",
        "messaging.destination": "payments",
        "messaging.messageId": "msg-payment-def"
      }
      
    • Metric: kafka_messages_produced_total{topic="payments"} to 1.
    • Log: INFO: OrderService published PaymentRequested. Trace ID: trace-12345, Order ID: 123, Payment ID: 789
  5. Payment Service consumes PaymentRequested:

    • Trace: Span span-payment-service-consume, linked to trace-12345 and span-order-service-publish-payment.
      {
        "traceId": "trace-12345",
        "spanId": "span-payment-service-consume",
        "parentId": "span-order-service-publish-payment",
        "timestamp": "2023-10-27T10:00:02.800Z",
        "message": "Consumed PaymentRequested event",
        "messaging.system": "kafka",
        "messaging.destination": "payments",
        "messaging.messageId": "msg-payment-def"
      }
      
    • Metric: kafka_messages_consumed_total{topic="payments"} to 1. payment_service_event_processing_duration_seconds with a value of 1.2s.
    • Log: INFO: PaymentService processing PaymentRequested. Trace ID: trace-12345, Payment ID: 789
  6. Payment Service publishes PaymentCompleted:

    • Trace: Span span-payment-service-publish-completed, linked to trace-12345 and span-payment-service-consume.
      {
        "traceId": "trace-12345",
        "spanId": "span-payment-service-publish-completed",
        "parentId": "span-payment-service-consume",
        "timestamp": "2023-10-27T10:00:03.000Z",
        "message": "Published PaymentCompleted event",
        "messaging.system": "kafka",
        "messaging.destination": "payments",
        "messaging.messageId": "msg-payment-ghi"
      }
      
    • Metric: kafka_messages_produced_total{topic="payments"} to 2.
    • Log: INFO: PaymentService published PaymentCompleted. Trace ID: trace-12345, Payment ID: 789
  7. Order Service consumes PaymentCompleted:

    • Trace: Span span-order-service-consume-payment-completed, linked to trace-12345 and span-payment-service-publish-completed.
      {
        "traceId": "trace-12345",
        "spanId": "span-order-service-consume-payment-completed",
        "parentId": "span-payment-service-publish-completed",
        "timestamp": "2023-10-27T10:00:04.100Z",
        "message": "Consumed PaymentCompleted event",
        "messaging.system": "kafka",
        "messaging.destination": "payments",
        "messaging.messageId": "msg-payment-ghi"
      }
      
    • Metric: kafka_messages_consumed_total{topic="payments"} to 2. order_service_event_processing_duration_seconds with a value of 1.0s (this is a different processing path than the initial OrderPlaced).
    • Log: INFO: OrderService processing PaymentCompleted. Trace ID: trace-12345, Order ID: 123
  8. Order Service publishes OrderShipped:

    • Trace: Span span-order-service-publish-shipped, linked to trace-12345 and span-order-service-consume-payment-completed.
      {
        "traceId": "trace-12345",
        "spanId": "span-order-service-publish-shipped",
        "parentId": "span-order-service-consume-payment-completed",
        "timestamp": "2023-10-27T10:00:04.300Z",
        "message": "Published OrderShipped event",
        "messaging.system": "kafka",
        "messaging.destination": "notifications",
        "messaging.messageId": "msg-order-jkl"
      }
      
    • Metric: kafka_messages_produced_total{topic="notifications"} to 1.
    • Log: INFO: OrderService published OrderShipped. Trace ID: trace-12345, Order ID: 123
  9. Notification Service consumes OrderShipped:

    • Trace: Span span-notification-service-consume, linked to trace-12345 and span-order-service-publish-shipped.
      {
        "traceId": "trace-12345",
        "spanId": "span-notification-service-consume",
        "parentId": "span-order-service-publish-shipped",
        "timestamp": "2023-10-27T10:00:05.500Z",
        "message": "Consumed OrderShipped event",
        "messaging.system": "kafka",
        "messaging.destination": "notifications",
        "messaging.messageId": "msg-order-jkl"
      }
      
    • Metric: kafka_messages_consumed_total{topic="notifications"} to 1. notification_service_event_processing_duration_seconds with a value of 1.1s.
    • Log: INFO: NotificationService sending email for Order ID: 123. Trace ID: trace-12345

Tracing (like OpenTelemetry, Jaeger, Zipkin) allows you to see the end-to-end journey of a single request or event across multiple services. You can reconstruct the entire causal chain, see where time is spent, and identify bottlenecks. For instance, if the PaymentRequested to PaymentCompleted took 5 seconds, you’d see that directly in the trace spans.

Metrics (like Prometheus, Grafana) provide aggregated, numerical data about the system’s health and performance over time. They tell you how many requests are happening, how long they are taking on average, and how many errors occurred. They are essential for dashboards, alerting, and understanding system-wide trends. You might see a spike in order_service_event_processing_duration_seconds or a sudden increase in kafka_messages_unprocessed_total.

Logging (like ELK stack, Splunk) provides detailed, human-readable records of events happening within a single service. It’s crucial for debugging specific errors and understanding the context of what a service was doing at a particular moment. When a trace shows an error in the Payment Service, logs from that service, filtered by traceId, will reveal the exact error message.

The most surprising true thing about tracing in event-driven systems is how it fundamentally redefines "request" to include the asynchronous hops between services. A "request" isn’t just a direct HTTP call; it’s the entire lifecycle of an event as it’s published, consumed, processed, and potentially triggers further events across different components, all orchestrated by a shared traceId.

When you’re debugging, you often start with metrics to identify a problem area (e.g., high latency in the Order Service). Then you use logs, filtered by traceId and timestamps, to pinpoint the exact error within that service. Finally, you use tracing to see how that error affected downstream or upstream services, or why the processing took so long by examining the latency of each individual span within the overall trace. The interplay between these three is key.

The next concept you’ll encounter is how to manage distributed context propagation, especially when dealing with different messaging systems or complex branching event flows.

Want structured learning?

Take the full Event-driven course →