Event-driven systems are more like a city with thousands of interconnected roads than a single highway.
Let’s watch a hypothetical OrderPlaced event flow through a microservices architecture. Our system has an API Gateway, an Order Service, a Payment Service, and a Notification Service.
Here’s a simplified view of the event flow and how we’d observe it:
-
API Gateway receives an order:
- Trace: A trace starts, let’s call it
trace-12345. The gateway logs an entry like:{ "traceId": "trace-12345", "spanId": "span-gateway-ingress", "timestamp": "2023-10-27T10:00:00Z", "message": "Received POST /orders", "http.method": "POST", "http.url": "/orders", "request.body.size": 512 } - Metric: A counter increments:
api_gateway_requests_total{method="POST", path="/orders"}to 1. A histogram records latency:api_gateway_request_duration_secondswith a value of 0.05s. - Log: The gateway logs the event for auditing:
INFO: API Gateway received new order request. Trace ID: trace-12345
- Trace: A trace starts, let’s call it
-
API Gateway publishes
OrderPlacedto a message queue (e.g., Kafka):- Trace: A new span,
span-gateway-publish-order, is created, linked totrace-12345.{ "traceId": "trace-12345", "spanId": "span-gateway-publish-order", "parentId": "span-gateway-ingress", "timestamp": "2023-10-27T10:00:00.100Z", "message": "Published OrderPlaced event to kafka topic 'orders'", "messaging.system": "kafka", "messaging.destination": "orders", "messaging.messageId": "msg-order-abc" } - Metric: A gauge might track queue depth if available, or a counter for messages produced:
kafka_messages_produced_total{topic="orders"}to 1. - Log:
INFO: Published OrderPlaced event. Trace ID: trace-12345, Message ID: msg-order-abc
- Trace: A new span,
-
Order Service consumes
OrderPlaced:- Trace: The Order Service starts a new span,
span-order-service-consume, linked totrace-12345andspan-gateway-publish-order.{ "traceId": "trace-12345", "spanId": "span-order-service-consume", "parentId": "span-gateway-publish-order", "timestamp": "2023-10-27T10:00:01.200Z", "message": "Consumed OrderPlaced event", "messaging.system": "kafka", "messaging.destination": "orders", "messaging.messageId": "msg-order-abc" } - Metric: A counter for consumed messages:
kafka_messages_consumed_total{topic="orders"}to 1. A histogram for processing time:order_service_event_processing_duration_secondswith a value of 0.3s. - Log:
INFO: OrderService processing OrderPlaced. Trace ID: trace-12345, Order ID: 123
- Trace: The Order Service starts a new span,
-
Order Service publishes
PaymentRequested:- Trace: A new span,
span-order-service-publish-payment, linked totrace-12345andspan-order-service-consume.{ "traceId": "trace-12345", "spanId": "span-order-service-publish-payment", "parentId": "span-order-service-consume", "timestamp": "2023-10-27T10:00:01.500Z", "message": "Published PaymentRequested event", "messaging.system": "kafka", "messaging.destination": "payments", "messaging.messageId": "msg-payment-def" } - Metric:
kafka_messages_produced_total{topic="payments"}to 1. - Log:
INFO: OrderService published PaymentRequested. Trace ID: trace-12345, Order ID: 123, Payment ID: 789
- Trace: A new span,
-
Payment Service consumes
PaymentRequested:- Trace: Span
span-payment-service-consume, linked totrace-12345andspan-order-service-publish-payment.{ "traceId": "trace-12345", "spanId": "span-payment-service-consume", "parentId": "span-order-service-publish-payment", "timestamp": "2023-10-27T10:00:02.800Z", "message": "Consumed PaymentRequested event", "messaging.system": "kafka", "messaging.destination": "payments", "messaging.messageId": "msg-payment-def" } - Metric:
kafka_messages_consumed_total{topic="payments"}to 1.payment_service_event_processing_duration_secondswith a value of 1.2s. - Log:
INFO: PaymentService processing PaymentRequested. Trace ID: trace-12345, Payment ID: 789
- Trace: Span
-
Payment Service publishes
PaymentCompleted:- Trace: Span
span-payment-service-publish-completed, linked totrace-12345andspan-payment-service-consume.{ "traceId": "trace-12345", "spanId": "span-payment-service-publish-completed", "parentId": "span-payment-service-consume", "timestamp": "2023-10-27T10:00:03.000Z", "message": "Published PaymentCompleted event", "messaging.system": "kafka", "messaging.destination": "payments", "messaging.messageId": "msg-payment-ghi" } - Metric:
kafka_messages_produced_total{topic="payments"}to 2. - Log:
INFO: PaymentService published PaymentCompleted. Trace ID: trace-12345, Payment ID: 789
- Trace: Span
-
Order Service consumes
PaymentCompleted:- Trace: Span
span-order-service-consume-payment-completed, linked totrace-12345andspan-payment-service-publish-completed.{ "traceId": "trace-12345", "spanId": "span-order-service-consume-payment-completed", "parentId": "span-payment-service-publish-completed", "timestamp": "2023-10-27T10:00:04.100Z", "message": "Consumed PaymentCompleted event", "messaging.system": "kafka", "messaging.destination": "payments", "messaging.messageId": "msg-payment-ghi" } - Metric:
kafka_messages_consumed_total{topic="payments"}to 2.order_service_event_processing_duration_secondswith a value of 1.0s (this is a different processing path than the initialOrderPlaced). - Log:
INFO: OrderService processing PaymentCompleted. Trace ID: trace-12345, Order ID: 123
- Trace: Span
-
Order Service publishes
OrderShipped:- Trace: Span
span-order-service-publish-shipped, linked totrace-12345andspan-order-service-consume-payment-completed.{ "traceId": "trace-12345", "spanId": "span-order-service-publish-shipped", "parentId": "span-order-service-consume-payment-completed", "timestamp": "2023-10-27T10:00:04.300Z", "message": "Published OrderShipped event", "messaging.system": "kafka", "messaging.destination": "notifications", "messaging.messageId": "msg-order-jkl" } - Metric:
kafka_messages_produced_total{topic="notifications"}to 1. - Log:
INFO: OrderService published OrderShipped. Trace ID: trace-12345, Order ID: 123
- Trace: Span
-
Notification Service consumes
OrderShipped:- Trace: Span
span-notification-service-consume, linked totrace-12345andspan-order-service-publish-shipped.{ "traceId": "trace-12345", "spanId": "span-notification-service-consume", "parentId": "span-order-service-publish-shipped", "timestamp": "2023-10-27T10:00:05.500Z", "message": "Consumed OrderShipped event", "messaging.system": "kafka", "messaging.destination": "notifications", "messaging.messageId": "msg-order-jkl" } - Metric:
kafka_messages_consumed_total{topic="notifications"}to 1.notification_service_event_processing_duration_secondswith a value of 1.1s. - Log:
INFO: NotificationService sending email for Order ID: 123. Trace ID: trace-12345
- Trace: Span
Tracing (like OpenTelemetry, Jaeger, Zipkin) allows you to see the end-to-end journey of a single request or event across multiple services. You can reconstruct the entire causal chain, see where time is spent, and identify bottlenecks. For instance, if the PaymentRequested to PaymentCompleted took 5 seconds, you’d see that directly in the trace spans.
Metrics (like Prometheus, Grafana) provide aggregated, numerical data about the system’s health and performance over time. They tell you how many requests are happening, how long they are taking on average, and how many errors occurred. They are essential for dashboards, alerting, and understanding system-wide trends. You might see a spike in order_service_event_processing_duration_seconds or a sudden increase in kafka_messages_unprocessed_total.
Logging (like ELK stack, Splunk) provides detailed, human-readable records of events happening within a single service. It’s crucial for debugging specific errors and understanding the context of what a service was doing at a particular moment. When a trace shows an error in the Payment Service, logs from that service, filtered by traceId, will reveal the exact error message.
The most surprising true thing about tracing in event-driven systems is how it fundamentally redefines "request" to include the asynchronous hops between services. A "request" isn’t just a direct HTTP call; it’s the entire lifecycle of an event as it’s published, consumed, processed, and potentially triggers further events across different components, all orchestrated by a shared traceId.
When you’re debugging, you often start with metrics to identify a problem area (e.g., high latency in the Order Service). Then you use logs, filtered by traceId and timestamps, to pinpoint the exact error within that service. Finally, you use tracing to see how that error affected downstream or upstream services, or why the processing took so long by examining the latency of each individual span within the overall trace. The interplay between these three is key.
The next concept you’ll encounter is how to manage distributed context propagation, especially when dealing with different messaging systems or complex branching event flows.