Tracing requests across event-driven services is less about tracking a single request and more about reconstructing a story from fragmented conversations.

Let’s see this in action. Imagine a user places an order. This triggers a cascade:

  1. User Service: Publishes UserOrderPlaced event to Kafka topic user-orders.
  2. Order Service: Consumes UserOrderPlaced, creates an order record, publishes OrderCreated event to orders topic.
  3. Inventory Service: Consumes OrderCreated, checks stock, publishes InventoryReserved event to inventory topic.
  4. Payment Service: Consumes OrderCreated, processes payment, publishes PaymentProcessed event to payments topic.
  5. Notification Service: Consumes PaymentProcessed (and potentially others), sends an email.

Without tracing, if the user’s order never gets fulfilled, you’re staring at a black box. Did the OrderCreated event never get published? Did the InventoryService fail to consume it? Did the PaymentService error out?

The mental model hinges on correlation IDs. Every event carries a traceId (unique for the entire user journey) and a spanId (unique for a specific operation within that journey). When a service produces an event, it generates a new traceId and spanId and propagates the traceId from the incoming event.

Here’s a simplified Kafka message payload demonstrating this:

{
  "event_type": "OrderCreated",
  "traceId": "a1b2c3d4e5f67890",
  "spanId": "f0e9d8c7b6a54321",
  "timestamp": "2023-10-27T10:30:00Z",
  "payload": {
    "orderId": "ORD-12345",
    "userId": "USR-67890",
    "amount": 99.99
  }
}

When the OrderService receives an event (say, UserOrderPlaced), it extracts its traceId and spanId. For the OrderCreated event it publishes, it uses the same traceId but generates a new spanId. This traceId is then passed along to downstream consumers.

The core problem this solves is distributed debugging. When a request fails or gets stuck, you can query your tracing backend (like Jaeger, Zipkin, or Datadog APM) using the traceId. This reconstructs the entire path the request took, showing which services participated, the order of operations, and importantly, where it stalled or errored out.

You control this by ensuring every service that publishes or consumes an event is instrumented to:

  1. Extract Correlation IDs: When receiving a message, parse out traceId and spanId.
  2. Generate New Span ID: For any new operation initiated by receiving that message, create a fresh spanId.
  3. Propagate Trace ID: When publishing a new message or making an outgoing call, include the original traceId and the newly generated spanId.

The actual implementation involves libraries specific to your languages and messaging systems. For Kafka in Java, you might use kafka-clients with OpenTelemetry instrumentation. A common pattern is to inject/extract trace context from message headers. For example, in Spring Kafka, you’d configure a RecordInterceptor or ConsumerInterceptor to handle this.

// Example of injecting trace context into Kafka headers
ProducerRecord<String, MyEvent> record = new ProducerRecord<>(topic, key, event);
// Assuming tracing library is configured and active
TextMap carrier = new TextMap() {
    @Override
    public void put(String key, String value) {
        record.headers().add(key, value.getBytes(StandardCharsets.UTF_8));
    }
    // ... other methods
};
tracer.inject(Context.current(), carrier);

A subtle but critical aspect is how you handle "fan-out" scenarios. If a single event triggers multiple independent downstream processing chains, each chain will share the same traceId but will diverge with unique spanIds for their respective operations. A tracing system that correctly visualizes this branching is crucial. You don’t just see a line; you see a tree.

The next conceptual hurdle is understanding how to correlate these traces with metrics and logs for a complete observability picture.

Want structured learning?

Take the full Event-driven course →