CQRS itself doesn’t inherently make tracing commands and queries across microservices any harder, but it does reveal the complexities that were always there.
Let’s look at a typical scenario: a user updates their profile.
// Command sent to the User Service
{
"commandType": "UpdateUserProfileCommand",
"userId": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"updates": {
"displayName": "Jane Doe Updated",
"email": "jane.doe.new@example.com"
}
}
This command might trigger events that are then consumed by other services, like an Email Notification Service or a Search Indexing Service.
// Event published by the User Service
{
"eventType": "UserProfileUpdatedEvent",
"userId": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"version": 3,
"timestamp": "2023-10-27T10:30:00Z"
}
The Search Indexing Service might then perform a query to update its index.
// Query to the Search Service (simplified, actual might be REST/gRPC)
{
"queryType": "UpdateSearchDocumentQuery",
"documentId": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"index": "users",
"payload": {
"userId": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"displayName": "Jane Doe Updated",
"email": "jane.doe.new@example.com"
}
}
This chain of command-event-query is where tracing becomes critical. Without it, if the profile update appears inconsistent in search results, you’re left guessing which step failed or introduced the discrepancy.
The core problem CQRS highlights is that the write path (commands) and read path (queries) are decoupled, often residing in different services or even different data stores. When a command is issued, it might initiate a cascade of asynchronous operations. An event is published, other services react to it by potentially performing their own commands or queries. To understand the end-to-end flow, you need to follow a single logical operation through these independent steps.
This is precisely why distributed tracing is essential. A tracing system assigns a unique traceId to the initial command. As this command is processed and generates events, and as other services consume those events and perform queries, this traceId is propagated. Each service that participates in the operation generates spans – discrete units of work, like processing a command, publishing an event, or executing a query. These spans are all linked back to the original traceId, forming a complete, visual representation of the request’s journey.
Consider the User Service receiving the UpdateUserProfileCommand. It creates a span for this command processing. If it publishes a UserProfileUpdatedEvent, the event publishing itself is another span, or part of the command processing span. The Search Indexing Service then consumes this event. When it starts processing, it creates a new span, crucially associating it with the incoming event’s trace context, which includes the original traceId. This span might encompass the UpdateSearchDocumentQuery.
Here’s how you’d typically implement this propagation. When a service sends a message (like an event) or makes an RPC call, it includes the tracing context. For example, in HTTP headers:
Trace-Id: a1b2c3d4-e5f6-7890-1234-567890abcdef
Span-Id: 1234567890abcdef
Parent-Span-Id: fedcba0987654321
Sampled: true
When a downstream service receives this, it extracts these headers and uses them to create its own spans, linking them to the parent span. This creates the chain. If you’re using message queues like Kafka or RabbitMQ, the trace context is often included as message headers.
For example, using OpenTelemetry with Kafka:
// Sending a message with trace context
func publishEvent(ctx context.Context, topic string, key string, value []byte) error {
// Extract trace information from the context
span := trace.SpanFromContext(ctx)
spanCtx := span.SpanContext()
// Create new headers including trace context
headers := kafka.NewHeaders()
headers.Add("traceparent", fmt.Sprintf("00-%s-%s-01", spanCtx.TraceID().String(), spanCtx.SpanID().String()))
msg := &kafka.Message{
TopicPartition: kafka.TopicPartition{Topic: &topic, Partition: kafka.PartitionAny},
Key: []byte(key),
Value: value,
Headers: headers,
}
// ... send message ...
return nil
}
And on the receiving end:
// Consuming a message and starting a new span
func consumeEvent(msg *kafka.Message) error {
traceHeader := ""
for _, h := range msg.Headers {
if string(h.Key) == "traceparent" {
traceHeader = string(h.Value)
break
}
}
// Extract trace context from header
// ... parse traceHeader ...
traceID, spanID, _ := parseTraceParent(traceHeader) // Assume parseTraceParent exists
// Create a new span, linking it to the incoming trace context
tp := trace.NewSpan(traceID, spanID, trace.WithSpanKind(trace.SpanKindConsumer))
ctx := trace.ContextWithSpan(context.Background(), tp)
// ... process message using ctx ...
return nil
}
The most surprising true thing about distributed tracing in a CQRS system is that the complexity isn’t in the tracing mechanism itself, but in correctly configuring and propagating the trace context across all communication boundaries – HTTP, gRPC, message queues, and even direct method calls if they cross service boundaries (though this is less common in microservices). If a single hop fails to propagate the traceId, the entire chain for that request breaks, leaving you with incomplete visibility.
The next concept you’ll likely grapple with is how to correlate trace data with business-level events or errors, especially when dealing with eventual consistency and complex state transitions.