OpenTelemetry doesn’t actually require you to instrument every single line of your code to see the full picture.
Let’s say you’ve got a user request that hits your frontend, then bounces to a users-service, which then calls a posts-service.
{
"traceId": "a1b2c3d4e5f67890",
"parentId": "0123456789abcdef",
"id": "fedcba9876543210",
"name": "GET /users/{id}",
"kind": 1, // SPAN_KIND_INTERNAL
"startTimeUnixNano": 1678886400000000000,
"endTimeUnixNano": 1678886400100000000,
"attributes": {
"http.method": "GET",
"http.url": "http://users-service/users/123",
"http.status_code": 200
},
"status": {
"code": 0 // STATUS_CODE_OK
}
}
This JSON represents a single "span" – a unit of work within a distributed trace. The traceId links all spans belonging to the same request. The parentId shows which span initiated this one.
When a request comes into your users-service, you’d typically add an OpenTelemetry SDK. This SDK intercepts incoming requests (e.g., via an HTTP middleware). It generates a traceId if one doesn’t exist (for the very first service in the chain) and a parentId that links to the upstream service. It then starts a new span for the work happening within the users-service.
# Example using Python OpenTelemetry SDK
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_user_request(request):
with tracer.start_as_current_span("process_user_request") as span:
# ... your user processing logic ...
user_id = request.get_param("id")
span.set_attribute("user.id", user_id)
# ... potentially make a call to posts-service ...
response = make_posts_service_call(user_id)
return response
The key is that the SDK automatically propagates the traceId and parentId to downstream services. For HTTP, this means injecting them into request headers. For gRPC, it’s via metadata.
So, if users-service calls posts-service using an HTTP client instrumented by OpenTelemetry, the SDK in users-service will add headers like traceparent: 00-a1b2c3d4e5f67890-fedcba9876543210-01 (the last part is a sampling decision). The posts-service SDK sees these headers, extracts the traceId and parentId, and starts its own span, correctly linked to the users-service span.
The "system in action" isn’t just about seeing individual service logs; it’s about the automatic linkage. You configure an exporter (like OTLP to send data to a backend like Jaeger or Datadog). The SDKs in each service send their spans to this exporter. The backend then reconstructs the entire request flow by grouping spans with the same traceId.
Here’s how you’d configure a basic exporter in Python:
# Example Python configuration
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure the tracer provider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure the OTLP exporter
# For local Jaeger/OTEL Collector, it's often localhost:4317
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True)
# Add the exporter to a span processor
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
The mental model is: each service acts as a "node" in the trace graph. OpenTelemetry’s context propagation is the "edge" that connects these nodes. You don’t need to manually pass trace IDs around in your application code if your HTTP clients and servers are instrumented. The SDK handles it.
The real power comes from understanding that the startTimeUnixNano and endTimeUnixNano of spans allow the backend to calculate durations within a service, and the parent-child relationships show the sequence and dependencies between services. You can visualize this as a waterfall or a dependency graph, revealing bottlenecks or errors.
What most people don’t realize is that the span names themselves are often the most important attribute for quick debugging. A span named HTTP GET /users/{id} is good, but a span named fetch_user_from_db or validate_post_permissions gives you granular insight into where within that service the time is being spent. The instrumentation libraries for popular frameworks (like Flask, Express, Spring) automatically generate meaningful span names for common operations.
Once you’ve got traces flowing, the next step is to start correlating them with metrics and logs.