Observability & Performance Deep Dive · 3 of 7

Distributed Tracing — Following One Request Across Many Services

In a microservice mesh, "the API is slow" is a useless statement. Tracing follows a single request as it threads through twenty services — gateway, auth, business logic, three databases, two external APIs — and shows you exactly which span ate the budget. Jaeger and Zipkin built the visualizations; OpenTelemetry is now the standard for the instrumentation underneath.

SpansTrace IDsOpenTelemetryW3C Trace ContextSampling
← Back to Observability & Performance
Anatomy

The Vocabulary

Basic Concepts

  • Trace — the full lifecycle of one request, end to end.
  • Span — one unit of work within a trace (an HTTP call, a DB query). Has start, duration, attributes.
  • Trace context — IDs propagated via headers (traceparent per W3C spec) across services.
  • Parent / child spans — form a tree showing causality.
  • Attributes / events — structured key-values and log lines attached to spans.
  • Sampling — head-based (decide upfront) or tail-based (decide after seeing the trace).
Players

OTel, Jaeger, Zipkin

ToolWhat it isNotes
OpenTelemetry (OTel)CNCF standard for SDKs + collector + protocol (OTLP)Merger of OpenCensus + OpenTracing. The de facto answer for new instrumentation.
JaegerTracing backend & UI (CNCF graduated)Born at Uber; works great with OTel collector.
ZipkinOriginal distributed-tracing systemTwitter's contribution; still widely used; simpler than Jaeger.
TempoGrafana's tracing backendCheap object-storage backed; pairs with Loki + Mimir.
CommercialDatadog APM, Honeycomb, Lightstep, New RelicStrong on tail-sampling, query & high-cardinality analysis.
What You Find

Where Traces Pay Off

  • The N+1 query. 200 little DB calls inside one request, instantly visible.
  • Hidden synchronous fan-out. A page that calls 12 services in series instead of in parallel.
  • The retry storm. Service A times out → retries 3× → upstream sees 4× load.
  • The wrong slow step. "Database is slow" is often "DNS resolution within the gateway is slow."
  • Cross-team handoffs — whose service actually owns the latency this quarter.
Tradeoffs

What to Watch Out For

  • Volume is huge. 100% sampling is rarely affordable; 1% head sampling misses rare slow paths. Tail-sampling is the answer at scale.
  • Context propagation must be everywhere. One service that drops the headers blinds the trace.
  • Async work (queues, batch jobs) needs explicit span linking; default instrumentation doesn't always do it.
  • OTel is a moving target. Stable for traces; logs & metrics SDKs still maturing per language.
  • Don't double-pay. If you have OTel + a vendor SDK + a service mesh emitting spans, you'll pay for the same trace three times.
Continue

Other Observability Tools