Jaeger / Zipkin / OpenTelemetry · Observability & Performance Deep Dive

Anatomy

The Vocabulary

Trace — the full lifecycle of one request, end to end.
Span — one unit of work within a trace (an HTTP call, a DB query). Has start, duration, attributes.
Trace context — IDs propagated via headers (traceparent per W3C spec) across services.
Parent / child spans — form a tree showing causality.
Attributes / events — structured key-values and log lines attached to spans.
Sampling — head-based (decide upfront) or tail-based (decide after seeing the trace).

Players

Tool	What it is	Notes
OpenTelemetry (OTel)	CNCF standard for SDKs + collector + protocol (OTLP)	Merger of OpenCensus + OpenTracing. The de facto answer for new instrumentation.
Jaeger	Tracing backend & UI (CNCF graduated)	Born at Uber; works great with OTel collector.
Zipkin	Original distributed-tracing system	Twitter's contribution; still widely used; simpler than Jaeger.
Tempo	Grafana's tracing backend	Cheap object-storage backed; pairs with Loki + Mimir.
Commercial	Datadog APM, Honeycomb, Lightstep, New Relic	Strong on tail-sampling, query & high-cardinality analysis.

What You Find

The N+1 query. 200 little DB calls inside one request, instantly visible.
Hidden synchronous fan-out. A page that calls 12 services in series instead of in parallel.
The retry storm. Service A times out → retries 3× → upstream sees 4× load.
The wrong slow step. "Database is slow" is often "DNS resolution within the gateway is slow."
Cross-team handoffs — whose service actually owns the latency this quarter.

Tradeoffs

Volume is huge. 100% sampling is rarely affordable; 1% head sampling misses rare slow paths. Tail-sampling is the answer at scale.
Context propagation must be everywhere. One service that drops the headers blinds the trace.
Async work (queues, batch jobs) needs explicit span linking; default instrumentation doesn't always do it.
OTel is a moving target. Stable for traces; logs & metrics SDKs still maturing per language.
Don't double-pay. If you have OTel + a vendor SDK + a service mesh emitting spans, you'll pay for the same trace three times.

Continue