Monitoring & Alerting · DevOps Deep Dive

Metrics

Numeric measurements over time: request latency, error rate, queue depth, CPU usage, disk usage. Aggregated (p50, p95, p99) or individual data points.

System metrics: CPU, memory, disk, network.
Application metrics: requests/sec, error rate, latency.
Business metrics: transactions completed, revenue, user signups.

Logs

Detailed, unstructured (or structured) text output from services. What happened and when. Essential for debugging specific incidents.

Application logs: what your code is doing.
System logs: OS-level events (kernel panics, package updates).
Access logs: HTTP requests (who, when, what, response code).

Traces

Follow a single request through the entire system. Which services did it touch? How long in each? Where did it slow down?

Use case: debugging latency issues, understanding critical paths, visualizing service dependencies.
Tools: Jaeger, Zipkin, Datadog APM.

Monitoring & Alerting — Visibility into Production

Observability Fundamentals