Observability & Performance Deep Dive · 2 of 7

Prometheus & Grafana — The Open-Source Default

Prometheus came out of SoundCloud in 2012, was donated to the CNCF, and is now the gravitational center of OSS metrics. It scrapes HTTP endpoints on an interval, stores time-series in a local TSDB, and exposes them via PromQL. Grafana is the dashboard UI everyone bolts on top.

PromQLPull-basedExportersAlertmanagerThanos / Mimir / Cortex
← Back to Observability & Performance
Anatomy

The Building Blocks

Basic Concepts

  • Time series — one metric + a set of labels over time (e.g., http_requests_total{method="GET",status="200"}).
  • Scrape — Prometheus pulls /metrics from each target on an interval (default 15–60s).
  • Exporter — adapter that exposes third-party stats as Prom-format metrics (node_exporter, postgres_exporter, blackbox_exporter, etc.).
  • PromQL — query language: rate(), histogram_quantile(), sum by().
  • Alertmanager — separate component that routes Prometheus alerts to Slack / PagerDuty / email.
  • Long-term storage — Thanos, Mimir, Cortex, or VictoriaMetrics for HA + multi-year retention.
The Four Metric Types

Pick the Right One

TypeUse forExample
CounterMonotonic counts (only goes up)http_requests_total
GaugeThings that go up and downqueue_depth, memory_bytes
HistogramDistributions (esp. latency)http_request_duration_seconds
SummaryPre-computed quantiles (rare)p50/p99 calculated client-side

Prefer histograms for latency — they aggregate; summaries don't.

RED & USE

What to Actually Measure

  • RED for request-driven services: Rate (req/s), Errors (% failing), Duration (latency p50/p99).
  • USE for resources: Utilization, Saturation, Errors.
  • The Four Golden Signals (Google SRE) — latency, traffic, errors, saturation.
  • SLI → SLO → alerts. Don't alert on raw CPU; alert on what users experience.
Tradeoffs

What to Watch Out For

  • Cardinality explosion. Each unique label combination = a new series. user_id as a label = millions of series = OOM.
  • Single-node by default. One Prom = single point of failure. HA pairs or remote-write to Thanos/Mimir.
  • Pull model needs network reachability — short-lived jobs need the Pushgateway (use sparingly).
  • Dashboard sprawl. 200 dashboards nobody owns. Curate ruthlessly.
  • Alert noise. If you ignore alerts, you have too many. Tune to actionable only.
Continue

Other Observability Tools