Prometheus + Grafana · Observability & Performance Deep Dive

Anatomy

The Building Blocks

Time series — one metric + a set of labels over time (e.g., http_requests_total{method="GET",status="200"}).
Scrape — Prometheus pulls /metrics from each target on an interval (default 15–60s).
Exporter — adapter that exposes third-party stats as Prom-format metrics (node_exporter, postgres_exporter, blackbox_exporter, etc.).
PromQL — query language: rate(), histogram_quantile(), sum by().
Alertmanager — separate component that routes Prometheus alerts to Slack / PagerDuty / email.
Long-term storage — Thanos, Mimir, Cortex, or VictoriaMetrics for HA + multi-year retention.

The Four Metric Types

Type	Use for	Example
Counter	Monotonic counts (only goes up)	`http_requests_total`
Gauge	Things that go up and down	`queue_depth`, `memory_bytes`
Histogram	Distributions (esp. latency)	`http_request_duration_seconds`
Summary	Pre-computed quantiles (rare)	p50/p99 calculated client-side

Prefer histograms for latency — they aggregate; summaries don't.

RED & USE

RED for request-driven services: Rate (req/s), Errors (% failing), Duration (latency p50/p99).
USE for resources: Utilization, Saturation, Errors.
The Four Golden Signals (Google SRE) — latency, traffic, errors, saturation.
SLI → SLO → alerts. Don't alert on raw CPU; alert on what users experience.

Tradeoffs

Cardinality explosion. Each unique label combination = a new series. user_id as a label = millions of series = OOM.
Single-node by default. One Prom = single point of failure. HA pairs or remote-write to Thanos/Mimir.
Pull model needs network reachability — short-lived jobs need the Pushgateway (use sparingly).
Dashboard sprawl. 200 dashboards nobody owns. Curate ruthlessly.
Alert noise. If you ignore alerts, you have too many. Tune to actionable only.

Continue