Querying with PromQL

PromQL is how you ask Prometheus questions. Select and filter series by label, turn counters into per-second rates, and aggregate across instances into the numbers you actually want.

Ad 728×90

Select and filter by label

Why: the simplest query is a metric name, which returns every series for it. You narrow with label matchers in {braces}: = exact, != not, =~ regex match. This is how you focus on one job, instance, or status code out of thousands of series.

# Every series for this metric
node_cpu_seconds_total

# Only the "idle" CPU mode
node_cpu_seconds_total{mode="idle"}

# Regex: any 5xx status
http_requests_total{status=~"5.."}

rate(): make counters useful

Why: a counter total is meaningless on its own — what you want is how fast it is growing. rate() computes the per-second average increase over a time window (written in [brackets]). This is THE most important PromQL function: requests/sec, errors/sec, bytes/sec all come from rate() over a counter.

# Requests per second, averaged over the last 5 minutes
rate(http_requests_total[5m])

# Error rate per second
rate(http_requests_total{status=~"5.."}[5m])

Aggregate across series

Why: rate() gives you one number per series (per instance, per status…). Aggregation operators collapse them: sum totals across series, and "by (label)" keeps one result per value of that label. This turns hundreds of per-instance rates into the few lines you actually chart.

# Total requests/sec across all instances
sum(rate(http_requests_total[5m]))

# Requests/sec broken down by status code
sum by (status) (rate(http_requests_total[5m]))

# Average CPU busy fraction per instance
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

Percentiles from histograms

Why: averages hide pain — a 50ms average can still mean 1% of users wait 5s. Histograms let you compute percentiles: histogram_quantile over the bucketed _bucket series gives the 95th-percentile latency. This is the standard way to track "how slow is it for the unlucky requests".

# 95th-percentile request latency over 5 minutes
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

The RED method

Note: a reliable starting point for any service is the RED metrics — Rate (requests/sec), Errors (failed/sec), and Duration (latency percentiles). Three PromQL queries cover the health of most services; build your first dashboard and alerts from these before going deeper.

Rate     → sum(rate(http_requests_total[5m]))
  Errors   → sum(rate(http_requests_total{status=~"5.."}[5m]))
  Duration → histogram_quantile(0.95, sum by (le) (rate(..._bucket[5m])))