PromQL is how you ask Prometheus questions. Select and filter series by label, turn counters into per-second rates, and aggregate across instances into the numbers you actually want.
Why: the simplest query is a metric name, which returns every series for it. You narrow with label matchers in {braces}: = exact, != not, =~ regex match. This is how you focus on one job, instance, or status code out of thousands of series.
# Every series for this metric
node_cpu_seconds_total
# Only the "idle" CPU mode
node_cpu_seconds_total{mode="idle"}
# Regex: any 5xx status
http_requests_total{status=~"5.."}Why: a counter total is meaningless on its own — what you want is how fast it is growing. rate() computes the per-second average increase over a time window (written in [brackets]). This is THE most important PromQL function: requests/sec, errors/sec, bytes/sec all come from rate() over a counter.
# Requests per second, averaged over the last 5 minutes
rate(http_requests_total[5m])
# Error rate per second
rate(http_requests_total{status=~"5.."}[5m])Why: rate() gives you one number per series (per instance, per status…). Aggregation operators collapse them: sum totals across series, and "by (label)" keeps one result per value of that label. This turns hundreds of per-instance rates into the few lines you actually chart.
# Total requests/sec across all instances
sum(rate(http_requests_total[5m]))
# Requests/sec broken down by status code
sum by (status) (rate(http_requests_total[5m]))
# Average CPU busy fraction per instance
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))Why: averages hide pain — a 50ms average can still mean 1% of users wait 5s. Histograms let you compute percentiles: histogram_quantile over the bucketed _bucket series gives the 95th-percentile latency. This is the standard way to track "how slow is it for the unlucky requests".
# 95th-percentile request latency over 5 minutes
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)Note: a reliable starting point for any service is the RED metrics — Rate (requests/sec), Errors (failed/sec), and Duration (latency percentiles). Three PromQL queries cover the health of most services; build your first dashboard and alerts from these before going deeper.
Rate → sum(rate(http_requests_total[5m]))
Errors → sum(rate(http_requests_total{status=~"5.."}[5m]))
Duration → histogram_quantile(0.95, sum by (le) (rate(..._bucket[5m])))