← AWS Course9 / 13

CloudWatch — Monitoring, Logs & Events

See what your infrastructure is doing: read built-in metrics, ship and search application logs, raise alarms when something goes wrong, and trigger automated actions on a schedule or on events with EventBridge.

Ad 728×90

What CloudWatch covers

CloudWatch is the eyes and ears of AWS: metrics (numbers over time, like CPU), logs (text your apps emit), alarms (alerts when a metric crosses a line), and events (reacting to things that happen). Why: without it you are flying blind — you cannot fix what you cannot see.

List the metric namespaces AWS is already collecting for you

aws cloudwatch list-metrics --query 'Metrics[].Namespace' \
  --output text | tr '\t' '\n' | sort -u | head

Metrics — numbers over time

A metric is a time series — CPU %, request count, queue depth. AWS publishes many automatically (EC2, RDS, Lambda, …). Why read them: they tell you whether a server is overloaded, a queue is backing up, or errors are spiking. You can also publish your own custom metrics.

Read average CPU of an instance over an hour, in 5-minute buckets

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --start-time 2024-05-01T00:00:00Z --end-time 2024-05-01T01:00:00Z \
  --period 300 --statistics Average

Publish your own custom metric (e.g. items in a processing queue)

aws cloudwatch put-metric-data --namespace MyApp \
  --metric-name QueueDepth --value 42

Logs — collect and search text

CloudWatch Logs stores text your apps and AWS services write, grouped into "log groups" and "log streams." Why: it is where you go to read what actually happened — Lambda output, server logs, error traces — all searchable in one place.

Create a log group for your app

aws logs create-log-group --log-group-name /myapp/web

Tail it live, like "tail -f", filtering for errors

aws logs tail /myapp/web --follow --filter-pattern "ERROR"

Run a Logs Insights query to count errors per hour

aws logs start-query --log-group-name /myapp/web \
  --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) \
  --query-string 'fields @timestamp | filter @message like /ERROR/ | stats count()'

Alarms — get told when something is wrong

An alarm watches a metric and changes state (OK ↔ ALARM) when it crosses a threshold for a set time, then notifies you (commonly via an SNS topic that emails or pages you). Why: you find out about high CPU or a flood of 5xx errors before your users complain.

Alarm when average CPU stays above 80% for two 5-minute periods

aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu \
  --namespace AWS/EC2 --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average --period 300 --evaluation-periods 2 \
  --threshold 80 --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alerts

Events & EventBridge — react automatically

EventBridge (the modern home of "CloudWatch Events") runs actions on a schedule or when something happens in AWS. Why: it is the cron and the glue of AWS — "every night at 2am, run this Lambda" or "when an EC2 instance stops, notify me." No servers required.

A scheduled rule that fires every day at 02:00 UTC

aws events put-rule --name nightly-job \
  --schedule-expression 'cron(0 2 * * ? *)'

Point the rule at a Lambda function (built in the Lambda lesson)

aws events put-targets --rule nightly-job \
  --targets 'Id=1,Arn=arn:aws:lambda:us-east-1:111122223333:function:nightly'