See what your infrastructure is doing: read built-in metrics, ship and search application logs, raise alarms when something goes wrong, and trigger automated actions on a schedule or on events with EventBridge.
CloudWatch is the eyes and ears of AWS: metrics (numbers over time, like CPU), logs (text your apps emit), alarms (alerts when a metric crosses a line), and events (reacting to things that happen). Why: without it you are flying blind — you cannot fix what you cannot see.
List the metric namespaces AWS is already collecting for you
aws cloudwatch list-metrics --query 'Metrics[].Namespace' \
--output text | tr '\t' '\n' | sort -u | headA metric is a time series — CPU %, request count, queue depth. AWS publishes many automatically (EC2, RDS, Lambda, …). Why read them: they tell you whether a server is overloaded, a queue is backing up, or errors are spiking. You can also publish your own custom metrics.
Read average CPU of an instance over an hour, in 5-minute buckets
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--start-time 2024-05-01T00:00:00Z --end-time 2024-05-01T01:00:00Z \
--period 300 --statistics AveragePublish your own custom metric (e.g. items in a processing queue)
aws cloudwatch put-metric-data --namespace MyApp \
--metric-name QueueDepth --value 42CloudWatch Logs stores text your apps and AWS services write, grouped into "log groups" and "log streams." Why: it is where you go to read what actually happened — Lambda output, server logs, error traces — all searchable in one place.
Create a log group for your app
aws logs create-log-group --log-group-name /myapp/webTail it live, like "tail -f", filtering for errors
aws logs tail /myapp/web --follow --filter-pattern "ERROR"Run a Logs Insights query to count errors per hour
aws logs start-query --log-group-name /myapp/web \
--start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) \
--query-string 'fields @timestamp | filter @message like /ERROR/ | stats count()'An alarm watches a metric and changes state (OK ↔ ALARM) when it crosses a threshold for a set time, then notifies you (commonly via an SNS topic that emails or pages you). Why: you find out about high CPU or a flood of 5xx errors before your users complain.
Alarm when average CPU stays above 80% for two 5-minute periods
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu \
--namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--statistic Average --period 300 --evaluation-periods 2 \
--threshold 80 --comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:111122223333:ops-alertsEventBridge (the modern home of "CloudWatch Events") runs actions on a schedule or when something happens in AWS. Why: it is the cron and the glue of AWS — "every night at 2am, run this Lambda" or "when an EC2 instance stops, notify me." No servers required.
A scheduled rule that fires every day at 02:00 UTC
aws events put-rule --name nightly-job \
--schedule-expression 'cron(0 2 * * ? *)'Point the rule at a Lambda function (built in the Lambda lesson)
aws events put-targets --rule nightly-job \
--targets 'Id=1,Arn=arn:aws:lambda:us-east-1:111122223333:function:nightly'