When to use
- A pager just went off and severity is unclear.
- Production is degraded and the on-call needs a structured loop.
- A deploy triggered an SLO breach and you need to decide roll-back vs. roll-forward.
- After mitigation, to draft the postmortem while memory is fresh.
Do not use for general debugging of a feature bug in dev — use the engineering:debug skill. Do not use for known-good maintenance work.
Inputs
alert— alert title, dashboard link, and first-firing timestamp (UTC).service— service name impacted (orunknownif still triaging).environment—prod|staging| multi-region (us-east-1,eu-west-1).symptoms— observed customer impact: errors, latency, outage, partial.recent_changes— last 3 deploys and last 3 infra changes (get from CI history).log_sources— which ofkubectl, CloudWatch, Loki, Datadog, Splunk are available.severity_guess— initial guess (SEV1 total outage, SEV2 major degradation, SEV3 minor).
Outputs
- Triage notes captured in the incident channel (timestamped lines).
- Confirmed severity and blast radius.
- Mitigation action taken with evidence that it worked.
- Slack status updates (initial, update every 30 min during active, resolution).
- Draft postmortem using
references/postmortem-template.mdwithin 24 hours of resolution.
Tool dependencies
kubectlwith context set to the affected cluster.helmfor rollback.- AWS CLI /
az/gcloudfor cloud-side evidence. logcli(Loki) or cloud log CLIs (aws logs tail,az monitor).promtool/curlto hit Prometheus.ghCLI to pull recent deploys.- Slack MCP or webhook for status posts.
Procedure
0. Declare
Open (or acknowledge) an incident in PagerDuty/Opsgenie. Create a dedicated Slack channel: #inc-<yyyymmdd>-<short-desc>. Assign roles:
- Incident Commander (IC) — decisions, communication.
- Ops lead — runs the procedure below.
- Scribe — timestamped notes into the channel (UTC).
- Comms — external status page + customer comms.
For a solo pager, one person covers IC + Ops + Scribe; still run the loop.
1. Initial Slack update (within 5 minutes)
:rotating_light: Incident declared
Severity: SEV2 (suspected)
Service: checkout-api
Impact: ~10% of checkout requests returning 5xx since 14:32 UTC
IC: @alice Ops: @bob
Channel: #inc-20260419-checkout-5xx
Status page: updating now
Next update: 15:05 UTC
2. Gather evidence
Run in parallel; capture command + output into the channel.
# Recent deploys
gh run list -R acme/checkout-api --workflow deploy.yml --limit 5
# Helm history
helm -n prod history checkout-api | head
# Pod status
kubectl -n prod get pods -l app.kubernetes.io/name=checkout-api -o wide
kubectl -n prod top pods -l app.kubernetes.io/name=checkout-api
# Recent events
kubectl -n prod get events --sort-by=.lastTimestamp | tail -30
# Application logs (last 15 min, ERROR or higher)
kubectl -n prod logs -l app.kubernetes.io/name=checkout-api \
--since=15m --tail=2000 --prefix --timestamps \
| grep -iE 'error|panic|fatal|5[0-9]{2}' | head -100
# CloudWatch (if logs ship there)
aws logs tail /aws/containerinsights/prod/application \
--since 15m --filter-pattern '{ $.service = "checkout-api" && $.level = "ERROR" }' --format short
# Loki
logcli query --since=15m --limit=500 \
'{namespace="prod", app="checkout-api"} |= "ERROR"'
# Prometheus — error rate and p95 latency
curl -sG "${PROM}/api/v1/query" \
--data-urlencode 'query=sum(rate(http_requests_total{service="checkout-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout-api"}[5m]))'
curl -sG "${PROM}/api/v1/query" \
--data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])))'
Also check:
- Dependency dashboards (DB CPU, queue lag, cache hit rate).
- Recent infra changes (
terraform show, Azure Activity Log, CloudTrail). - Whether the alert is isolated to one AZ/node/region.
3. Identify the failing component
Use a decision tree:
- Errors only on pods from the new ReplicaSet? -> bad deploy.
- Errors tied to DB timeouts? -> DB side (check CPU, slow query log, locks).
- Errors from upstream caller only? -> ingress/LB or client-side TLS/DNS.
- Errors in one AZ only? -> suspect AZ failure, cordon nodes.
- Queue depth rising, no consumer errors? -> consumer throughput/capacity.
Document the single hypothesis you are acting on. If wrong, rewind.
4. Confirm blast radius
- Percentage of requests affected (from Prometheus).
- Customer segments affected (authenticated vs anonymous, region, plan tier).
- Whether data is being written incorrectly (worse than errors — check).
- Whether downstream services are also degrading.
Re-evaluate severity with the real number. Upgrade/downgrade as needed.
5. Communicate
Slack update every 30 minutes while active, plus on any major status change:
Update 15:05 UTC
Impact: still ~10% of checkout requests failing
Hypothesis: last deploy (sha-7f3a9c1) introduced regression; rolling back now
ETA mitigation: 10 min
Next update: 15:20 UTC or sooner
Status page update matches the Slack message in tone but stays customer-facing (no internal service names).
6. Mitigate
Try the safest action that restores service, even if it doesn’t fix the root cause.
- Bad deploy:
helm -n prod rollback checkout-api <previous-revision> --wait --timeout 10m. - Bad config: revert ConfigMap via
kubectl rollout restart deploy/checkout-apiafterkubectl applyof the previous ConfigMap. - Bad node:
kubectl cordon+kubectl drainone at a time, verifying impact each step. - Traffic spike: scale manually
kubectl -n prod scale deploy/checkout-api --replicas=20, raise HPA max. - Bad dependency: flip a feature flag to fallback path, or shed traffic via rate limiter.
Confirm mitigation:
- Error rate drops below SLO threshold for 10 continuous minutes.
- p95 returns to baseline.
- No pod is
CrashLoopBackOff.
7. Resolve
Slack:
:white_check_mark: Incident resolved 15:42 UTC
Root cause (preliminary): regression in checkout-api v1.4.2 rolled back to v1.4.1
Customer impact: ~10% of checkouts between 14:32-15:28 UTC (56 minutes)
Postmortem: scheduled, owner @alice, due 2026-04-22
Status page marked resolved. Page oncall off.
8. Postmortem
Within 24 hours, draft the postmortem using references/postmortem-template.md. Send for review within 72 hours. Action items land in Jira/Linear with named owners and due dates.
Examples
Example 1 — 5xx spike tied to a deploy
Alert: checkout-api error rate > 2% for 5m. Timeline shows a deploy 8 minutes earlier. Evidence: new pods log NullPointerException at 14:33 UTC; 30 pods restart in 5 minutes. Mitigation: helm rollback to previous revision. Error rate returns to baseline at 15:28 UTC. Postmortem identifies missing null check on new optional header and a staging test gap.
Example 2 — Latency spike with no recent deploy
Alert: p95 latency > 500ms. No deploy in the last 24 hours. Evidence: RDS CPU at 98%, pg_stat_activity shows a long-running ANALYZE blocking queries. Mitigation: SELECT pg_cancel_backend(pid) for the offending session. Latency returns to baseline. Postmortem identifies a manual DBA task run during peak traffic and a missing guardrail.
Constraints
- Never speculate publicly. Say “investigating” until you have evidence.
- Never push a forward fix during an active SEV1/SEV2 without first attempting rollback; the clock matters more than the fix.
- Never
kubectl delete podblindly to “clear” an issue — you lose evidence. - Never write to production data as part of triage; read-only. Any write is a decision the IC must approve in channel.
- Never resolve the incident before metrics show sustained recovery (10 min continuous).
- Blameless means blameless. Postmortems describe systems, not individuals.
- UTC timestamps only. Local time causes reconstruction errors.
Quality checks
- Channel has timestamped (UTC) evidence for every decision.
- Severity assignment matches impact observed (not the initial guess).
- Mitigation is confirmed by a metric, not a vibe.
- Stakeholders were updated at least every 30 min while active.
- Postmortem drafted using the template within 24 hours; action items have owners and due dates.
- All temporary silences/overrides applied during the incident are removed after resolution.
kubectl get eventsand relevant logs captured before any pod is recreated — evidence preserved.