Departments / devops / incident-response

incident-response

Use when an alert fires, a customer reports an outage, or a deploy goes sideways. Guides triage across logs (kubectl, CloudWatch, Loki), identifies the failing component, confirms blast radius, drives Slack status updates, and produces a blameless postmortem.

Department

DevOps

Safety

writes-shared
Writes shared state

Supported stacks

Stack-agnostic — no detection required.

When to use

Do not use for general debugging of a feature bug in dev — use the engineering:debug skill. Do not use for known-good maintenance work.

Inputs

Outputs

Tool dependencies

Procedure

0. Declare

Open (or acknowledge) an incident in PagerDuty/Opsgenie. Create a dedicated Slack channel: #inc-<yyyymmdd>-<short-desc>. Assign roles:

For a solo pager, one person covers IC + Ops + Scribe; still run the loop.

1. Initial Slack update (within 5 minutes)

:rotating_light: Incident declared
Severity: SEV2 (suspected)
Service: checkout-api
Impact: ~10% of checkout requests returning 5xx since 14:32 UTC
IC: @alice   Ops: @bob
Channel: #inc-20260419-checkout-5xx
Status page: updating now
Next update: 15:05 UTC

2. Gather evidence

Run in parallel; capture command + output into the channel.

# Recent deploys
gh run list -R acme/checkout-api --workflow deploy.yml --limit 5

# Helm history
helm -n prod history checkout-api | head

# Pod status
kubectl -n prod get pods -l app.kubernetes.io/name=checkout-api -o wide
kubectl -n prod top pods -l app.kubernetes.io/name=checkout-api

# Recent events
kubectl -n prod get events --sort-by=.lastTimestamp | tail -30

# Application logs (last 15 min, ERROR or higher)
kubectl -n prod logs -l app.kubernetes.io/name=checkout-api \
  --since=15m --tail=2000 --prefix --timestamps \
  | grep -iE 'error|panic|fatal|5[0-9]{2}' | head -100

# CloudWatch (if logs ship there)
aws logs tail /aws/containerinsights/prod/application \
  --since 15m --filter-pattern '{ $.service = "checkout-api" && $.level = "ERROR" }' --format short

# Loki
logcli query --since=15m --limit=500 \
  '{namespace="prod", app="checkout-api"} |= "ERROR"'

# Prometheus — error rate and p95 latency
curl -sG "${PROM}/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{service="checkout-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout-api"}[5m]))'

curl -sG "${PROM}/api/v1/query" \
  --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])))'

Also check:

3. Identify the failing component

Use a decision tree:

Document the single hypothesis you are acting on. If wrong, rewind.

4. Confirm blast radius

Re-evaluate severity with the real number. Upgrade/downgrade as needed.

5. Communicate

Slack update every 30 minutes while active, plus on any major status change:

Update 15:05 UTC
Impact: still ~10% of checkout requests failing
Hypothesis: last deploy (sha-7f3a9c1) introduced regression; rolling back now
ETA mitigation: 10 min
Next update: 15:20 UTC or sooner

Status page update matches the Slack message in tone but stays customer-facing (no internal service names).

6. Mitigate

Try the safest action that restores service, even if it doesn’t fix the root cause.

Confirm mitigation:

7. Resolve

Slack:

:white_check_mark: Incident resolved 15:42 UTC
Root cause (preliminary): regression in checkout-api v1.4.2 rolled back to v1.4.1
Customer impact: ~10% of checkouts between 14:32-15:28 UTC (56 minutes)
Postmortem: scheduled, owner @alice, due 2026-04-22

Status page marked resolved. Page oncall off.

8. Postmortem

Within 24 hours, draft the postmortem using references/postmortem-template.md. Send for review within 72 hours. Action items land in Jira/Linear with named owners and due dates.

Examples

Example 1 — 5xx spike tied to a deploy

Alert: checkout-api error rate > 2% for 5m. Timeline shows a deploy 8 minutes earlier. Evidence: new pods log NullPointerException at 14:33 UTC; 30 pods restart in 5 minutes. Mitigation: helm rollback to previous revision. Error rate returns to baseline at 15:28 UTC. Postmortem identifies missing null check on new optional header and a staging test gap.

Example 2 — Latency spike with no recent deploy

Alert: p95 latency > 500ms. No deploy in the last 24 hours. Evidence: RDS CPU at 98%, pg_stat_activity shows a long-running ANALYZE blocking queries. Mitigation: SELECT pg_cancel_backend(pid) for the offending session. Latency returns to baseline. Postmortem identifies a manual DBA task run during peak traffic and a missing guardrail.

Constraints

Quality checks

Customise for your organisation

incident-response

The LLM will rewrite this skill for your environment. Your API key and form inputs stay in your browser — only the skill and your environment go to OpenRouter.

One line. Be specific — cloud, language, framework, orchestrator.

Free text that steers the rewrite. Leave blank if nothing specific.

cost estimate: