When to use
Trigger this skill when:
- An incident (any severity) has been declared resolved and the team needs a written postmortem.
- A customer or leadership stakeholder has asked for a writeup of what happened.
- A recurring issue needs a formal record to spot patterns across incidents.
Do not use this skill for:
- Mid-incident communication (use
engineering:incident-response). - Security vulnerability disclosure to external parties (follow the formal security process).
- Routine bug triage notes (use a ticket, not a postmortem).
Inputs
Required:
- Incident ID (e.g.,
INC-2026-03-14-01) or the Slack channel name (e.g.,#incident-api-5xx-2026-03-14). - Start and end timestamps of the incident (UTC preferred).
- At minimum one source of truth: Slack channel transcript, PagerDuty incident link, or log excerpts.
Optional:
- Links to dashboards, commit SHAs of the change that introduced the issue, and the revert PR.
- Names and roles of the responders (IC, comms lead, scribe).
- Prior related incidents.
Outputs
A single Markdown postmortem following the structure in references/incident-template.md:
- Header with incident ID, severity, status, duration, impact summary
- One-paragraph summary (2–3 sentences)
- Timeline in UTC (who did what, when)
- Detection (how did we find out, how long until detection)
- Response (what we did, what worked, what did not)
- Root cause analysis with 5 whys
- Contributing factors
- What went well / what did not
- Action items table: Item | Owner | Due | Ticket | Priority
- Lessons learned
- Related incidents
Tool dependencies
- Slack MCP —
slack_read_channel,slack_search_publicto pull the incident channel transcript in order. - PagerDuty MCP (if available) — for incident start/end timestamps, severity, and responder list.
- GitHub MCP — to find the offending PR (via
git bisector by searching for the revert commit) and the resolving PR. - Bash —
git log --before --afterto scope commits around the incident. - Grep / Read — to pull log excerpts from local files if logs were exported.
Procedure
- Gather inputs. Confirm the incident ID, severity, channel, and timestamp window. If any are missing, ask before proceeding.
- Pull the Slack timeline. Read the incident channel from declaration to resolution. Extract every message with a timestamp, author, and a factual claim. Discard reactions, jokes, and “ack” acknowledgments unless they mark a handoff.
- Normalize to UTC. Convert all timestamps to UTC. Use ISO-8601 format
2026-03-14T14:23:00Z. - Build the timeline. Each entry:
HH:MM UTC — <actor> <action>. Merge adjacent messages from the same actor describing the same action. - Compute impact. Duration (end minus start), affected users (from metrics, not estimates), dollar impact if computable (RPM × outage minutes × affected %), affected features, customer reports count.
- Detection analysis. First alert time, first human acknowledgment time, time to declare incident. Compute time-to-detect = first-alert minus actual-start (use metrics to find actual start; this is often earlier than the alert).
- Root cause — 5 whys. Start with the observable symptom. Ask “why” five times. Each “why” should point at a condition, not a person. Example:
- Symptom: API returned 5xx on 3% of
/checkoutcalls. - Why 1: The payments service was returning 500s.
- Why 2: Its connection pool was exhausted.
- Why 3: A new retry loop in the order service held connections open too long.
- Why 4: The retry loop had no timeout.
- Why 5: The code review checklist does not require timeout verification on new outbound calls.
- Symptom: API returned 5xx on 3% of
- Contributing factors. Things that made it worse but are not the root cause: stale alert threshold, on-call paged the wrong person first, runbook was outdated, etc.
- What went well / what did not. Two balanced lists. “What went well” is not optional — the purpose is to reinforce behaviors.
- Action items. Every action item has an owner (named person), a due date (real), a ticket (created or linked), and a priority (P0/P1/P2). If you cannot assign an owner, the action item is not real; flag it for the incident review meeting.
- Blamelessness pass. Re-read the draft. Rewrite any sentence that names a person in a causal role (“X deployed the bug”) into a systemic one (“the change was deployed without a staging soak period because the deploy pipeline does not enforce one for hotfixes”).
- Return the draft for human review. Recommend a postmortem review meeting before publication.
Examples
Example 1: SEV2 API outage, Slack + GitHub available
Input: “Write the postmortem for INC-2026-03-14-01. Channel is #incident-api-5xx-2026-03-14. It started at 14:12 UTC and was resolved at 14:59 UTC.”
Output (excerpt):
# Incident INC-2026-03-14-01
| Field | Value |
|---|---|
| Severity | SEV2 |
| Status | Resolved |
| Duration | 47 minutes (2026-03-14 14:12 – 14:59 UTC) |
| Impact | ~3% of API traffic returned 5xx on /checkout; estimated $4,200 in delayed or failed revenue; 17 customer reports |
## Summary
At 14:12 UTC a new retry loop deployed to the order service exhausted the payments service connection pool, returning 5xx on ~3% of /checkout calls for 47 minutes. The fix was a revert of PR #4812, deployed at 14:54 UTC; traffic normalized by 14:59 UTC.
## Timeline (UTC)
- 14:08 — order-service v3.22.0 deployed (PR #4812).
- 14:12 — Datadog alert: payments-service pool_exhausted > 50/min.
- 14:13 — @alice acks page, declares SEV2, opens #incident-api-5xx-2026-03-14.
- 14:17 — @bob joins as IC; @carol as comms lead.
- 14:22 — Graph confirms pool exhaustion correlates with order-service deploy at 14:08.
- 14:38 — Revert PR #4821 opened.
- 14:48 — Revert merged and deployed.
- 14:54 — Pool utilization drops below threshold.
- 14:59 — Error rate back to baseline; incident resolved.
## Detection
- Actual start: 14:10 UTC (first 5xx in metrics).
- First alert: 14:12 UTC (2 min to detect).
- Incident declared: 14:13 UTC (1 min from alert to declaration).
## Root cause (5 whys)
1. Why did /checkout return 5xx? The payments service returned 500s.
2. Why? Its connection pool was exhausted (50+ pool_exhausted events/min).
3. Why? A new retry loop in order-service held connections open for up to 120s per attempt.
4. Why? The retry loop had no per-attempt timeout.
5. Why? Our code review checklist does not require timeout verification on new outbound calls to internal services.
## Action items
| Item | Owner | Due | Ticket | Priority |
|---|---|---|---|---|
| Add per-attempt timeout to the retry helper | @dana | 2026-03-21 | PLAT-501 | P0 |
| Add "outbound call timeout" to code review checklist | @erin | 2026-03-28 | ENG-OPS-88 | P1 |
| Add pool_exhausted alert at 20/min (earlier) | @bob | 2026-03-21 | SRE-204 | P1 |
| Post-deploy soak of 15 min for order-service | @dana | 2026-04-04 | PLAT-502 | P2 |
Example 2: SEV3 data correctness issue, no PagerDuty
Input: user pastes a Slack transcript and says “this was a data correctness issue, no paging involved; please write it up anyway.”
Output: same structure, but the “Detection” section notes that no alert existed (detection was a customer report via support), and the action items include “add a detection mechanism” as P0.
Constraints
- Blameless language only. No sentence may name a person as the cause. People are responders, not causes. Conditions, processes, and systems are causes.
- All timestamps in UTC, ISO-8601 format.
- Impact must be quantified where possible. “Some users affected” is not acceptable; give a percent, a count, or an explicit “unknown, see action item to add measurement.”
- Every action item has owner + due + ticket + priority. No exceptions. If you cannot assign, flag it explicitly.
- 5 whys must bottom out in a system/process, not a person. If why-5 is “Alice forgot,” keep asking.
- Do not publish without human review. The skill returns a draft. A human must run the postmortem review before distribution.
- Related incidents must be real. If you cite a prior incident, verify the ID exists.
Quality checks
Before returning, verify:
- Incident ID, severity, duration, and impact are in the header.
- Timeline is in UTC, ISO-8601 format, chronological.
- No sentence names a person in a causal role.
- Root cause reaches a systemic factor by why-5.
- “What went well” is not empty.
- Every action item has owner, due date, ticket, priority.
- Impact is quantified or explicitly marked “unknown + action item.”
- Detection section includes time-to-detect and time-to-declare.
- Any cited related incidents have verifiable IDs.
See references/incident-template.md for the full template.