Departments / infrastructure / infra-triage

infra-triage orchestrator

Use when a platform incident is reported ("something's wrong with cluster X", alerts firing, user-facing degradation of unknown origin). Runs structured first-response triage — cluster health, then network or TLS branches as evidence demands, then incident comms if impact is user-facing.

Department

Infrastructure

Safety

writes-shared
Writes shared state

Supported stacks

Stack-agnostic — no detection required.

Produces

infrastructure/reports/triage-<timestamp>.md

Consumes

  • infrastructure/findings/cluster-health.json
  • infrastructure/findings/network.json
  • infrastructure/findings/certs.json

When to use

Do not use for application-level debugging (use the owning service’s logs), for cost/FinOps reviews, or for a known-cause incident where direct remediation is obvious — call incident-response or the specific skill directly.

Chained skills

Executed as a decision tree, not a fixed sequence:

  1. cluster-health — always runs first. Produces infrastructure/findings/cluster-health.json with node conditions, pod lifecycle counts, and a severity-ranked findings ladder.
  2. network-diagnostics — runs only if cluster-health findings include any of: DNS failure, service unreachable, NetworkUnavailable node condition, CoreDNS CrashLoopBackOff, kube-proxy not ready on a node, or ingress controller 5xx surge.
  3. ssl-certificate-manager — runs only if cluster-health or alerting surfaces: cert-manager CertificateRequest failures, x509: certificate has expired errors in pods, Ingress TLS secret missing, or a Prometheus CertManagerCertExpirySoon alert.
  4. incident-response — runs only if user-facing impact is confirmed (5xx rate > SLO on a public service, customer tickets tied to the symptom, or status-page decision pending). Drives comms and draft postmortem.

Inputs

Outputs

Tool dependencies

Procedure

1. Establish baseline (chain: cluster-health)

kubectl config use-context "$CLUSTER"
kubectl version --short
kubectl get nodes -o wide
kubectl get --raw='/readyz?verbose' | head -40
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
kubectl get events -A --sort-by=.lastTimestamp | tail -50

Persist normalized findings:

# cluster-health emits this shape:
jq -n '{
  cluster: "'$CLUSTER'",
  captured_at: "'$(date -u +%FT%TZ)'",
  node_conditions: [/* ... */],
  pod_states: { Running: 0, Pending: 0, CrashLoopBackOff: 0, ImagePullBackOff: 0 },
  findings: [
    { severity: "high", reason: "MemoryPressure", resources: ["node/ip-10-0-1-23"], hint: "kubectl describe node ip-10-0-1-23" }
  ]
}' > infrastructure/findings/cluster-health.json

2. Decide branches

NETWORK=$(jq -r '[.findings[].reason] | map(select(
  . == "DNSFailure" or . == "NetworkUnavailable"
  or . == "CoreDNSCrashLoop" or . == "KubeProxyNotReady"
  or . == "IngressSurge5xx"
)) | length > 0' infrastructure/findings/cluster-health.json)

TLS=$(jq -r '[.findings[].reason] | map(select(
  . == "CertExpiring" or . == "CertManagerRequestFailed"
  or . == "TLSSecretMissing" or . == "X509Expired"
)) | length > 0' infrastructure/findings/cluster-health.json)

3. Network branch (chain: network-diagnostics)

Only if NETWORK=true.

kubectl -n kube-system get pods -l k8s-app=kube-dns -o wide
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=200
kubectl run -n default --rm -it --image=nicolaka/netshoot netshoot-$$ \
  --restart=Never --command -- \
  sh -c 'dig +time=2 +tries=1 kubernetes.default.svc.cluster.local; \
         dig +time=2 +tries=1 example.com; \
         curl -sSI https://kubernetes.default.svc.cluster.local'

kubectl get endpoints -A | awk 'NR==1 || $3==""'   # services with no endpoints

Persist:

jq -n '{
  captured_at: "'$(date -u +%FT%TZ)'",
  dns: { intra_cluster_ms: 2, external_ms: 14, errors: [] },
  endpoints_without_backends: [],
  coredns_pod_states: { Running: 2, CrashLoopBackOff: 0 }
}' > infrastructure/findings/network.json

4. TLS branch (chain: ssl-certificate-manager)

Only if TLS=true.

cmctl status -n cert-manager
kubectl get certificates -A -o wide
kubectl get certificaterequests -A --sort-by=.metadata.creationTimestamp \
  | tail -20

# For each Ingress TLS secret:
for s in $(kubectl get secret -A -o json \
  | jq -r '.items[] | select(.type=="kubernetes.io/tls") | "\(.metadata.namespace)/\(.metadata.name)"'); do
  NS=${s%/*}; NAME=${s#*/}
  kubectl -n "$NS" get secret "$NAME" -o jsonpath='{.data.tls\.crt}' \
    | base64 -d \
    | openssl x509 -noout -subject -enddate
done

Persist:

jq -n '{
  captured_at: "'$(date -u +%FT%TZ)'",
  cert_manager_ready: true,
  expiring_within_days: [
    { ns: "ingress", name: "api-tls", not_after: "2026-04-22T00:00:00Z", days: 3 }
  ],
  failed_requests: []
}' > infrastructure/findings/certs.json

5. Correlate and hypothesize root cause

Cross-reference the three JSON files:

Pick the single most likely root cause and record an ordered remediation plan.

6. Impact decision (chain: incident-response)

if [ "$IMPACT_SCOPE" = "user-facing" ]; then
  # incident-response reads the three findings JSONs, drafts:
  #   - status-page update
  #   - #platform-oncall summary
  #   - postmortem skeleton with timeline pre-populated from findings timestamps
  :
fi

Skip entirely when impact_scope=internal — triage report still written, no incident record created.

7. Remediation

In remediation_mode=suggest (default): the report lists exact commands. In remediation_mode=act: the orchestrator may execute low-risk commands:

# Examples of act-mode remediations; each is opt-in and logged in the report.
kubectl -n "$NS" rollout restart deploy/"$DEPLOY"
kubectl -n cert-manager delete pod -l app=cert-manager
cmctl renew -n ingress api-tls

Destructive actions (kubectl delete node, secret rotation, helm rollback) require an explicit flag beyond act and are never taken by default.

8. Write the triage report

infrastructure/reports/triage-<timestamp>.md (UTC, e.g. triage-20260419T142300Z.md) always includes:

Examples

Example 1 — node pressure

Alert: KubeNodeMemoryPressure on 2 of 6 nodes in prod-use1.

Flow:

Report infrastructure/reports/triage-20260419T091200Z.md contains kubectl evidence, kubectl top pod -n payments --sort-by=memory, and the exact HPA manifest to apply. impact_scope=internal so incident-response is not called.

Example 2 — cert expiry cascade

Alert: Prometheus CertManagerCertExpirySooningress/api-tls expires in 3 days.

Flow:

Report contains the failing CertificateRequest YAML, the dig TXT _acme-challenge.api.example.com output before and after, and the full chain of skill calls.

Constraints

Quality checks

Customise for your organisation

infra-triage

The LLM will rewrite this skill for your environment. Your API key and form inputs stay in your browser — only the skill and your environment go to OpenRouter.

One line. Be specific — cloud, language, framework, orchestrator.

Free text that steers the rewrite. Leave blank if nothing specific.

cost estimate: