cluster-health

When to use

User inherits a cluster and wants a baseline report.
Alerts are firing but the root cause is unclear — need a structured triage before drilling into a single service.
Before running any changes (Helm upgrade, CNI swap, node pool recycle), confirm the cluster is green.
Post-incident: validate that the cluster is back to baseline.
Recurring “some pods are weird” complaints that need systematic triage.

Do not use this skill for application-level debugging (use the owning service’s logs), for cloud-provider outages (check provider status page), or for cost reviews (separate skill).

Inputs

Kubeconfig context (single cluster per run; run again per context for multi-cluster estates).
Optional: list of critical namespaces to give extra weight (e.g. payments, ingress, kube-system).
Optional: an incident time window to focus event triage on.

Outputs

A report with these sections:

Cluster summary: version, node count, region, CNI, ingress controller.
Node table: node | role | version | conditions | CPU util | mem util | disk util | PID util | age.
Control-plane check: API server reachability, scheduler, controller-manager, etcd (or managed control plane indicators), admission webhooks.
Critical DaemonSets: CNI, CSI, kube-proxy, node-exporter, log shipper, cert-manager, ingress. ready / desired per DaemonSet.
Pod lifecycle: counts of Running, Pending, CrashLoopBackOff, ImagePullBackOff, Error, Terminating.
Event triage: last 1 h of Warning-level events grouped by reason.
Findings ladder: blocker / high / medium / low / info, each with a concrete command or manifest fix.

Tool dependencies

kubectl (≥ 1.27), jq, yq.
kubectl top (requires metrics-server).
kubectl-neat, stern optional for prettier output.
Kubernetes MCP for batched list/describe.
Optional: kubeval, kube-score, popeye for deeper static checks.

Procedure

Detect the stack. Before running any diagnostic commands, confirm the target is a Kubernetes cluster this skill can read:
- kubectl config current-context — must return a context. Empty output → stop; this skill requires kubernetes.
- kubectl auth can-i get nodes — must return yes. If no, the current kubeconfig lacks the permissions required for a cluster-wide health check; stop and ask the user for a kubeconfig with at least read access across the cluster.
- kubectl version --short 2>&1 | head -3 — record client and server versions; warn on deprecated (<1.27) server versions.
If the user wanted to health-check a Nomad cluster, an ECS cluster, a VM fleet (Ansible-managed), or serverless functions, this skill does not apply — report that and recommend a dedicated skill for the platform they’re on.

Capture cluster basics:

kubectl cluster-info
kubectl get nodes -o wide
kubectl get --raw='/readyz?verbose' | head -40

Note any deprecation warnings.

Node conditions:

kubectl get nodes -o json | jq -r '
  .items[] | [
    .metadata.name,
    (.status.conditions[] | select(.type=="Ready") | .status),
    (.status.conditions[] | select(.type=="MemoryPressure") | .status),
    (.status.conditions[] | select(.type=="DiskPressure") | .status),
    (.status.conditions[] | select(.type=="PIDPressure") | .status),
    (.status.conditions[] | select(.type=="NetworkUnavailable") | .status // "False")
  ] | @tsv'

Any MemoryPressure=True, DiskPressure=True, PIDPressure=True, or Ready=False for more than a few minutes is a high finding.

Resource utilisation:

kubectl top nodes
kubectl top pods -A --sort-by=cpu | head -30
kubectl top pods -A --sort-by=memory | head -30
# Allocatable vs requests to spot scheduling pressure
kubectl get nodes -o json | jq -r '
  .items[] | .metadata.name + " alloc-cpu=" + .status.allocatable.cpu +
             " alloc-mem=" + .status.allocatable.memory'
kubectl get pods -A -o json | jq -r '
  [.items[].spec.containers[].resources.requests.cpu // "0"]
  | map(sub("m$";"") | tonumber) | add'

Cluster-wide CPU requests > 85% of allocatable is a high finding: the next deploy will Pending.

Control-plane checks (managed cluster: skip etcd, but keep the health endpoints):

kubectl get --raw='/healthz'
kubectl get --raw='/livez?verbose' | tail
kubectl -n kube-system get pods -l tier=control-plane
kubectl get componentstatuses        # deprecated but still informative on self-managed
kubectl get apiservice | grep -v True   # non-Available aggregated APIs
kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration

A non-Available APIService or a ValidatingWebhook pointing at a down service will break kubectl apply silently. That is blocker.

Critical DaemonSets:

kubectl get ds -A -o json | jq -r '
  .items[] | select(.status.numberReady < .status.desiredNumberScheduled) |
  [.metadata.namespace, .metadata.name,
   (.status.numberReady|tostring) + "/" + (.status.desiredNumberScheduled|tostring)] | @tsv'

Any DS below desired is at minimum high; for CNI/CSI/kube-proxy it is blocker.

Pod lifecycle:
```
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(
    (.status.containerStatuses // []) |
    map(.state.waiting.reason? // "" ) |
    any(. == "CrashLoopBackOff" or . == "ImagePullBackOff" or . == "ErrImagePull" or . == "CreateContainerConfigError")
  ) | [.metadata.namespace, .metadata.name,
       ((.status.containerStatuses // [])[0].state.waiting.reason // "-")] | @tsv'
```
Classify:
- CrashLoopBackOff: high — pull logs (kubectl logs --previous) and last 20 events for that pod.
- ImagePullBackOff / ErrImagePull: high — verify image tag, registry creds (kubectl get secret ... -o json), and imagePullSecrets on the pod spec.
- Pending for > 5 min: medium unless the namespace is Tier 1 — then high. Root causes: no node fits resources, tainted nodes with no tolerations, PVC unbound, PodSecurity admission rejection.
- CreateContainerConfigError: high — usually a missing Secret or ConfigMap.
- Terminating for > 10 min: medium — finalizer stuck; inspect metadata.finalizers.
PersistentVolume check:
```
kubectl get pv -o wide
kubectl get pvc -A | grep -v Bound
```
Any PVC Pending for more than 10 min is high if it blocks a Tier 1 pod.

Events in the last hour:

kubectl get events -A --sort-by=.lastTimestamp \
  | awk -v cutoff="$(date -u -d '1 hour ago' +%FT%T 2>/dev/null || date -u -v-1H +%FT%T)" \
    '$1 >= cutoff || NR==1' | head -80
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp \
  | awk '{print $5}' | sort | uniq -c | sort -rn | head -20

The uniq -c output shows the dominant warning reasons (FailedScheduling, BackOff, Unhealthy, FailedMount). Each deserves a finding.

Version / deprecation:

kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis | grep -v '0$' | head

Any non-zero counter means something is still calling a deprecated API and will break on upgrade. medium.

Produce the report using the ladder:
- blocker → fix before any other change. API server non-ready, etcd unhealthy, >1 control-plane node down, CNI DS failing.
- high → fix today. Node under DiskPressure, DS below desired on any namespace, CrashLoopBackOff in Tier 1 namespace, > 85% CPU requests cluster-wide.
- medium → fix this week. CrashLoopBackOff in Tier 2/3, deprecated APIs, Pending pods in non-Tier 1.
- low → backlog. Suboptimal resource requests, missing readiness probes.
- info → observation; no action.

Examples

Happy path: inherited 12-node EKS, first health check

Report excerpt:

Cluster:  EKS 1.29, 12 nodes (3 system m6i.large, 9 workload m6i.xlarge), eu-west-1, VPC CNI, nginx-ingress 1.10.
Nodes:    12/12 Ready. All pressure conditions False.
Top util: ip-10-0-3-21 CPU 78%, mem 66% (hot node: payments-api).
Control:  /readyz ok. 1 APIService non-Available: v1beta1.metrics.k8s.io.
DS:       22 DaemonSets, 21/22 fully ready; node-exporter 11/12 (one node failed CSI mount).
Pods:     412 Running, 0 Pending, 2 CrashLoopBackOff in staging, 0 ImagePullBackOff.
Events:   Top warnings: BackOff x18 (staging/checkout), FailedScheduling x3 (batch namespace).

Findings
| high   | monitoring      | APIService metrics.k8s.io non-Available         | kubectl -n kube-system rollout restart deploy metrics-server |
| high   | payments        | Hot node at 78% CPU; single-AZ single replica   | Scale payments-api to 3 replicas with topologySpreadConstraints across AZs |
| medium | staging         | checkout in CrashLoopBackOff (OOMKilled)        | Bump memory limit from 256Mi to 512Mi; investigate leak |
| medium | batch           | FailedScheduling (no node with 16 GiB free)     | Add c6i.2xlarge node group or reduce job memory request |
| low    | monitoring      | node-exporter DS 11/12                          | Investigate csi-node on ip-10-0-7-84; likely kubelet cert rotation |
| info   | -               | kube-proxy 1.28 on 1.29 cluster                 | Plan upgrade to 1.29 minor to match |

Edge case: cluster looks healthy but deploys are silently rejected

Every query returns green. Nodes Ready, pods Running, events quiet. But the user reports kubectl apply -f deploy.yaml “says configured but nothing changes”.

Diagnosis path:

Check mutating/validating webhooks:
```
kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations
```
Finding: validatingwebhookconfiguration/old-policy-controller points at a Service (policy-webhook.svc) whose Pod was deleted last week. With failurePolicy: Ignore, kubectl apply returns success but the object is silently mutated or dropped upstream; with failurePolicy: Fail, apply errors loudly (better). Here it is Ignore and the webhook mutates away the spec changes.
Fix: either restore the policy-webhook service or delete the stale webhook configuration.

This is a blocker finding: the cluster is unmanageable until resolved. It is also invisible on every dashboard.

Constraints

Do not produce output for a stack outside supported-stacks. If detection (step 1) shows no Kubernetes context, insufficient RBAC, or a non-K8s platform (Nomad, ECS, VM fleet, serverless), STOP and report. Guessing at a health check on the wrong platform yields confident-sounding but worthless output.
Never mutate the cluster during a health check without user confirmation. The output is read-only advice.
Never run kubectl drain or kubectl delete node as part of diagnosis.
Never conclude “healthy” while any blocker or high finding is open.
Never over-index on a single snapshot; if pods are mid-rollout, say so and recheck.
Never cite a finding without the command or kubectl output that evidenced it.
Never blindly run kubectl get pods -A -o yaml on large clusters; it can OOM your shell. Use field selectors and jq.

Quality checks

Every node’s conditions are reported (not just “all Ready”).
Control-plane /readyz and aggregated APIServices are checked.
DaemonSet ready / desired is reported for every DS in kube-system and any namespace hosting CNI/CSI/ingress/observability.
Pod findings include kubectl logs --previous output (or a note that logs are unavailable) for every CrashLoopBackOff.
Warning-event reasons are grouped and counted, not listed raw.
Deprecated-API counter is checked before the next upgrade window.
Every finding has a severity, a namespace/node scope, and a concrete fix command or manifest.
The verdict (healthy / degraded / unhealthy) is consistent with the severities present.

Department

Safety

Supported stacks