When to use
- User inherits a cluster and wants a baseline report.
- Alerts are firing but the root cause is unclear — need a structured triage before drilling into a single service.
- Before running any changes (Helm upgrade, CNI swap, node pool recycle), confirm the cluster is green.
- Post-incident: validate that the cluster is back to baseline.
- Recurring “some pods are weird” complaints that need systematic triage.
Do not use this skill for application-level debugging (use the owning service’s logs), for cloud-provider outages (check provider status page), or for cost reviews (separate skill).
Inputs
- Kubeconfig context (single cluster per run; run again per context for multi-cluster estates).
- Optional: list of critical namespaces to give extra weight (e.g.
payments,ingress,kube-system). - Optional: an incident time window to focus event triage on.
Outputs
A report with these sections:
- Cluster summary: version, node count, region, CNI, ingress controller.
- Node table:
node | role | version | conditions | CPU util | mem util | disk util | PID util | age. - Control-plane check: API server reachability, scheduler, controller-manager, etcd (or managed control plane indicators), admission webhooks.
- Critical DaemonSets: CNI, CSI, kube-proxy, node-exporter, log shipper, cert-manager, ingress.
ready / desiredper DaemonSet. - Pod lifecycle: counts of
Running,Pending,CrashLoopBackOff,ImagePullBackOff,Error,Terminating. - Event triage: last 1 h of
Warning-level events grouped by reason. - Findings ladder:
blocker/high/medium/low/info, each with a concrete command or manifest fix.
Tool dependencies
kubectl(≥ 1.27),jq,yq.kubectl top(requires metrics-server).kubectl-neat,sternoptional for prettier output.- Kubernetes MCP for batched list/describe.
- Optional:
kubeval,kube-score,popeyefor deeper static checks.
Procedure
-
Detect the stack. Before running any diagnostic commands, confirm the target is a Kubernetes cluster this skill can read:
kubectl config current-context— must return a context. Empty output → stop; this skill requireskubernetes.kubectl auth can-i get nodes— must returnyes. Ifno, the current kubeconfig lacks the permissions required for a cluster-wide health check; stop and ask the user for a kubeconfig with at least read access across the cluster.kubectl version --short 2>&1 | head -3— record client and server versions; warn on deprecated (<1.27) server versions.
If the user wanted to health-check a Nomad cluster, an ECS cluster, a VM fleet (Ansible-managed), or serverless functions, this skill does not apply — report that and recommend a dedicated skill for the platform they’re on.
-
Capture cluster basics:
kubectl cluster-info kubectl get nodes -o wide kubectl get --raw='/readyz?verbose' | head -40Note any deprecation warnings.
-
Node conditions:
kubectl get nodes -o json | jq -r ' .items[] | [ .metadata.name, (.status.conditions[] | select(.type=="Ready") | .status), (.status.conditions[] | select(.type=="MemoryPressure") | .status), (.status.conditions[] | select(.type=="DiskPressure") | .status), (.status.conditions[] | select(.type=="PIDPressure") | .status), (.status.conditions[] | select(.type=="NetworkUnavailable") | .status // "False") ] | @tsv'Any
MemoryPressure=True,DiskPressure=True,PIDPressure=True, orReady=Falsefor more than a few minutes is ahighfinding. -
Resource utilisation:
kubectl top nodes kubectl top pods -A --sort-by=cpu | head -30 kubectl top pods -A --sort-by=memory | head -30 # Allocatable vs requests to spot scheduling pressure kubectl get nodes -o json | jq -r ' .items[] | .metadata.name + " alloc-cpu=" + .status.allocatable.cpu + " alloc-mem=" + .status.allocatable.memory' kubectl get pods -A -o json | jq -r ' [.items[].spec.containers[].resources.requests.cpu // "0"] | map(sub("m$";"") | tonumber) | add'Cluster-wide CPU requests > 85% of allocatable is a
highfinding: the next deploy will Pending. -
Control-plane checks (managed cluster: skip etcd, but keep the health endpoints):
kubectl get --raw='/healthz' kubectl get --raw='/livez?verbose' | tail kubectl -n kube-system get pods -l tier=control-plane kubectl get componentstatuses # deprecated but still informative on self-managed kubectl get apiservice | grep -v True # non-Available aggregated APIs kubectl get validatingwebhookconfiguration,mutatingwebhookconfigurationA non-Available APIService or a ValidatingWebhook pointing at a down service will break
kubectl applysilently. That isblocker. -
Critical DaemonSets:
kubectl get ds -A -o json | jq -r ' .items[] | select(.status.numberReady < .status.desiredNumberScheduled) | [.metadata.namespace, .metadata.name, (.status.numberReady|tostring) + "/" + (.status.desiredNumberScheduled|tostring)] | @tsv'Any DS below desired is at minimum
high; for CNI/CSI/kube-proxy it isblocker. -
Pod lifecycle:
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide kubectl get pods -A -o json | jq -r ' .items[] | select( (.status.containerStatuses // []) | map(.state.waiting.reason? // "" ) | any(. == "CrashLoopBackOff" or . == "ImagePullBackOff" or . == "ErrImagePull" or . == "CreateContainerConfigError") ) | [.metadata.namespace, .metadata.name, ((.status.containerStatuses // [])[0].state.waiting.reason // "-")] | @tsv'Classify:
CrashLoopBackOff:high— pull logs (kubectl logs --previous) and last 20 events for that pod.ImagePullBackOff/ErrImagePull:high— verify image tag, registry creds (kubectl get secret ... -o json), and imagePullSecrets on the pod spec.Pendingfor > 5 min:mediumunless the namespace is Tier 1 — thenhigh. Root causes: no node fits resources, tainted nodes with no tolerations, PVC unbound, PodSecurity admission rejection.CreateContainerConfigError:high— usually a missing Secret or ConfigMap.Terminatingfor > 10 min:medium— finalizer stuck; inspectmetadata.finalizers.
-
PersistentVolume check:
kubectl get pv -o wide kubectl get pvc -A | grep -v BoundAny PVC
Pendingfor more than 10 min ishighif it blocks a Tier 1 pod. -
Events in the last hour:
kubectl get events -A --sort-by=.lastTimestamp \ | awk -v cutoff="$(date -u -d '1 hour ago' +%FT%T 2>/dev/null || date -u -v-1H +%FT%T)" \ '$1 >= cutoff || NR==1' | head -80 kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp \ | awk '{print $5}' | sort | uniq -c | sort -rn | head -20The
uniq -coutput shows the dominant warning reasons (FailedScheduling,BackOff,Unhealthy,FailedMount). Each deserves a finding. -
Version / deprecation:
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis | grep -v '0$' | head
Any non-zero counter means something is still calling a deprecated API and will break on upgrade. medium.
- Produce the report using the ladder:
blocker→ fix before any other change. API server non-ready, etcd unhealthy, >1 control-plane node down, CNI DS failing.high→ fix today. Node under DiskPressure, DS below desired on any namespace, CrashLoopBackOff in Tier 1 namespace, > 85% CPU requests cluster-wide.medium→ fix this week. CrashLoopBackOff in Tier 2/3, deprecated APIs, Pending pods in non-Tier 1.low→ backlog. Suboptimal resource requests, missing readiness probes.info→ observation; no action.
Examples
Happy path: inherited 12-node EKS, first health check
Report excerpt:
Cluster: EKS 1.29, 12 nodes (3 system m6i.large, 9 workload m6i.xlarge), eu-west-1, VPC CNI, nginx-ingress 1.10.
Nodes: 12/12 Ready. All pressure conditions False.
Top util: ip-10-0-3-21 CPU 78%, mem 66% (hot node: payments-api).
Control: /readyz ok. 1 APIService non-Available: v1beta1.metrics.k8s.io.
DS: 22 DaemonSets, 21/22 fully ready; node-exporter 11/12 (one node failed CSI mount).
Pods: 412 Running, 0 Pending, 2 CrashLoopBackOff in staging, 0 ImagePullBackOff.
Events: Top warnings: BackOff x18 (staging/checkout), FailedScheduling x3 (batch namespace).
Findings
| high | monitoring | APIService metrics.k8s.io non-Available | kubectl -n kube-system rollout restart deploy metrics-server |
| high | payments | Hot node at 78% CPU; single-AZ single replica | Scale payments-api to 3 replicas with topologySpreadConstraints across AZs |
| medium | staging | checkout in CrashLoopBackOff (OOMKilled) | Bump memory limit from 256Mi to 512Mi; investigate leak |
| medium | batch | FailedScheduling (no node with 16 GiB free) | Add c6i.2xlarge node group or reduce job memory request |
| low | monitoring | node-exporter DS 11/12 | Investigate csi-node on ip-10-0-7-84; likely kubelet cert rotation |
| info | - | kube-proxy 1.28 on 1.29 cluster | Plan upgrade to 1.29 minor to match |
Edge case: cluster looks healthy but deploys are silently rejected
Every query returns green. Nodes Ready, pods Running, events quiet. But the user reports kubectl apply -f deploy.yaml “says configured but nothing changes”.
Diagnosis path:
- Check mutating/validating webhooks:
Finding:kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurationsvalidatingwebhookconfiguration/old-policy-controllerpoints at a Service (policy-webhook.svc) whose Pod was deleted last week. WithfailurePolicy: Ignore,kubectl applyreturns success but the object is silently mutated or dropped upstream; withfailurePolicy: Fail, apply errors loudly (better). Here it isIgnoreand the webhook mutates away the spec changes. - Fix: either restore the policy-webhook service or delete the stale webhook configuration.
This is a blocker finding: the cluster is unmanageable until resolved. It is also invisible on every dashboard.
Constraints
- Do not produce output for a stack outside
supported-stacks. If detection (step 1) shows no Kubernetes context, insufficient RBAC, or a non-K8s platform (Nomad, ECS, VM fleet, serverless), STOP and report. Guessing at a health check on the wrong platform yields confident-sounding but worthless output. - Never mutate the cluster during a health check without user confirmation. The output is read-only advice.
- Never run
kubectl drainorkubectl delete nodeas part of diagnosis. - Never conclude “healthy” while any
blockerorhighfinding is open. - Never over-index on a single snapshot; if pods are mid-rollout, say so and recheck.
- Never cite a finding without the command or
kubectloutput that evidenced it. - Never blindly run
kubectl get pods -A -o yamlon large clusters; it can OOM your shell. Use field selectors and jq.
Quality checks
- Every node’s conditions are reported (not just “all Ready”).
- Control-plane
/readyzand aggregated APIServices are checked. - DaemonSet
ready / desiredis reported for every DS inkube-systemand any namespace hosting CNI/CSI/ingress/observability. - Pod findings include
kubectl logs --previousoutput (or a note that logs are unavailable) for every CrashLoopBackOff. - Warning-event reasons are grouped and counted, not listed raw.
- Deprecated-API counter is checked before the next upgrade window.
- Every finding has a severity, a namespace/node scope, and a concrete fix command or manifest.
- The verdict (
healthy/degraded/unhealthy) is consistent with the severities present.