Departments / infrastructure

Infrastructure

Monitoring, log aggregation, SSL certs, network diagnostics, backups, cluster health.

7 skills · 1 orchestrator

skillskit install infrastructure installs just this department into ~/.claude/skills/ — need the CLI? install it first.

Task skills

backup-strategy

Use when a user wants to design or audit a backup posture across databases, object storage, and Kubernetes state; targets 3-2-1 (3 copies, 2 media, 1 offsite); sets RPO/RTO per workload tier; schedules Velero for K8s state and logical + physical backups for databases; enforces monthly restore tests. Produces a backup inventory, schedules, and a restore-test calendar.

writes-shared

cluster-health

Use when a user wants a Kubernetes cluster health check, says "is the cluster healthy", "something is off with the cluster", inherits an unfamiliar cluster, or is triaging an ongoing incident. Walks node conditions, control-plane components, resource pressure, critical DaemonSets, pod lifecycle states, and recent events, then produces a severity-ranked issue list.

safe kubernetes

log-aggregation

Use when a user wants to centralise Kubernetes logs, install Loki + Promtail or ELK (Elasticsearch + Logstash/Fluent Bit + Kibana), configure retention, wire log shipping from pods, or tune label/index hygiene. Picks the lightweight (Loki) or heavyweight (ELK) stack based on scale and budget, installs, validates ingestion, and produces a LogQL or KQL query cheat sheet.

writes-shared loki+promtail+k8selk+k8s

monitoring-setup

Use when a user wants to provision Kubernetes observability, install Prometheus/Grafana/Alertmanager, wire ServiceMonitors, import Golden Signal dashboards, or configure alert routing to Slack/PagerDuty. Installs kube-prometheus-stack via Helm, applies ServiceMonitors, loads dashboards for latency/traffic/errors/saturation, and commits Alertmanager routes.

writes-shared prometheus+grafana+k8s

network-diagnostics

Use when a user reports connectivity failures, "can't reach X", DNS issues, TLS handshake errors, timeouts, or suspected firewall/NetworkPolicy problems. Walks a layered flow from DNS to TCP to TLS to application, audits K8s NetworkPolicy, cloud firewall / NSG rules, MTU, and emits a structured diagnosis with the exact failing layer and fix.

safe

ssl-certificate-manager

Use when a user wants to audit TLS certificates across a Kubernetes estate, migrate to cert-manager with Let's Encrypt (HTTP-01 or DNS-01), set up expiry alerts (≤30d warning / ≤7d critical), or rotate certs without downtime. Runs a cert inventory, issues / renews via cert-manager, and validates the ingress still serves the new chain.

writes-shared cert-manager+k8s

Workflow orchestrators

Orchestrators chain the task skills above into an end-to-end flow. Invoke them the same way as any other skill — they declare chains: in frontmatter, which means tooling can pass artifacts between steps automatically.

infra-triage orchestrator

Use when a platform incident is reported ("something's wrong with cluster X", alerts firing, user-facing degradation of unknown origin). Runs structured first-response triage — cluster health, then network or TLS branches as evidence demands, then incident comms if impact is user-facing.

writes-shared