When to use
- User says “set up monitoring”, “install Prometheus”, “add Grafana dashboards”, “wire Alertmanager to Slack”, or “we need alerts for CPU/memory/error rate”.
- A new cluster has no metrics backend and services are emitting
/metricsendpoints that nothing is scraping. - Existing Prometheus is present but ServiceMonitors are missing for a specific workload.
- User wants Golden Signal coverage (latency, traffic, errors, saturation) on an existing service.
Do not use this skill for log aggregation (use log-aggregation), for tracing (out of scope — install Tempo/Jaeger separately), or for synthetic monitoring (use blackbox-exporter which this skill can point at).
Inputs
- Kubeconfig context for the target cluster (
kubectl config current-context). - Target namespace (default
monitoring). - Helm chart version of
kube-prometheus-stack(default: pin to a known-good release, e.g.55.5.0+). - Notification targets: Slack webhook URL (per severity channel) and/or PagerDuty integration key.
- List of services to cover with ServiceMonitors (namespace, label selector, port name, metrics path).
- Storage class for Prometheus PVC and retention window (default
30d).
Outputs
- Running
kube-prometheus-stackrelease in the target namespace with Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics, Prometheus Operator. ServiceMonitorresources for every listed service.PrometheusRulewith the baseline alerts from references/alertmanager-rules-template.yaml.- Grafana dashboards imported from references/grafana-dashboard-templates.json, plus the community dashboards for kubelet (1860), node-exporter (1860), and kube-state-metrics (13332).
- Alertmanager configuration with per-severity routing (
critical→ PagerDuty + Slack,warning→ Slack only,info→ swallowed). - A verification report:
prom-rules.yamllint result, list of up targets, a test alert firing end-to-end.
Tool dependencies
kubectl(≥ 1.27),helm(≥ 3.12),jq,yq,curl.promtoolforcheck rules.- Kubernetes MCP for list/apply operations if available.
amtoolfor Alertmanager config validation.
Procedure
-
Detect the stack. Run these read-only commands and record findings:
kubectl config current-context— confirm a Kubernetes cluster is addressable. No cluster → stop; this skill requireskubernetes.kubectl get crd prometheuses.monitoring.coreos.com -o name 2>/dev/null— Prometheus Operator already present?helm list -A 2>/dev/null | grep -Ei 'kube-prometheus-stack|prometheus-operator'— managed by Helm?grep -l 'datadog\|newrelic\|dd-trace\|opentelemetry' package.json requirements.txt go.mod Cargo.toml pom.xml 2>/dev/null | head— a competing APM already instrumented?ls monitoring/ observability/ .github/monitoring/ 2>/dev/null— existing dashboards/rules in the repo?
Conclude which stack applies. This skill supports only
prometheus+grafana+k8s. If detection shows Datadog, New Relic, CloudWatch, Honeycomb, or any non-Prometheus backend as the primary, STOP and report the detected stack to the user; recommend a dedicated skill for that stack instead of forcing Prometheus config onto a mismatched environment. -
Preflight. Run
kubectl get nodes,kubectl version --short, and confirm the cluster has at least 4 vCPU and 8 GiB of free capacity. If another Prometheus Operator is already installed (detected in step 1), either reuse it or uninstall the old operator first — do not install a second one. -
Add the Helm repo and create the namespace:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f - -
Render a values file
kps-values.yamlthat pins retention, storage, scrape interval, and routing. Key stanzas:prometheus: prometheusSpec: retention: 30d scrapeInterval: 30s evaluationInterval: 30s storageSpec: volumeClaimTemplate: spec: storageClassName: gp3 resources: requests: storage: 100Gi serviceMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false grafana: adminPassword: "__replace_me__" defaultDashboardsEnabled: true persistence: enabled: true size: 10Gi alertmanager: config: route: receiver: slack-warnings group_by: ["alertname", "namespace", "severity"] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - receiver: pagerduty-critical matchers: ['severity="critical"'] continue: true - receiver: slack-critical matchers: ['severity="critical"'] - receiver: slack-warnings matchers: ['severity="warning"'] - receiver: "null" matchers: ['severity="info"'] receivers: - name: "null" - name: slack-warnings slack_configs: - api_url: "__SLACK_WARN_WEBHOOK__" channel: "#alerts-warn" send_resolved: true title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}' - name: slack-critical slack_configs: - api_url: "__SLACK_CRIT_WEBHOOK__" channel: "#alerts-critical" send_resolved: true - name: pagerduty-critical pagerduty_configs: - routing_key: "__PAGERDUTY_KEY__" severity: critical send_resolved: trueReplace the three placeholder tokens with real secrets from a sealed-secret, External Secrets Operator, or
kubectl create secret. -
Install or upgrade:
helm upgrade --install kps prometheus-community/kube-prometheus-stack \ --namespace monitoring --version 55.5.0 -f kps-values.yaml --wait --timeout 15m -
Validate the install:
kubectl -n monitoring get pods kubectl -n monitoring get servicemonitors kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090 & curl -s localhost:9090/api/v1/targets | jq '.data.activeTargets | length' -
Apply baseline alerts. Copy references/alertmanager-rules-template.yaml, substitute the namespace label if required, validate with
promtool check rules(after extracting thespec.groupswithyq), thenkubectl apply -f. -
Write ServiceMonitors for each input service. Template:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: <app>-sm namespace: monitoring labels: release: kps spec: namespaceSelector: matchNames: ["<app-ns>"] selector: matchLabels: app.kubernetes.io/name: <app> endpoints: - port: <metrics-port-name> path: /metrics interval: 30s scrapeTimeout: 10sConfirm
up{job="<app>-sm"} == 1in Prometheus. -
Import dashboards. From Grafana UI or via ConfigMap with label
grafana_dashboard: "1", apply references/grafana-dashboard-templates.json. Also import community dashboards:1860(node-exporter),13332(kube-state-metrics),7249(kubelet),15661(Kubernetes overview). -
Fire a synthetic alert end-to-end:
kubectl -n monitoring run stress --image=polinux/stress --restart=Never -- \ stress --cpu 4 --timeout 600sWatch
#alerts-warnfor theHighCPUnotification. Remove the pod afterward. -
Document. Emit a short report listing the Helm release name, chart version, active targets count, firing alerts count, dashboards imported, and the Alertmanager receivers configured. Store the values file and secrets source in the ops repo.
Examples
Happy path: greenfield cluster, 2 services, Slack + PagerDuty
Inputs: cluster prod-eu, namespace monitoring, services api in namespace default (port http-metrics), worker in jobs (port metrics), Slack #alerts-warn + #alerts-crit, PagerDuty key for the SRE service.
Steps executed:
helm install kps ... --version 55.5.0.- Two ServiceMonitors applied, both show
up == 1. PrometheusRulebaseline-alertsapplied from the template.- Grafana dashboard
Golden Signalsimported; loads with data within 2 minutes. - Stress pod triggers
HighCPU→ Slack#alerts-warnat severitywarning;HighErrorRatesimulated via fault injection triggers PagerDuty incident. Resolve paths confirmed.
Report delivered to the user:
kps 55.5.0 installed in monitoring
Active scrape targets: 42
Firing alerts: 0
Dashboards imported: 4 (Golden Signals, node-exporter, kube-state-metrics, kubelet)
Alertmanager receivers: slack-warnings, slack-critical, pagerduty-critical, null
Edge case: existing Prometheus Operator, no persistent storage
The cluster has a prior prometheus-operator install from 2019 and no default StorageClass. Approach:
- Detect with
kubectl get crd prometheuses.monitoring.coreos.com -o yaml | yq '.metadata.labels'. Old install has noapp.kubernetes.io/managed-by: Helm. - Refuse to overwrite. Offer two paths: (a) adopt the existing CRDs by installing chart with
crds.enabled=falseand matching selector labels; (b) uninstall the legacy operator after exporting its alert rules withkubectl get prometheusrules -A -o yaml > legacy-rules.yaml. - For storage: if no StorageClass, set
prometheus.prometheusSpec.storageSpec: {}for emptyDir (with an explicit warning that data is lost on pod restart), or create a StorageClass first. Never silently accept data loss.
Constraints
- Do not produce output for a stack outside
supported-stacks. If step 1 detection shows the primary metrics backend is Datadog, New Relic, CloudWatch, Honeycomb, or any non-Prometheus system, STOP and report the detected stack to the user. Recommend a dedicated skill for that stack rather than producing Prometheus config that will not integrate. - Never commit real Slack webhooks or PagerDuty keys to git. Use sealed-secrets, External Secrets, or
kubectl create secretreferenced from the Helm values. - Never set Grafana admin password to a default in production. Generate with
openssl rand -base64 24. - Never set
scrapeIntervalbelow15swithout a documented reason; it 2x’s storage cost. - Never disable
ruleSelectorNilUsesHelmValues: falseonce set — other teams’PrometheusRuleresources will stop being discovered. - Never route
severity=infoto a human channel; it desensitises responders. - Always pin the Helm chart version.
latestbreaks reproducibility. - Always set retention and storage explicitly. Defaults are rarely right for production.
Quality checks
kubectl -n monitoring get podsshows every podRunningandReady.curl -s localhost:9090/api/v1/targets | jq '[.data.activeTargets[] | select(.health!="up")] | length'returns0.promtool check rules baseline-alerts.yamlexits 0.amtool check-config alertmanager.yamlexits 0.- At least one synthetic alert has fired end-to-end (Prometheus → Alertmanager → Slack/PagerDuty) and been marked resolved.
- Grafana dashboard
Golden Signalsrenders with non-empty panels for the target service within 5 minutes of first scrape. - Alertmanager routing tree has an explicit
severity=info→nullbranch. - All four Golden Signals (latency, traffic, errors, saturation) are present on the primary dashboard.