Departments / infrastructure / monitoring-setup

monitoring-setup

Use when a user wants to provision Kubernetes observability, install Prometheus/Grafana/Alertmanager, wire ServiceMonitors, import Golden Signal dashboards, or configure alert routing to Slack/PagerDuty. Installs kube-prometheus-stack via Helm, applies ServiceMonitors, loads dashboards for latency/traffic/errors/saturation, and commits Alertmanager routes.

Department

Infrastructure

Safety

writes-shared
Writes shared state

Supported stacks

prometheus+grafana+k8s

When to use

Do not use this skill for log aggregation (use log-aggregation), for tracing (out of scope — install Tempo/Jaeger separately), or for synthetic monitoring (use blackbox-exporter which this skill can point at).

Inputs

Outputs

Tool dependencies

Procedure

  1. Detect the stack. Run these read-only commands and record findings:

    • kubectl config current-context — confirm a Kubernetes cluster is addressable. No cluster → stop; this skill requires kubernetes.
    • kubectl get crd prometheuses.monitoring.coreos.com -o name 2>/dev/null — Prometheus Operator already present?
    • helm list -A 2>/dev/null | grep -Ei 'kube-prometheus-stack|prometheus-operator' — managed by Helm?
    • grep -l 'datadog\|newrelic\|dd-trace\|opentelemetry' package.json requirements.txt go.mod Cargo.toml pom.xml 2>/dev/null | head — a competing APM already instrumented?
    • ls monitoring/ observability/ .github/monitoring/ 2>/dev/null — existing dashboards/rules in the repo?

    Conclude which stack applies. This skill supports only prometheus+grafana+k8s. If detection shows Datadog, New Relic, CloudWatch, Honeycomb, or any non-Prometheus backend as the primary, STOP and report the detected stack to the user; recommend a dedicated skill for that stack instead of forcing Prometheus config onto a mismatched environment.

  2. Preflight. Run kubectl get nodes, kubectl version --short, and confirm the cluster has at least 4 vCPU and 8 GiB of free capacity. If another Prometheus Operator is already installed (detected in step 1), either reuse it or uninstall the old operator first — do not install a second one.

  3. Add the Helm repo and create the namespace:

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
  4. Render a values file kps-values.yaml that pins retention, storage, scrape interval, and routing. Key stanzas:

    prometheus:
      prometheusSpec:
        retention: 30d
        scrapeInterval: 30s
        evaluationInterval: 30s
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: gp3
              resources:
                requests:
                  storage: 100Gi
        serviceMonitorSelectorNilUsesHelmValues: false
        ruleSelectorNilUsesHelmValues: false
    grafana:
      adminPassword: "__replace_me__"
      defaultDashboardsEnabled: true
      persistence:
        enabled: true
        size: 10Gi
    alertmanager:
      config:
        route:
          receiver: slack-warnings
          group_by: ["alertname", "namespace", "severity"]
          group_wait: 30s
          group_interval: 5m
          repeat_interval: 4h
          routes:
            - receiver: pagerduty-critical
              matchers: ['severity="critical"']
              continue: true
            - receiver: slack-critical
              matchers: ['severity="critical"']
            - receiver: slack-warnings
              matchers: ['severity="warning"']
            - receiver: "null"
              matchers: ['severity="info"']
        receivers:
          - name: "null"
          - name: slack-warnings
            slack_configs:
              - api_url: "__SLACK_WARN_WEBHOOK__"
                channel: "#alerts-warn"
                send_resolved: true
                title: '{{ template "slack.default.title" . }}'
                text: '{{ template "slack.default.text" . }}'
          - name: slack-critical
            slack_configs:
              - api_url: "__SLACK_CRIT_WEBHOOK__"
                channel: "#alerts-critical"
                send_resolved: true
          - name: pagerduty-critical
            pagerduty_configs:
              - routing_key: "__PAGERDUTY_KEY__"
                severity: critical
                send_resolved: true

    Replace the three placeholder tokens with real secrets from a sealed-secret, External Secrets Operator, or kubectl create secret.

  5. Install or upgrade:

    helm upgrade --install kps prometheus-community/kube-prometheus-stack \
      --namespace monitoring --version 55.5.0 -f kps-values.yaml --wait --timeout 15m
  6. Validate the install:

    kubectl -n monitoring get pods
    kubectl -n monitoring get servicemonitors
    kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090 &
    curl -s localhost:9090/api/v1/targets | jq '.data.activeTargets | length'
  7. Apply baseline alerts. Copy references/alertmanager-rules-template.yaml, substitute the namespace label if required, validate with promtool check rules (after extracting the spec.groups with yq), then kubectl apply -f.

  8. Write ServiceMonitors for each input service. Template:

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: <app>-sm
      namespace: monitoring
      labels:
        release: kps
    spec:
      namespaceSelector:
        matchNames: ["<app-ns>"]
      selector:
        matchLabels:
          app.kubernetes.io/name: <app>
      endpoints:
        - port: <metrics-port-name>
          path: /metrics
          interval: 30s
          scrapeTimeout: 10s

    Confirm up{job="<app>-sm"} == 1 in Prometheus.

  9. Import dashboards. From Grafana UI or via ConfigMap with label grafana_dashboard: "1", apply references/grafana-dashboard-templates.json. Also import community dashboards: 1860 (node-exporter), 13332 (kube-state-metrics), 7249 (kubelet), 15661 (Kubernetes overview).

  10. Fire a synthetic alert end-to-end:

    kubectl -n monitoring run stress --image=polinux/stress --restart=Never -- \
      stress --cpu 4 --timeout 600s

    Watch #alerts-warn for the HighCPU notification. Remove the pod afterward.

  11. Document. Emit a short report listing the Helm release name, chart version, active targets count, firing alerts count, dashboards imported, and the Alertmanager receivers configured. Store the values file and secrets source in the ops repo.

Examples

Happy path: greenfield cluster, 2 services, Slack + PagerDuty

Inputs: cluster prod-eu, namespace monitoring, services api in namespace default (port http-metrics), worker in jobs (port metrics), Slack #alerts-warn + #alerts-crit, PagerDuty key for the SRE service.

Steps executed:

  1. helm install kps ... --version 55.5.0.
  2. Two ServiceMonitors applied, both show up == 1.
  3. PrometheusRule baseline-alerts applied from the template.
  4. Grafana dashboard Golden Signals imported; loads with data within 2 minutes.
  5. Stress pod triggers HighCPU → Slack #alerts-warn at severity warning; HighErrorRate simulated via fault injection triggers PagerDuty incident. Resolve paths confirmed.

Report delivered to the user:

kps 55.5.0 installed in monitoring
Active scrape targets: 42
Firing alerts: 0
Dashboards imported: 4 (Golden Signals, node-exporter, kube-state-metrics, kubelet)
Alertmanager receivers: slack-warnings, slack-critical, pagerduty-critical, null

Edge case: existing Prometheus Operator, no persistent storage

The cluster has a prior prometheus-operator install from 2019 and no default StorageClass. Approach:

  1. Detect with kubectl get crd prometheuses.monitoring.coreos.com -o yaml | yq '.metadata.labels'. Old install has no app.kubernetes.io/managed-by: Helm.
  2. Refuse to overwrite. Offer two paths: (a) adopt the existing CRDs by installing chart with crds.enabled=false and matching selector labels; (b) uninstall the legacy operator after exporting its alert rules with kubectl get prometheusrules -A -o yaml > legacy-rules.yaml.
  3. For storage: if no StorageClass, set prometheus.prometheusSpec.storageSpec: {} for emptyDir (with an explicit warning that data is lost on pod restart), or create a StorageClass first. Never silently accept data loss.

Constraints

Quality checks

Customise for your organisation

monitoring-setup

The LLM will rewrite this skill for your environment. Your API key and form inputs stay in your browser — only the skill and your environment go to OpenRouter.

One line. Be specific — cloud, language, framework, orchestrator.

Free text that steers the rewrite. Leave blank if nothing specific.

cost estimate: