Departments / infrastructure / log-aggregation

log-aggregation

Use when a user wants to centralise Kubernetes logs, install Loki + Promtail or ELK (Elasticsearch + Logstash/Fluent Bit + Kibana), configure retention, wire log shipping from pods, or tune label/index hygiene. Picks the lightweight (Loki) or heavyweight (ELK) stack based on scale and budget, installs, validates ingestion, and produces a LogQL or KQL query cheat sheet.

Department

Infrastructure

Safety

writes-shared
Writes shared state

Supported stacks

loki+promtail+k8selk+k8s

When to use

Do not use this skill for metrics (see monitoring-setup), distributed tracing (install Tempo/Jaeger separately), or on-host syslog collection outside Kubernetes.

Inputs

Outputs

Tool dependencies

Procedure

Detect the stack

Run these read-only commands before anything else and record findings:

Conclude which of the two supported stacks applies (or neither). This skill supports only loki+promtail+k8s and elk+k8s. If Datadog Logs, Splunk, Sumo Logic, Logz.io, or a non-supported shipper (Vector to S3 directly, Fluentd to a custom sink) is the primary, STOP and report the detected stack; do not layer a second log pipeline on top of an existing one — the cost/complexity rarely justifies it without an explicit cut-over plan.

Choosing the stack

Pick Loki when:

Pick ELK when:

Loki + Promtail install

  1. Create namespace and secrets:
    kubectl create namespace logging --dry-run=client -o yaml | kubectl apply -f -
    kubectl -n logging create secret generic loki-s3 \
      --from-literal=AWS_ACCESS_KEY_ID=... \
      --from-literal=AWS_SECRET_ACCESS_KEY=...
  2. Add Helm repo:
    helm repo add grafana https://grafana.github.io/helm-charts
    helm repo update
  3. Write loki-values.yaml pinning retention and object store:
    loki:
      auth_enabled: false
      schemaConfig:
        configs:
          - from: "2024-01-01"
            store: tsdb
            object_store: s3
            schema: v13
            index:
              prefix: index_
              period: 24h
      storage:
        type: s3
        bucketNames:
          chunks: loki-chunks
          ruler: loki-ruler
          admin: loki-admin
        s3:
          region: eu-west-1
      limits_config:
        retention_period: 336h   # 14d
        max_global_streams_per_user: 5000
        ingestion_rate_mb: 8
        ingestion_burst_size_mb: 16
      compactor:
        retention_enabled: true
        delete_request_store: s3
    deploymentMode: SimpleScalable
    backend:
      replicas: 2
    read:
      replicas: 2
    write:
      replicas: 3
  4. Install Loki:
    helm upgrade --install loki grafana/loki --namespace logging -f loki-values.yaml --wait
  5. Install Promtail (or Grafana Alloy) as a DaemonSet:
    helm upgrade --install promtail grafana/promtail --namespace logging \
      --set "config.clients[0].url=http://loki-gateway/loki/api/v1/push" --wait
  6. Validate:
    logcli query --limit 5 '{namespace="default"}'
    kubectl -n logging logs ds/promtail --tail=50 | grep -i "error"
  7. Register Loki as a Grafana datasource (URL: http://loki-gateway.logging.svc:80). Test a query from Grafana Explore: {namespace="kube-system"} |= "error".

ELK install (Elastic Operator + Filebeat)

  1. Install ECK operator:
    kubectl create -f https://download.elastic.co/downloads/eck/2.14.0/crds.yaml
    kubectl apply -f https://download.elastic.co/downloads/eck/2.14.0/operator.yaml
  2. Create the Elasticsearch cluster:
    apiVersion: elasticsearch.k8s.elastic.co/v1
    kind: Elasticsearch
    metadata:
      name: logs
      namespace: logging
    spec:
      version: 8.13.4
      nodeSets:
        - name: master
          count: 3
          config:
            node.roles: ["master"]
          volumeClaimTemplates:
            - metadata: { name: elasticsearch-data }
              spec:
                accessModes: [ReadWriteOnce]
                storageClassName: gp3
                resources: { requests: { storage: 50Gi } }
        - name: data-hot
          count: 3
          config:
            node.roles: ["data_hot", "data_content", "ingest"]
          volumeClaimTemplates:
            - metadata: { name: elasticsearch-data }
              spec:
                accessModes: [ReadWriteOnce]
                storageClassName: gp3
                resources: { requests: { storage: 500Gi } }
  3. Kibana:
    apiVersion: kibana.k8s.elastic.co/v1
    kind: Kibana
    metadata:
      name: logs
      namespace: logging
    spec:
      version: 8.13.4
      count: 2
      elasticsearchRef: { name: logs }
  4. Filebeat DaemonSet: use the Elastic-provided manifest from https://raw.githubusercontent.com/elastic/beats/8.13/deploy/kubernetes/filebeat-kubernetes.yaml. Patch ELASTICSEARCH_HOSTS to the logs-es-http service.
  5. Apply an ILM policy for 30d hot + 90d warm + delete:
    PUT _ilm/policy/k8s-logs
    {
      "policy": {
        "phases": {
          "hot":   { "actions": { "rollover": { "max_age": "1d", "max_primary_shard_size": "50gb" } } },
          "warm":  { "min_age": "30d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
          "delete":{ "min_age": "120d", "actions": { "delete": {} } }
        }
      }
    }
  6. Register the index template pointing at the k8s-logs ILM policy and the data stream name logs-k8s-*.
  7. Validate:
    kubectl -n logging port-forward svc/logs-es-http 9200:9200 &
    curl -sk -u elastic:$PASS https://localhost:9200/_cat/indices?v | head

Label and index hygiene

Examples

Happy path: Loki on EKS, 80 GiB/day

Cluster: EKS 1.29, S3 bucket acme-loki-chunks pre-created. 12 namespaces.

Output report:

Stack: Loki 2.9, Promtail 2.9, deploymentMode: SimpleScalable
Retention: 14d (336h)
Ingesters: write=3, read=2, backend=2
Promtail pods: 6 (one per node)
Active streams: 412
Grafana datasource: http://loki-gateway.logging.svc:80
Cheat sheet:
  {namespace="payments", app="api"} |= "ERROR" | json | line_format "{{.msg}}"
  rate({namespace="payments"}[5m])
  {namespace=~".+"} |~ "trace_id=abc123"

Edge case: ELK replacing a legacy Fluentd + ES stack

A 2020-era Fluentd DaemonSet is shipping to a self-managed ES 6.8. Approach:

  1. Stand up new ECK cluster (8.13) alongside the old one.
  2. Dual-write for 7 days: leave Fluentd shipping to old ES, add Filebeat shipping to new ES. Compare volumes daily.
  3. Cut Kibana UI to the new cluster; keep the old cluster read-only for 30 days.
  4. Decommission Fluentd and old ES once the grace window expires and a restore test has succeeded against a frozen snapshot.

Never hot-migrate indices across major ES versions; always dual-write and cut.

Constraints

Quality checks

Customise for your organisation

log-aggregation

The LLM will rewrite this skill for your environment. Your API key and form inputs stay in your browser — only the skill and your environment go to OpenRouter.

One line. Be specific — cloud, language, framework, orchestrator.

Free text that steers the rewrite. Leave blank if nothing specific.

cost estimate: