Departments / infrastructure / network-diagnostics

network-diagnostics

Use when a user reports connectivity failures, "can't reach X", DNS issues, TLS handshake errors, timeouts, or suspected firewall/NetworkPolicy problems. Walks a layered flow from DNS to TCP to TLS to application, audits K8s NetworkPolicy, cloud firewall / NSG rules, MTU, and emits a structured diagnosis with the exact failing layer and fix.

Department

Infrastructure

Safety

safe
Safe · read-only

Supported stacks

Stack-agnostic — no detection required.

When to use

Do not use this skill for application-level 4xx/5xx without a connectivity symptom (route that to the owning service’s logs and monitoring-setup).

Inputs

Outputs

Tool dependencies

Procedure

Walk these layers in order. Stop at the first failure and fix before continuing; downstream layers cannot pass if upstream layers fail.

1. DNS resolution

From the affected source:

dig +short A api.example.com
dig +short AAAA api.example.com
dig +trace api.example.com            # authoritative path
nslookup api.example.com 8.8.8.8      # bypass the local resolver
getent hosts api.example.com          # what libc sees (includes /etc/hosts, NSS)

Inside Kubernetes:

kubectl exec -n <ns> <pod> -- nslookup api.example.com
kubectl exec -n <ns> <pod> -- cat /etc/resolv.conf
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=200 | grep -i error

Common failures:

2. TCP reachability

nc -vz api.example.com 443         # from source
mtr -rwc 50 api.example.com        # hop-by-hop loss / latency
traceroute -T -p 443 api.example.com
ss -tn state established           # existing connections on the host

Expected outcomes:

3. TLS handshake

openssl s_client -connect api.example.com:443 -servername api.example.com -showcerts </dev/null
openssl s_client -connect api.example.com:443 -servername api.example.com -tls1_2 </dev/null
curl -v --resolve api.example.com:443:<ip> https://api.example.com/healthz

Inspect for:

4. K8s NetworkPolicy audit

kubectl -n <dst-ns> get networkpolicies
kubectl -n <dst-ns> describe networkpolicy <name>
kubectl get endpoints -n <dst-ns> <svc>

If NetworkPolicies exist in <dst-ns>, the default is deny for unlisted ingress once any policy selects the target pod. Confirm:

Debug from inside with netshoot:

kubectl run netshoot -n <src-ns> --rm -it --image=nicolaka/netshoot --labels="app=probe" --restart=Never -- \
  curl -v --max-time 5 http://<svc>.<dst-ns>.svc.cluster.local/

If it fails, temporarily apply a permissive NetworkPolicy to the destination and retry. If it now passes, write the minimal allow policy and remove the permissive one.

5. Cloud firewall / Security Group / NSG

For a pod-to-RDS case, the SG chain is: node SG → RDS SG. RDS SG must allow tcp/5432 from node SG (not from a CIDR; reference by SG id).

6. MTU

Symptom: small requests (curl http://svc/health) work, larger requests (uploads, streaming) hang.

# verify path MTU
ping -M do -s 1472 api.example.com     # 1472 + 28 = 1500
ping -M do -s 1420 api.example.com     # common VXLAN overhead

# lower pod MTU via CNI config or init script
ip link show eth0 | awk '/mtu/ {print $5}'

Typical needed MTUs:

If path MTU is lower than the interface MTU and PMTUD is broken (ICMP Frag-Needed being dropped), TCP will hang on large writes. Fix: either lower the pod/VPN MTU, or ensure ICMP type 3 code 4 is allowed end-to-end.

7. Final summary

Emit the report:

Layer              | Test                               | Result     | Evidence
DNS                | dig +short api.example.com         | ok         | 10.1.2.3
TCP                | nc -vz 10.1.2.3 443                | timeout    | 3x retry, no SYN-ACK
Firewall           | SG sg-abc outbound 443 to sg-xyz   | MISSING    | aws ec2 describe-sg ...
TLS                | skipped                            | -          | blocked by TCP
App                | skipped                            | -          | blocked by TCP

Verdict: Egress security group sg-abc is missing rule allowing tcp/443 to sg-xyz (RDS proxy SG).
Fix: aws ec2 authorize-security-group-egress --group-id sg-abc --protocol tcp --port 443 --source-group sg-xyz

Examples

Happy path: pod cannot reach an in-cluster service after a NetworkPolicy rollout

Symptom: payments namespace pods can no longer reach redis.cache.svc.cluster.local:6379. Recent change: platform team rolled out default-deny policies to cache namespace.

Diagnosis:

DNS     | nslookup redis.cache.svc      | ok     | 10.96.12.34
TCP     | nc -vz 10.96.12.34 6379       | timeout
Policy  | kubectl -n cache get netpol   | found  | default-deny-ingress + allow-from-payments missing "app=api"

Fix:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-payments-api-to-redis
  namespace: cache
spec:
  podSelector: { matchLabels: { app: redis } }
  policyTypes: [Ingress]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { kubernetes.io/metadata.name: payments } }
          podSelector: { matchLabels: { app: api } }
      ports:
        - protocol: TCP
          port: 6379

Edge case: intermittent 502s only on file uploads from EU region

Small requests 200, uploads >1 MB fail with 502 after ~30 s. Only EU region, recently migrated to a new VPN tunnel.

Diagnosis: ping -M do -s 1472 fails with Message too long; succeeds at -s 1436. The VPN tunnel MTU is 1436 but the pod interface is 1500, and the cloud firewall is dropping ICMP Frag-Needed. PMTUD broken → TCP hangs on large writes → proxy times out → 502.

Fix options:

Constraints

Quality checks

Customise for your organisation

network-diagnostics

The LLM will rewrite this skill for your environment. Your API key and form inputs stay in your browser — only the skill and your environment go to OpenRouter.

One line. Be specific — cloud, language, framework, orchestrator.

Free text that steers the rewrite. Leave blank if nothing specific.

cost estimate: