When to use
- User says “we need backups”, “audit our backups”, “set up Velero”, “our RDS snapshots aren’t tested”, “what’s our RTO”, or describes a ransomware / accidental-delete scare.
- A new data-bearing workload is being introduced (new Postgres, new bucket).
- SOC 2 / ISO 27001 evidence required for the backup control.
- A restore has never been tested in anger.
- A workload’s RPO/RTO is unknown or out of date.
Do not use this skill for log retention (see log-aggregation), for metrics retention (see monitoring-setup), or for source-code backups (that is what your Git host and a periodic repo mirror cover).
Inputs
- Inventory of data-bearing systems: databases (engine, version, size), buckets (S3/GCS/Azure, size), K8s clusters (namespaces with persistent state), managed services (ElastiCache, MSK, etc.).
- Business tiering: which workloads are Tier 1 (customer-facing, payments), Tier 2 (internal critical), Tier 3 (nice-to-have).
- Target RPO / RTO per tier (defaults below).
- Regulatory constraints: data residency, retention minimums, encryption-at-rest requirements.
- Offsite destination (cross-region bucket, different cloud, or physical tape vault).
Default tiering (override with user input):
| Tier | Example workloads | RPO | RTO | Frequency | Retention |
|---|---|---|---|---|---|
| 1 | prod DB, payment bucket | 5 min | 1 h | Continuous WAL + 15-min snapshots | 30d + 7y PIT |
| 2 | prod K8s state, internal DBs | 1 h | 4 h | Hourly snapshots | 30d |
| 3 | dev, stage, analytics | 24 h | 24 h | Daily | 14d |
Outputs
- Backup inventory table with:
workload | tier | RPO | RTO | method | destination | encryption | schedule | retention | last restore test. - Scheduled jobs / resources:
- Velero
Scheduleresources for each K8s namespace group. - RDS / Aurora automated backups and cross-region snapshot copy.
- Postgres
pgBackRestorwal-gconfig for self-managed instances (logicalpg_dumpnightly + physical continuous WAL). - Bucket replication rules (S3 Cross-Region Replication, GCS dual-region, or Azure GRS).
- Velero
- Encryption posture per backup: KMS key ID, algorithm, rotation cadence.
- Access policy: who can read backups, who can delete, object-lock / MFA-delete setup.
- A restore-test calendar: monthly sample restore, quarterly full restore per tier.
- The checklist from references/backup-checklist.md, filled in.
Tool dependencies
kubectl,helm.veleroCLI (≥ 1.14) for K8s state.aws/gcloud/azCLIs for managed services.pgbackrest,wal-g,pg_dump,mysqldump,mongodumpfor self-managed databases.resticorkopiafor filesystem backups outside K8s.- Kubernetes MCP, filesystem MCP, cloud provider MCPs for scheduling and audit.
Procedure
-
Inventory and classify. List every data-bearing system with owner, size, tier, current backup state (or “none”). Use a spreadsheet-like table — this is the single source of truth.
-
Apply the 3-2-1 rule per workload.
- 3 copies: the live system + 2 backup copies.
- 2 media: on-site (same region) + off-site (different region or cloud).
- 1 offsite: immutable, ideally in a different trust domain (separate AWS account, different cloud).
- Immutability: S3 Object Lock (Compliance mode), GCS Bucket Lock, Azure Immutable Blob Storage. A ransomware event must not be able to delete or overwrite backups.
-
Databases — logical + physical.
Managed (RDS, Aurora, Cloud SQL):
aws rds modify-db-instance --db-instance-identifier prod-pg \ --backup-retention-period 35 \ --preferred-backup-window "03:00-04:00" --apply-immediately aws rds start-export-task --export-task-identifier weekly-prod-pg-$(date +%F) \ --source-arn arn:aws:rds:eu-west-1:...:snapshot:rds:prod-pg-YYYY-MM-DD \ --s3-bucket-name acme-db-exports --iam-role-arn arn:aws:iam::...:role/rds-export \ --kms-key-id alias/db-backupsConfigure cross-region automated backups (Aurora Global, RDS cross-region copy) and confirm target-region snapshots exist.
Self-managed Postgres:
pgBackRestis the default.# /etc/pgbackrest/pgbackrest.conf [global] repo1-path=/var/lib/pgbackrest repo1-retention-full=4 repo1-retention-diff=7 repo1-s3-bucket=acme-pg-backups repo1-s3-endpoint=s3.eu-west-1.amazonaws.com repo1-s3-region=eu-west-1 repo1-cipher-type=aes-256-cbc repo1-type=s3 start-fast=y archive-async=y [prod] pg1-path=/var/lib/postgresql/15/mainSchedule: full weekly Sunday 02:00, differential daily, incremental every 15 min. Enable WAL archiving for PITR.
Add a logical nightly
pg_dump -Fcto the same bucket for easy single-table restore. -
Object storage — replication + versioning + lifecycle.
aws s3api put-bucket-versioning --bucket acme-prod-data --versioning-configuration Status=Enabled aws s3api put-object-lock-configuration --bucket acme-prod-backups --object-lock-configuration '{ "ObjectLockEnabled": "Enabled", "Rule": { "DefaultRetention": { "Mode": "COMPLIANCE", "Days": 35 }} }' aws s3api put-bucket-replication --bucket acme-prod-data --replication-configuration file://repl.json aws s3api put-bucket-lifecycle-configuration --bucket acme-prod-data --lifecycle-configuration file://lifecycle.jsonLifecycle: current version 90d → Glacier IR, noncurrent 30d → Glacier, 7y → delete (or per regulatory).
-
Kubernetes state — Velero.
velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.10.0 \ --bucket acme-velero --prefix prod-cluster \ --backup-location-config region=eu-west-1 \ --snapshot-location-config region=eu-west-1 \ --secret-file ./credentials-velero \ --use-node-agent --uploader-type=kopia --default-volumes-to-fs-backupSchedules:
apiVersion: velero.io/v1 kind: Schedule metadata: name: tier1-hourly namespace: velero spec: schedule: "0 * * * *" template: includedNamespaces: [payments, checkout] ttl: 720h0m0s storageLocation: default defaultVolumesToFsBackup: true --- apiVersion: velero.io/v1 kind: Schedule metadata: name: cluster-daily namespace: velero spec: schedule: "0 2 * * *" template: includedNamespaces: ["*"] excludedNamespaces: [kube-system, velero] ttl: 336h0m0sConfigure a second BackupStorageLocation in a different region (read-only primary for DR, writable for the mirror).
-
Encryption and access.
- All backups encrypted at rest with a KMS key scoped to the backup account.
- Key rotation annual, with an audit log of rotations.
- IAM: write-only role for backup producers; read + restore role reserved for break-glass; object-lock prevents delete by anyone.
-
Restore-test calendar.
- Monthly: pick one random Tier 1 workload; restore a single table / object / namespace to a scratch target; verify row count / object hash / pod Ready. 60-minute time-box.
- Quarterly: full restore of one Tier 1 workload end-to-end to a parallel environment; measure RTO; compare to target.
- Annually: regional-failure tabletop. Delete access to the primary region (simulated) and rehearse the DR runbook.
- Record every test in the inventory
last restore testcolumn and in a ticket with screenshots and timings.
-
Populate the checklist at references/backup-checklist.md for every workload and attach to the ticket.
-
Alert on backup failures. Export Velero metrics (
velero_backup_failure_total) and RDS/Aurora snapshot events to the Alertmanager / Prometheus stack frommonitoring-setup. Missed schedule (velero_backup_last_successful_timestampolder than2 * schedule interval) iscritical.
Examples
Happy path: mid-size SaaS, EKS + Aurora + S3
Inventory: 1 Aurora Postgres (120 GiB), 2 S3 buckets (2 TiB user uploads), EKS with 12 namespaces, 6 of them stateful.
Plan produced:
Aurora: Tier 1, RPO 5m, RTO 1h. Automated backups 35d, PITR, cross-region replica.
Uploads: Tier 1, RPO 5m, RTO 1h. Versioning + CRR to eu-central-1 + Object Lock 35d.
Internal: Tier 2, RPO 1h, RTO 4h. Velero hourly + daily cluster backup, 30d retention.
Dev: Tier 3, RPO 24h, RTO 24h. Velero daily, 14d retention, no CRR.
Restore test calendar: monthly sample (1st Tue), quarterly full (first month of Q).
Next test: 2026-05-05 — restore `orders` table from last night's Aurora export to a scratch instance; target <15 min.
Edge case: on-prem MySQL with no ops team
A single on-prem MySQL primary, no DBA, no offsite backup. Owner wants “something that works”.
Plan:
- Install
mysqldumpnightly +xtrabackupweekly full. - Encrypt with
age, upload to S3 in a separate AWS account with Object Lock (Compliance mode, 35d). - Classify as Tier 2 initially (RPO 24h) with a roadmap to Tier 1 once binlog streaming is added.
- Monthly restore to a container on any spare host; target <2 hours.
- Document the runbook; the owner rehearses the restore once.
Never leave a production DB with only local backups; a disk failure is a total loss.
Constraints
- Never store backups in the same cloud account / project as the source. A compromised root credential must not be able to delete them.
- Never consider a backup valid until it has been restored successfully. “Backups succeeded” is not “backups work”.
- Never skip encryption at rest. Every backup object has a KMS-encrypted envelope.
- Never put secrets (DB passwords, TLS private keys) in a backup intended for a less-trusted destination without re-wrapping.
- Never let Object Lock retention exceed the regulatory maximum; over-retention is a privacy risk for personal data.
- Never use “continuous” replication (RDS read replica, S3 CRR alone) as the only backup; it propagates logical errors. Point-in-time, immutable backups must exist separately.
- Always test restore before decommissioning the old backup system.
Quality checks
- Every Tier 1 workload has a successful restore recorded in the last 30 days.
- Every Tier 2 workload has a successful restore recorded in the last 90 days.
- 3-2-1 is satisfied for every Tier 1 and Tier 2 workload (verified from the inventory table).
- Object Lock / WORM is enabled on the offsite copy for Tier 1.
- Velero
velero backup getshowsCompletedfor the last run of every Schedule. velero_backup_last_successful_timestampgap is under 2x the schedule for every Schedule.- RDS automated backup retention ≥ 7 days for prod.
- The restore-test calendar is populated 12 months ahead, with owners assigned.
- The checklist in
references/backup-checklist.mdis filled for every workload.