The checklist

  • Separate roles per service; no shared superuser accounts
  • Connection pooling (PgBouncer or app-level) — never let apps open raw connections at scale
  • Slow query log enabled with log_min_duration_statement = 200ms
  • Backups going off-host, tested by restoring at least once a quarter
  • track_io_timing = on so you can see disk waits in pg_stat_statements

Backups that you have actually tested

The single biggest mistake we see is teams that have a backup pipeline but have never restored from it. A backup is a hypothesis until you prove it works.