The checklist
- Separate roles per service; no shared superuser accounts
- Connection pooling (PgBouncer or app-level) — never let apps open raw connections at scale
- Slow query log enabled with
log_min_duration_statement = 200ms - Backups going off-host, tested by restoring at least once a quarter
track_io_timing = onso you can see disk waits inpg_stat_statements
Backups that you have actually tested
The single biggest mistake we see is teams that have a backup pipeline but have never restored from it. A backup is a hypothesis until you prove it works.