Overview
Run Postgres on a managed provider until you have a concrete reason not to. The provider handles backups, patching, replication, and pager duty; you handle schema and queries. This page covers production decisions downstream of postgres: hosting, backups, PITR, failover, pooling, replicas, secrets, and observability.
Default to a managed provider
Pick from Neon, Supabase, AWS RDS or Aurora, Google Cloud SQL, or Crunchy Bridge. Each ships automated backups, minor-version patching, monitoring, and a one-click read replica.
- Neon: branchable Postgres with separated storage and compute. Good for preview environments.
- Supabase: Postgres plus auth, storage, edge functions. Good when the app needs more than the database.
- RDS or Cloud SQL: boring, durable, expensive at the high end. Pick when you already live in AWS or GCP.
Self-host only when the managed bill is provably worse at scale, residency laws pin you to a region no provider serves, or you need an extension the provider blocks. See hostinger-vps for the hardening baseline.
Backups are useless until you have restored one
A backup you have not restored is a hope. Run a full restore drill quarterly and after any backup config change.
# Nightly logical dump, parallel, custom format
pg_dump --jobs=4 --format=directory --file=/backups/app-$(date +%F) app_prod
# Restore drill: into a scratch instance, end-to-end timed
pg_restore --jobs=4 --dbname=app_restore /backups/app-2026-05-14Push dumps off-box with restic or the provider’s snapshot tooling; see hostinger-vps. Keep seven daily, four weekly, and twelve monthly snapshots. The runbook records wall-clock time to first query after restore.
Layer WAL archiving for point-in-time recovery
Logical dumps give you a daily RPO. Pair them with continuous WAL archiving when you need minutes.
- Managed providers ship PITR by default. Confirm the retention window (Neon: 7 to 30 days; RDS: up to 35) and the latest restorable timestamp.
- Self-hosted: configure
archive_commandto ship WAL segments to S3.pgBackRestorwal-gwrap this with retention, encryption, and parallel restore.
RPO is the WAL archive interval, typically under a minute. RTO is dominated by replay since the last base backup; size that against your incident budget.
Drill the failover before the incident
A failover playbook exists when someone has timed it end to end. Run the drill quarterly.
- Managed: trigger the provider’s failover button. Time endpoint propagation, pool reconvergence, app reconnect.
- Self-hosted: promote a replica with
pg_ctl promoteor Patroni; update the connection string at the pool.
Write the steps, the expected duration, and the rollback path. The playbook lives next to the runbook.
Pool connections, do not multiply them
Postgres is one process per connection. A hundred app workers each holding ten connections starves the server. Use PgBouncer in transaction mode for any service with more than a few workers; Supabase and Neon ship their own pooler. See pooling rules in postgres, Prisma flags in prisma, and the rollout discipline in migrations.
Read replicas are for read scale, not safety
A replica that lags by ten seconds is not a backup. Use replicas to offload read-heavy queries (reporting, search indexing) and to keep a failover candidate warm. Do not point the nightly backup job at a replica; back up from primary or a dedicated backup-only node. See postgres-replication for the streaming-vs-logical trade-offs.
Rotate secrets out of plain env vars
Never store the production database URL in a .env committed to history. Use a secret manager (AWS Secrets Manager, GCP Secret Manager, Doppler, 1Password) and rotate on a schedule. Prefer short-lived IAM credentials where supported (RDS IAM auth, Cloud SQL IAM). fastapi and other clients read the URL from the store at startup, not from disk.
Wire observability before you need it
Three signals catch most production pain.
pg_stat_statements: enable inshared_preload_libraries; review top time-consumers weekly. Feed it into postgres-explain for the plan-level read, and into postgres-indexes for the index decisions.- Slow query log: log statements over 250 ms. Ship to the same log pipeline as the app.
- Autovacuum: alert on
n_dead_tupratios andlast_autovacuumage for hot tables. Bloat is the slowest, most preventable production decay. See postgres-vacuum for the per-table tuning.
Pair with host metrics from hostinger-vps (CPU, memory, disk, IO wait) and app-side traces and SLOs from observability. Alert on disk above 80 percent and replication lag above the RPO target.