Production-grade PostgreSQL: real-database benchmarks (with measured 3.5× wins), slow-query observability via pg_stat_statements + auto_explain, point-in-time recovery with verified backups, a custom AST-based migration linter, and range partitioning with materialized views for reporting. Redis underpins caching, rate limiting, HTTP idempotency, JWT revocation, and time-windowed analytics across the Go stack and Spring @Cacheable in the Java stack. MongoDB and Qdrant are also in use elsewhere — dedicated tabs for each are coming.
Functional ORM-style code is a starting point, not the finish line. The Go services were re-benchmarked against a real PostgreSQL 16 container and rewritten where the data showed it.
testcontainers-go made the benchmarks runnable on any Docker-equipped machine; results were captured to go/benchdata/ for interview-ready evidence.
| Optimization | Before | After | Speedup |
|---|---|---|---|
| Order creation (20 items) | 4.5 ms | 1.3 ms | 3.5× |
| Product search | 1.0 ms | 0.55 ms | 1.9× |
| Order creation (5 items) | 1.5 ms | 0.8 ms | 1.8× |
| Category filter | 430 µs | 327 µs | 1.3× |
testcontainers-go against PostgreSQL 16; results in go/benchdata/baseline-results.txt and optimized-results.txtCOUNT(*) OVER() window function replaces COUNT-then-data double queryWITH updated AS (UPDATE … RETURNING) SELECT EXISTS(...))QueryExecModeCacheDescribe)idx_orders_saga_step, partial idx_products_low_stock WHERE stock < 10, composite idx_cart_items_user_reservederrors.Is(err, pgx.ErrNoRows), errors.As(err, &pgconn.PgError) for code 23505Slow queries don't fix themselves; they have to be found first. Postgres ships two extensions that do exactly this — pg_stat_statements aggregates per-query latency, call counts, and IO; auto_explain captures full execution plans for any query that crosses a duration threshold.
Both are wired into the portfolio so the slow query a hiring manager would normally have to take on faith is instead visible in Grafana.
shared_preload_libraries='pg_stat_statements,auto_explain' set on the Postgres deployment — server-wide enablement; restart picked up by the existing Recreate strategy.CREATE EXTENSION IF NOT EXISTS pg_stat_statements for all 7 prod databases, bootstrapped by an idempotent K8s Job (postgres-extensions-bootstrap).auto_explain.log_min_duration=500ms, log_analyze=true, log_format=json — every query over 500ms writes a JSON plan to Postgres logs, which Promtail ships to Loki keyed by query_id.postgres_exporter queries surface the top-50 by mean latency and the top-50 by IO, with query_text truncated to 200 chars to bound label cardinality.auto_explain-stalled canary that fires when no plan log lines arrive in 24h.grafana_reader role (pg_monitorpredefined role) lets a Grafana PostgreSQL data source render live “top slow queries” tables without leaking write access.Five Go services × pod replicas × pgx pool size adds up fast. Postgres tops out around 100 connections by default, and every reconnect during a deploy is a small stampede. PgBouncer sits between every Go service and Postgres in transaction-pool mode, letting hundreds of client-side connections share a small fixed pool of server-side ones.
Migrations bypass the pooler — golang-migrate uses session-level features (advisory locks, multi-statement transactions) that transaction-pool mode strips, so each service ConfigMap exposes a second DSN that points at Postgres directly.
java-tasks with pool_mode=transaction, max_client_conn=1000, default_pool_size=20, server_reset_query=DISCARD ALL.auth_query credential delegation: PgBouncer runs as a low-privilege pgbouncer_auth role and looks up application credentials from a dedicated pgbouncer.userlist view at connect time — no synced password files, no plaintext secrets in the pooler image.DATABASE_URL routes through pgbouncer.java-tasks.svc.cluster.local:6432 (apps); DATABASE_URL_DIRECT hits postgres.java-tasks.svc.cluster.local:5432 (migrations and any session-level work).kustomize overlay applies the same split in QA pointing at the prod-namespace pooler via ExternalName — shared infra, single source of truth.PodDisruptionBudget on the pooler with maxUnavailable: 1so node drains don't take all client traffic with them.pgbouncer_auth role and pgbouncer.userlist on first install; idempotent re-runs.Reporting reads (sales trends, inventory turnover, top customers) shouldn't compete with checkout writes for primary connections. An async streaming replica takes the load: order-service opens a second pgxpool against postgres-replica.java-tasks, and the /reporting/* endpoints route there.
Each pool sets a distinct application_name runtime parameter, so primary vs reporting traffic is trivially distinguishable in pg_stat_activity— “did the reporting reads actually move off the primary?” becomes a single query instead of guesswork.
primary_conninfo + standby signal); fed by the same WAL stream the archive uses.order-service wires two pgxpool instances — Primary for OLTP and Reporting for read-only reporting queries — with identical pool tuning so failover behavior is uniform.application_name is set per pool (order-service vs order-service-reporting) so traffic attribution is built in, not guessed.ExternalName pointed at a not-yet-deployed prod replica, the previous fail-fast behavior crashed every new pod, and the rollout timed out. Best-effort connect at the boundary is the production-grade answer.ExternalName (postgres-replica.java-tasks-qa → postgres-replica.java-tasks) — same shared-infra pattern as the rest of the QA stack.Production-grade SQL isn't only about queries. Postgres needs scheduled backups, continuous WAL archiving, monitored health, and a written runbook for the day someone has to restore.
The portfolio's Postgres deployment ships all four — and verifies the backups are actually restorable, because a backup that hasn't been restored is a hope, not a guarantee.
pg_dump --format=custom per database (7 prod DBs), 7-day retention; backups land on a hostPath PV (/backups/postgres) separate from the Postgres data PVC so PVC corruption doesn't affect backups.Recreate strategy + terminationGracePeriodSeconds: 90 + preStop: pg_ctl stop -m fast — the combination that prevents the WAL-corruption incident the data-integrity ADR documents.PodDisruptionBudget with maxUnavailable: 0on the single-replica DB so node drains don't take it out involuntarily.archive_mode=on + custom archive_command wrapper script (pg-archive-wal.sh, atomic via temp + rename) ships every WAL segment to a 10Gi wal-archive PV.archive_timeout=300 forces a WAL switch every 5 min during idle periods, so RPO drops from ≤ 24h to ≤ 5m.pg_basebackup CronJob (Sundays 03:00 UTC) writes --format=tar --gzip --wal-method=fetch tarballs; retains 4 weeklies + WAL back to the second-newest base backup. Uses a dedicated replicator role with only REPLICATION LOGIN (not taskuser).pg_stat_archiver-based alerts: archive command failing, WAL archive stale, base backup stale.postgres-backup-verifyCronJob restores yesterday's dump into a throwaway database, runs pg_restore --list | wc -l and a row-count smoke check, pushes success/failure to Pushgateway as a Prometheus metric.pg_dump-stale and basebackup-stale panels — three operational signals on one screen.postgres_exporter sidecar feeding Prometheus; Grafana dashboard surfaces connection counts, replication lag, table sizes, and slow queries.docs/runbooks/postgres-recovery.md): fresh PVC reset, full restore from pg_dump, partial restore (single database), point-in-time recovery to a specific timestamp.golang-migratecatches syntactic errors when the migration runs against Docker; it doesn't catch operationally unsafe DDL that's syntactically valid. A custom Go linter (migration-lint) walks each .up.sql AST via libpg_query and flags eight common foot-guns at lint time, before any container starts.
Each rule pairs with a recipe in a checked-in safe-migration runbook.
pganalyze/pg_query_go (CGO wrapper around libpg_query, the upstream PG parser)-- migration-lint: ignore=MIGNNN reason="..." with mandatory reason="…"make preflight-go-migrations and the CI matrix as a hard prerequisite to the runtime migration pipelineCREATE INDEX CONCURRENTLY in its own migration file (go/product-service/migrations/005_add_product_search_index.up.sql)Reporting workloads on a monotonically growing orders table forced a schema-design pass. Range partitioning by created_at prunes scan scope; three materialized views give constant-time reads for dashboard queries; CTE + window functions express the rolling-average business logic without application-side aggregation.
orders.created_at (monthly), 18 months pre-provisioned with a default catch-all partitionCREATE TABLE IF NOT EXISTSmv_daily_revenue, mv_product_performance, mv_customer_summary) refreshed CONCURRENTLY on a 15-min cadenceREFRESH CONCURRENTLYSUM(...) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for rolling 7/30-day averagesDENSE_RANK() for tie-aware top-N (turnover, top customers)(id, created_at) removes single-column FK target — referential integrity moves to the saga)