main·5006ce4·1m ago

Database Engineering

Production-grade PostgreSQL: real-database benchmarks (with measured 3.5× wins), slow-query observability via pg_stat_statements + auto_explain, point-in-time recovery with verified backups, a custom AST-based migration linter, and range partitioning with materialized views for reporting. Redis underpins caching, rate limiting, HTTP idempotency, JWT revocation, and time-windowed analytics across the Go stack and Spring @Cacheable in the Java stack. MongoDB and Qdrant are also in use elsewhere — dedicated tabs for each are coming.

Database Tracks

Query Optimization & Benchmarking

Functional ORM-style code is a starting point, not the finish line. The Go services were re-benchmarked against a real PostgreSQL 16 container and rewritten where the data showed it.

testcontainers-go made the benchmarks runnable on any Docker-equipped machine; results were captured to go/benchdata/ for interview-ready evidence.

OptimizationBeforeAfterSpeedup
Order creation (20 items)4.5 ms1.3 ms3.5×
Product search1.0 ms0.55 ms1.9×
Order creation (5 items)1.5 ms0.8 ms1.8×
Category filter430 µs327 µs1.3×
  • Real-DB benchmarks via testcontainers-go against PostgreSQL 16; results in go/benchdata/baseline-results.txt and optimized-results.txt
  • Batch INSERT for order items: 3.5× speedup on 20-item orders (4.5ms → 1.3ms), single round trip instead of N
  • COUNT(*) OVER() window function replaces COUNT-then-data double query
  • CTE-based atomic conflict resolution in cart updates (WITH updated AS (UPDATE … RETURNING) SELECT EXISTS(...))
  • pgx prepared-statement cache enabled (QueryExecModeCacheDescribe)
  • Targeted indexes: idx_orders_saga_step, partial idx_products_low_stock WHERE stock < 10, composite idx_cart_items_user_reserved
  • Typed pgx error checks: errors.Is(err, pgx.ErrNoRows), errors.As(err, &pgconn.PgError) for code 23505

Query Observability — pg_stat_statements + auto_explain

Slow queries don't fix themselves; they have to be found first. Postgres ships two extensions that do exactly this — pg_stat_statements aggregates per-query latency, call counts, and IO; auto_explain captures full execution plans for any query that crosses a duration threshold.

Both are wired into the portfolio so the slow query a hiring manager would normally have to take on faith is instead visible in Grafana.

  • shared_preload_libraries='pg_stat_statements,auto_explain' set on the Postgres deployment — server-wide enablement; restart picked up by the existing Recreate strategy.
  • Per-database CREATE EXTENSION IF NOT EXISTS pg_stat_statements for all 7 prod databases, bootstrapped by an idempotent K8s Job (postgres-extensions-bootstrap).
  • auto_explain.log_min_duration=500ms, log_analyze=true, log_format=json — every query over 500ms writes a JSON plan to Postgres logs, which Promtail ships to Loki keyed by query_id.
  • Custom postgres_exporter queries surface the top-50 by mean latency and the top-50 by IO, with query_text truncated to 200 chars to bound label cardinality.
  • Three Prometheus alerts: hard ceiling on per-query mean (> 1s for 10m), regression detection (mean > 2× the 7-day baseline for 15m), and an auto_explain-stalled canary that fires when no plan log lines arrive in 24h.
  • Read-only grafana_reader role (pg_monitorpredefined role) lets a Grafana PostgreSQL data source render live “top slow queries” tables without leaking write access.

Connection Pooling — PgBouncer

Five Go services × pod replicas × pgx pool size adds up fast. Postgres tops out around 100 connections by default, and every reconnect during a deploy is a small stampede. PgBouncer sits between every Go service and Postgres in transaction-pool mode, letting hundreds of client-side connections share a small fixed pool of server-side ones.

Migrations bypass the pooler — golang-migrate uses session-level features (advisory locks, multi-statement transactions) that transaction-pool mode strips, so each service ConfigMap exposes a second DSN that points at Postgres directly.

  • PgBouncer Deployment in java-tasks with pool_mode=transaction, max_client_conn=1000, default_pool_size=20, server_reset_query=DISCARD ALL.
  • auth_query credential delegation: PgBouncer runs as a low-privilege pgbouncer_auth role and looks up application credentials from a dedicated pgbouncer.userlist view at connect time — no synced password files, no plaintext secrets in the pooler image.
  • Per-service ConfigMap split: DATABASE_URL routes through pgbouncer.java-tasks.svc.cluster.local:6432 (apps); DATABASE_URL_DIRECT hits postgres.java-tasks.svc.cluster.local:5432 (migrations and any session-level work).
  • CI/CD kustomize overlay applies the same split in QA pointing at the prod-namespace pooler via ExternalName — shared infra, single source of truth.
  • PodDisruptionBudget on the pooler with maxUnavailable: 1so node drains don't take all client traffic with them.
  • Bootstrap Job seeds the pgbouncer_auth role and pgbouncer.userlist on first install; idempotent re-runs.

Read Replica & Reporting Pool

Reporting reads (sales trends, inventory turnover, top customers) shouldn't compete with checkout writes for primary connections. An async streaming replica takes the load: order-service opens a second pgxpool against postgres-replica.java-tasks, and the /reporting/* endpoints route there.

Each pool sets a distinct application_name runtime parameter, so primary vs reporting traffic is trivially distinguishable in pg_stat_activity— “did the reporting reads actually move off the primary?” becomes a single query instead of guesswork.

  • Async streaming replica via Postgres physical replication (primary_conninfo + standby signal); fed by the same WAL stream the archive uses.
  • order-service wires two pgxpool instances — Primary for OLTP and Reporting for read-only reporting queries — with identical pool tuning so failover behavior is uniform.
  • application_name is set per pool (order-service vs order-service-reporting) so traffic attribution is built in, not guessed.
  • Graceful fallback: if the replica is unreachable at startup (5-second connect timeout), the reporting pool is aliased to primary and a warning is logged. The service stays up and reporting reads transparently hit the primary until a future restart re-resolves the replica.
  • The fallback was driven by a real cross-namespace deployment-ordering incident: a QA ExternalName pointed at a not-yet-deployed prod replica, the previous fail-fast behavior crashed every new pod, and the rollout timed out. Best-effort connect at the boundary is the production-grade answer.
  • QA shares the prod replica via ExternalName (postgres-replica.java-tasks-qa postgres-replica.java-tasks) — same shared-infra pattern as the rest of the QA stack.

Reliability & Backups

Production-grade SQL isn't only about queries. Postgres needs scheduled backups, continuous WAL archiving, monitored health, and a written runbook for the day someone has to restore.

The portfolio's Postgres deployment ships all four — and verifies the backups are actually restorable, because a backup that hasn't been restored is a hope, not a guarantee.

  • Daily pg_dump --format=custom per database (7 prod DBs), 7-day retention; backups land on a hostPath PV (/backups/postgres) separate from the Postgres data PVC so PVC corruption doesn't affect backups.
  • Postgres deployment uses Recreate strategy + terminationGracePeriodSeconds: 90 + preStop: pg_ctl stop -m fast — the combination that prevents the WAL-corruption incident the data-integrity ADR documents.
  • PodDisruptionBudget with maxUnavailable: 0on the single-replica DB so node drains don't take it out involuntarily.
  • archive_mode=on + custom archive_command wrapper script (pg-archive-wal.sh, atomic via temp + rename) ships every WAL segment to a 10Gi wal-archive PV.
  • archive_timeout=300 forces a WAL switch every 5 min during idle periods, so RPO drops from ≤ 24h to ≤ 5m.
  • Weekly pg_basebackup CronJob (Sundays 03:00 UTC) writes --format=tar --gzip --wal-method=fetch tarballs; retains 4 weeklies + WAL back to the second-newest base backup. Uses a dedicated replicator role with only REPLICATION LOGIN (not taskuser).
  • Three pg_stat_archiver-based alerts: archive command failing, WAL archive stale, base backup stale.
  • Daily postgres-backup-verifyCronJob restores yesterday's dump into a throwaway database, runs pg_restore --list | wc -l and a row-count smoke check, pushes success/failure to Pushgateway as a Prometheus metric.
  • Two verification alerts: verification failed (immediate, severity critical) and verification stale (no successful verify in 26h, severity warning).
  • The verification metric is on the existing PostgreSQL dashboard alongside thepg_dump-stale and basebackup-stale panels — three operational signals on one screen.
  • postgres_exporter sidecar feeding Prometheus; Grafana dashboard surfaces connection counts, replication lag, table sizes, and slow queries.
  • Alert rules: backup-job failure, replication-lag-too-high, disk-full, long-running-transaction.
  • Four-scenario runbook (docs/runbooks/postgres-recovery.md): fresh PVC reset, full restore from pg_dump, partial restore (single database), point-in-time recovery to a specific timestamp.

Migration Safety — migration-lint

golang-migratecatches syntactic errors when the migration runs against Docker; it doesn't catch operationally unsafe DDL that's syntactically valid. A custom Go linter (migration-lint) walks each .up.sql AST via libpg_query and flags eight common foot-guns at lint time, before any container starts.

Each rule pairs with a recipe in a checked-in safe-migration runbook.

  • Custom Go CLI built on pganalyze/pg_query_go (CGO wrapper around libpg_query, the upstream PG parser)
  • Eight rules: CREATE INDEX without CONCURRENTLY (MIG001), NOT NULL ADD COLUMN with volatile default (MIG002), table-rewrite ALTER COLUMN TYPE (MIG003), CHECK without NOT VALID (MIG004), DROP COLUMN (MIG005), RENAME COLUMN (MIG006), CONCURRENTLY mixed with other DDL (MIG007), LOCK TABLE (MIG008)
  • Per-statement opt-out: -- migration-lint: ignore=MIGNNN reason="..." with mandatory reason="…"
  • Wired into make preflight-go-migrations and the CI matrix as a hard prerequisite to the runtime migration pipeline
  • Companion 8-recipe runbook
  • Worked example: CREATE INDEX CONCURRENTLY in its own migration file (go/product-service/migrations/005_add_product_search_index.up.sql)

Schema Design — Partitioning & Materialized Views

Reporting workloads on a monotonically growing orders table forced a schema-design pass. Range partitioning by created_at prunes scan scope; three materialized views give constant-time reads for dashboard queries; CTE + window functions express the rolling-average business logic without application-side aggregation.

  • Range partitioning on orders.created_at (monthly), 18 months pre-provisioned with a default catch-all partition
  • Background goroutine creates partitions 3 months ahead daily; idempotent CREATE TABLE IF NOT EXISTS
  • Three materialized views (mv_daily_revenue, mv_product_performance, mv_customer_summary) refreshed CONCURRENTLY on a 15-min cadence
  • Unique indexes per MV to support REFRESH CONCURRENTLY
  • CTE-driven reporting with SUM(...) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for rolling 7/30-day averages
  • DENSE_RANK() for tie-aware top-N (turnover, top customers)
  • Composite primary key trade-off documented ((id, created_at) removes single-column FK target — referential integrity moves to the saga)