main·5006ce4·1m ago

Observability

Three pillars. Sixteen alert rules. And four production incidents that taught us what was missing.

Every service in this portfolio is instrumented for metrics, logs, and traces. The journey from “monitoring deployed” to “production-debuggable” took real incidents — gRPC handshakes that hung silently, webhook 500s lost in middleware, agent loops that were black boxes, and Postgres connections we couldn’t attribute. Here’s what broke, what we shipped, and what we still want to close.

The Journey

Four incidents in three days. Each exposed a gap, drove a concrete change, and shaped the system you see below.

  1. Apr 22, 2026#1

    The mTLS Handshake

    go-ecommerce

    "Order failed. Please try again." — and Loki had nothing useful for 45 minutes.

    Before

    A gRPC mTLS handshake to payment-service hung silently for 30 seconds. The saga blocked, no metrics fired, and the only signal was a stuck order status. Discovering it required kubectl exec, openssl, and reading git diff HEAD~1 to realise CI had never rebuilt the fix.

    After

    A shared grpcmetrics client interceptor records grpc_client_requests_total and grpc_client_request_duration_seconds per target, and emits slog.ErrorContext on every non-OK result. All gRPC calls now have 30s context deadlines. Saga steps are timed via saga_step_duration_seconds. Build SHA is logged at startup so {app=...} | json | gitSHA=... answers 'is my fix deployed?' from Loki. CI image-change detection moved from HEAD~1 to HEAD~5.

    gRPC interceptorsaga step durationbuild SHA in logsCI HEAD~5
  2. Apr 23, 2026#2

    The Silent Webhook

    go-ecommerce

    Customer completes Stripe checkout. Cart still full. No order confirmation. Loki shows zero ERROR logs for 24 hours.

    Before

    The apperror.ErrorHandler() middleware silently converted AppError instances to JSON responses without logging — a webhook 500 vanished. QA and production also shared a RabbitMQ instance with identical queue names, so a QA clear.cart command was being consumed by the production cart-service.

    After

    Middleware now logs every 5xx AppError via slog.Error with code, message, status, and request ID before responding — silent server errors are no longer possible. QA runs on a dedicated RabbitMQ /qa vhost, fully isolating saga flow. A saga-order-stalled Grafana alert fires when saga_steps_total{step="PAYMENT_CREATED"} increases but neither COMPLETED nor COMPENSATION_COMPLETE does within 30 minutes.

    5xx middleware loggingRabbitMQ /qa vhostsaga-order-stalled alertwebhook event-type panel
  3. Apr 23, 2026#3

    The Black-Box Agent Loop

    ai-services

    Loki was deployed and ai-service was emitting JSON logs — but only 4 `slog` calls existed in the entire codebase. A failed agent request gave you 'turn started' and 'turn ended' with nothing in between.

    Before

    The agent loop made 3-8 LLM roundtrips per request, each potentially triggering 1-N tool calls. When something went wrong the question was always 'which step failed, and why?' — and the only answer was 'add print statements and redeploy.' The OpenAI and Anthropic clients didn't emit OTel spans, so provider comparison wasn't possible in Jaeger.

    After

    Six-layer structured logging covers HTTP handler, agent loop, LLM clients, cache, guardrails, and tools. All agent-loop logs use slog.InfoContext(ctx, ...) so tracing.NewLogHandler() injects the OTel traceID into every record. OpenAI and Anthropic clients now emit openai.chat / anthropic.chat spans with token attributes — provider comparison is visible in Jaeger. A single Loki query ({app="ai-service"} | json | traceID="...") shows the complete request lifecycle.

    6-layer slogOTel span paritytraceID-in-logstruncation discipline
  4. Apr 24, 2026#4

    The Postgres WAL Incident

    all

    During a Postgres WAL corruption incident, three observability gaps made diagnosis harder than necessary. We couldn't tell which deploy preceded the metric change. K8s Warning events expired before we looked. And `pg_stat_activity` showed every connection as the same `taskuser` — no way to tell which service owned them.

    Before

    Deploy timestamps had to be reconstructed from kubectl get events. K8s Warning events (OOM kills, probe failures, evictions) lived 1 hour and weren't queryable from Grafana. All six Go services shared the same Postgres credentials — a connection leak in one was indistinguishable from normal load across all.

    After

    Every CI rollout posts a Grafana annotation tagged with namespace + short SHA via /api/annotations, with anonymous-Viewer auth preserved for public dashboard viewing. kubernetes-event-exporter (resmoio fork) ships Warning-only events into Loki under {job="kube-event-exporter"} with namespace/reason/kind/name labels. Every Go service's DATABASE_URL includes application_name=<service-name>, so a "Connections by Service" dashboard panel attributes every Postgres connection in seconds.

    CI deploy annotationsK8s events → Lokiapplication_name in DSNconnection attribution panel

Architecture

Three language stacks feed three collection pipelines. Everything converges in Grafana for unified dashboards and alerting.

Metrics — Prometheus

Prometheus scrapes every pod annotated with prometheus.io/scrape: "true" on a 15-second interval. Infrastructure exporters provide cluster and hardware visibility: kube-state-metrics for pod status and deployment health, node-exporter for CPU, memory, and disk, and a GPU exporter for NVIDIA utilization and temperature.

http_requests_totalhttp_request_duration_secondskafka_consumer_lagcontainer_memory_working_set_bytesgo_goroutinesnvidia_smi_temperature_gpu
Lessons from productionADR 07ADR 10
After ADR 07, every outbound gRPC call is metered: grpc_client_request_duration_seconds{target=...} and grpc_client_requests_total with target/method/code labels. After ADR 10, every PostgreSQL connection is attributed via application_name= in the DATABASE_URL, surfaced in a Grafana “Connections by Service” panel.

Logs — Loki + Promtail

Promtail runs as a DaemonSet on every node, tailing container logs from /var/log/pods/. Go services emit structured JSON via slog with a custom handler that injects the OpenTelemetry traceID into every log line. Java services use logstash-logback-encoder for the same JSON output. Loki indexes only labels — namespace, pod, level — keeping storage efficient on a single-node cluster.

{namespace="go-ecommerce"} | json | level="error"
Lessons from productionADR 08ADR 09
After ADR 08, the apperror middleware logs every 5xx AppError — silent server errors are no longer possible. After ADR 09, the AI agent loop emits structured logs at six layers (HTTP, agent, LLM client, cache, guardrails, tools), all with the OTel traceID injected, so {app="ai-service"} | json | traceID="..." returns the full request lifecycle.

Traces — Jaeger + OpenTelemetry

The Go services are instrumented with the OpenTelemetry SDK. otelgin middleware auto-instruments HTTP handlers, and otelhttp propagates W3C traceparent headers on outbound calls. Trace context also flows through Kafka message headers, so a single request can be traced from the HTTP gateway through ecommerce processing, across an async Kafka boundary, to the analytics consumer.

Lessons from productionADR 09
After ADR 09, OpenAI and Anthropic clients emit openai.chat / anthropic.chat spans with otelhttp.NewTransport-based propagation, matching the existing Ollama instrumentation. Provider comparison is now possible in Jaeger as child spans of the agent turn.

Event Sourcing & CQRS — Order Projector

The same ecommerce.orders Kafka topic that drives saga state also feeds an order-projector consumer, which writes a denormalized read model into projectordb. Reads against the projection are independent of the OLTP write path — order-service owns the write schema, the projector owns the read schema, and the two evolve on different timelines.

The pay-off is operational: dashboard and reporting reads on the projection don't compete with checkout writes for primary-pool connections, and the projection's shape can be tuned to query patterns instead of transactional invariants. Trace context arrives on Kafka headers, so a single order's lifecycle (HTTP → order-service → Kafka → projector) renders as one trace in Jaeger. The projector reuses the same metric vocabulary as the analytics consumer (kafka_consumer_lag, aggregation latency), so a single dashboard panel covers both consumers.

Alerting — 16 Rules → Telegram

Four alert groups cover infrastructure through application layers. Symptom-based SLO alerts catch user-facing degradation before anything crashes. All alerts route to Telegram via Grafana unified alerting.

Infrastructure

4 rules

GPU exporter health, AI service readiness, GPU temperature and VRAM usage

Kubernetes Health

6 rules

OOM kills, pod restart storms, container memory pressure, node disk pressure, stuck deployments

Application SLOs

6 rules

HTTP error rate and p95 latency targets for Go AI, Go ecommerce, and Java gateway services

Streaming Analytics

1 rule

Kafka consumer lag monitoring across order, cart, and product view event topics

PostgreSQL

8 rules

Connection saturation, replication lag, deadlocks, backup freshness, and query-level latency, regression, slow-query rate, and auto_explain stalled signals

Lessons from productionADR 08ADR 10
After ADR 08, a saga-order-stalled rule fires when saga_steps_total{step="PAYMENT_CREATED"} increases without a matching COMPLETED within 30 minutes. After ADR 10, every CI rollout posts a Grafana annotation tagged with namespace and short SHA, so dashboards mark the exact deploy preceding any metric change.

Database Query Performance — pg_stat_statements + auto_explain

System-level Postgres metrics (connections, cache hit, deadlocks, backup freshness) tell you the database is alive. They don’t tell you which queries are slow, drifting, or eating CPU. The shared Postgres 17 instance now preloads pg_stat_statements and auto_explain, exposing per-query latency, call rate, IO behavior, and full execution plans for anything over 500 ms.

Three independent paths feed Grafana. The postgres_exporter sidecar runs custom queries that export the top-50 statements as time-series metrics — pg_stat_statements_mean_exec_time, pg_stat_statements_calls_total, pg_stat_statements_shared_blks_read. A read-only grafana_reader role with the pg_monitor predefined role powers per-database PostgreSQL data sources for live SQL inspection. And auto_explain.log_format = json writes plans to Postgres logs — Promtail extracts them into Loki so plans render inline in a Grafana logs panel filtered by queryid.

{namespace="java-tasks", app="postgres"} |= "auto_explain" | json | query_id=~"$queryid"

A new PostgreSQL Query Performance dashboard ties it together: top-N tables by mean and total exec time, p95 latency per queryid, slow-query call rate, cache hit ratio, and a plan-viewer panel driven by a queryid template variable. Four new alerts cover the realistic failure modes — a hard >1 s ceiling, a regression rule that fires when current mean is>2× its 7-day baseline, a slow-query rate spike rule, and an auto_explain stalled rule that catches misconfigurations before they hide a regression.

Lessons from productionQuery Observability ADR
Hard latency thresholds miss the realistic failure mode — a query that quietly drifts from 50 ms to 200 ms after a planner change. The 7-day baseline regression alert catches that; the hard ceiling catches genuinely terrible queries. Together they give the database a working measurement layer for the rest of thedb-roadmap (replication, retention, vacuum tuning, partitioning) to build on.

Correlation — Connecting the Pillars

The real value of observability is connecting the pillars. Structured logging injects the OpenTelemetry traceID into every log line. Grafana’s derived fields on the Loki datasource turn those traceIDs into clickable Jaeger links. When an alert fires, the investigation path is: metric spike → filtered logs → distributed trace → root cause.

What’s Next

Production maturity is a continuous process. Here’s what’s on the roadmap, pulled from the “Remaining gaps” sections of the recent ADRs.

In-app PostgreSQL query tracing

ADR 07

Manual OpenTelemetry spans around slow queries — partly displaced by the pg_stat_statements + auto_explain layer below. Still useful for tying a slow query span to the surrounding business operation in Jaeger; deferred until the database-side data exposes a query that warrants per-call attribution.

RabbitMQ queue depth metrics

ADR 07

Requires scraping the RabbitMQ management API. Saga DLQ depth alerts depend on it being shipped first.

Grafana dashboard for AI agent

ADR 09

A dedicated panel set for agent-loop debugging. Deferred until QA logs accumulate enough volume to know which queries are most useful.

Promtail trace_id field for Python

ADR 07

Pipeline change deployed but not yet verified end-to-end. Once verified, Python ingestion / chat / debug services join the unified traceID-in-logs experience.

View Live Grafana Dashboard →View ADRs on GitHub →