Three pillars. Sixteen alert rules. And four production incidents that taught us what was missing.
Every service in this portfolio is instrumented for metrics, logs, and traces. The journey from “monitoring deployed” to “production-debuggable” took real incidents — gRPC handshakes that hung silently, webhook 500s lost in middleware, agent loops that were black boxes, and Postgres connections we couldn’t attribute. Here’s what broke, what we shipped, and what we still want to close.
Four incidents in three days. Each exposed a gap, drove a concrete change, and shaped the system you see below.
"Order failed. Please try again." — and Loki had nothing useful for 45 minutes.
A gRPC mTLS handshake to payment-service hung silently for 30 seconds. The saga blocked, no metrics fired, and the only signal was a stuck order status. Discovering it required kubectl exec, openssl, and reading git diff HEAD~1 to realise CI had never rebuilt the fix.
A shared grpcmetrics client interceptor records grpc_client_requests_total and grpc_client_request_duration_seconds per target, and emits slog.ErrorContext on every non-OK result. All gRPC calls now have 30s context deadlines. Saga steps are timed via saga_step_duration_seconds. Build SHA is logged at startup so {app=...} | json | gitSHA=... answers 'is my fix deployed?' from Loki. CI image-change detection moved from HEAD~1 to HEAD~5.
gRPC interceptorsaga step durationbuild SHA in logsCI HEAD~5Customer completes Stripe checkout. Cart still full. No order confirmation. Loki shows zero ERROR logs for 24 hours.
The apperror.ErrorHandler() middleware silently converted AppError instances to JSON responses without logging — a webhook 500 vanished. QA and production also shared a RabbitMQ instance with identical queue names, so a QA clear.cart command was being consumed by the production cart-service.
Middleware now logs every 5xx AppError via slog.Error with code, message, status, and request ID before responding — silent server errors are no longer possible. QA runs on a dedicated RabbitMQ /qa vhost, fully isolating saga flow. A saga-order-stalled Grafana alert fires when saga_steps_total{step="PAYMENT_CREATED"} increases but neither COMPLETED nor COMPENSATION_COMPLETE does within 30 minutes.
5xx middleware loggingRabbitMQ /qa vhostsaga-order-stalled alertwebhook event-type panelLoki was deployed and ai-service was emitting JSON logs — but only 4 `slog` calls existed in the entire codebase. A failed agent request gave you 'turn started' and 'turn ended' with nothing in between.
The agent loop made 3-8 LLM roundtrips per request, each potentially triggering 1-N tool calls. When something went wrong the question was always 'which step failed, and why?' — and the only answer was 'add print statements and redeploy.' The OpenAI and Anthropic clients didn't emit OTel spans, so provider comparison wasn't possible in Jaeger.
Six-layer structured logging covers HTTP handler, agent loop, LLM clients, cache, guardrails, and tools. All agent-loop logs use slog.InfoContext(ctx, ...) so tracing.NewLogHandler() injects the OTel traceID into every record. OpenAI and Anthropic clients now emit openai.chat / anthropic.chat spans with token attributes — provider comparison is visible in Jaeger. A single Loki query ({app="ai-service"} | json | traceID="...") shows the complete request lifecycle.
6-layer slogOTel span paritytraceID-in-logstruncation disciplineDuring a Postgres WAL corruption incident, three observability gaps made diagnosis harder than necessary. We couldn't tell which deploy preceded the metric change. K8s Warning events expired before we looked. And `pg_stat_activity` showed every connection as the same `taskuser` — no way to tell which service owned them.
Deploy timestamps had to be reconstructed from kubectl get events. K8s Warning events (OOM kills, probe failures, evictions) lived 1 hour and weren't queryable from Grafana. All six Go services shared the same Postgres credentials — a connection leak in one was indistinguishable from normal load across all.
Every CI rollout posts a Grafana annotation tagged with namespace + short SHA via /api/annotations, with anonymous-Viewer auth preserved for public dashboard viewing. kubernetes-event-exporter (resmoio fork) ships Warning-only events into Loki under {job="kube-event-exporter"} with namespace/reason/kind/name labels. Every Go service's DATABASE_URL includes application_name=<service-name>, so a "Connections by Service" dashboard panel attributes every Postgres connection in seconds.
CI deploy annotationsK8s events → Lokiapplication_name in DSNconnection attribution panelThree language stacks feed three collection pipelines. Everything converges in Grafana for unified dashboards and alerting.
Prometheus scrapes every pod annotated with prometheus.io/scrape: "true" on a 15-second interval. Infrastructure exporters provide cluster and hardware visibility: kube-state-metrics for pod status and deployment health, node-exporter for CPU, memory, and disk, and a GPU exporter for NVIDIA utilization and temperature.
http_requests_totalhttp_request_duration_secondskafka_consumer_lagcontainer_memory_working_set_bytesgo_goroutinesnvidia_smi_temperature_gpugrpc_client_request_duration_seconds{target=...} and grpc_client_requests_total with target/method/code labels. After ADR 10, every PostgreSQL connection is attributed via application_name= in the DATABASE_URL, surfaced in a Grafana “Connections by Service” panel.Promtail runs as a DaemonSet on every node, tailing container logs from /var/log/pods/. Go services emit structured JSON via slog with a custom handler that injects the OpenTelemetry traceID into every log line. Java services use logstash-logback-encoder for the same JSON output. Loki indexes only labels — namespace, pod, level — keeping storage efficient on a single-node cluster.
{namespace="go-ecommerce"} | json | level="error"apperror middleware logs every 5xx AppError — silent server errors are no longer possible. After ADR 09, the AI agent loop emits structured logs at six layers (HTTP, agent, LLM client, cache, guardrails, tools), all with the OTel traceID injected, so {app="ai-service"} | json | traceID="..." returns the full request lifecycle.The Go services are instrumented with the OpenTelemetry SDK. otelgin middleware auto-instruments HTTP handlers, and otelhttp propagates W3C traceparent headers on outbound calls. Trace context also flows through Kafka message headers, so a single request can be traced from the HTTP gateway through ecommerce processing, across an async Kafka boundary, to the analytics consumer.
openai.chat / anthropic.chat spans with otelhttp.NewTransport-based propagation, matching the existing Ollama instrumentation. Provider comparison is now possible in Jaeger as child spans of the agent turn.The same ecommerce.orders Kafka topic that drives saga state also feeds an order-projector consumer, which writes a denormalized read model into projectordb. Reads against the projection are independent of the OLTP write path — order-service owns the write schema, the projector owns the read schema, and the two evolve on different timelines.
The pay-off is operational: dashboard and reporting reads on the projection don't compete with checkout writes for primary-pool connections, and the projection's shape can be tuned to query patterns instead of transactional invariants. Trace context arrives on Kafka headers, so a single order's lifecycle (HTTP → order-service → Kafka → projector) renders as one trace in Jaeger. The projector reuses the same metric vocabulary as the analytics consumer (kafka_consumer_lag, aggregation latency), so a single dashboard panel covers both consumers.
Four alert groups cover infrastructure through application layers. Symptom-based SLO alerts catch user-facing degradation before anything crashes. All alerts route to Telegram via Grafana unified alerting.
GPU exporter health, AI service readiness, GPU temperature and VRAM usage
OOM kills, pod restart storms, container memory pressure, node disk pressure, stuck deployments
HTTP error rate and p95 latency targets for Go AI, Go ecommerce, and Java gateway services
Kafka consumer lag monitoring across order, cart, and product view event topics
Connection saturation, replication lag, deadlocks, backup freshness, and query-level latency, regression, slow-query rate, and auto_explain stalled signals
saga-order-stalled rule fires when saga_steps_total{step="PAYMENT_CREATED"} increases without a matching COMPLETED within 30 minutes. After ADR 10, every CI rollout posts a Grafana annotation tagged with namespace and short SHA, so dashboards mark the exact deploy preceding any metric change.System-level Postgres metrics (connections, cache hit, deadlocks, backup freshness) tell you the database is alive. They don’t tell you which queries are slow, drifting, or eating CPU. The shared Postgres 17 instance now preloads pg_stat_statements and auto_explain, exposing per-query latency, call rate, IO behavior, and full execution plans for anything over 500 ms.
Three independent paths feed Grafana. The postgres_exporter sidecar runs custom queries that export the top-50 statements as time-series metrics — pg_stat_statements_mean_exec_time, pg_stat_statements_calls_total, pg_stat_statements_shared_blks_read. A read-only grafana_reader role with the pg_monitor predefined role powers per-database PostgreSQL data sources for live SQL inspection. And auto_explain.log_format = json writes plans to Postgres logs — Promtail extracts them into Loki so plans render inline in a Grafana logs panel filtered by queryid.
{namespace="java-tasks", app="postgres"} |= "auto_explain" | json | query_id=~"$queryid"A new PostgreSQL Query Performance dashboard ties it together: top-N tables by mean and total exec time, p95 latency per queryid, slow-query call rate, cache hit ratio, and a plan-viewer panel driven by a queryid template variable. Four new alerts cover the realistic failure modes — a hard >1 s ceiling, a regression rule that fires when current mean is>2× its 7-day baseline, a slow-query rate spike rule, and an auto_explain stalled rule that catches misconfigurations before they hide a regression.
db-roadmap (replication, retention, vacuum tuning, partitioning) to build on.The real value of observability is connecting the pillars. Structured logging injects the OpenTelemetry traceID into every log line. Grafana’s derived fields on the Loki datasource turn those traceIDs into clickable Jaeger links. When an alert fires, the investigation path is: metric spike → filtered logs → distributed trace → root cause.
Production maturity is a continuous process. Here’s what’s on the roadmap, pulled from the “Remaining gaps” sections of the recent ADRs.
Manual OpenTelemetry spans around slow queries — partly displaced by the pg_stat_statements + auto_explain layer below. Still useful for tying a slow query span to the surrounding business operation in Jaeger; deferred until the database-side data exposes a query that warrants per-call attribution.
Requires scraping the RabbitMQ management API. Saga DLQ depth alerts depend on it being shipped first.
A dedicated panel set for agent-loop debugging. Deferred until QA logs accumulate enough volume to know which queries are most useful.
Pipeline change deployed but not yet verified end-to-end. Once verified, Python ingestion / chat / debug services join the unified traceID-in-logs experience.