Building intelligent systems with retrieval-augmented generation, agentic architectures, and Model Context Protocol (MCP) servers that any AI client can call. This section demonstrates an MCP server fronting twelve tools, RAG pipelines with evaluation, vector search, and tool-using agents — built with FastAPI, Qdrant, Ollama, and Go, deployed on Kubernetes.
Prometheus scrapes every AI service and streams metrics to a live Grafana dashboard.
MCP Server
The Go ai-service exposes twelve tools to any Model Context Protocol client over HTTPS. Built on the official modelcontextprotocol/go-sdk, it fronts both the ecommerce backend (REST + gRPC) and a Python RAG pipeline (HTTP, circuit breaker, OTel trace propagation). Authentication is optional: catalog and knowledge-base tools work anonymously; cart, order, and return tools require a Bearer JWT.
Architecture
External MCP clients connect over the HTTPS Streamable transport. The Go server enforces optional JWT auth, then routes tool calls to either the ecommerce backend or the Python RAG bridge. The bridge uses a circuit breaker with a 30-second timeout and propagates OTel trace context across the language boundary.
Tool Catalog
The MCP server exposes twelve tools across four domains. Catalog and knowledge-base tools are public; order, cart, and return tools require a Bearer JWT. Knowledge-base tools call the Python RAG pipeline through a circuit-breaker HTTP bridge with a 30-second timeout. Checkout (place_order) is deliberately excluded — the agent can advise but not transact.
Internal MCPs for Engineering Operations
These MCPs are Codex-facing control surfaces over larger engineering systems. The route stays focused on what the MCP layer adds: bounded tool access, evidence gathering, repeatable workflows, and links into the deeper platforms behind each service.
Read-only production evidence
Observability MCP
A local MCP endpoint over the metrics, logs, traces, dashboard, and runbook layers. It turns operational questions into bounded evidence bundles for system health, checkout incidents, AI pipeline failures, eval runs, streaming analytics, service logs, and trace lookup without mutating the cluster.
An operator MCP for repeatable RAG evaluation work. It coordinates evaluator datasets, run history, study records, conclusions, collection metadata from ingestion, retrieval settings, ranking comparisons, worst-case review, and the eval-run evidence handoff into the observability control plane.
A compact internal MCP for structured practice sessions, expected answer feedback, weak-topic tracking, review attempts, and scoring. It keeps interview preparation in the same tool-call workflow without competing with the production observability and RAG eval control planes.
Example tool-call trace
Why did checkout stall in QA?
observability.investigate_checkout
observability.get_trace
Payment was created, but the order saga never observed completion. The trace points to a RabbitMQ reply timeout.
Why this matters
Turns production questions into bounded evidence requests.
Keeps practice feedback and weak-topic tracking in the same workflow.
Makes RAG changes measurable before they are treated as improvements.
Try it interactively
The same tool registry powers an in-browser agent loop on the Go section. The agent runs Qwen 2.5 14B locally and streams tool calls and results live.
The MCP server is publicly reachable at https://api.kylebradshaw.dev/ai-api/mcp. Public tools (catalog search, RAG search, list_collections) work without auth. Auth-scoped tools require a Bearer JWT — register at /go/register, log in, and copy the access token from the Authorization header in DevTools.
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
A measurement tool for systematically tracking RAG pipeline quality. Create golden datasets with expected answers, run first-party RAG quality evaluations against the live pipeline, and view scorecards with per-query breakdowns — faithfulness, answer relevancy, context precision, and context recall.
What It Demonstrates
First-party RAG quality evaluator with LLM-judged metrics
A full-stack Retrieval-Augmented Generation (RAG) application that lets users upload PDF documents and ask questions about their content. The system parses, chunks, and embeds documents into a vector database, then retrieves relevant context to generate accurate, grounded answers using a local LLM.
Tech Stack
FastAPI microservices (ingestion + chat)
Qdrant vector database
Ollama with Mistral 7B (chat) and nomic-embed-text (embeddings)
An agentic debugging tool that indexes a Python codebase into a vector store and uses a ReAct-style agent loop to search the code, retrieve relevant context, and stream a grounded diagnosis of the described bug.
Tech Stack
FastAPI debug service (index + agent endpoints)
Qdrant vector database (per-project collections)
Ollama with Qwen 2.5 14B (agent reasoning) and nomic-embed-text (embeddings)
LangChain Python-aware text splitter
SSE streaming for real-time agent event output
Minikube Kubernetes deployment (production)
What It Demonstrates
Agentic tool-use loop (ReAct pattern) with a local LLM
Language-aware code chunking for higher-quality retrieval
Named SSE events for streaming structured agent state
Per-request Qdrant collections for isolated debug sessions