AI Engineering Copilot Infrastructure
Production-grade AI assistant for software engineers. 6-step deterministic workflow: classify intent → detect libraries (MCP) → fetch real docs (Context7 via MCP) → build grounded prompt → GPT-4o-mini → validate output.
Role
AI Infrastructure Engineer
Team
Solo
Company/Organization
Personal Project
The Problem
LLMs answering technical questions hallucinate API signatures, deprecated configuration options, and non-existent methods because answers derive from...
No deterministic quality gate on LLM output — empty responses, answers that are too short, or responses containing hallucination signals all reach...
Every identical technical question triggers a fresh OpenAI API call — no caching, so repeated questions waste tokens and add unnecessary latency.
Tool execution (library detection, documentation fetching) was tightly coupled to the main API process, making it impossible to scale or replace...
No per-request observability — without a trace_id, correlating logs across the classify → detect → fetch → generate → validate pipeline was...
The Solution
Built a cleanly layered AI infrastructure with a deterministic pipeline engine, a separate MCP tool server, and a grounded prompting strategy.
6-Step Workflow Pipeline (ai_copilot_infra/workflows/)
base.py — `WorkflowStep` abstract base class with `execute(state: WorkflowState) -> WorkflowState` interface. `StepPipeline` runs steps...
state.py — `WorkflowState` dataclass: query, trace_id, intent, detected_libraries[], fetched_docs{}, grounded_prompt, llm_response,...
copilot_workflow.py — Instantiates and chains 6 steps:
1. `IntentClassificationStep` — Classifies query as debug/how-to/config/concept. Sets state.intent.
2. `LibraryDetectionStep` — Calls MCP server `library_detection_tool` with the raw query. Returns list of detected libraries (e.g., ['Redis',...
3. `DocumentationFetchStep` — For each detected library, calls MCP server `documentation_fetch_tool` → Context7 API. Aggregates real docs per...
4. `PromptBuildingStep` — Constructs grounded prompt: system message + retrieved docs as context + user query. Ensures answer is grounded in actual...
5. `LLMGenerationStep` — Calls llm_service.py (async OpenAI wrapper, GPT-4o-mini, temperature 0.2 for factual accuracy). Sets state.llm_response.
6. `ValidationStep` — Runs validation.py checks: empty response check, minimum length check (< 50 chars = invalid), hallucination signal detection...
MCP Tool Server (ai_copilot_infra/mcp_server/)
Separate FastAPI application running on :8100 — independently deployable and scalable.
base.py — `BaseTool` ABC with `name`, `description`, `execute(input: dict) -> dict` interface.
registry.py — `ToolRegistry` maps tool names to `BaseTool` instances. `POST /tools/{tool_name}` dispatches to registered tool.
library_detection_tool.py — Parses query for known library/framework keywords (Redis, Celery, Docker, FastAPI, SQLAlchemy, etc.) using pattern...
documentation_fetch_tool.py — Takes library name, calls Context7 API (context7_client.py async HTTP client) to fetch current documentation...
tools.py — Registers default tools into ToolRegistry on startup.
Core Infrastructure (ai_copilot_infra/core/)
llm_service.py — Async OpenAI client wrapper. `generate(prompt: str) -> str`. Temperature 0.2, GPT-4o-mini model, max_tokens 1024. Reads...
mcp_client.py — Async HTTP client for MCP server. `call_tool(tool_name: str, input: dict) -> dict`. Uses httpx with connection pooling.
redis_service.py — Async Redis operations: `get(key)`, `set(key, value, ttl)`, `rate_limit_check(ip, limit=20, window=60)`. Cache key = SHA-256...
validation.py — `OutputValidator.validate(response: str) -> ValidationResult`. Checks: not empty, length >= 50 chars, no hallucination phrases....
config.py — Pydantic settings: OPENAI_API_KEY, REDIS_URL, MCP_BASE_URL, CONTEXT7_BASE_URL, CONTEXT7_API_KEY, LOG_FORMAT.
dependencies.py — FastAPI DI providers: `get_redis()`, `get_mcp_client()`, `get_llm_service()`.
API Layer (ai_copilot_infra/api/)
routes/copilot.py — `POST /api/v1/copilot/query`:
1. Check Redis cache (cache hit → return immediately, cached: true)
2. Rate limit check via redis_service (20 req/min per IP → 429 if exceeded)
3. Generate trace_id (UUID4)
4. Run 6-step workflow pipeline
5. If validation passed: cache result, return answer + libraries_used + validation_passed + cached + trace_id
6. If validation failed: return 422 with validation reason
middleware/logging.py — Structured request/response logging middleware: logs method, path, status_code, duration_ms, trace_id on every request.
Frontend (copilot-ui/ — React TypeScript)
App.tsx — Single-page UI with query input, submit button, loading state, answer display (markdown rendering), libraries_used badges, cached...
Dark theme CSS. Minimal dependencies — no UI library.
Docker Compose Stack
`api` — FastAPI on :8000
`mcp` — MCP tool server on :8100
`redis` — Redis 7 on :6379 (internal only)
`context7` — Context7 documentation API (or mock)
Frontend served separately (Node 18)
GitHub Actions CI
Ruff lint + format check
pytest with Redis service container
Docker image build verification
Frontend npm ci + build
Design Decisions
Deterministic 6-step pipeline over a single LLM call — each step is independently testable, observable, and replaceable. Adding a new step (e.g.,...
Separate MCP tool server (FastAPI :8100) over embedding tools in the API — tools can be scaled, deployed, and versioned independently. New tools are...
Context7 for documentation fetching over static knowledge base — retrieves current docs at query time, so answers are grounded in the latest library...
Output validation step before returning to client — catches empty, too-short, and hallucinated responses programmatically. Users never receive...
Redis cache keyed by SHA-256 of query text — identical questions always hit cache, avoiding redundant OpenAI calls and latency. 1-hour TTL balances...
Rate limiting via Redis (20 req/min per IP) — prevents API abuse without a separate rate-limiting service. Redis sliding window counter is...
WorkflowState dataclass as shared context through pipeline — all steps read and write to the same state object. No hidden side effects; full pipeline...
GPT-4o-mini at temperature 0.2 for factual accuracy — low temperature reduces creative hallucinations for technical Q&A. GPT-4o-mini is...
Loguru with JSON format in production — structured logs are parseable by log aggregators (Datadog, Grafana Loki). trace_id in every log line enables...
BaseTool + ToolRegistry pattern in MCP server — new tools (e.g., GitHub search, Stack Overflow fetch) can be added by implementing BaseTool and...
Tradeoffs & Constraints
Synchronous 6-step pipeline — steps execute sequentially; total latency is the sum of all steps. Steps 2+3 (library detection + doc fetch) could run...
Context7 availability dependency — if Context7 API is down or rate-limited, doc fetch fails and answers fall back to training data. Would improve...
Hallucination detection via phrase matching — catches common LLM hedging phrases but won't catch factually wrong answers stated confidently. Would...
Rate limiting per IP via Redis — effective for most abuse cases but bypassable with IP rotation. For public deployment, would add API key...
No streaming responses — full answer returned after complete pipeline execution. Streaming (SSE) would improve perceived performance but requires...
Would improve: Parallel doc fetching for multiple libraries, local documentation cache with periodic refresh, streaming responses, API key auth,...
Outcome & Impact
Grounded technical answers sourced from real library documentation retrieved at query time — not LLM training data. Libraries detected automatically...
Deterministic 6-step pipeline with WorkflowStep ABCs — every step is independently testable and observable. trace_id in every response enables full...
Output validation gate — empty, too-short, and hallucination-signalling responses return 422 with reason before reaching the client. Users only...
Redis cache eliminates redundant OpenAI calls for repeated questions — cache hit returns immediately (cached: true in response) at sub-5ms latency.
Rate limiting at 20 requests/minute per IP via Redis sliding window — prevents API abuse without an external rate-limiting service.
Separate MCP tool server (FastAPI :8100) with ToolRegistry — tools are independently deployable, scalable, and extensible via BaseTool ABC without...
GitHub Actions CI: Ruff lint, pytest with Redis service container, Docker build verification, and frontend build on every push and PR.
Tech Stack
Backend: Python 3.11, FastAPI, Uvicorn, Poetry (dependency management)
LLM: OpenAI GPT-4o-mini (temperature 0.2 for factual technical answers)
Pipeline: Custom deterministic workflow engine (WorkflowStep + StepPipeline ABCs, WorkflowState)
MCP Server: Separate FastAPI service (:8100), BaseTool ABC, ToolRegistry, library_detection_tool, documentation_fetch_tool
Documentation: Context7 API (real-time library documentation retrieval via async HTTP client)
Caching + Rate Limiting: Redis 7 (SHA-256 query cache 1h TTL, 20 req/min per IP sliding window)
Observability: Loguru (structured JSON logs), trace_id per request
Frontend: React 18, TypeScript, dark theme
CI/CD: GitHub Actions (Ruff lint, pytest + Redis service, Docker build, frontend build)
Orchestration: Docker Compose (api + mcp + redis + context7)