AI Engineering Copilot Infrastructure

Production-grade AI assistant for software engineers. 6-step deterministic workflow: classify intent → detect libraries (MCP) → fetch real docs (Context7 via MCP) → build grounded prompt → GPT-4o-mini → validate output.

Python 3.11FastAPIOpenAI GPT-4o-miniRedisLoguruPoetryReact 18TypeScriptMCP ServerContext7DockerDocker ComposeGitHub ActionsRuff

View Code

Role

AI Infrastructure Engineer

Team

Solo

Company/Organization

Personal Project

The Problem

•

LLMs answering technical questions hallucinate API signatures, deprecated configuration options, and non-existent methods because answers derive from...

•

No deterministic quality gate on LLM output — empty responses, answers that are too short, or responses containing hallucination signals all reach...

•

Every identical technical question triggers a fresh OpenAI API call — no caching, so repeated questions waste tokens and add unnecessary latency.

•

Tool execution (library detection, documentation fetching) was tightly coupled to the main API process, making it impossible to scale or replace...

•

No per-request observability — without a trace_id, correlating logs across the classify → detect → fetch → generate → validate pipeline was...

The Solution

•

Built a cleanly layered AI infrastructure with a deterministic pipeline engine, a separate MCP tool server, and a grounded prompting strategy.

6-Step Workflow Pipeline (ai_copilot_infra/workflows/)

•

base.py — `WorkflowStep` abstract base class with `execute(state: WorkflowState) -> WorkflowState` interface. `StepPipeline` runs steps...

•

state.py — `WorkflowState` dataclass: query, trace_id, intent, detected_libraries[], fetched_docs{}, grounded_prompt, llm_response,...

•

copilot_workflow.py — Instantiates and chains 6 steps:

•

1. `IntentClassificationStep` — Classifies query as debug/how-to/config/concept. Sets state.intent.

•

2. `LibraryDetectionStep` — Calls MCP server `library_detection_tool` with the raw query. Returns list of detected libraries (e.g., ['Redis',...

•

3. `DocumentationFetchStep` — For each detected library, calls MCP server `documentation_fetch_tool` → Context7 API. Aggregates real docs per...

•

4. `PromptBuildingStep` — Constructs grounded prompt: system message + retrieved docs as context + user query. Ensures answer is grounded in actual...

•

5. `LLMGenerationStep` — Calls llm_service.py (async OpenAI wrapper, GPT-4o-mini, temperature 0.2 for factual accuracy). Sets state.llm_response.

•

6. `ValidationStep` — Runs validation.py checks: empty response check, minimum length check (< 50 chars = invalid), hallucination signal detection...

MCP Tool Server (ai_copilot_infra/mcp_server/)

•

Separate FastAPI application running on :8100 — independently deployable and scalable.

•

base.py — `BaseTool` ABC with `name`, `description`, `execute(input: dict) -> dict` interface.

•

registry.py — `ToolRegistry` maps tool names to `BaseTool` instances. `POST /tools/{tool_name}` dispatches to registered tool.

•

library_detection_tool.py — Parses query for known library/framework keywords (Redis, Celery, Docker, FastAPI, SQLAlchemy, etc.) using pattern...

•

documentation_fetch_tool.py — Takes library name, calls Context7 API (context7_client.py async HTTP client) to fetch current documentation...

•

tools.py — Registers default tools into ToolRegistry on startup.

Core Infrastructure (ai_copilot_infra/core/)

•

llm_service.py — Async OpenAI client wrapper. `generate(prompt: str) -> str`. Temperature 0.2, GPT-4o-mini model, max_tokens 1024. Reads...

•

mcp_client.py — Async HTTP client for MCP server. `call_tool(tool_name: str, input: dict) -> dict`. Uses httpx with connection pooling.

•

redis_service.py — Async Redis operations: `get(key)`, `set(key, value, ttl)`, `rate_limit_check(ip, limit=20, window=60)`. Cache key = SHA-256...

•

validation.py — `OutputValidator.validate(response: str) -> ValidationResult`. Checks: not empty, length >= 50 chars, no hallucination phrases....

•

config.py — Pydantic settings: OPENAI_API_KEY, REDIS_URL, MCP_BASE_URL, CONTEXT7_BASE_URL, CONTEXT7_API_KEY, LOG_FORMAT.

•

dependencies.py — FastAPI DI providers: `get_redis()`, `get_mcp_client()`, `get_llm_service()`.

API Layer (ai_copilot_infra/api/)

•

routes/copilot.py — `POST /api/v1/copilot/query`:

•

1. Check Redis cache (cache hit → return immediately, cached: true)

•

2. Rate limit check via redis_service (20 req/min per IP → 429 if exceeded)

•

3. Generate trace_id (UUID4)

•

4. Run 6-step workflow pipeline

•

5. If validation passed: cache result, return answer + libraries_used + validation_passed + cached + trace_id

•

6. If validation failed: return 422 with validation reason

•

middleware/logging.py — Structured request/response logging middleware: logs method, path, status_code, duration_ms, trace_id on every request.

Frontend (copilot-ui/ — React TypeScript)

•

App.tsx — Single-page UI with query input, submit button, loading state, answer display (markdown rendering), libraries_used badges, cached...

•

Dark theme CSS. Minimal dependencies — no UI library.

Docker Compose Stack

•

`api` — FastAPI on :8000

•

`mcp` — MCP tool server on :8100

•

`redis` — Redis 7 on :6379 (internal only)

•

`context7` — Context7 documentation API (or mock)

•

Frontend served separately (Node 18)

GitHub Actions CI

•

Ruff lint + format check

•

pytest with Redis service container

•

Docker image build verification

•

Frontend npm ci + build

Design Decisions

•

Deterministic 6-step pipeline over a single LLM call — each step is independently testable, observable, and replaceable. Adding a new step (e.g.,...

•

Separate MCP tool server (FastAPI :8100) over embedding tools in the API — tools can be scaled, deployed, and versioned independently. New tools are...

•

Context7 for documentation fetching over static knowledge base — retrieves current docs at query time, so answers are grounded in the latest library...

•

Output validation step before returning to client — catches empty, too-short, and hallucinated responses programmatically. Users never receive...

•

Redis cache keyed by SHA-256 of query text — identical questions always hit cache, avoiding redundant OpenAI calls and latency. 1-hour TTL balances...

•

Rate limiting via Redis (20 req/min per IP) — prevents API abuse without a separate rate-limiting service. Redis sliding window counter is...

•

WorkflowState dataclass as shared context through pipeline — all steps read and write to the same state object. No hidden side effects; full pipeline...

•

GPT-4o-mini at temperature 0.2 for factual accuracy — low temperature reduces creative hallucinations for technical Q&A. GPT-4o-mini is...

•

Loguru with JSON format in production — structured logs are parseable by log aggregators (Datadog, Grafana Loki). trace_id in every log line enables...

•

BaseTool + ToolRegistry pattern in MCP server — new tools (e.g., GitHub search, Stack Overflow fetch) can be added by implementing BaseTool and...

Tradeoffs & Constraints

•

Synchronous 6-step pipeline — steps execute sequentially; total latency is the sum of all steps. Steps 2+3 (library detection + doc fetch) could run...

•

Context7 availability dependency — if Context7 API is down or rate-limited, doc fetch fails and answers fall back to training data. Would improve...

•

Hallucination detection via phrase matching — catches common LLM hedging phrases but won't catch factually wrong answers stated confidently. Would...

•

Rate limiting per IP via Redis — effective for most abuse cases but bypassable with IP rotation. For public deployment, would add API key...

•

No streaming responses — full answer returned after complete pipeline execution. Streaming (SSE) would improve perceived performance but requires...

•

Would improve: Parallel doc fetching for multiple libraries, local documentation cache with periodic refresh, streaming responses, API key auth,...

Outcome & Impact

•

Grounded technical answers sourced from real library documentation retrieved at query time — not LLM training data. Libraries detected automatically...

•

Deterministic 6-step pipeline with WorkflowStep ABCs — every step is independently testable and observable. trace_id in every response enables full...

•

Output validation gate — empty, too-short, and hallucination-signalling responses return 422 with reason before reaching the client. Users only...

•

Redis cache eliminates redundant OpenAI calls for repeated questions — cache hit returns immediately (cached: true in response) at sub-5ms latency.

•

Rate limiting at 20 requests/minute per IP via Redis sliding window — prevents API abuse without an external rate-limiting service.

•

Separate MCP tool server (FastAPI :8100) with ToolRegistry — tools are independently deployable, scalable, and extensible via BaseTool ABC without...

•

GitHub Actions CI: Ruff lint, pytest with Redis service container, Docker build verification, and frontend build on every push and PR.

Tech Stack

•

Backend: Python 3.11, FastAPI, Uvicorn, Poetry (dependency management)

•

LLM: OpenAI GPT-4o-mini (temperature 0.2 for factual technical answers)

•

Pipeline: Custom deterministic workflow engine (WorkflowStep + StepPipeline ABCs, WorkflowState)

•

MCP Server: Separate FastAPI service (:8100), BaseTool ABC, ToolRegistry, library_detection_tool, documentation_fetch_tool

•

Documentation: Context7 API (real-time library documentation retrieval via async HTTP client)

•

Caching + Rate Limiting: Redis 7 (SHA-256 query cache 1h TTL, 20 req/min per IP sliding window)

•

Observability: Loguru (structured JSON logs), trace_id per request

•

Frontend: React 18, TypeScript, dark theme

•

CI/CD: GitHub Actions (Ruff lint, pytest + Redis service, Docker build, frontend build)

•

Orchestration: Docker Compose (api + mcp + redis + context7)

Back to Projects