Back to Projects

AI Engineering Copilot Infrastructure

Production-grade AI assistant for software engineers. 6-step deterministic workflow: classify intent → detect libraries (MCP) → fetch real docs (Context7 via MCP) → build grounded prompt → GPT-4o-mini → validate output.

Python 3.11FastAPIOpenAI GPT-4o-miniRedisLoguruPoetryReact 18TypeScriptMCP ServerContext7DockerDocker ComposeGitHub ActionsRuff

Role

AI Infrastructure Engineer

Team

Solo

Company/Organization

Personal Project

The Problem

LLMs answering technical questions hallucinate API signatures, deprecated configuration options, and non-existent methods because answers derive from...

No deterministic quality gate on LLM outputempty responses, answers that are too short, or responses containing hallucination signals all reach...

Every identical technical question triggers a fresh OpenAI API callno caching, so repeated questions waste tokens and add unnecessary latency.

Tool execution (library detection, documentation fetching) was tightly coupled to the main API process, making it impossible to scale or replace...

No per-request observabilitywithout a trace_id, correlating logs across the classify → detect → fetch → generate → validate pipeline was...

The Solution

Built a cleanly layered AI infrastructure with a deterministic pipeline engine, a separate MCP tool server, and a grounded prompting strategy.

6-Step Workflow Pipeline (ai_copilot_infra/workflows/)

base.py — `WorkflowStep` abstract base class with `execute(state: WorkflowState) -> WorkflowState` interface. `StepPipeline` runs steps...

state.py — `WorkflowState` dataclass: query, trace_id, intent, detected_libraries[], fetched_docs{}, grounded_prompt, llm_response,...

copilot_workflow.py — Instantiates and chains 6 steps:

1. `IntentClassificationStep`Classifies query as debug/how-to/config/concept. Sets state.intent.

2. `LibraryDetectionStep`Calls MCP server `library_detection_tool` with the raw query. Returns list of detected libraries (e.g., ['Redis',...

3. `DocumentationFetchStep`For each detected library, calls MCP server `documentation_fetch_tool` → Context7 API. Aggregates real docs per...

4. `PromptBuildingStep`Constructs grounded prompt: system message + retrieved docs as context + user query. Ensures answer is grounded in actual...

5. `LLMGenerationStep`Calls llm_service.py (async OpenAI wrapper, GPT-4o-mini, temperature 0.2 for factual accuracy). Sets state.llm_response.

6. `ValidationStep`Runs validation.py checks: empty response check, minimum length check (< 50 chars = invalid), hallucination signal detection...

MCP Tool Server (ai_copilot_infra/mcp_server/)

Separate FastAPI application running on :8100independently deployable and scalable.

base.py — `BaseTool` ABC with `name`, `description`, `execute(input: dict) -> dict` interface.

registry.py — `ToolRegistry` maps tool names to `BaseTool` instances. `POST /tools/{tool_name}` dispatches to registered tool.

library_detection_tool.py — Parses query for known library/framework keywords (Redis, Celery, Docker, FastAPI, SQLAlchemy, etc.) using pattern...

documentation_fetch_tool.py — Takes library name, calls Context7 API (context7_client.py async HTTP client) to fetch current documentation...

tools.py — Registers default tools into ToolRegistry on startup.

Core Infrastructure (ai_copilot_infra/core/)

llm_service.py — Async OpenAI client wrapper. `generate(prompt: str) -> str`. Temperature 0.2, GPT-4o-mini model, max_tokens 1024. Reads...

mcp_client.py — Async HTTP client for MCP server. `call_tool(tool_name: str, input: dict) -> dict`. Uses httpx with connection pooling.

redis_service.py — Async Redis operations: `get(key)`, `set(key, value, ttl)`, `rate_limit_check(ip, limit=20, window=60)`. Cache key = SHA-256...

validation.py — `OutputValidator.validate(response: str) -> ValidationResult`. Checks: not empty, length >= 50 chars, no hallucination phrases....

config.py — Pydantic settings: OPENAI_API_KEY, REDIS_URL, MCP_BASE_URL, CONTEXT7_BASE_URL, CONTEXT7_API_KEY, LOG_FORMAT.

dependencies.py — FastAPI DI providers: `get_redis()`, `get_mcp_client()`, `get_llm_service()`.

API Layer (ai_copilot_infra/api/)

routes/copilot.py — `POST /api/v1/copilot/query`:

1. Check Redis cache (cache hit → return immediately, cached: true)

2. Rate limit check via redis_service (20 req/min per IP → 429 if exceeded)

3. Generate trace_id (UUID4)

4. Run 6-step workflow pipeline

5. If validation passed: cache result, return answer + libraries_used + validation_passed + cached + trace_id

6. If validation failed: return 422 with validation reason

middleware/logging.py — Structured request/response logging middleware: logs method, path, status_code, duration_ms, trace_id on every request.

Frontend (copilot-ui/ — React TypeScript)

App.tsx — Single-page UI with query input, submit button, loading state, answer display (markdown rendering), libraries_used badges, cached...

Dark theme CSS. Minimal dependenciesno UI library.

Docker Compose Stack

`api`FastAPI on :8000

`mcp`MCP tool server on :8100

`redis`Redis 7 on :6379 (internal only)

`context7`Context7 documentation API (or mock)

Frontend served separately (Node 18)

GitHub Actions CI

Ruff lint + format check

pytest with Redis service container

Docker image build verification

Frontend npm ci + build

Design Decisions

Deterministic 6-step pipeline over a single LLM calleach step is independently testable, observable, and replaceable. Adding a new step (e.g.,...

Separate MCP tool server (FastAPI :8100) over embedding tools in the APItools can be scaled, deployed, and versioned independently. New tools are...

Context7 for documentation fetching over static knowledge baseretrieves current docs at query time, so answers are grounded in the latest library...

Output validation step before returning to clientcatches empty, too-short, and hallucinated responses programmatically. Users never receive...

Redis cache keyed by SHA-256 of query textidentical questions always hit cache, avoiding redundant OpenAI calls and latency. 1-hour TTL balances...

Rate limiting via Redis (20 req/min per IP)prevents API abuse without a separate rate-limiting service. Redis sliding window counter is...

WorkflowState dataclass as shared context through pipelineall steps read and write to the same state object. No hidden side effects; full pipeline...

GPT-4o-mini at temperature 0.2 for factual accuracylow temperature reduces creative hallucinations for technical Q&A. GPT-4o-mini is...

Loguru with JSON format in productionstructured logs are parseable by log aggregators (Datadog, Grafana Loki). trace_id in every log line enables...

BaseTool + ToolRegistry pattern in MCP servernew tools (e.g., GitHub search, Stack Overflow fetch) can be added by implementing BaseTool and...

Tradeoffs & Constraints

Synchronous 6-step pipelinesteps execute sequentially; total latency is the sum of all steps. Steps 2+3 (library detection + doc fetch) could run...

Context7 availability dependencyif Context7 API is down or rate-limited, doc fetch fails and answers fall back to training data. Would improve...

Hallucination detection via phrase matchingcatches common LLM hedging phrases but won't catch factually wrong answers stated confidently. Would...

Rate limiting per IP via Rediseffective for most abuse cases but bypassable with IP rotation. For public deployment, would add API key...

No streaming responsesfull answer returned after complete pipeline execution. Streaming (SSE) would improve perceived performance but requires...

Would improve: Parallel doc fetching for multiple libraries, local documentation cache with periodic refresh, streaming responses, API key auth,...

Outcome & Impact

Grounded technical answers sourced from real library documentation retrieved at query timenot LLM training data. Libraries detected automatically...

Deterministic 6-step pipeline with WorkflowStep ABCsevery step is independently testable and observable. trace_id in every response enables full...

Output validation gateempty, too-short, and hallucination-signalling responses return 422 with reason before reaching the client. Users only...

Redis cache eliminates redundant OpenAI calls for repeated questionscache hit returns immediately (cached: true in response) at sub-5ms latency.

Rate limiting at 20 requests/minute per IP via Redis sliding windowprevents API abuse without an external rate-limiting service.

Separate MCP tool server (FastAPI :8100) with ToolRegistrytools are independently deployable, scalable, and extensible via BaseTool ABC without...

GitHub Actions CI: Ruff lint, pytest with Redis service container, Docker build verification, and frontend build on every push and PR.

Tech Stack

Backend: Python 3.11, FastAPI, Uvicorn, Poetry (dependency management)

LLM: OpenAI GPT-4o-mini (temperature 0.2 for factual technical answers)

Pipeline: Custom deterministic workflow engine (WorkflowStep + StepPipeline ABCs, WorkflowState)

MCP Server: Separate FastAPI service (:8100), BaseTool ABC, ToolRegistry, library_detection_tool, documentation_fetch_tool

Documentation: Context7 API (real-time library documentation retrieval via async HTTP client)

Caching + Rate Limiting: Redis 7 (SHA-256 query cache 1h TTL, 20 req/min per IP sliding window)

Observability: Loguru (structured JSON logs), trace_id per request

Frontend: React 18, TypeScript, dark theme

CI/CD: GitHub Actions (Ruff lint, pytest + Redis service, Docker build, frontend build)

Orchestration: Docker Compose (api + mcp + redis + context7)

Back to Projects