AI Document Parser

Production-ready AI microservice accepting PDF, TXT, and EML files — extracts text, caches by SHA-256 hash in Redis 7 (1-hour TTL, AOF persistence), and returns structured GPT-4o-mini analysis (summary, document type, key points, important entities). FastAPI backend with Gunicorn + UvicornWorker, layered service architecture (document_service.py + llm_service.py), Langfuse LLM observability (prompt/response/tokens/cost per call), Loguru structured JSON logs.

Python 3.11FastAPIGunicorn + UvicornWorkerOpenAI GPT-4o-miniRedis 7LangfuseLogurupypdfReact 19TypeScriptVite 7DockerDocker ComposeGitHub ActionsPoetryNginx

View Code

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

•

Extracting structured insights from PDF, TXT, and EML documents required manual effort or separate tools with no unified REST API — no single endpoint...

•

Repeated analysis of identical documents (e.g., the same invoice re-uploaded) wasted LLM tokens and added unnecessary latency with no deduplication...

•

Zero visibility into LLM cost, token usage, or trace data per request made it impossible to debug expensive or slow AI calls in production.

•

No structured logging made correlating request metadata (filename, text length, cache status) with application errors difficult.

•

Deploying the service alongside a Redis cache and a React frontend required manual coordination — no containerised full-stack setup for consistent...

The Solution

•

Built a FastAPI microservice with a clean layered architecture separating extraction, caching, and LLM concerns.

Backend Architecture

•

app/services/document_service.py — Text extraction + cache orchestration:

•

PDF: pypdf PdfReader, concatenates all page text

•

EML: stdlib email.message_from_bytes, walks MIME parts, extracts text/plain

•

TXT: UTF-8 decode with error fallback

•

Computes SHA-256 hash of extracted text, checks Redis async GET, returns cached result or continues to LLM

•

Unsupported MIME type raises HTTP 415; corrupt/unreadable file raises HTTP 422

•

app/services/llm_service.py — OpenAI call + Langfuse trace:

•

Sends extracted text to GPT-4o-mini with structured prompt requesting JSON: summary, document_type, key_points[], important_entities[]

•

Wraps call in Langfuse generation (start/end) recording model, prompt, response, usage tokens, and cost

•

Parses and validates JSON response; raises HTTP 502 on OpenAI failure or malformed JSON

•

app/core/redis_client.py — Async Redis wrapper:

•

GET returns cached DocumentResponse or None

•

SET serialises to JSON with 3600s TTL

•

Redis 7 with AOF persistence ensures cache survives container restarts

•

app/core/langfuse_client.py — Langfuse singleton initialised from env vars (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL); no-ops...

•

app/core/logger.py — Loguru sink configured for structured JSON output; every request logs filename, text_length, cached status, and processing...

•

app/api/v1/documents.py — POST /api/v1/documents/upload:

•

Accepts multipart file upload, validates extension (.pdf/.txt/.eml)

•

Calls document_service → (cache hit) return cached result or → llm_service → SET cache → return result

•

Returns DocumentResponse: filename, text_length, cached (bool), llm_analysis

•

app/schemas/document.py — Pydantic models: LLMAnalysis (summary, document_type, key_points, important_entities), DocumentResponse

Frontend (React 19 + TypeScript + Vite 7)

•

src/api/documentApi.ts — fetch wrapper using VITE_API_URL env var; POST multipart form, returns typed DocumentResponse

•

src/components/FileUpload.tsx — Drag-and-drop or click-to-browse, validates file type client-side, calls API on submit

•

src/components/ResultCard.tsx — Renders LLM analysis with summary, document type badge, key points list, entities list, cache hit indicator

•

src/components/Loader.tsx — Spinner shown during API call

•

src/types/document.ts — TypeScript interfaces mirroring Pydantic schemas

•

Nginx production image (Node build → Nginx static serve + /api reverse proxy to backend)

Caching Strategy

•

SHA-256 hash of extracted text content as cache key — identical documents (regardless of filename) always hit cache

•

1-hour TTL balances freshness with LLM cost savings

•

Redis 7 AOF persistence ensures cache survives restarts, avoiding cold-start cost spikes

•

`cached: true` field in response gives clients visibility into cache hit status

Observability

•

Langfuse traces every LLM call: prompt sent, response received, model used, token counts (prompt + completion), estimated cost

•

Langfuse dashboard provides cost analytics, latency histograms, and error rates across all document processing calls

•

Loguru structured JSON logs on every request for correlation with application monitoring

Docker + Deployment

•

Backend: multi-stage Dockerfile (Python 3.11-slim → poetry install → Gunicorn + UvicornWorker, runs as non-root UID 1001)

•

Frontend: multi-stage Dockerfile (Node 20 build → Nginx production image with SPA fallback + /api reverse proxy)

•

docker-compose.yml orchestrates api + frontend + redis with internal network (Redis not exposed externally)

•

Full stack: `make docker-up`; development: `make dev` (concurrent backend + frontend + Redis)

CI/CD (GitHub Actions)

•

backend job: Ruff lint, Ruff format check, pytest

•

frontend job: ESLint, TypeScript type-check (tsc --noEmit), Vite production build

•

docker job: build backend image, build frontend image

•

Runs on every push and pull request

Design Decisions

•

Chose SHA-256 content hash as cache key (not filename) — identical documents with different names always hit cache, and renamed files don't pollute the...

•

Separated document_service.py (extraction + cache) from llm_service.py (OpenAI + Langfuse) — keeps LLM logic isolated, easy to swap models or add new...

•

Used Redis 7 with AOF persistence over in-memory caching — survives container restarts, shareable across multiple API replicas, and TTL management is...

•

Langfuse for LLM observability over custom logging — provides out-of-the-box cost tracking, latency histograms, and prompt/response replay. Optional...

•

Gunicorn + UvicornWorker over plain Uvicorn in production — Gunicorn manages worker lifecycle (crashes, memory limits), UvicornWorker provides async...

•

GPT-4o-mini over GPT-4o — cost-effective for structured extraction tasks (~10x cheaper), sufficient quality for summary/classification/entity...

•

Nginx reverse proxy in frontend Docker image — SPA fallback (all routes serve index.html), /api proxy to backend, static asset caching headers....

•

Production Docker image runs as non-root user (UID 1001) — security best practice for containerised services, prevents privilege escalation if...

•

Poetry for backend dependency management — lockfile ensures reproducible installs across dev, CI, and production. Separation of main and dev...

•

Pydantic settings (app/core/config.py) reads all config from environment — single source of truth for configuration, type-validated, works with .env...

Tradeoffs & Constraints

•

SHA-256 cache key means any text change (even whitespace) produces a cache miss — intentional for correctness, but means slightly reformatted identical...

•

Synchronous PDF extraction with pypdf — blocking for very large PDFs. For production scale with 100+ page documents, would move to background job queue...

•

Single LLM call per document — no chunking for very long documents. GPT-4o-mini context window (~128K tokens) handles most business documents; chunking...

•

Langfuse tracing adds ~20-50ms latency per LLM call (async flush) — acceptable overhead for the observability value; can be disabled via env vars in...

•

Redis TTL of 1 hour — documents with rapidly changing content (live reports) would return stale analysis. TTL is configurable via env var but not...

•

No authentication or rate limiting on upload endpoint — suitable for internal or demo use, would require API key middleware or OAuth for public-facing...

•

Would improve: Add chunked processing for large documents, per-user rate limiting, webhook support for async processing of large files, support for...

Outcome & Impact

•

Structured document analysis API returning consistent JSON (summary, document_type, key_points[], important_entities[]) for PDF, TXT, and EML uploads...

•

Sub-100ms cache hits for repeated documents via Redis SHA-256 keyed cache — LLM is bypassed entirely, with `cached: true` field in response confirming...

•

Full LLM observability via Langfuse: every GPT-4o-mini call traced with prompt, response, token usage (prompt + completion), and estimated cost...

•

Structured JSON logs on every request via Loguru: filename, text_length, cached status, processing time — correlatable with application monitoring and...

•

GitHub Actions CI enforces quality on every commit: Ruff lint + format check, pytest, ESLint, TypeScript type-check, and Docker image builds for both...

•

One-command local development stack (`make dev`) and one-command Docker deployment (`make docker-up`) with Redis, backend, and frontend orchestrated...

•

Security: Redis not exposed outside Docker network, production image runs as non-root UID 1001, CORS restricted to configured origins, secrets in...

Tech Stack

•

Backend: Python 3.11, FastAPI (web framework), Gunicorn + UvicornWorker (production ASGI server)

•

LLM: OpenAI GPT-4o-mini (document analysis — summary, type, key points, entities)

•

Caching: Redis 7 (AOF persistence, 1-hour TTL, SHA-256 content-keyed)

•

Observability: Langfuse (LLM cost + trace per call), Loguru (structured JSON request logs)

•

Text Extraction: pypdf (PDF), stdlib email (EML), UTF-8 decode (TXT)

•

Frontend: React 19, TypeScript, Vite 7 (build tool + dev server)

•

Containerisation: Docker (multi-stage builds — non-root production images), Docker Compose (full stack)

•

Web Server: Nginx (SPA fallback + /api reverse proxy in frontend image)

•

CI/CD: GitHub Actions (Ruff lint, pytest, ESLint, TypeScript check, Docker build on every push)

•

Dependency Management: Poetry (backend lockfile), npm (frontend)

Back to Projects