Back to Projects

AI Document Parser

Production-ready AI microservice accepting PDF, TXT, and EML files — extracts text, caches by SHA-256 hash in Redis 7 (1-hour TTL, AOF persistence), and returns structured GPT-4o-mini analysis (summary, document type, key points, important entities). FastAPI backend with Gunicorn + UvicornWorker, layered service architecture (document_service.py + llm_service.py), Langfuse LLM observability (prompt/response/tokens/cost per call), Loguru structured JSON logs.

Python 3.11FastAPIGunicorn + UvicornWorkerOpenAI GPT-4o-miniRedis 7LangfuseLogurupypdfReact 19TypeScriptVite 7DockerDocker ComposeGitHub ActionsPoetryNginx

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

Extracting structured insights from PDF, TXT, and EML documents required manual effort or separate tools with no unified REST APIno single endpoint...

Repeated analysis of identical documents (e.g., the same invoice re-uploaded) wasted LLM tokens and added unnecessary latency with no deduplication...

Zero visibility into LLM cost, token usage, or trace data per request made it impossible to debug expensive or slow AI calls in production.

No structured logging made correlating request metadata (filename, text length, cache status) with application errors difficult.

Deploying the service alongside a Redis cache and a React frontend required manual coordinationno containerised full-stack setup for consistent...

The Solution

Built a FastAPI microservice with a clean layered architecture separating extraction, caching, and LLM concerns.

Backend Architecture

app/services/document_service.py — Text extraction + cache orchestration:

PDF: pypdf PdfReader, concatenates all page text

EML: stdlib email.message_from_bytes, walks MIME parts, extracts text/plain

TXT: UTF-8 decode with error fallback

Computes SHA-256 hash of extracted text, checks Redis async GET, returns cached result or continues to LLM

Unsupported MIME type raises HTTP 415; corrupt/unreadable file raises HTTP 422

app/services/llm_service.py — OpenAI call + Langfuse trace:

Sends extracted text to GPT-4o-mini with structured prompt requesting JSON: summary, document_type, key_points[], important_entities[]

Wraps call in Langfuse generation (start/end) recording model, prompt, response, usage tokens, and cost

Parses and validates JSON response; raises HTTP 502 on OpenAI failure or malformed JSON

app/core/redis_client.py — Async Redis wrapper:

GET returns cached DocumentResponse or None

SET serialises to JSON with 3600s TTL

Redis 7 with AOF persistence ensures cache survives container restarts

app/core/langfuse_client.py — Langfuse singleton initialised from env vars (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL); no-ops...

app/core/logger.py — Loguru sink configured for structured JSON output; every request logs filename, text_length, cached status, and processing...

app/api/v1/documents.py — POST /api/v1/documents/upload:

Accepts multipart file upload, validates extension (.pdf/.txt/.eml)

Calls document_service → (cache hit) return cached result or → llm_service → SET cache → return result

Returns DocumentResponse: filename, text_length, cached (bool), llm_analysis

app/schemas/document.py — Pydantic models: LLMAnalysis (summary, document_type, key_points, important_entities), DocumentResponse

Frontend (React 19 + TypeScript + Vite 7)

src/api/documentApi.ts — fetch wrapper using VITE_API_URL env var; POST multipart form, returns typed DocumentResponse

src/components/FileUpload.tsx — Drag-and-drop or click-to-browse, validates file type client-side, calls API on submit

src/components/ResultCard.tsx — Renders LLM analysis with summary, document type badge, key points list, entities list, cache hit indicator

src/components/Loader.tsx — Spinner shown during API call

src/types/document.ts — TypeScript interfaces mirroring Pydantic schemas

Nginx production image (Node build → Nginx static serve + /api reverse proxy to backend)

Caching Strategy

SHA-256 hash of extracted text content as cache keyidentical documents (regardless of filename) always hit cache

1-hour TTL balances freshness with LLM cost savings

Redis 7 AOF persistence ensures cache survives restarts, avoiding cold-start cost spikes

`cached: true` field in response gives clients visibility into cache hit status

Observability

Langfuse traces every LLM call: prompt sent, response received, model used, token counts (prompt + completion), estimated cost

Langfuse dashboard provides cost analytics, latency histograms, and error rates across all document processing calls

Loguru structured JSON logs on every request for correlation with application monitoring

Docker + Deployment

Backend: multi-stage Dockerfile (Python 3.11-slim → poetry install → Gunicorn + UvicornWorker, runs as non-root UID 1001)

Frontend: multi-stage Dockerfile (Node 20 build → Nginx production image with SPA fallback + /api reverse proxy)

docker-compose.yml orchestrates api + frontend + redis with internal network (Redis not exposed externally)

Full stack: `make docker-up`; development: `make dev` (concurrent backend + frontend + Redis)

CI/CD (GitHub Actions)

backend job: Ruff lint, Ruff format check, pytest

frontend job: ESLint, TypeScript type-check (tsc --noEmit), Vite production build

docker job: build backend image, build frontend image

Runs on every push and pull request

Design Decisions

Chose SHA-256 content hash as cache key (not filename)identical documents with different names always hit cache, and renamed files don't pollute the...

Separated document_service.py (extraction + cache) from llm_service.py (OpenAI + Langfuse)keeps LLM logic isolated, easy to swap models or add new...

Used Redis 7 with AOF persistence over in-memory cachingsurvives container restarts, shareable across multiple API replicas, and TTL management is...

Langfuse for LLM observability over custom loggingprovides out-of-the-box cost tracking, latency histograms, and prompt/response replay. Optional...

Gunicorn + UvicornWorker over plain Uvicorn in productionGunicorn manages worker lifecycle (crashes, memory limits), UvicornWorker provides async...

GPT-4o-mini over GPT-4ocost-effective for structured extraction tasks (~10x cheaper), sufficient quality for summary/classification/entity...

Nginx reverse proxy in frontend Docker imageSPA fallback (all routes serve index.html), /api proxy to backend, static asset caching headers....

Production Docker image runs as non-root user (UID 1001)security best practice for containerised services, prevents privilege escalation if...

Poetry for backend dependency managementlockfile ensures reproducible installs across dev, CI, and production. Separation of main and dev...

Pydantic settings (app/core/config.py) reads all config from environmentsingle source of truth for configuration, type-validated, works with .env...

Tradeoffs & Constraints

SHA-256 cache key means any text change (even whitespace) produces a cache missintentional for correctness, but means slightly reformatted identical...

Synchronous PDF extraction with pypdfblocking for very large PDFs. For production scale with 100+ page documents, would move to background job queue...

Single LLM call per documentno chunking for very long documents. GPT-4o-mini context window (~128K tokens) handles most business documents; chunking...

Langfuse tracing adds ~20-50ms latency per LLM call (async flush)acceptable overhead for the observability value; can be disabled via env vars in...

Redis TTL of 1 hourdocuments with rapidly changing content (live reports) would return stale analysis. TTL is configurable via env var but not...

No authentication or rate limiting on upload endpointsuitable for internal or demo use, would require API key middleware or OAuth for public-facing...

Would improve: Add chunked processing for large documents, per-user rate limiting, webhook support for async processing of large files, support for...

Outcome & Impact

Structured document analysis API returning consistent JSON (summary, document_type, key_points[], important_entities[]) for PDF, TXT, and EML uploads...

Sub-100ms cache hits for repeated documents via Redis SHA-256 keyed cacheLLM is bypassed entirely, with `cached: true` field in response confirming...

Full LLM observability via Langfuse: every GPT-4o-mini call traced with prompt, response, token usage (prompt + completion), and estimated cost...

Structured JSON logs on every request via Loguru: filename, text_length, cached status, processing timecorrelatable with application monitoring and...

GitHub Actions CI enforces quality on every commit: Ruff lint + format check, pytest, ESLint, TypeScript type-check, and Docker image builds for both...

One-command local development stack (`make dev`) and one-command Docker deployment (`make docker-up`) with Redis, backend, and frontend orchestrated...

Security: Redis not exposed outside Docker network, production image runs as non-root UID 1001, CORS restricted to configured origins, secrets in...

Tech Stack

Backend: Python 3.11, FastAPI (web framework), Gunicorn + UvicornWorker (production ASGI server)

LLM: OpenAI GPT-4o-mini (document analysissummary, type, key points, entities)

Caching: Redis 7 (AOF persistence, 1-hour TTL, SHA-256 content-keyed)

Observability: Langfuse (LLM cost + trace per call), Loguru (structured JSON request logs)

Text Extraction: pypdf (PDF), stdlib email (EML), UTF-8 decode (TXT)

Frontend: React 19, TypeScript, Vite 7 (build tool + dev server)

Containerisation: Docker (multi-stage buildsnon-root production images), Docker Compose (full stack)

Web Server: Nginx (SPA fallback + /api reverse proxy in frontend image)

CI/CD: GitHub Actions (Ruff lint, pytest, ESLint, TypeScript check, Docker build on every push)

Dependency Management: Poetry (backend lockfile), npm (frontend)

Back to Projects