Intelligent Document Compliance Agent
AI-powered document compliance pipeline built as a LangGraph state machine. Ingests PDF, DOCX, and image documents, runs prompt injection detection (guardrail halts pipeline on detection), PII redaction (emails, phone, CC, SSN → [REDACTED_*] tokens), and regulatory compliance checking (5 required legal clauses, banned keywords).
Role
AI Engineer & Full-stack Developer
Team
Solo
Company/Organization
Personal Project
The Problem
Legal and compliance teams manually review documents for PII, regulatory violations, and adversarial content — time-consuming and error-prone at...
Documents submitted to AI pipelines can contain prompt injection patterns (jailbreak attempts, role overrides, instruction hijacking) that compromise...
PII (emails, phone numbers, SSNs, credit card numbers) must be detected and redacted before any LLM processing to prevent accidental data exposure...
Compliance teams need to verify presence of required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution, Limitation of...
Existing tools address these concerns in isolation — no open-source pipeline combined guardrails, PII redaction, and compliance checking in a single,...
The Solution
Built a LangGraph state machine pipeline with conditional edges and a security-first architecture.
LangGraph Pipeline (4 Nodes)
Parse Doc — document_processing/parser.py extracts text from PDF (pypdf), DOCX (python-docx), and images (pytesseract OCR). Auto-detects format...
Guardrails — guardrails/prompt_injection.py runs regex-based detection of 15+ adversarial patterns (jailbreak, role override, instruction...
PII Detect & Redact — compliance/pii_redactor.py detects emails, phone numbers, credit card numbers, and SSNs using regex patterns. Reports what...
Compliance Check — compliance/rules_engine.py checks for required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution,...
Backend (Python + FastAPI)
app.py — FastAPI entry point with CORS, two endpoints:
POST /analyze-document — accepts file upload, runs full pipeline, returns structured JSON with guardrail_result, pii_result, compliance_result
GET /health — liveness probe
agent/graph.py — LangGraph pipeline definition with nodes and conditional edges
agent/nodes.py — Pipeline step functions (parse, guardrail, pii, compliance)
agent/state.py — DocumentState TypedDict shared across all nodes
Testing (54 pytest tests)
conftest.py auto-generates test fixtures (PDF, DOCX, image files) — no manual setup
test_guardrails.py — injection detection for 15+ adversarial patterns
test_parser.py — multi-format parsing (PDF, DOCX, image)
test_pii_redactor.py — PII detection and redaction accuracy
test_pipeline.py — end-to-end pipeline integration tests
test_rules_engine.py — compliance clause and keyword checks
Frontend (React + TypeScript + Vite)
DocumentAnalyzer.tsx — file upload component with drag-and-drop, pipeline results display (guardrail status, PII findings with masked previews,...
DocumentAnalyzer.module.css — scoped styles
CI/CD (GitHub Actions)
Backend Tests: Python 3.13 + tesseract → pip install → pytest (54 tests)
Frontend Build: Node 20 → npm ci → tsc --noEmit → vite build
Runs on every push
Design Decisions
Chose LangGraph state machine over a simple sequential function chain — conditional edges enable fail-fast guardrails that halt the entire pipeline...
Guardrail node runs before any LLM or PII processing — adversarial documents are rejected immediately at the first node, preventing prompt injection...
Regex-based PII detection over LLM-based — deterministic, auditable, no API costs, works offline. Covers the most common PII patterns (emails,...
Declarative compliance rules engine — rule definitions separated from engine logic. Adding a new required clause or banned keyword requires only a...
Two guardrail modes (fail-fast vs sanitize) — fail-fast rejects the document entirely for strict compliance use cases; sanitize strips injection...
Auto-generated test fixtures in conftest.py — pytest generates PDF, DOCX, and image test files at runtime using reportlab and python-docx. No binary...
DocumentState TypedDict as shared pipeline context — all nodes read from and write to a single state object passed through the LangGraph graph. Clean...
FastAPI for backend — async-native, automatic OpenAPI docs, easy file upload handling (UploadFile), Pydantic validation, and simple CORS...
Tradeoffs & Constraints
Regex-based PII detection — high precision for common patterns (emails, phones, SSNs, credit cards) but misses novel or obfuscated PII formats....
Regex-based prompt injection detection — effective for known adversarial patterns (15+ covered) but can be bypassed by novel attack vectors not in...
Synchronous pipeline — documents are processed sequentially through all nodes. For high-volume batch processing, an async queue (Celery, RQ) with...
No LLM integration yet — OpenAI API is listed as planned integration. Current compliance checking is rule-based only; LLM-assisted clause extraction...
tesseract OCR for image parsing — works for clean printed documents but accuracy degrades on handwritten text, poor scans, or complex layouts. A...
No persistent storage — documents are processed in-memory and not stored. Production use would require database storage for audit trails, compliance...
Would improve: Add LLM-based compliance checking for nuanced clause detection, implement async batch processing for high-volume workflows, add...
Outcome & Impact
Production-ready document compliance pipeline processing PDF, DOCX, and image documents through a 4-node LangGraph state machine with fail-fast...
54 pytest tests covering all pipeline modules — guardrails (15+ injection patterns), parser (PDF/DOCX/image), PII redactor (detection + redaction...
LangGraph conditional edge architecture — guardrail node halts pipeline and returns early if injection detected, preventing adversarial content from...
PII redaction reports masked previews (jan*, 555*, 123***) before replacing with [REDACTED_*] tokens, enabling audit of what was found without...
Declarative compliance rules engine — 5 required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution, Limitation of...
Multi-format document parsing handles PDF (pypdf), DOCX (python-docx), and images (pytesseract OCR) with auto-detection by file extension.
GitHub Actions CI/CD runs on every push — backend tests (Python 3.13 + tesseract + pytest) and frontend build (Node 20 + tsc --noEmit + vite build).
Makefile automation with 8 commands: make install, make dev, make test, make test-backend, make test-frontend, make stop, make clean, make help.
Tech Stack
Backend: Python, FastAPI (web framework with async file upload), Uvicorn (ASGI server)
Pipeline: LangGraph (state machine with conditional edges), DocumentState TypedDict
Document Parsing: pypdf (PDF text extraction), python-docx (DOCX parsing), pytesseract (OCR for images)
Compliance: regex-based PII detection (emails, phones, CC, SSN), declarative rules engine
Guardrails: regex-based prompt injection detection (15+ adversarial patterns)
Frontend: React, TypeScript, Vite, CSS Modules
Testing: pytest (54 tests), conftest.py auto-generated fixtures, reportlab (test PDF generation)
CI/CD: GitHub Actions (backend tests + frontend build on every push)
Automation: Makefile (install, dev, test, stop, clean)