Intelligent Document Compliance Agent

AI-powered document compliance pipeline built as a LangGraph state machine. Ingests PDF, DOCX, and image documents, runs prompt injection detection (guardrail halts pipeline on detection), PII redaction (emails, phone, CC, SSN → [REDACTED_*] tokens), and regulatory compliance checking (5 required legal clauses, banned keywords).

PythonFastAPILangGraphReactTypeScriptVitepypdfpython-docxpytesseractpytestGitHub ActionsMakefile

View Code

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

•

Legal and compliance teams manually review documents for PII, regulatory violations, and adversarial content — time-consuming and error-prone at...

•

Documents submitted to AI pipelines can contain prompt injection patterns (jailbreak attempts, role overrides, instruction hijacking) that compromise...

•

PII (emails, phone numbers, SSNs, credit card numbers) must be detected and redacted before any LLM processing to prevent accidental data exposure...

•

Compliance teams need to verify presence of required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution, Limitation of...

•

Existing tools address these concerns in isolation — no open-source pipeline combined guardrails, PII redaction, and compliance checking in a single,...

The Solution

•

Built a LangGraph state machine pipeline with conditional edges and a security-first architecture.

LangGraph Pipeline (4 Nodes)

•

Parse Doc — document_processing/parser.py extracts text from PDF (pypdf), DOCX (python-docx), and images (pytesseract OCR). Auto-detects format...

•

Guardrails — guardrails/prompt_injection.py runs regex-based detection of 15+ adversarial patterns (jailbreak, role override, instruction...

•

PII Detect & Redact — compliance/pii_redactor.py detects emails, phone numbers, credit card numbers, and SSNs using regex patterns. Reports what...

•

Compliance Check — compliance/rules_engine.py checks for required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution,...

Backend (Python + FastAPI)

•

app.py — FastAPI entry point with CORS, two endpoints:

•

POST /analyze-document — accepts file upload, runs full pipeline, returns structured JSON with guardrail_result, pii_result, compliance_result

•

GET /health — liveness probe

•

agent/graph.py — LangGraph pipeline definition with nodes and conditional edges

•

agent/nodes.py — Pipeline step functions (parse, guardrail, pii, compliance)

•

agent/state.py — DocumentState TypedDict shared across all nodes

Testing (54 pytest tests)

•

conftest.py auto-generates test fixtures (PDF, DOCX, image files) — no manual setup

•

test_guardrails.py — injection detection for 15+ adversarial patterns

•

test_parser.py — multi-format parsing (PDF, DOCX, image)

•

test_pii_redactor.py — PII detection and redaction accuracy

•

test_pipeline.py — end-to-end pipeline integration tests

•

test_rules_engine.py — compliance clause and keyword checks

Frontend (React + TypeScript + Vite)

•

DocumentAnalyzer.tsx — file upload component with drag-and-drop, pipeline results display (guardrail status, PII findings with masked previews,...

•

DocumentAnalyzer.module.css — scoped styles

CI/CD (GitHub Actions)

•

Backend Tests: Python 3.13 + tesseract → pip install → pytest (54 tests)

•

Frontend Build: Node 20 → npm ci → tsc --noEmit → vite build

•

Runs on every push

Design Decisions

•

Chose LangGraph state machine over a simple sequential function chain — conditional edges enable fail-fast guardrails that halt the entire pipeline...

•

Guardrail node runs before any LLM or PII processing — adversarial documents are rejected immediately at the first node, preventing prompt injection...

•

Regex-based PII detection over LLM-based — deterministic, auditable, no API costs, works offline. Covers the most common PII patterns (emails,...

•

Declarative compliance rules engine — rule definitions separated from engine logic. Adding a new required clause or banned keyword requires only a...

•

Two guardrail modes (fail-fast vs sanitize) — fail-fast rejects the document entirely for strict compliance use cases; sanitize strips injection...

•

Auto-generated test fixtures in conftest.py — pytest generates PDF, DOCX, and image test files at runtime using reportlab and python-docx. No binary...

•

DocumentState TypedDict as shared pipeline context — all nodes read from and write to a single state object passed through the LangGraph graph. Clean...

•

FastAPI for backend — async-native, automatic OpenAPI docs, easy file upload handling (UploadFile), Pydantic validation, and simple CORS...

Tradeoffs & Constraints

•

Regex-based PII detection — high precision for common patterns (emails, phones, SSNs, credit cards) but misses novel or obfuscated PII formats....

•

Regex-based prompt injection detection — effective for known adversarial patterns (15+ covered) but can be bypassed by novel attack vectors not in...

•

Synchronous pipeline — documents are processed sequentially through all nodes. For high-volume batch processing, an async queue (Celery, RQ) with...

•

No LLM integration yet — OpenAI API is listed as planned integration. Current compliance checking is rule-based only; LLM-assisted clause extraction...

•

tesseract OCR for image parsing — works for clean printed documents but accuracy degrades on handwritten text, poor scans, or complex layouts. A...

•

No persistent storage — documents are processed in-memory and not stored. Production use would require database storage for audit trails, compliance...

•

Would improve: Add LLM-based compliance checking for nuanced clause detection, implement async batch processing for high-volume workflows, add...

Outcome & Impact

•

Production-ready document compliance pipeline processing PDF, DOCX, and image documents through a 4-node LangGraph state machine with fail-fast...

•

54 pytest tests covering all pipeline modules — guardrails (15+ injection patterns), parser (PDF/DOCX/image), PII redactor (detection + redaction...

•

LangGraph conditional edge architecture — guardrail node halts pipeline and returns early if injection detected, preventing adversarial content from...

•

PII redaction reports masked previews (jan*, 555*, 123***) before replacing with [REDACTED_*] tokens, enabling audit of what was found without...

•

Declarative compliance rules engine — 5 required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution, Limitation of...

•

Multi-format document parsing handles PDF (pypdf), DOCX (python-docx), and images (pytesseract OCR) with auto-detection by file extension.

•

GitHub Actions CI/CD runs on every push — backend tests (Python 3.13 + tesseract + pytest) and frontend build (Node 20 + tsc --noEmit + vite build).

•

Makefile automation with 8 commands: make install, make dev, make test, make test-backend, make test-frontend, make stop, make clean, make help.

Tech Stack

•

Backend: Python, FastAPI (web framework with async file upload), Uvicorn (ASGI server)

•

Pipeline: LangGraph (state machine with conditional edges), DocumentState TypedDict

•

Document Parsing: pypdf (PDF text extraction), python-docx (DOCX parsing), pytesseract (OCR for images)

•

Compliance: regex-based PII detection (emails, phones, CC, SSN), declarative rules engine

•

Guardrails: regex-based prompt injection detection (15+ adversarial patterns)

•

Frontend: React, TypeScript, Vite, CSS Modules

•

Testing: pytest (54 tests), conftest.py auto-generated fixtures, reportlab (test PDF generation)

•

CI/CD: GitHub Actions (backend tests + frontend build on every push)

•

Automation: Makefile (install, dev, test, stop, clean)

Back to Projects