Back to Projects

Intelligent Document Compliance Agent

AI-powered document compliance pipeline built as a LangGraph state machine. Ingests PDF, DOCX, and image documents, runs prompt injection detection (guardrail halts pipeline on detection), PII redaction (emails, phone, CC, SSN → [REDACTED_*] tokens), and regulatory compliance checking (5 required legal clauses, banned keywords).

PythonFastAPILangGraphReactTypeScriptVitepypdfpython-docxpytesseractpytestGitHub ActionsMakefile

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

Legal and compliance teams manually review documents for PII, regulatory violations, and adversarial contenttime-consuming and error-prone at...

Documents submitted to AI pipelines can contain prompt injection patterns (jailbreak attempts, role overrides, instruction hijacking) that compromise...

PII (emails, phone numbers, SSNs, credit card numbers) must be detected and redacted before any LLM processing to prevent accidental data exposure...

Compliance teams need to verify presence of required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution, Limitation of...

Existing tools address these concerns in isolationno open-source pipeline combined guardrails, PII redaction, and compliance checking in a single,...

The Solution

Built a LangGraph state machine pipeline with conditional edges and a security-first architecture.

LangGraph Pipeline (4 Nodes)

Parse Doc — document_processing/parser.py extracts text from PDF (pypdf), DOCX (python-docx), and images (pytesseract OCR). Auto-detects format...

Guardrails — guardrails/prompt_injection.py runs regex-based detection of 15+ adversarial patterns (jailbreak, role override, instruction...

PII Detect & Redact — compliance/pii_redactor.py detects emails, phone numbers, credit card numbers, and SSNs using regex patterns. Reports what...

Compliance Check — compliance/rules_engine.py checks for required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution,...

Backend (Python + FastAPI)

app.py — FastAPI entry point with CORS, two endpoints:

POST /analyze-documentaccepts file upload, runs full pipeline, returns structured JSON with guardrail_result, pii_result, compliance_result

GET /healthliveness probe

agent/graph.py — LangGraph pipeline definition with nodes and conditional edges

agent/nodes.py — Pipeline step functions (parse, guardrail, pii, compliance)

agent/state.py — DocumentState TypedDict shared across all nodes

Testing (54 pytest tests)

conftest.py auto-generates test fixtures (PDF, DOCX, image files)no manual setup

test_guardrails.pyinjection detection for 15+ adversarial patterns

test_parser.pymulti-format parsing (PDF, DOCX, image)

test_pii_redactor.pyPII detection and redaction accuracy

test_pipeline.pyend-to-end pipeline integration tests

test_rules_engine.pycompliance clause and keyword checks

Frontend (React + TypeScript + Vite)

DocumentAnalyzer.tsxfile upload component with drag-and-drop, pipeline results display (guardrail status, PII findings with masked previews,...

DocumentAnalyzer.module.cssscoped styles

CI/CD (GitHub Actions)

Backend Tests: Python 3.13 + tesseract → pip install → pytest (54 tests)

Frontend Build: Node 20 → npm ci → tsc --noEmit → vite build

Runs on every push

Design Decisions

Chose LangGraph state machine over a simple sequential function chainconditional edges enable fail-fast guardrails that halt the entire pipeline...

Guardrail node runs before any LLM or PII processingadversarial documents are rejected immediately at the first node, preventing prompt injection...

Regex-based PII detection over LLM-baseddeterministic, auditable, no API costs, works offline. Covers the most common PII patterns (emails,...

Declarative compliance rules enginerule definitions separated from engine logic. Adding a new required clause or banned keyword requires only a...

Two guardrail modes (fail-fast vs sanitize)fail-fast rejects the document entirely for strict compliance use cases; sanitize strips injection...

Auto-generated test fixtures in conftest.pypytest generates PDF, DOCX, and image test files at runtime using reportlab and python-docx. No binary...

DocumentState TypedDict as shared pipeline contextall nodes read from and write to a single state object passed through the LangGraph graph. Clean...

FastAPI for backendasync-native, automatic OpenAPI docs, easy file upload handling (UploadFile), Pydantic validation, and simple CORS...

Tradeoffs & Constraints

Regex-based PII detectionhigh precision for common patterns (emails, phones, SSNs, credit cards) but misses novel or obfuscated PII formats....

Regex-based prompt injection detectioneffective for known adversarial patterns (15+ covered) but can be bypassed by novel attack vectors not in...

Synchronous pipelinedocuments are processed sequentially through all nodes. For high-volume batch processing, an async queue (Celery, RQ) with...

No LLM integration yetOpenAI API is listed as planned integration. Current compliance checking is rule-based only; LLM-assisted clause extraction...

tesseract OCR for image parsingworks for clean printed documents but accuracy degrades on handwritten text, poor scans, or complex layouts. A...

No persistent storagedocuments are processed in-memory and not stored. Production use would require database storage for audit trails, compliance...

Would improve: Add LLM-based compliance checking for nuanced clause detection, implement async batch processing for high-volume workflows, add...

Outcome & Impact

Production-ready document compliance pipeline processing PDF, DOCX, and image documents through a 4-node LangGraph state machine with fail-fast...

54 pytest tests covering all pipeline modulesguardrails (15+ injection patterns), parser (PDF/DOCX/image), PII redactor (detection + redaction...

LangGraph conditional edge architectureguardrail node halts pipeline and returns early if injection detected, preventing adversarial content from...

PII redaction reports masked previews (jan*, 555*, 123***) before replacing with [REDACTED_*] tokens, enabling audit of what was found without...

Declarative compliance rules engine5 required legal clauses (Governing Law, Confidentiality, Termination, Dispute Resolution, Limitation of...

Multi-format document parsing handles PDF (pypdf), DOCX (python-docx), and images (pytesseract OCR) with auto-detection by file extension.

GitHub Actions CI/CD runs on every pushbackend tests (Python 3.13 + tesseract + pytest) and frontend build (Node 20 + tsc --noEmit + vite build).

Makefile automation with 8 commands: make install, make dev, make test, make test-backend, make test-frontend, make stop, make clean, make help.

Tech Stack

Backend: Python, FastAPI (web framework with async file upload), Uvicorn (ASGI server)

Pipeline: LangGraph (state machine with conditional edges), DocumentState TypedDict

Document Parsing: pypdf (PDF text extraction), python-docx (DOCX parsing), pytesseract (OCR for images)

Compliance: regex-based PII detection (emails, phones, CC, SSN), declarative rules engine

Guardrails: regex-based prompt injection detection (15+ adversarial patterns)

Frontend: React, TypeScript, Vite, CSS Modules

Testing: pytest (54 tests), conftest.py auto-generated fixtures, reportlab (test PDF generation)

CI/CD: GitHub Actions (backend tests + frontend build on every push)

Automation: Makefile (install, dev, test, stop, clean)

Back to Projects