Back to Projects

RAG PDF Chatbot

Retrieval-Augmented Generation (RAG) chatbot for intelligent question-answering over technical PDF documents. Combines FAISS (CPU) vector search, intent-based retrieval (6 query types: figure, table, page, section, general, comparison), and Google Gemini (gemini-embedding-001 + gemini-2.5-flash) for accurate, source-cited answers with confidence scoring (High/Medium/Low from source quality, chunk count, semantic match, verbatim presence).

Python 3.10+FastAPIUvicornFAISSNumPyGoogle GeminiPillowReact 19Vite 7jsPDFGitHub ActionsDockerMakefilepre-commit hooks

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

YNM Safety

The Problem

Finding specific information in large technical PDF documents (hundreds of pages with figures, tables, specifications) required manual searchingno...

Traditional keyword search (Ctrl+F) only matched exact text, missing semantic meaning and context. Searching 'crash barrier specifications' wouldn't...

Difficult to locate specific figures, tables, or page references without knowing exact locations. Questions like 'What does Fig 3.3 show?' or...

No confidence scoring to help users assess answer reliability and accuracy. Users couldn't gauge if retrieved information was definitive (high...

Existing RAG solutions lacked intent-aware retrievalcouldn't distinguish between requests for specific figures/tables (exact match needed) vs...

Commercial document Q&A services (DocuSign Insight, Adobe Liquid Mode) were expensive ($500+/month) or required proprietary APIs with vendor lock-in.

No modern UI for technical PDF Q&Aneeded dark/light theme for different reading environments, multi-chat support for organizing questions by topic,...

The Solution

Built a comprehensive RAG chatbot with 5-stage pipeline and security-first architecture.

RAG Pipeline (5 Stages)

Intent Classification — intent_classifier.py uses pattern matching and keyword detection to classify queries into 6 types:

- FIGURE_QUERY: 'What does Fig 3.3 show?' → Extract figure number, exact match in metadata

- TABLE_QUERY: 'Table 6.2 guidelines' → Extract table number, exact match

- PAGE_QUERY: 'What is on page 27?' → Extract page number, retrieve chunks from that page

- SECTION_QUERY: 'What is in Section 4?' → Extract section identifier, match section headers

- GENERAL_QUERY: 'Size of STOP sign?' → FAISS semantic search with k=5 nearest neighbors

- COMPARISON_QUERY: 'Compare Fig 3.1 and 3.2' → Extract both references, retrieve both, compare

Retrieval — Based on intent, use appropriate strategy:

- Exact match (figures/tables/pages): Query metadata.json for exact reference, return specific chunk

- FAISS semantic search (general/section): Embed query with gemini-embedding-001, query FAISS index (CPU-based approximate nearest neighbor), return...

- Vision captions: If page has images, load from vision_captions.json (pre-generated with Gemini Vision API on page images)

Context Building — Assemble retrieved chunks into context:

- Sort by relevance score (FAISS distance or exact match priority)

- Include chunk text, page numbers, figure/table identifiers, section headers

- Add surrounding chunks if /expand-context called (shows context before/after matched chunk)

- Limit total context to ~4000 tokens to fit Gemini context window

LLM Generation — Send context to Google Gemini gemini-2.5-flash:

- Structured prompt: "Based on the following PDF excerpts, answer the question. Provide structured JSON with 'answer' (paragraphs array or lists),...

- Temperature 0.3 for factual accuracy (low creativity, high consistency)

- Max tokens 1024 for detailed answers

Response — Parse Gemini JSON response, calculate confidence score, return to frontend:

- Multi-factor confidence: source quality (primary source = High, secondary = Medium), chunk count (1 chunk = Medium, 3+ = High), semantic match...

- Final confidence: average of factors mapped to High/Medium/Low

Backend (Python 3.10+ + FastAPI)

app.py — Main FastAPI application with 5 endpoints:

`GET /health`Health check with environment variable verification (returns 'GEMINI_API_KEY: set' or 'missing', never exposes actual key)

`POST /ask`Main RAG Q&A endpoint (accepts question, returns answer with confidence and sources)

`POST /classify-intent`Intent classification only (returns intent type and extracted entities like figure numbers)

`POST /expand-context`Show surrounding chunks for a given chunk ID (no additional LLM/FAISS calls, just metadata lookup)

`POST /generate-chat-title`Generate short title from question (uses Gemini to create 3-5 word summary)

intent_classifier.py — Pattern-based intent classification with regex for figure numbers, table numbers, page numbers, section identifiers. Falls...

rebuild_index.py — Script to rebuild FAISS index from PDF:

Extract text chunks (paragraph-level, preserving structure)

Embed each chunk with gemini-embedding-001

Build FAISS index with IndexFlatL2 (exact search, CPU-friendly)

Save faiss.index, metadata.json (chunk text, page numbers, figure/table IDs), vision_captions.json (pre-generated page image captions)

Gitignored (faiss.index can be large, regenerated locally)

Frontend (React 19 + Vite 7)

App.jsx — Main component with chat interface:

Multi-chat support: create/switch/delete conversations, stored in localStorage

Message list with user questions and bot answers (structured paragraphs/lists, confidence badge, source citations)

Input field with submit button and loading indicator

Dark/light theme toggle (persists to localStorage)

api.js — API client for FastAPI backend:

Fetch wrapper with error handling

Endpoints for /ask, /classify-intent, /expand-context, /generate-chat-title

Environment variable VITE_API_URL for backend URL (localhost:8000 in dev, production URL in deploy)

PDF Export — jsPDF export button captures conversation:

Generates PDF with conversation title, timestamp, all Q&A pairs, confidence scores, sources

Downloads as `chat-export-{timestamp}.pdf`

Makefile Automation (15 commands)

`make install`Install Python dependencies (requirements.txt) + Node dependencies (frontend/package.json), optionally in venv

`make setup-env`Copy .env.example → .env for local configuration

`make check-env`Verify GEMINI_API_KEY is set in environment

`make dev`Run backend (uvicorn on port 8000) + frontend (Vite on port 5173) concurrently

`make dev-backend`Backend only (uvicorn app:app --reload --port 8000)

`make dev-frontend`Frontend only (cd frontend && npm run dev)

`make health`Curl http://localhost:8000/health to verify backend responding

`make status`Show running processes (backend/frontend)

`make build`Production frontend build (frontend/dist/)

`make lint`Lint frontend (ESLint)

`make lint-backend`Lint Python (requires black/flake8, optional)

`make kill`Kill dev server processes (backend/frontend)

`make rebuild-index`Run rebuild_index.py to regenerate FAISS index from PDFs

`make clean`Remove build artifacts (__pycache__, frontend/dist/)

`make clean-all`Deep clean (remove node_modules, venv, faiss.index, metadata.json)

`make verify-deploy`Run scripts/verify-deployment.sh for pre-deployment security check

GitHub Actions CI/CD (.github/workflows/ci.yml)

Runs on every push and pull request:

Lint Frontend — Install npm deps, run ESLint on frontend/src/

Build Frontend — Production build (npm run build), verify frontend/dist/ created

Lint Backend (optional) — Install black/flake8, lint Python files

Security Scan — Check for committed secrets (.env files, API keys) using grep/ack, fail if found

Comprehensive Documentation

SETUP.md — Detailed setup guide: prerequisites (Python 3.10+, Node 18+, Gemini API key), installation steps (clone, venv setup, install deps,...

DEPLOYMENT.md — Deployment guides for 5 platforms:

Vercel: Connect GitHub repo, configure build settings (root: ., build command: cd frontend && npm run build, output: frontend/dist), set...

GCP Cloud Run: Build Docker image, push to Container Registry, deploy with Cloud Run, configure secrets (GEMINI_API_KEY in Secret Manager)

Railway: Connect repo, configure start command (uvicorn app:app --host 0.0.0.0 --port $PORT), set environment variables

Render: Connect repo, configure build/start commands, set environment variables

Docker: Multi-stage Dockerfile (build frontend → copy to backend → serve with FastAPI), docker build/run commands

CONTRIBUTING.md — Contribution guidelines: fork/clone/branch workflow, code style (black for Python, ESLint for JS), commit message conventions,...

SECURITY.md — Security policy: responsible disclosure process, supported versions, known issues, security best practices (API keys in env only,...

CHANGELOG.md — Version history: v1.0.0 initial release, v1.1.0 added intent classification, v1.2.0 added confidence scoring, v2.0.0 React 19...

Security-First Approach

API key server-only: GEMINI_API_KEY loaded from environment on backend, never sent to frontend. Health endpoint returns 'set' or 'missing' status...

.env gitignored: .gitignore blocks .env, .env.local, .env.*, ensuring secrets never committed. .env.example template with placeholder values safe to...

Pre-commit hooks: .pre-commit-config.yaml runs secret detection (detect-private-key, check-added-large-files) before each commit.

Secret scanning in CI: GitHub Actions workflow fails build if .env files or API key patterns detected in committed files.

verify-deploy check: scripts/verify-deployment.sh runs automated security checks (no .env files, no API keys in code, faiss.index gitignored) before...

Deployment Options

Vercel (Easiest) — Full-stack serverless, automatic deployments on git push, environment variables via dashboard

GCP Cloud Run (Medium) — Scalable containerized deployment, Cloud Build integration, Secret Manager for API keys

Railway (Easy) — Git-push deploy, automatic HTTPS, environment variables via dashboard

Render (Easy) — Git-push deploy, free tier available, environment variables via dashboard

Docker (Medium) — Any container platform (AWS ECS, Azure Container Instances, DigitalOcean), multi-stage Dockerfile provided

Design Decisions

Chose FAISS over Pinecone/Weaviate for vector searchFAISS is CPU-friendly (no GPU required), free (no API costs), works offline, and sufficient for...

Implemented intent classification with 6 query types (figure, table, page, section, general, comparison) to route between exact match...

Used Google Gemini gemini-embedding-001 for embeddings and gemini-2.5-flash for generationcost-effective (~$0.0001/1K tokens embedding, ~$0.002/1K...

Structured JSON output from LLMprompt explicitly requests JSON format with 'answer' (paragraphs/lists), 'sources' (pages/figures/tables),...

Multi-factor confidence scoringcombines source quality (primary source document = High, secondary mentions = Medium), chunk count (single chunk =...

Context expansion endpoint (/expand-context) shows surrounding chunks without additional LLM or FAISS callsjust metadata lookup by chunk ID. Useful...

API key server-only with health endpointGEMINI_API_KEY loaded from environment on backend, never sent to frontend. /health returns 'GEMINI_API_KEY:...

React 19 + Vite 7 frontendmodern React with concurrent features, Vite provides instant HMR and fast production builds, simpler than Next.js for SPA...

Makefile for project automation15 commands (install, setup-env, dev, build, clean, rebuild-index, verify-deploy) simplify development workflow,...

GitHub Actions CI/CD with security scanninglint/build checks catch errors before merge, secret scanning (grep/ack for .env files, API key patterns)...

Comprehensive documentationSETUP.md (detailed setup), DEPLOYMENT.md (5 platforms), CONTRIBUTING.md (contribution guidelines), SECURITY.md (security...

Pre-commit hooks with .pre-commit-config.yamlruns detect-private-key, check-added-large-files before each commit. Prevents secrets and large files...

Gitignored data files (faiss.index, metadata.json, vision_captions.json)these are generated locally from PDFs via rebuild_index.py. Keeps repo size...

Dark/light theme with localStorage persistenceaccommodates different reading preferences (dark mode for low-light environments, light for daylight),...

Multi-chat support with localStorageenables organizing questions by topic (e.g., separate chats for different PDF documents or question categories),...

Tradeoffs & Constraints

FAISS IndexFlatL2 exact searchprovides highest accuracy (no approximation errors) but doesn't scale to millions of vectors. For larger datasets...

faiss.index must be pre-built from PDFs via rebuild_index.pynot generated at runtime (would be too slow for large documents). Requires running...

Google Gemini API costsembedding ~$0.0001/1K tokens, generation ~$0.002/1K tokens. Controlled via vision caption caching (pre-generate page image...

Single-document focusoptimized for querying one PDF at a time. Multi-document support would require index management (separate FAISS indexes per...

No real-time indexingPDFs must be indexed offline via rebuild_index.py. Real-time indexing (index new PDFs on upload) would require background job...

Structured JSON responses from LLMrely on Gemini following prompt instructions. Occasionally LLM returns malformed JSON or ignores structure....

React SPA frontendgreat for interactive chat UI but misses SEO benefits of SSR (Next.js). Acceptable for internal tools but would need SSR for...

CPU-only FAISSgood for moderate-sized indexes (10K-100K vectors) but GPU-accelerated FAISS would be 10-100x faster for large-scale deployments....

No streaming responsesanswers return all at once after LLM finishes. Streaming (token-by-token display) would improve perceived performance but...

Would improve: Add streaming responses for long answers, implement multi-document support with document selector UI, add real-time PDF indexing on...

Outcome & Impact

Production-ready RAG chatbot for technical PDF question-answering with intelligent retrieval and confidence scoring enabling users to quickly find...

Intent-aware retrieval with 6 query types routing to optimal strategy: FIGURE_QUERY ('What does Fig 3.3 show?') → exact figure reference matching,...

Multi-factor confidence scoring provides High/Medium/Low assessment from source quality (primary source = High, secondary mentions = Medium), chunk...

Structured JSON answers with paragraphs/lists format and source citations (page numbers, figure IDs, table IDs) enable clear, scannable responses...

5 FastAPI endpoints serving complete RAG workflow: GET /health (health check with GEMINI_API_KEY verification, never exposes actual key), POST /ask...

React 19 frontend with modern UX: dark/light theme toggle (persists to localStorage), multi-chat support (create/switch/delete conversations,...

Makefile automation with 15 commands simplifies development workflow: install (Python + Node deps), setup-env (copy .env.example → .env), check-env...

GitHub Actions CI/CD catches errors before merge: lint frontend (ESLint on frontend/src/), build frontend (production build, verify dist/ created),...

Comprehensive documentation enables self-service: SETUP.md (detailed setup: prerequisites, installation steps, FAISS index generation, local running,...

Security-first architecture protects sensitive credentials: API key server-only (GEMINI_API_KEY on backend only, never sent to frontend, health...

Flexible deployment options accommodate different hosting preferences: Vercel easiest (full-stack serverless, git-push deploy, environment variables...

FAISS vector search with Google Gemini embeddings provides semantic understandingqueries like 'crash barrier specifications' retrieve semantically...

Gitignored data files (faiss.index, metadata.json, vision_captions.json) keep repository size smallfiles generated locally via rebuild_index.py...

Pre-deployment checklist (`make verify-deploy`) automates security validation: checks no .env files committed, no API keys in source code,...

MIT license enables open research useacademic researchers, students, and developers can use, modify, and distribute the codebase for educational and...

Tech Stack

Backend: Python 3.10+, FastAPI (web framework), Uvicorn (ASGI server)

Vector Search: FAISS (CPU-based IndexFlatL2 exact nearest-neighbor), NumPy (numerical operations)

Embeddings / LLM: Google Gemini (gemini-embedding-001 for embeddings, gemini-2.5-flash for generation)

Vision: Pillow (page image processing), Gemini Vision API (image caption generation)

Frontend: React 19 (UI library with concurrent features), Vite 7 (build tool, dev server with instant HMR)

PDF Export: jsPDF (PDF generation from conversation data)

CI/CD: GitHub Actions (automated lint/build/security checks on push and PR)

Containerization: Docker (multi-stage Dockerfile for production deployment)

Automation: Makefile (15 commands for dev/build/deploy workflows)

Security: pre-commit hooks (detect-private-key, check-added-large-files), secret scanning in CI

Back to Projects