RAG PDF Chatbot

Retrieval-Augmented Generation (RAG) chatbot for intelligent question-answering over technical PDF documents. Combines FAISS (CPU) vector search, intent-based retrieval (6 query types: figure, table, page, section, general, comparison), and Google Gemini (gemini-embedding-001 + gemini-2.5-flash) for accurate, source-cited answers with confidence scoring (High/Medium/Low from source quality, chunk count, semantic match, verbatim presence).

Python 3.10+FastAPIUvicornFAISSNumPyGoogle GeminiPillowReact 19Vite 7jsPDFGitHub ActionsDockerMakefilepre-commit hooks

View Code

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

YNM Safety

The Problem

•

Finding specific information in large technical PDF documents (hundreds of pages with figures, tables, specifications) required manual searching — no...

•

Traditional keyword search (Ctrl+F) only matched exact text, missing semantic meaning and context. Searching 'crash barrier specifications' wouldn't...

•

Difficult to locate specific figures, tables, or page references without knowing exact locations. Questions like 'What does Fig 3.3 show?' or...

•

No confidence scoring to help users assess answer reliability and accuracy. Users couldn't gauge if retrieved information was definitive (high...

•

Existing RAG solutions lacked intent-aware retrieval — couldn't distinguish between requests for specific figures/tables (exact match needed) vs...

•

Commercial document Q&A services (DocuSign Insight, Adobe Liquid Mode) were expensive ($500+/month) or required proprietary APIs with vendor lock-in.

•

No modern UI for technical PDF Q&A — needed dark/light theme for different reading environments, multi-chat support for organizing questions by topic,...

The Solution

•

Built a comprehensive RAG chatbot with 5-stage pipeline and security-first architecture.

RAG Pipeline (5 Stages)

•

Intent Classification — intent_classifier.py uses pattern matching and keyword detection to classify queries into 6 types:

•

- FIGURE_QUERY: 'What does Fig 3.3 show?' → Extract figure number, exact match in metadata

•

- TABLE_QUERY: 'Table 6.2 guidelines' → Extract table number, exact match

•

- PAGE_QUERY: 'What is on page 27?' → Extract page number, retrieve chunks from that page

•

- SECTION_QUERY: 'What is in Section 4?' → Extract section identifier, match section headers

•

- GENERAL_QUERY: 'Size of STOP sign?' → FAISS semantic search with k=5 nearest neighbors

•

- COMPARISON_QUERY: 'Compare Fig 3.1 and 3.2' → Extract both references, retrieve both, compare

•

Retrieval — Based on intent, use appropriate strategy:

•

- Exact match (figures/tables/pages): Query metadata.json for exact reference, return specific chunk

•

- FAISS semantic search (general/section): Embed query with gemini-embedding-001, query FAISS index (CPU-based approximate nearest neighbor), return...

•

- Vision captions: If page has images, load from vision_captions.json (pre-generated with Gemini Vision API on page images)

•

Context Building — Assemble retrieved chunks into context:

•

- Sort by relevance score (FAISS distance or exact match priority)

•

- Include chunk text, page numbers, figure/table identifiers, section headers

•

- Add surrounding chunks if /expand-context called (shows context before/after matched chunk)

•

- Limit total context to ~4000 tokens to fit Gemini context window

•

LLM Generation — Send context to Google Gemini gemini-2.5-flash:

•

- Structured prompt: "Based on the following PDF excerpts, answer the question. Provide structured JSON with 'answer' (paragraphs array or lists),...

•

- Temperature 0.3 for factual accuracy (low creativity, high consistency)

•

- Max tokens 1024 for detailed answers

•

Response — Parse Gemini JSON response, calculate confidence score, return to frontend:

•

- Multi-factor confidence: source quality (primary source = High, secondary = Medium), chunk count (1 chunk = Medium, 3+ = High), semantic match...

•

- Final confidence: average of factors mapped to High/Medium/Low

Backend (Python 3.10+ + FastAPI)

•

app.py — Main FastAPI application with 5 endpoints:

•

`GET /health` — Health check with environment variable verification (returns 'GEMINI_API_KEY: set' or 'missing', never exposes actual key)

•

`POST /ask` — Main RAG Q&A endpoint (accepts question, returns answer with confidence and sources)

•

`POST /classify-intent` — Intent classification only (returns intent type and extracted entities like figure numbers)

•

`POST /expand-context` — Show surrounding chunks for a given chunk ID (no additional LLM/FAISS calls, just metadata lookup)

•

`POST /generate-chat-title` — Generate short title from question (uses Gemini to create 3-5 word summary)

•

intent_classifier.py — Pattern-based intent classification with regex for figure numbers, table numbers, page numbers, section identifiers. Falls...

•

rebuild_index.py — Script to rebuild FAISS index from PDF:

•

Extract text chunks (paragraph-level, preserving structure)

•

Embed each chunk with gemini-embedding-001

•

Build FAISS index with IndexFlatL2 (exact search, CPU-friendly)

•

Save faiss.index, metadata.json (chunk text, page numbers, figure/table IDs), vision_captions.json (pre-generated page image captions)

•

Gitignored (faiss.index can be large, regenerated locally)

Frontend (React 19 + Vite 7)

•

App.jsx — Main component with chat interface:

•

Multi-chat support: create/switch/delete conversations, stored in localStorage

•

Message list with user questions and bot answers (structured paragraphs/lists, confidence badge, source citations)

•

Input field with submit button and loading indicator

•

Dark/light theme toggle (persists to localStorage)

•

api.js — API client for FastAPI backend:

•

Fetch wrapper with error handling

•

Endpoints for /ask, /classify-intent, /expand-context, /generate-chat-title

•

Environment variable VITE_API_URL for backend URL (localhost:8000 in dev, production URL in deploy)

•

PDF Export — jsPDF export button captures conversation:

•

Generates PDF with conversation title, timestamp, all Q&A pairs, confidence scores, sources

•

Downloads as `chat-export-{timestamp}.pdf`

Makefile Automation (15 commands)

•

`make install` — Install Python dependencies (requirements.txt) + Node dependencies (frontend/package.json), optionally in venv

•

`make setup-env` — Copy .env.example → .env for local configuration

•

`make check-env` — Verify GEMINI_API_KEY is set in environment

•

`make dev` — Run backend (uvicorn on port 8000) + frontend (Vite on port 5173) concurrently

•

`make dev-backend` — Backend only (uvicorn app:app --reload --port 8000)

•

`make dev-frontend` — Frontend only (cd frontend && npm run dev)

•

`make health` — Curl http://localhost:8000/health to verify backend responding

•

`make status` — Show running processes (backend/frontend)

•

`make build` — Production frontend build (frontend/dist/)

•

`make lint` — Lint frontend (ESLint)

•

`make lint-backend` — Lint Python (requires black/flake8, optional)

•

`make kill` — Kill dev server processes (backend/frontend)

•

`make rebuild-index` — Run rebuild_index.py to regenerate FAISS index from PDFs

•

`make clean` — Remove build artifacts (__pycache__, frontend/dist/)

•

`make clean-all` — Deep clean (remove node_modules, venv, faiss.index, metadata.json)

•

`make verify-deploy` — Run scripts/verify-deployment.sh for pre-deployment security check

GitHub Actions CI/CD (.github/workflows/ci.yml)

•

Runs on every push and pull request:

•

Lint Frontend — Install npm deps, run ESLint on frontend/src/

•

Build Frontend — Production build (npm run build), verify frontend/dist/ created

•

Lint Backend (optional) — Install black/flake8, lint Python files

•

Security Scan — Check for committed secrets (.env files, API keys) using grep/ack, fail if found

Comprehensive Documentation

•

SETUP.md — Detailed setup guide: prerequisites (Python 3.10+, Node 18+, Gemini API key), installation steps (clone, venv setup, install deps,...

•

DEPLOYMENT.md — Deployment guides for 5 platforms:

•

Vercel: Connect GitHub repo, configure build settings (root: ., build command: cd frontend && npm run build, output: frontend/dist), set...

•

GCP Cloud Run: Build Docker image, push to Container Registry, deploy with Cloud Run, configure secrets (GEMINI_API_KEY in Secret Manager)

•

Railway: Connect repo, configure start command (uvicorn app:app --host 0.0.0.0 --port $PORT), set environment variables

•

Render: Connect repo, configure build/start commands, set environment variables

•

Docker: Multi-stage Dockerfile (build frontend → copy to backend → serve with FastAPI), docker build/run commands

•

CONTRIBUTING.md — Contribution guidelines: fork/clone/branch workflow, code style (black for Python, ESLint for JS), commit message conventions,...

•

SECURITY.md — Security policy: responsible disclosure process, supported versions, known issues, security best practices (API keys in env only,...

•

CHANGELOG.md — Version history: v1.0.0 initial release, v1.1.0 added intent classification, v1.2.0 added confidence scoring, v2.0.0 React 19...

Security-First Approach

•

API key server-only: GEMINI_API_KEY loaded from environment on backend, never sent to frontend. Health endpoint returns 'set' or 'missing' status...

•

.env gitignored: .gitignore blocks .env, .env.local, .env.*, ensuring secrets never committed. .env.example template with placeholder values safe to...

•

Pre-commit hooks: .pre-commit-config.yaml runs secret detection (detect-private-key, check-added-large-files) before each commit.

•

Secret scanning in CI: GitHub Actions workflow fails build if .env files or API key patterns detected in committed files.

•

verify-deploy check: scripts/verify-deployment.sh runs automated security checks (no .env files, no API keys in code, faiss.index gitignored) before...

Deployment Options

•

Vercel (Easiest) — Full-stack serverless, automatic deployments on git push, environment variables via dashboard

•

GCP Cloud Run (Medium) — Scalable containerized deployment, Cloud Build integration, Secret Manager for API keys

•

Railway (Easy) — Git-push deploy, automatic HTTPS, environment variables via dashboard

•

Render (Easy) — Git-push deploy, free tier available, environment variables via dashboard

•

Docker (Medium) — Any container platform (AWS ECS, Azure Container Instances, DigitalOcean), multi-stage Dockerfile provided

Design Decisions

•

Chose FAISS over Pinecone/Weaviate for vector search — FAISS is CPU-friendly (no GPU required), free (no API costs), works offline, and sufficient for...

•

Implemented intent classification with 6 query types (figure, table, page, section, general, comparison) to route between exact match...

•

Used Google Gemini gemini-embedding-001 for embeddings and gemini-2.5-flash for generation — cost-effective (~$0.0001/1K tokens embedding, ~$0.002/1K...

•

Structured JSON output from LLM — prompt explicitly requests JSON format with 'answer' (paragraphs/lists), 'sources' (pages/figures/tables),...

•

Multi-factor confidence scoring — combines source quality (primary source document = High, secondary mentions = Medium), chunk count (single chunk =...

•

Context expansion endpoint (/expand-context) shows surrounding chunks without additional LLM or FAISS calls — just metadata lookup by chunk ID. Useful...

•

API key server-only with health endpoint — GEMINI_API_KEY loaded from environment on backend, never sent to frontend. /health returns 'GEMINI_API_KEY:...

•

React 19 + Vite 7 frontend — modern React with concurrent features, Vite provides instant HMR and fast production builds, simpler than Next.js for SPA...

•

Makefile for project automation — 15 commands (install, setup-env, dev, build, clean, rebuild-index, verify-deploy) simplify development workflow,...

•

GitHub Actions CI/CD with security scanning — lint/build checks catch errors before merge, secret scanning (grep/ack for .env files, API key patterns)...

•

Comprehensive documentation — SETUP.md (detailed setup), DEPLOYMENT.md (5 platforms), CONTRIBUTING.md (contribution guidelines), SECURITY.md (security...

•

Pre-commit hooks with .pre-commit-config.yaml — runs detect-private-key, check-added-large-files before each commit. Prevents secrets and large files...

•

Gitignored data files (faiss.index, metadata.json, vision_captions.json) — these are generated locally from PDFs via rebuild_index.py. Keeps repo size...

•

Dark/light theme with localStorage persistence — accommodates different reading preferences (dark mode for low-light environments, light for daylight),...

•

Multi-chat support with localStorage — enables organizing questions by topic (e.g., separate chats for different PDF documents or question categories),...

Tradeoffs & Constraints

•

FAISS IndexFlatL2 exact search — provides highest accuracy (no approximation errors) but doesn't scale to millions of vectors. For larger datasets...

•

faiss.index must be pre-built from PDFs via rebuild_index.py — not generated at runtime (would be too slow for large documents). Requires running...

•

Google Gemini API costs — embedding ~$0.0001/1K tokens, generation ~$0.002/1K tokens. Controlled via vision caption caching (pre-generate page image...

•

Single-document focus — optimized for querying one PDF at a time. Multi-document support would require index management (separate FAISS indexes per...

•

No real-time indexing — PDFs must be indexed offline via rebuild_index.py. Real-time indexing (index new PDFs on upload) would require background job...

•

Structured JSON responses from LLM — rely on Gemini following prompt instructions. Occasionally LLM returns malformed JSON or ignores structure....

•

React SPA frontend — great for interactive chat UI but misses SEO benefits of SSR (Next.js). Acceptable for internal tools but would need SSR for...

•

CPU-only FAISS — good for moderate-sized indexes (10K-100K vectors) but GPU-accelerated FAISS would be 10-100x faster for large-scale deployments....

•

No streaming responses — answers return all at once after LLM finishes. Streaming (token-by-token display) would improve perceived performance but...

•

Would improve: Add streaming responses for long answers, implement multi-document support with document selector UI, add real-time PDF indexing on...

Outcome & Impact

•

Production-ready RAG chatbot for technical PDF question-answering with intelligent retrieval and confidence scoring enabling users to quickly find...

•

Intent-aware retrieval with 6 query types routing to optimal strategy: FIGURE_QUERY ('What does Fig 3.3 show?') → exact figure reference matching,...

•

Multi-factor confidence scoring provides High/Medium/Low assessment from source quality (primary source = High, secondary mentions = Medium), chunk...

•

Structured JSON answers with paragraphs/lists format and source citations (page numbers, figure IDs, table IDs) enable clear, scannable responses...

•

5 FastAPI endpoints serving complete RAG workflow: GET /health (health check with GEMINI_API_KEY verification, never exposes actual key), POST /ask...

•

React 19 frontend with modern UX: dark/light theme toggle (persists to localStorage), multi-chat support (create/switch/delete conversations,...

•

Makefile automation with 15 commands simplifies development workflow: install (Python + Node deps), setup-env (copy .env.example → .env), check-env...

•

GitHub Actions CI/CD catches errors before merge: lint frontend (ESLint on frontend/src/), build frontend (production build, verify dist/ created),...

•

Comprehensive documentation enables self-service: SETUP.md (detailed setup: prerequisites, installation steps, FAISS index generation, local running,...

•

Security-first architecture protects sensitive credentials: API key server-only (GEMINI_API_KEY on backend only, never sent to frontend, health...

•

Flexible deployment options accommodate different hosting preferences: Vercel easiest (full-stack serverless, git-push deploy, environment variables...

•

FAISS vector search with Google Gemini embeddings provides semantic understanding — queries like 'crash barrier specifications' retrieve semantically...

•

Gitignored data files (faiss.index, metadata.json, vision_captions.json) keep repository size small — files generated locally via rebuild_index.py...

•

Pre-deployment checklist (`make verify-deploy`) automates security validation: checks no .env files committed, no API keys in source code,...

•

MIT license enables open research use — academic researchers, students, and developers can use, modify, and distribute the codebase for educational and...

Tech Stack

•

Backend: Python 3.10+, FastAPI (web framework), Uvicorn (ASGI server)

•

Vector Search: FAISS (CPU-based IndexFlatL2 exact nearest-neighbor), NumPy (numerical operations)

•

Embeddings / LLM: Google Gemini (gemini-embedding-001 for embeddings, gemini-2.5-flash for generation)

•

Vision: Pillow (page image processing), Gemini Vision API (image caption generation)

•

Frontend: React 19 (UI library with concurrent features), Vite 7 (build tool, dev server with instant HMR)

•

PDF Export: jsPDF (PDF generation from conversation data)

•

CI/CD: GitHub Actions (automated lint/build/security checks on push and PR)

•

Containerization: Docker (multi-stage Dockerfile for production deployment)

•

Automation: Makefile (15 commands for dev/build/deploy workflows)

•

Security: pre-commit hooks (detect-private-key, check-added-large-files), secret scanning in CI

Back to Projects