RAGLab — RAG Experimentation Platform
Interactive RAG (Retrieval-Augmented Generation) experimentation platform for comparing retrieval strategies and RAG modes in real time. Built with Python 3.13 + FastAPI backend featuring 4 RAG modes (Basic, Self-RAG with iterative sufficiency, Agentic RAG with plan-retrieve-reflect, Graph RAG with knowledge graph traversal), 4 retrieval strategies (Vector/FAISS, Hybrid BM25+RRF fusion, Multi-query expansion, Cross-encoder reranking), automatic knowledge graph extraction (NetworkX with background pre-building and disk caching), contextual compression (embedding-based sentence filtering), and smart PDF parsing (structure-aware with domain-specific chunking).
Role
AI Engineer & Full-stack Developer
Team
Solo
Company/Organization
Personal Project (Research & Experimentation)
The Problem
Researchers and engineers experimenting with RAG systems need to compare how different retrieval strategies (vector, hybrid, multi-query, reranked)...
No unified platform exists for experimenting with multiple RAG modes — Basic RAG (simple retrieve-generate), Self-RAG (iterative sufficiency checking),...
Understanding the impact of different chunking strategies (recursive, fixed, semantic, structure-based) and embedding models (OpenAI, local) requires...
Real-time feedback during RAG experimentation is essential for iterative improvement, but most implementations return only final answers without...
Knowledge graphs for Graph RAG require manual entity/relationship extraction and are not automatically generated from documents with caching for...
Contextual compression (filtering irrelevant chunks) and smart PDF parsing (domain-specific chunking for legal, research, financial documents) are...
The Solution
Built a comprehensive RAG experimentation platform with two main components: Python FastAPI backend and Next.js React frontend.
Backend Architecture (Python 3.13 + FastAPI + LangChain)
Implemented 4 RAG modes for different use cases:
Basic RAG — Standard retrieve-and-generate: retrieves relevant chunks with chosen strategy and generates answer with Google Gemini 2.0 Flash.
Self-RAG — Iterative sufficiency loop: generates initial answer, checks if retrieved context is sufficient, retrieves additional chunks if...
Agentic RAG — Plan-retrieve-reflect workflow: creates retrieval plan based on question, retrieves context for each sub-question, reflects on...
Graph RAG — Knowledge graph traversal: extracts entities/relationships from documents into NetworkX graph, traverses graph to find relevant...
Implemented 4 retrieval strategies:
Vector Search (FAISS) — Dense retrieval using OpenAI text-embedding-3-small or local SentenceTransformer embeddings, approximate nearest-neighbor...
Hybrid (Vector + BM25) — Combines dense (FAISS) and sparse (BM25) retrieval with Reciprocal Rank Fusion (RRF) for score merging, balances...
Multi-query Expansion — Generates multiple query variations from original question, retrieves chunks for each variation, and merges results with...
Cross-encoder Reranking — Initial retrieval via vector search, reranks top-k results using ms-marco-MiniLM cross-encoder for relevance scoring,...
Knowledge Graph System
Automatic entity/relationship extraction from documents using LangChain LLM chains
NetworkX graph construction with entities as nodes and relationships as edges
Background pre-building: graph generation runs asynchronously during document loading
Disk caching: graphs saved to `cache/` directory for instant reuse across sessions
Graph retrieval: traverses graph to find relevant entity paths for Graph RAG mode
Document Processing
Smart PDF parsing via `unstructured` library with layout-aware structure extraction
Domain-specific chunking strategies: general (recursive), legal (section-based), research (paragraph-based), financial (table-aware)
Chunking methods: recursive (character-based with overlap), fixed (equal-sized), semantic (embedding-based similarity), structure-based (document...
Document store with in-memory caching and disk persistence
Advanced Features
Contextual Compression — Embedding-based sentence filtering to remove irrelevant content from retrieved chunks, reduces LLM context noise and...
Streaming Responses — Server-Sent Events (SSE) for token-by-token answer display via `/ask-stream` endpoint, real-time feedback during generation
Performance Optimizations — Parallel retrieval for multiple queries, embedder caching (reuses embeddings across requests), hybrid retriever...
API Endpoints (6 total)
`GET /health` — Health check endpoint
`GET /documents` — List available PDF documents with metadata
`GET /documents/{id}/pdf` — Serve document PDF for preview
`POST /load-document` — Load and index document with chosen chunk_strategy, embedding_model, smart_parse settings
`POST /ask` — Ask question and return full response (non-streaming)
`POST /ask-stream` — Ask question with SSE streaming (token-by-token)
Request parameters: `rag_mode` (basic | self_rag | agentic_rag | graph_rag), `search_method` (vector | hybrid | multi_query | reranked),...
Frontend (Next.js 16 + React 19 + TypeScript)
Next.js 16 App Router with React 19 for modern concurrent features
Guided mode: step-by-step UI walkthrough for selecting documents, chunking strategies, embedding models, retrieval methods, and RAG modes
Streaming UI: token-by-token answer display with Server-Sent Events client
Configuration comparison: side-by-side comparison of different retrieval strategies and RAG modes on the same question
Tailwind CSS for responsive utility-first styling
TypeScript for type safety across API client and components
Project Automation (Makefile with 12 commands)
`make setup` — First-time setup: install Python and Node dependencies, create `.env` files
`make dev` — Run backend (port 8000) and frontend (port 3000) concurrently
`make dev-backend` / `make dev-frontend` — Run services independently
`make test` — Quick system check (imports, caches, config without LLM calls)
`make validate` — Full backend validation including LLM calls
`make build` — Production build of frontend
`make clean` — Remove `__pycache__`, `.pyc`, `.next` build cache
`make clean-cache` — Remove all indexed document and knowledge graph caches
`make stop` — Kill all running RAGLab processes
CI/CD (GitHub Actions)
Backend checks: install dependencies, verify all Python imports, syntax validation
Frontend checks: install dependencies, ESLint, production build verification
Security scan: detect committed `.env` files and hardcoded API keys
Runs on every push and PR to main/master branches
Deployment
Frontend: Vercel deployment with `NEXT_PUBLIC_API_URL` environment variable
Backend: GCP Cloud Run / App Engine or any cloud with `OPENAI_API_KEY` and `GEMINI_API_KEY` environment variables
Security: `.env` files git-ignored, platform secret managers for production credentials
Design Decisions
Chose 4 RAG modes (Basic, Self-RAG, Agentic, Graph) to cover different use cases: Basic for simple Q&A, Self-RAG for iterative refinement, Agentic...
Implemented 4 retrieval strategies to demonstrate trade-offs: Vector (fast, semantic), Hybrid (balances semantic+keyword), Multi-query (handles...
Used FAISS for vector search — industry-standard library with efficient approximate nearest-neighbor search, CPU-friendly, and no GPU required for...
Built knowledge graph with NetworkX instead of Neo4j — simpler setup (no separate database), disk caching for persistence, sufficient for...
Chose Google Gemini 2.0 Flash for generation — cost-effective (~$0.002/call), fast response times, good quality for experimentation. OpenAI...
Implemented Server-Sent Events (SSE) for streaming over WebSockets — simpler protocol, HTTP-based (better firewall compatibility), one-way...
Added contextual compression as optional feature — reduces LLM context noise by filtering irrelevant sentences, trades processing time for answer...
Built smart PDF parsing with domain-specific chunking — legal documents need section awareness, financial documents need table handling, research...
Implemented multiple embedding options (OpenAI, local SentenceTransformer) — OpenAI for quality, local for cost control and offline use.
Used Makefile for project automation — single command (`make dev`) to run full stack, `make test` for quick validation, `make clean-cache` to reset...
Chose monorepo structure with separate backend/frontend directories — easier to run both services, shared Git history, simpler deployment configuration.
Implemented background knowledge graph pre-building — graph generation runs asynchronously during document loading, doesn't block user, caches to disk...
Tradeoffs & Constraints
Chose FAISS over Pinecone/Weaviate: No API costs and works offline, but lacks distributed scaling and real-time updates. Suitable for...
NetworkX for knowledge graphs: Simple Python library with disk caching, but limited to single-machine scale. Would need Neo4j/Amazon Neptune for...
Server-Sent Events for streaming: Simpler than WebSockets but one-way only. Can't send client updates during streaming (e.g., stop generation...
BM25 for sparse retrieval: Classic algorithm with good keyword matching, but can't handle synonyms or semantic similarity. Hybrid mode combines both...
OpenAI embeddings: Best quality but API costs (~$0.0001/1K tokens) and requires internet. Local SentenceTransformer is free but lower quality.
In-memory document store: Fast access but requires re-indexing on server restart. Cache directory provides persistence but needs manual cleanup.
Google Gemini 2.0 Flash: Cost-effective and fast, but less capable than GPT-4 for complex reasoning. Trade cost/speed for quality.
Cross-encoder reranking: Highest precision but slow (processes each doc-query pair). Only practical for top-k results (e.g., rerank top 20 from 100).
Graph RAG entity extraction: LLM-based extraction is accurate but expensive. Pre-building and caching amortizes cost across multiple queries.
Would improve: Add streaming stop capability (WebSocket upgrade), implement distributed vector store (Pinecone/Weaviate), add more LLM options...
Outcome & Impact
Production-ready RAG experimentation platform enabling side-by-side comparison of 4 RAG modes (Basic, Self-RAG, Agentic, Graph) and 4 retrieval...
Comprehensive backend with Python 3.13 + FastAPI + LangChain providing 6 API endpoints (/health, /documents, /documents/{id}/pdf, /load-document,...
4 RAG modes implemented: Basic (retrieve-generate), Self-RAG (iterative sufficiency loop up to 3 iterations), Agentic (plan-retrieve-reflect), Graph...
4 retrieval strategies: Vector search (FAISS with OpenAI/local embeddings), Hybrid (BM25+FAISS with RRF fusion), Multi-query expansion (generates...
Automatic knowledge graph extraction with NetworkX: entities/relationships from documents, background pre-building during document loading, disk...
Streaming responses via Server-Sent Events: token-by-token answer display through /ask-stream endpoint, real-time feedback during LLM generation.
Contextual compression: embedding-based sentence filtering to remove irrelevant content from retrieved chunks, reduces LLM context noise and improves...
Smart PDF parsing: structure-aware parsing via unstructured library, domain-specific chunking strategies (general, legal, research, financial),...
Multiple chunking strategies: recursive (character-based with overlap), fixed (equal-sized), semantic (embedding-based similarity), structure-based...
Performance optimizations: parallel retrieval for multi-query expansion, embedder caching (reuses embeddings across requests), hybrid retriever...
Interactive Next.js 16 + React 19 frontend: guided mode with step-by-step document/strategy/mode selection, streaming UI with token-by-token display,...
Comprehensive Makefile automation: 12 commands including setup, dev (backend+frontend), test (quick validation), validate (full with LLM calls),...
GitHub Actions CI/CD: backend checks (dependencies, imports, syntax), frontend checks (ESLint, build), security scan (detect .env files, hardcoded...
Flexible deployment: Vercel for frontend (NEXT_PUBLIC_API_URL env var), GCP/any cloud for backend (OPENAI_API_KEY, GEMINI_API_KEY env vars), .env...
Clean project structure: monorepo with raglab-backend/ (FastAPI, chunking, embeddings, graph, ingestion, retrieval, services) and raglab-frontend/...
MIT licensed for open experimentation and research use.
Tech Stack
Backend: Python 3.13, FastAPI (REST + SSE), Uvicorn
RAG Framework: LangChain for orchestration and chains
Vector Store: FAISS for dense retrieval with approximate NN search
Knowledge Graph: NetworkX for entity/relationship graphs
Sparse Retrieval: BM25 for keyword-based search with RRF fusion
Reranking: Cross-encoder (ms-marco-MiniLM) for relevance scoring
PDF Parsing: unstructured library for layout-aware extraction
Embeddings: OpenAI text-embedding-3-small, SentenceTransformer (local)
LLM: Google Gemini 2.0 Flash for answer generation
Frontend: Next.js 16 (App Router), React 19, TypeScript
Styling: Tailwind CSS for utility-first responsive design
Streaming: Server-Sent Events (SSE) for token-by-token display
Automation: Makefile for project commands and workflows
CI/CD: GitHub Actions (backend/frontend/security checks)