Back to Projects

RAGLab — RAG Experimentation Platform

Interactive RAG (Retrieval-Augmented Generation) experimentation platform for comparing retrieval strategies and RAG modes in real time. Built with Python 3.13 + FastAPI backend featuring 4 RAG modes (Basic, Self-RAG with iterative sufficiency, Agentic RAG with plan-retrieve-reflect, Graph RAG with knowledge graph traversal), 4 retrieval strategies (Vector/FAISS, Hybrid BM25+RRF fusion, Multi-query expansion, Cross-encoder reranking), automatic knowledge graph extraction (NetworkX with background pre-building and disk caching), contextual compression (embedding-based sentence filtering), and smart PDF parsing (structure-aware with domain-specific chunking).

Python 3.13FastAPILangChainFAISSNetworkXNext.js 16React 19TypeScriptTailwind CSSGoogle Gemini 2.0 FlashOpenAI text-embedding-3-smallBM25Cross-encoder (ms-marco-MiniLM)unstructured (PDF parsing)Server-Sent EventsGitHub Actions

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project (Research & Experimentation)

The Problem

Researchers and engineers experimenting with RAG systems need to compare how different retrieval strategies (vector, hybrid, multi-query, reranked)...

No unified platform exists for experimenting with multiple RAG modesBasic RAG (simple retrieve-generate), Self-RAG (iterative sufficiency checking),...

Understanding the impact of different chunking strategies (recursive, fixed, semantic, structure-based) and embedding models (OpenAI, local) requires...

Real-time feedback during RAG experimentation is essential for iterative improvement, but most implementations return only final answers without...

Knowledge graphs for Graph RAG require manual entity/relationship extraction and are not automatically generated from documents with caching for...

Contextual compression (filtering irrelevant chunks) and smart PDF parsing (domain-specific chunking for legal, research, financial documents) are...

The Solution

Built a comprehensive RAG experimentation platform with two main components: Python FastAPI backend and Next.js React frontend.

Backend Architecture (Python 3.13 + FastAPI + LangChain)

Implemented 4 RAG modes for different use cases:

Basic RAG — Standard retrieve-and-generate: retrieves relevant chunks with chosen strategy and generates answer with Google Gemini 2.0 Flash.

Self-RAG — Iterative sufficiency loop: generates initial answer, checks if retrieved context is sufficient, retrieves additional chunks if...

Agentic RAG — Plan-retrieve-reflect workflow: creates retrieval plan based on question, retrieves context for each sub-question, reflects on...

Graph RAG — Knowledge graph traversal: extracts entities/relationships from documents into NetworkX graph, traverses graph to find relevant...

Implemented 4 retrieval strategies:

Vector Search (FAISS) — Dense retrieval using OpenAI text-embedding-3-small or local SentenceTransformer embeddings, approximate nearest-neighbor...

Hybrid (Vector + BM25) — Combines dense (FAISS) and sparse (BM25) retrieval with Reciprocal Rank Fusion (RRF) for score merging, balances...

Multi-query Expansion — Generates multiple query variations from original question, retrieves chunks for each variation, and merges results with...

Cross-encoder Reranking — Initial retrieval via vector search, reranks top-k results using ms-marco-MiniLM cross-encoder for relevance scoring,...

Knowledge Graph System

Automatic entity/relationship extraction from documents using LangChain LLM chains

NetworkX graph construction with entities as nodes and relationships as edges

Background pre-building: graph generation runs asynchronously during document loading

Disk caching: graphs saved to `cache/` directory for instant reuse across sessions

Graph retrieval: traverses graph to find relevant entity paths for Graph RAG mode

Document Processing

Smart PDF parsing via `unstructured` library with layout-aware structure extraction

Domain-specific chunking strategies: general (recursive), legal (section-based), research (paragraph-based), financial (table-aware)

Chunking methods: recursive (character-based with overlap), fixed (equal-sized), semantic (embedding-based similarity), structure-based (document...

Document store with in-memory caching and disk persistence

Advanced Features

Contextual Compression — Embedding-based sentence filtering to remove irrelevant content from retrieved chunks, reduces LLM context noise and...

Streaming Responses — Server-Sent Events (SSE) for token-by-token answer display via `/ask-stream` endpoint, real-time feedback during generation

Performance Optimizations — Parallel retrieval for multiple queries, embedder caching (reuses embeddings across requests), hybrid retriever...

API Endpoints (6 total)

`GET /health`Health check endpoint

`GET /documents`List available PDF documents with metadata

`GET /documents/{id}/pdf`Serve document PDF for preview

`POST /load-document`Load and index document with chosen chunk_strategy, embedding_model, smart_parse settings

`POST /ask`Ask question and return full response (non-streaming)

`POST /ask-stream`Ask question with SSE streaming (token-by-token)

Request parameters: `rag_mode` (basic | self_rag | agentic_rag | graph_rag), `search_method` (vector | hybrid | multi_query | reranked),...

Frontend (Next.js 16 + React 19 + TypeScript)

Next.js 16 App Router with React 19 for modern concurrent features

Guided mode: step-by-step UI walkthrough for selecting documents, chunking strategies, embedding models, retrieval methods, and RAG modes

Streaming UI: token-by-token answer display with Server-Sent Events client

Configuration comparison: side-by-side comparison of different retrieval strategies and RAG modes on the same question

Tailwind CSS for responsive utility-first styling

TypeScript for type safety across API client and components

Project Automation (Makefile with 12 commands)

`make setup`First-time setup: install Python and Node dependencies, create `.env` files

`make dev`Run backend (port 8000) and frontend (port 3000) concurrently

`make dev-backend` / `make dev-frontend`Run services independently

`make test`Quick system check (imports, caches, config without LLM calls)

`make validate`Full backend validation including LLM calls

`make build`Production build of frontend

`make clean`Remove `__pycache__`, `.pyc`, `.next` build cache

`make clean-cache`Remove all indexed document and knowledge graph caches

`make stop`Kill all running RAGLab processes

CI/CD (GitHub Actions)

Backend checks: install dependencies, verify all Python imports, syntax validation

Frontend checks: install dependencies, ESLint, production build verification

Security scan: detect committed `.env` files and hardcoded API keys

Runs on every push and PR to main/master branches

Deployment

Frontend: Vercel deployment with `NEXT_PUBLIC_API_URL` environment variable

Backend: GCP Cloud Run / App Engine or any cloud with `OPENAI_API_KEY` and `GEMINI_API_KEY` environment variables

Security: `.env` files git-ignored, platform secret managers for production credentials

Design Decisions

Chose 4 RAG modes (Basic, Self-RAG, Agentic, Graph) to cover different use cases: Basic for simple Q&A, Self-RAG for iterative refinement, Agentic...

Implemented 4 retrieval strategies to demonstrate trade-offs: Vector (fast, semantic), Hybrid (balances semantic+keyword), Multi-query (handles...

Used FAISS for vector searchindustry-standard library with efficient approximate nearest-neighbor search, CPU-friendly, and no GPU required for...

Built knowledge graph with NetworkX instead of Neo4jsimpler setup (no separate database), disk caching for persistence, sufficient for...

Chose Google Gemini 2.0 Flash for generationcost-effective (~$0.002/call), fast response times, good quality for experimentation. OpenAI...

Implemented Server-Sent Events (SSE) for streaming over WebSocketssimpler protocol, HTTP-based (better firewall compatibility), one-way...

Added contextual compression as optional featurereduces LLM context noise by filtering irrelevant sentences, trades processing time for answer...

Built smart PDF parsing with domain-specific chunkinglegal documents need section awareness, financial documents need table handling, research...

Implemented multiple embedding options (OpenAI, local SentenceTransformer)OpenAI for quality, local for cost control and offline use.

Used Makefile for project automationsingle command (`make dev`) to run full stack, `make test` for quick validation, `make clean-cache` to reset...

Chose monorepo structure with separate backend/frontend directorieseasier to run both services, shared Git history, simpler deployment configuration.

Implemented background knowledge graph pre-buildinggraph generation runs asynchronously during document loading, doesn't block user, caches to disk...

Tradeoffs & Constraints

Chose FAISS over Pinecone/Weaviate: No API costs and works offline, but lacks distributed scaling and real-time updates. Suitable for...

NetworkX for knowledge graphs: Simple Python library with disk caching, but limited to single-machine scale. Would need Neo4j/Amazon Neptune for...

Server-Sent Events for streaming: Simpler than WebSockets but one-way only. Can't send client updates during streaming (e.g., stop generation...

BM25 for sparse retrieval: Classic algorithm with good keyword matching, but can't handle synonyms or semantic similarity. Hybrid mode combines both...

OpenAI embeddings: Best quality but API costs (~$0.0001/1K tokens) and requires internet. Local SentenceTransformer is free but lower quality.

In-memory document store: Fast access but requires re-indexing on server restart. Cache directory provides persistence but needs manual cleanup.

Google Gemini 2.0 Flash: Cost-effective and fast, but less capable than GPT-4 for complex reasoning. Trade cost/speed for quality.

Cross-encoder reranking: Highest precision but slow (processes each doc-query pair). Only practical for top-k results (e.g., rerank top 20 from 100).

Graph RAG entity extraction: LLM-based extraction is accurate but expensive. Pre-building and caching amortizes cost across multiple queries.

Would improve: Add streaming stop capability (WebSocket upgrade), implement distributed vector store (Pinecone/Weaviate), add more LLM options...

Outcome & Impact

Production-ready RAG experimentation platform enabling side-by-side comparison of 4 RAG modes (Basic, Self-RAG, Agentic, Graph) and 4 retrieval...

Comprehensive backend with Python 3.13 + FastAPI + LangChain providing 6 API endpoints (/health, /documents, /documents/{id}/pdf, /load-document,...

4 RAG modes implemented: Basic (retrieve-generate), Self-RAG (iterative sufficiency loop up to 3 iterations), Agentic (plan-retrieve-reflect), Graph...

4 retrieval strategies: Vector search (FAISS with OpenAI/local embeddings), Hybrid (BM25+FAISS with RRF fusion), Multi-query expansion (generates...

Automatic knowledge graph extraction with NetworkX: entities/relationships from documents, background pre-building during document loading, disk...

Streaming responses via Server-Sent Events: token-by-token answer display through /ask-stream endpoint, real-time feedback during LLM generation.

Contextual compression: embedding-based sentence filtering to remove irrelevant content from retrieved chunks, reduces LLM context noise and improves...

Smart PDF parsing: structure-aware parsing via unstructured library, domain-specific chunking strategies (general, legal, research, financial),...

Multiple chunking strategies: recursive (character-based with overlap), fixed (equal-sized), semantic (embedding-based similarity), structure-based...

Performance optimizations: parallel retrieval for multi-query expansion, embedder caching (reuses embeddings across requests), hybrid retriever...

Interactive Next.js 16 + React 19 frontend: guided mode with step-by-step document/strategy/mode selection, streaming UI with token-by-token display,...

Comprehensive Makefile automation: 12 commands including setup, dev (backend+frontend), test (quick validation), validate (full with LLM calls),...

GitHub Actions CI/CD: backend checks (dependencies, imports, syntax), frontend checks (ESLint, build), security scan (detect .env files, hardcoded...

Flexible deployment: Vercel for frontend (NEXT_PUBLIC_API_URL env var), GCP/any cloud for backend (OPENAI_API_KEY, GEMINI_API_KEY env vars), .env...

Clean project structure: monorepo with raglab-backend/ (FastAPI, chunking, embeddings, graph, ingestion, retrieval, services) and raglab-frontend/...

MIT licensed for open experimentation and research use.

Tech Stack

Backend: Python 3.13, FastAPI (REST + SSE), Uvicorn

RAG Framework: LangChain for orchestration and chains

Vector Store: FAISS for dense retrieval with approximate NN search

Knowledge Graph: NetworkX for entity/relationship graphs

Sparse Retrieval: BM25 for keyword-based search with RRF fusion

Reranking: Cross-encoder (ms-marco-MiniLM) for relevance scoring

PDF Parsing: unstructured library for layout-aware extraction

Embeddings: OpenAI text-embedding-3-small, SentenceTransformer (local)

LLM: Google Gemini 2.0 Flash for answer generation

Frontend: Next.js 16 (App Router), React 19, TypeScript

Styling: Tailwind CSS for utility-first responsive design

Streaming: Server-Sent Events (SSE) for token-by-token display

Automation: Makefile for project commands and workflows

CI/CD: GitHub Actions (backend/frontend/security checks)

Back to Projects