RAGLab — RAG Experimentation Platform

Interactive RAG (Retrieval-Augmented Generation) experimentation platform for comparing retrieval strategies and RAG modes in real time. Built with Python 3.13 + FastAPI backend featuring 4 RAG modes (Basic, Self-RAG with iterative sufficiency, Agentic RAG with plan-retrieve-reflect, Graph RAG with knowledge graph traversal), 4 retrieval strategies (Vector/FAISS, Hybrid BM25+RRF fusion, Multi-query expansion, Cross-encoder reranking), automatic knowledge graph extraction (NetworkX with background pre-building and disk caching), contextual compression (embedding-based sentence filtering), and smart PDF parsing (structure-aware with domain-specific chunking).

Python 3.13FastAPILangChainFAISSNetworkXNext.js 16React 19TypeScriptTailwind CSSGoogle Gemini 2.0 FlashOpenAI text-embedding-3-smallBM25Cross-encoder (ms-marco-MiniLM)unstructured (PDF parsing)Server-Sent EventsGitHub Actions

View Code

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project (Research & Experimentation)

The Problem

•

Researchers and engineers experimenting with RAG systems need to compare how different retrieval strategies (vector, hybrid, multi-query, reranked)...

•

No unified platform exists for experimenting with multiple RAG modes — Basic RAG (simple retrieve-generate), Self-RAG (iterative sufficiency checking),...

•

Understanding the impact of different chunking strategies (recursive, fixed, semantic, structure-based) and embedding models (OpenAI, local) requires...

•

Real-time feedback during RAG experimentation is essential for iterative improvement, but most implementations return only final answers without...

•

Knowledge graphs for Graph RAG require manual entity/relationship extraction and are not automatically generated from documents with caching for...

•

Contextual compression (filtering irrelevant chunks) and smart PDF parsing (domain-specific chunking for legal, research, financial documents) are...

The Solution

•

Built a comprehensive RAG experimentation platform with two main components: Python FastAPI backend and Next.js React frontend.

Backend Architecture (Python 3.13 + FastAPI + LangChain)

•

Implemented 4 RAG modes for different use cases:

•

Basic RAG — Standard retrieve-and-generate: retrieves relevant chunks with chosen strategy and generates answer with Google Gemini 2.0 Flash.

•

Self-RAG — Iterative sufficiency loop: generates initial answer, checks if retrieved context is sufficient, retrieves additional chunks if...

•

Agentic RAG — Plan-retrieve-reflect workflow: creates retrieval plan based on question, retrieves context for each sub-question, reflects on...

•

Graph RAG — Knowledge graph traversal: extracts entities/relationships from documents into NetworkX graph, traverses graph to find relevant...

•

Implemented 4 retrieval strategies:

•

Vector Search (FAISS) — Dense retrieval using OpenAI text-embedding-3-small or local SentenceTransformer embeddings, approximate nearest-neighbor...

•

Hybrid (Vector + BM25) — Combines dense (FAISS) and sparse (BM25) retrieval with Reciprocal Rank Fusion (RRF) for score merging, balances...

•

Multi-query Expansion — Generates multiple query variations from original question, retrieves chunks for each variation, and merges results with...

•

Cross-encoder Reranking — Initial retrieval via vector search, reranks top-k results using ms-marco-MiniLM cross-encoder for relevance scoring,...

Knowledge Graph System

•

Automatic entity/relationship extraction from documents using LangChain LLM chains

•

NetworkX graph construction with entities as nodes and relationships as edges

•

Background pre-building: graph generation runs asynchronously during document loading

•

Disk caching: graphs saved to `cache/` directory for instant reuse across sessions

•

Graph retrieval: traverses graph to find relevant entity paths for Graph RAG mode

Document Processing

•

Smart PDF parsing via `unstructured` library with layout-aware structure extraction

•

Domain-specific chunking strategies: general (recursive), legal (section-based), research (paragraph-based), financial (table-aware)

•

Chunking methods: recursive (character-based with overlap), fixed (equal-sized), semantic (embedding-based similarity), structure-based (document...

•

Document store with in-memory caching and disk persistence

Advanced Features

•

Contextual Compression — Embedding-based sentence filtering to remove irrelevant content from retrieved chunks, reduces LLM context noise and...

•

Streaming Responses — Server-Sent Events (SSE) for token-by-token answer display via `/ask-stream` endpoint, real-time feedback during generation

•

Performance Optimizations — Parallel retrieval for multiple queries, embedder caching (reuses embeddings across requests), hybrid retriever...

API Endpoints (6 total)

•

`GET /health` — Health check endpoint

•

`GET /documents` — List available PDF documents with metadata

•

`GET /documents/{id}/pdf` — Serve document PDF for preview

•

`POST /load-document` — Load and index document with chosen chunk_strategy, embedding_model, smart_parse settings

•

`POST /ask` — Ask question and return full response (non-streaming)

•

`POST /ask-stream` — Ask question with SSE streaming (token-by-token)

•

Frontend (Next.js 16 + React 19 + TypeScript)

•

Next.js 16 App Router with React 19 for modern concurrent features

•

Guided mode: step-by-step UI walkthrough for selecting documents, chunking strategies, embedding models, retrieval methods, and RAG modes

•

Streaming UI: token-by-token answer display with Server-Sent Events client

•

Configuration comparison: side-by-side comparison of different retrieval strategies and RAG modes on the same question

•

Tailwind CSS for responsive utility-first styling

•

TypeScript for type safety across API client and components

Project Automation (Makefile with 12 commands)

•

`make setup` — First-time setup: install Python and Node dependencies, create `.env` files

•

`make dev` — Run backend (port 8000) and frontend (port 3000) concurrently

•

`make dev-backend` / `make dev-frontend` — Run services independently

•

`make test` — Quick system check (imports, caches, config without LLM calls)

•

`make validate` — Full backend validation including LLM calls

•

`make build` — Production build of frontend

•

`make clean` — Remove `__pycache__`, `.pyc`, `.next` build cache

•

`make clean-cache` — Remove all indexed document and knowledge graph caches

•

`make stop` — Kill all running RAGLab processes

CI/CD (GitHub Actions)

•

Backend checks: install dependencies, verify all Python imports, syntax validation

•

Frontend checks: install dependencies, ESLint, production build verification

•

Security scan: detect committed `.env` files and hardcoded API keys

•

Runs on every push and PR to main/master branches

Deployment

•

Frontend: Vercel deployment with `NEXT_PUBLIC_API_URL` environment variable

•

Backend: GCP Cloud Run / App Engine or any cloud with `OPENAI_API_KEY` and `GEMINI_API_KEY` environment variables

•

Security: `.env` files git-ignored, platform secret managers for production credentials

Design Decisions

•

Chose 4 RAG modes (Basic, Self-RAG, Agentic, Graph) to cover different use cases: Basic for simple Q&A, Self-RAG for iterative refinement, Agentic...

•

Implemented 4 retrieval strategies to demonstrate trade-offs: Vector (fast, semantic), Hybrid (balances semantic+keyword), Multi-query (handles...

•

Used FAISS for vector search — industry-standard library with efficient approximate nearest-neighbor search, CPU-friendly, and no GPU required for...

•

Built knowledge graph with NetworkX instead of Neo4j — simpler setup (no separate database), disk caching for persistence, sufficient for...

•

Chose Google Gemini 2.0 Flash for generation — cost-effective (~$0.002/call), fast response times, good quality for experimentation. OpenAI...

•

Implemented Server-Sent Events (SSE) for streaming over WebSockets — simpler protocol, HTTP-based (better firewall compatibility), one-way...

•

Added contextual compression as optional feature — reduces LLM context noise by filtering irrelevant sentences, trades processing time for answer...

•

Built smart PDF parsing with domain-specific chunking — legal documents need section awareness, financial documents need table handling, research...

•

Implemented multiple embedding options (OpenAI, local SentenceTransformer) — OpenAI for quality, local for cost control and offline use.

•

Used Makefile for project automation — single command (`make dev`) to run full stack, `make test` for quick validation, `make clean-cache` to reset...

•

Chose monorepo structure with separate backend/frontend directories — easier to run both services, shared Git history, simpler deployment configuration.

•

Implemented background knowledge graph pre-building — graph generation runs asynchronously during document loading, doesn't block user, caches to disk...

Tradeoffs & Constraints

•

Chose FAISS over Pinecone/Weaviate: No API costs and works offline, but lacks distributed scaling and real-time updates. Suitable for...

•

NetworkX for knowledge graphs: Simple Python library with disk caching, but limited to single-machine scale. Would need Neo4j/Amazon Neptune for...

•

Server-Sent Events for streaming: Simpler than WebSockets but one-way only. Can't send client updates during streaming (e.g., stop generation...

•

BM25 for sparse retrieval: Classic algorithm with good keyword matching, but can't handle synonyms or semantic similarity. Hybrid mode combines both...

•

OpenAI embeddings: Best quality but API costs (~$0.0001/1K tokens) and requires internet. Local SentenceTransformer is free but lower quality.

•

In-memory document store: Fast access but requires re-indexing on server restart. Cache directory provides persistence but needs manual cleanup.

•

Google Gemini 2.0 Flash: Cost-effective and fast, but less capable than GPT-4 for complex reasoning. Trade cost/speed for quality.

•

Cross-encoder reranking: Highest precision but slow (processes each doc-query pair). Only practical for top-k results (e.g., rerank top 20 from 100).

•

Graph RAG entity extraction: LLM-based extraction is accurate but expensive. Pre-building and caching amortizes cost across multiple queries.

•

Would improve: Add streaming stop capability (WebSocket upgrade), implement distributed vector store (Pinecone/Weaviate), add more LLM options...

Outcome & Impact

•

Production-ready RAG experimentation platform enabling side-by-side comparison of 4 RAG modes (Basic, Self-RAG, Agentic, Graph) and 4 retrieval...

•

Comprehensive backend with Python 3.13 + FastAPI + LangChain providing 6 API endpoints (/health, /documents, /documents/{id}/pdf, /load-document,...

•

4 RAG modes implemented: Basic (retrieve-generate), Self-RAG (iterative sufficiency loop up to 3 iterations), Agentic (plan-retrieve-reflect), Graph...

•

4 retrieval strategies: Vector search (FAISS with OpenAI/local embeddings), Hybrid (BM25+FAISS with RRF fusion), Multi-query expansion (generates...

•

Automatic knowledge graph extraction with NetworkX: entities/relationships from documents, background pre-building during document loading, disk...

•

Streaming responses via Server-Sent Events: token-by-token answer display through /ask-stream endpoint, real-time feedback during LLM generation.

•

Contextual compression: embedding-based sentence filtering to remove irrelevant content from retrieved chunks, reduces LLM context noise and improves...

•

Smart PDF parsing: structure-aware parsing via unstructured library, domain-specific chunking strategies (general, legal, research, financial),...

•

Multiple chunking strategies: recursive (character-based with overlap), fixed (equal-sized), semantic (embedding-based similarity), structure-based...

•

Performance optimizations: parallel retrieval for multi-query expansion, embedder caching (reuses embeddings across requests), hybrid retriever...

•

Interactive Next.js 16 + React 19 frontend: guided mode with step-by-step document/strategy/mode selection, streaming UI with token-by-token display,...

•

Comprehensive Makefile automation: 12 commands including setup, dev (backend+frontend), test (quick validation), validate (full with LLM calls),...

•

GitHub Actions CI/CD: backend checks (dependencies, imports, syntax), frontend checks (ESLint, build), security scan (detect .env files, hardcoded...

•

Flexible deployment: Vercel for frontend (NEXT_PUBLIC_API_URL env var), GCP/any cloud for backend (OPENAI_API_KEY, GEMINI_API_KEY env vars), .env...

•

Clean project structure: monorepo with raglab-backend/ (FastAPI, chunking, embeddings, graph, ingestion, retrieval, services) and raglab-frontend/...

•

MIT licensed for open experimentation and research use.

Tech Stack

•

Backend: Python 3.13, FastAPI (REST + SSE), Uvicorn

•

RAG Framework: LangChain for orchestration and chains

•

Vector Store: FAISS for dense retrieval with approximate NN search

•

Knowledge Graph: NetworkX for entity/relationship graphs

•

Sparse Retrieval: BM25 for keyword-based search with RRF fusion

•

Reranking: Cross-encoder (ms-marco-MiniLM) for relevance scoring

•

PDF Parsing: unstructured library for layout-aware extraction

•

Embeddings: OpenAI text-embedding-3-small, SentenceTransformer (local)

•

LLM: Google Gemini 2.0 Flash for answer generation

•

Frontend: Next.js 16 (App Router), React 19, TypeScript

•

Styling: Tailwind CSS for utility-first responsive design

•

Streaming: Server-Sent Events (SSE) for token-by-token display

•

Automation: Makefile for project commands and workflows

•

CI/CD: GitHub Actions (backend/frontend/security checks)

Back to Projects