Back to Projects

YNM Dashboard — Infrastructure Data Pipeline

End-to-end data pipeline and interactive dashboard for Indian infrastructure company and consultant data across Aviation, Road, and Rail sectors. Built with 5-stage Python ETL pipeline (Excel to CSV conversion, data combination, initial cleaning, production-grade validation, auto-repair) using pandas and openpyxl, automated via Makefile.

Python 3.10+pandasopenpyxlReact 19Vite 7Tailwind CSS 4TanStack React TableRechartsPapaParseGitHub ActionsMakefile

Role

Data Engineer & Full-stack Developer

Team

Solo

Company/Organization

YNM Safety

The Problem

YNM Safety had scattered infrastructure company and consultant data across multiple Excel files covering Aviation, Road, and Rail sectors with no...

Excel datasets had inconsistent formattingdifferent column names, mixed data types, multi-contact fields in single cells, and irregular pincode...

Data contained duplicate contact entries, invalid email formats (missing @, .com), invalid phone numbers (wrong lengths, non-numeric characters), and...

No automated pipeline to transform raw Excel files into clean, validated datasets ready for analysis and visualization.

Business team needed interactive dashboard to explore companies by industry, state, contact details with real-time filtering and export capabilities.

Manual data cleaning was time-consuming, error-prone, and not repeatable across new data batches.

The Solution

Built a comprehensive data engineering solution with two major components: Python ETL pipeline and React dashboard.

The 5-stage Python ETL pipeline processes raw Excel files into clean, validated CSV datasets:

xlsx_to_csv.py — Converts all .xlsx files in 'excel files/' directory to CSV format in 'csv_data/' using openpyxl, preserving data integrity and...

combine_csv.py — Merges all individual CSVs into a single master file (ynm_master_data.csv) with unified schema, handling column name variations...

clean_data.py — Initial data cleaning and normalization: standardizes column names, splits multi-contact fields (e.g., 'Contact 1, Contact 2'...

final_clean.py — Production-grade cleaning and validation: validates email formats (checks for @, domain, TLD), validates phone numbers (length,...

repair_review_rows.py — Auto-repairs flagged rows using heuristics (fixes common email typos, standardizes phone formats, fills missing data from...

The entire pipeline is automated via Makefile with commands: 'make convert', 'make combine', 'make clean-data', 'make final-clean', 'make...

The React 19 + Vite 7 dashboard provides interactive data exploration:

Two Data Views — 'YNM Project Today' (main infrastructure companies) and 'Road Consultants' (specialized consultants) with tab switching.

TanStack React Table — Sortable columns, paginated rows (10/25/50/100 per page), column visibility toggle, and responsive layout.

Full-Text Search — Search across all columns with real-time filtering and debouncing for performance.

Sidebar Filters — Multi-select dropdowns for Industry (Aviation, Road, Rail), State (all Indian states), City, and other categorical fields with...

Interactive Charts — Recharts-powered visualizations: industry breakdown (pie chart), state distribution (bar chart), company size distribution,...

Statistics Cards — Live counts of total companies, active projects, states covered, industries served, updated in real-time as filters change.

CSV Export — Export filtered data to CSV with current filter settings applied.

Responsive Layout — Mobile-first design with collapsible sidebar and optimized for all screen sizes.

Data loading uses custom React hooks (useCSVData) with PapaParse for efficient client-side CSV parsing.

Tailwind CSS 4 provides utility-first styling with custom color palette and component classes.

GitHub Actions CI/CD runs on every push and PR to main/master:

Python Checks — Syntax validation, ruff linter for code quality, format checks for consistency.

Dashboard Checks — ESLint for code quality, production build verification to catch build errors early.

Deployment supports Vercel (set root directory to 'ynm-dashboard', framework: Vite, build: 'npm run build', output: 'dist') or GCP Cloud Run/App...

Design Decisions

Chose 5-stage Python ETL pipeline over single script for separation of concernseach stage handles one transformation, making debugging easier and...

Used pandas for data manipulationindustry-standard library with powerful DataFrame operations, vectorized computations, and built-in validation...

Built Makefile automation for repeatable buildsensures consistent execution order with dependency management, enables one-command full pipeline runs.

Implemented auto-repair stage (repair_review_rows.py) instead of manual reviewuses heuristics to fix common issues (email typos, phone formats),...

Chose React 19 + Vite 7 for dashboardmodern React features with fast HMR, optimized production builds, and simple configuration.

Used TanStack React Table over custom tableheadless table library with built-in sorting, pagination, filtering, and column management without...

Implemented client-side CSV parsing (PapaParse) instead of databasesimpler deployment, no backend needed, works with static hosting, suitable for...

Built multi-select sidebar filters with live countsimproves UX by showing available options and result counts before applying filters.

Added GitHub Actions CI/CD with separate Python and dashboard checkscatches issues early, ensures code quality, and validates builds before...

Made dashboard fully responsive with Tailwind CSS 4mobile-first design, collapsible sidebar, optimized touch interactions for field use.

Tradeoffs & Constraints

Chose Python + pandas over database ETL: Simpler setup and maintenance for current scale (<100K rows), but would need migration to Spark/Airflow for...

Client-side CSV parsing: Fast and simple for current dataset size, but would need server-side processing or database for datasets >50MB or real-time...

Auto-repair heuristics: Fixes 80%+ of common issues automatically but may need manual review for edge cases. Trade simplicity for occasional...

Static dashboard deployment: No backend needed (Vercel/GCP static hosting), but lacks real-time data sync or user authentication. Suitable for...

5-stage pipeline: Each stage writes intermediate files (ynm_master_data.csv, cleaned versions) for debugging, but increases disk usage. Trade disk...

Multi-select filters without URL params: Filter state lives in React state, not URL, so filters can't be shared via links. Trade shareability for...

Recharts for visualizations: Good for standard charts but limited for complex custom visualizations. Trade customization flexibility for ease of use.

GitHub Actions for CI: Free tier limits (2000 minutes/month for private repos) sufficient for current usage but may need paid tier or self-hosted...

Would improve: Add database backend (PostgreSQL/Supabase) for real-time updates, implement URL-based filter persistence for sharing, add more...

Outcome & Impact

Production-ready data engineering solution processing multiple Excel files (Aviation, Road, Rail sectors) into clean, validated datasets through...

Data quality improvements: validated email formats (checks @, domain, TLD), standardized phone numbers (length, numeric, country codes), split...

Makefile automation enables one-command execution: 'make start' runs full pipeline and launches dashboard, 'make pipeline' processes all data...

Interactive React 19 dashboard with two data views (YNM Project Today, Road Consultants) providing real-time exploration of infrastructure company...

Advanced data table with TanStack React Table: sortable columns (multi-column sort), paginated rows (10/25/50/100 per page), column visibility...

Full-text search across all columns with real-time filtering, debouncing for performance, and highlighted search results.

Multi-select sidebar filters for Industry, State, City, and categorical fields with live counts showing available options and result counts before...

Interactive visualizations with Recharts: industry breakdown pie chart, state distribution bar chart, company size distribution, sector trends over...

Statistics cards with live counts: total companies, active projects, states covered, industries servedupdates in real-time as filters change.

CSV export functionality exports filtered data with current filter settings applied, preserving column order and formatting.

Responsive mobile-first design with Tailwind CSS 4: collapsible sidebar, optimized touch interactions, works on mobile, tablet, laptop, and desktop.

GitHub Actions CI/CD with Python checks (syntax validation, ruff linter, format checks) and dashboard checks (ESLint, production build verification).

Deployment flexibility: Vercel (set root to 'ynm-dashboard', framework Vite, build 'npm run build', output 'dist') or GCP Cloud Run/App Engine (serve...

Clean codebase organization: Python scripts at root (5 pipeline stages), React dashboard in 'ynm-dashboard/' (components, hooks, lib utilities),...

Comprehensive documentation in README: project structure, prerequisites, quick start, data pipeline commands, dashboard commands, tech stack,...

Private project with all rights reserved license suitable for internal YNM Safety business analytics use.

Tech Stack

ETL Pipeline: Python 3.10+, pandas (data manipulation), openpyxl (Excel reading)

Dashboard: React 19 (UI library), Vite 7 (build tool, dev server)

Styling: Tailwind CSS 4 utility-first framework

Data Table: TanStack React Table (sortable, paginated, filterable)

Charts: Recharts for interactive visualizations

CSV Parsing: PapaParse for client-side CSV loading

Automation: Makefile for build pipeline and task management

CI/CD: GitHub Actions (Python ruff lint, dashboard ESLint + build)

Deployment: Vercel (static hosting), GCP Cloud Run / App Engine

Back to Projects