Back to Projects

GCP AI Log Monitoring Agent

AI-powered SRE agent for real-time GCP log monitoring. ReAct agent loop with GPT-4o and 8 LLM-callable tools: fetch GCP logs → Redis sliding-window deduplication → anomaly detection → RCA generation → SMTP email alerts.

PythonOpenAI GPT-4oFastAPIRedisGCP Cloud LoggingGCP Cloud FunctionsPub/SubCloud SchedulerReactViteSMTPGitHub ActionsMakefile

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

SRE teams monitoring GCP workloads face alert fatiguehigh-volume raw logs with no automated root cause analysis require constant human triage,...

Correlating related log errors across multiple GCP services to identify a root cause is manual, time-consuming, and requires deep infrastructure...

Writing RCA reports after incidents is a slow manual process that delays post-mortem documentation and often gets skipped under pressure, leaving...

Duplicate alerts from the same underlying incident flood on-call channelsno deduplication or sliding-window logic means the same error fires...

No open-source tool combined real-time GCP Cloud Logging queries, LLM-powered anomaly detection, automated RCA generation, Redis deduplication, and...

The Solution

Built a ReAct agent orchestrator with modular services and two GCP Cloud Function deployment modes.

Agent Architecture

agent/monitoring_agent.py — Orchestrator running the full monitoring cycle:

1. Fetch recent logs from GCP Cloud Logging via log_service.py

2. Check Redis cache (sliding window) for duplicate log batches

3. Pass logs to GPT-4o with 8 registered tools and system prompt from reasoning.py

4. ReAct loop: LLM calls tools in sequence (query_logs → analyze_anomaly → generate_rca → send_alert → store_incident)

5. Email RCA report via alert_service.py (Gmail SMTP)

6. Store incident in FastAPI backend for dashboard visibility

agent/tools.py — 8 LLM-callable tools with OpenAI function calling schemas:

query_logsfetch GCP logs with filters (severity, time range, service)

get_error_summaryaggregate error counts by service and type

analyze_anomalydetect anomalous patterns in log batches

generate_rcaproduce structured RCA report (root cause, affected services, severity, summary)

send_alertdispatch email via SMTP

store_incidentPOST RCA to FastAPI /incident endpoint

flush_cacheclear Redis cache

health_checkverify all services responding

agent/reasoning.py — All prompt templates isolated from business logic (system prompt, RCA format, anomaly detection instructions)

Services

services/log_service.py — GCP Cloud Logging client using google-cloud-logging SDK. Queries by project ID, severity filter, time range, and...

services/redis_cache.py — Redis sliding-window cache for log deduplication. Stores hash of recent log batches with TTL. Prevents duplicate...

services/alert_service.py — Gmail SMTP email dispatch. Formats RCA report as HTML email, sends to configured ALERT_EMAIL_TO address.

API Backend (FastAPI)

api/server.py — 7 endpoints:

GET /healthhealth check

GET /incidentslist all stored RCA reports

POST /incidentstore RCA report (root_cause, affected_services, severity, summary)

POST /analyzerun agent on natural language query

POST /logs/queryquery GCP logs with filters

GET /logs/error-summaryaggregated error summary

DELETE /cacheflush Redis cache

Cloud Function Deployment (cloud/cloud_function.py)

Two entry points for GCP deployment: - log_monitoring_trigger — triggered by Cloud Scheduler → Pub/Sub on a schedule (every N minutes) -...

Both entry points invoke the same monitoring_agent.py orchestrator.

Frontend (React + Vite)

App.jsxDashboard root with polling (setInterval) against /incidents endpoint, state management for incident list

IncidentList.jsxIncident list container with severity filtering

IncidentCard.jsxIndividual incident card (root cause, affected services, severity badge, timestamp, summary)

CI/CD (GitHub Actions)

Backend job: Python 3.13 → pip install → syntax check (py_compile) → import verification → FastAPI app load test

Frontend job: Node 20 → npm ci → vite production build

Design Decisions

ReAct agent loop with OpenAI function calling over a fixed pipelinethe LLM decides which tools to invoke and in what order based on the log...

8 LLM-callable tools with OpenAI function schemaseach tool is a well-defined operation (query, analyze, alert, store). The LLM selects the right...

Prompt templates isolated in reasoning.pyall system prompts, RCA format instructions, and anomaly detection prompts live in one file, never inside...

Redis sliding-window cache for deduplicationprevents the same error burst from triggering multiple RCA cycles and alert emails. TTL-based expiry...

Two Cloud Function deployment modes (scheduled + reactive)scheduled mode provides guaranteed periodic sweeps; reactive log-sink mode provides...

FastAPI incident store over a databasein-memory incident list in server.py sufficient for dashboard use case. Avoids adding PostgreSQL/MongoDB...

Gmail SMTP for email alerts over SendGrid/Mailgunzero additional API cost, works with any Gmail account and App Password. Simple SMTP is reliable...

React polling dashboard over WebSocketsimpler architecture for an ops dashboard that refreshes every 30s. WebSocket would add server-side...

Tradeoffs & Constraints

In-memory incident store in FastAPIincidents are lost on server restart. Production use requires persistent storage (PostgreSQL, Firestore) for...

GPT-4o latency per monitoring cycle (~5-15s for ReAct loop)acceptable for scheduled monitoring but may be too slow for sub-minute reactive...

Redis required as local dependencyadds operational complexity for local development. Cloud Memorystore or Upstash Redis would simplify deployment...

Gmail SMTP with App Passwordrequires 2FA on the sender account and manual App Password generation. Not suitable for team-shared alert accounts; a...

Google service account JSON key as file pathGOOGLE_APPLICATION_CREDENTIALS points to a local file, which doesn't work in serverless environments....

No alert deduplication beyond Redis TTLif two monitoring cycles run before TTL expires, the second will be deduplicated even if the incident is...

Would improve: Add persistent incident storage (Firestore), implement smarter alert fingerprinting, support Slack/PagerDuty alerting alongside email,...

Outcome & Impact

Production-deployable GCP SRE agent with ReAct loop, 8 LLM-callable tools, Redis deduplication, SMTP alerting, and two GCP Cloud Function deployment...

ReAct agent orchestrator runs full monitoring cycle autonomously: fetch GCP logs → Redis deduplication check → GPT-4o ReAct loop (tool selection +...

8 LLM-callable tools with OpenAI function schemas enable flexible agent behaviourLLM selects tool sequence based on log context rather than fixed...

Two Cloud Function entry points cover both monitoring modes: log_monitoring_trigger (Cloud Scheduler → Pub/Sub, periodic sweeps) and pubsub_trigger...

7 FastAPI endpoints provide full API surface: health check, incident CRUD, natural language log analysis, filtered log queries, error aggregation,...

Redis sliding-window cache prevents duplicate RCA cycles and alert emails for the same error burst, reducing alert fatigue.

React dashboard polls /incidents endpoint for real-time incident visibility with severity filtering and structured RCA display (root cause, affected...

GitHub Actions CI/CD validates every push: Python syntax check, import verification, FastAPI app load test, Node 20 frontend build.

Tech Stack

Agent: Python, OpenAI GPT-4o (ReAct loop with function calling), 8 LLM-callable tools

GCP: Cloud Logging SDK (google-cloud-logging), Cloud Functions (python313 runtime), Cloud Scheduler, Pub/Sub

Cache: Redis (sliding-window deduplication, TTL-based expiry)

Backend: FastAPI (7 endpoints, in-memory incident store), Uvicorn

Alerting: Gmail SMTP (smtplib, HTML email dispatch)

Frontend: React, Vite, polling dashboard (IncidentList, IncidentCard components)

CI/CD: GitHub Actions (backend syntax/import/load checks + frontend build)

Automation: Makefile (install, dev, backend, frontend, monitor, check, stop, clean)

Back to Projects