GCP AI Log Monitoring Agent
AI-powered SRE agent for real-time GCP log monitoring. ReAct agent loop with GPT-4o and 8 LLM-callable tools: fetch GCP logs → Redis sliding-window deduplication → anomaly detection → RCA generation → SMTP email alerts.
Role
AI Engineer & Full-stack Developer
Team
Solo
Company/Organization
Personal Project
The Problem
SRE teams monitoring GCP workloads face alert fatigue — high-volume raw logs with no automated root cause analysis require constant human triage,...
Correlating related log errors across multiple GCP services to identify a root cause is manual, time-consuming, and requires deep infrastructure...
Writing RCA reports after incidents is a slow manual process that delays post-mortem documentation and often gets skipped under pressure, leaving...
Duplicate alerts from the same underlying incident flood on-call channels — no deduplication or sliding-window logic means the same error fires...
No open-source tool combined real-time GCP Cloud Logging queries, LLM-powered anomaly detection, automated RCA generation, Redis deduplication, and...
The Solution
Built a ReAct agent orchestrator with modular services and two GCP Cloud Function deployment modes.
Agent Architecture
agent/monitoring_agent.py — Orchestrator running the full monitoring cycle:
1. Fetch recent logs from GCP Cloud Logging via log_service.py
2. Check Redis cache (sliding window) for duplicate log batches
3. Pass logs to GPT-4o with 8 registered tools and system prompt from reasoning.py
4. ReAct loop: LLM calls tools in sequence (query_logs → analyze_anomaly → generate_rca → send_alert → store_incident)
5. Email RCA report via alert_service.py (Gmail SMTP)
6. Store incident in FastAPI backend for dashboard visibility
agent/tools.py — 8 LLM-callable tools with OpenAI function calling schemas:
query_logs — fetch GCP logs with filters (severity, time range, service)
get_error_summary — aggregate error counts by service and type
analyze_anomaly — detect anomalous patterns in log batches
generate_rca — produce structured RCA report (root cause, affected services, severity, summary)
send_alert — dispatch email via SMTP
store_incident — POST RCA to FastAPI /incident endpoint
flush_cache — clear Redis cache
health_check — verify all services responding
agent/reasoning.py — All prompt templates isolated from business logic (system prompt, RCA format, anomaly detection instructions)
Services
services/log_service.py — GCP Cloud Logging client using google-cloud-logging SDK. Queries by project ID, severity filter, time range, and...
services/redis_cache.py — Redis sliding-window cache for log deduplication. Stores hash of recent log batches with TTL. Prevents duplicate...
services/alert_service.py — Gmail SMTP email dispatch. Formats RCA report as HTML email, sends to configured ALERT_EMAIL_TO address.
API Backend (FastAPI)
api/server.py — 7 endpoints:
GET /health — health check
GET /incidents — list all stored RCA reports
POST /incident — store RCA report (root_cause, affected_services, severity, summary)
POST /analyze — run agent on natural language query
POST /logs/query — query GCP logs with filters
GET /logs/error-summary — aggregated error summary
DELETE /cache — flush Redis cache
Cloud Function Deployment (cloud/cloud_function.py)
Two entry points for GCP deployment: - log_monitoring_trigger — triggered by Cloud Scheduler → Pub/Sub on a schedule (every N minutes) -...
Both entry points invoke the same monitoring_agent.py orchestrator.
Frontend (React + Vite)
App.jsx — Dashboard root with polling (setInterval) against /incidents endpoint, state management for incident list
IncidentList.jsx — Incident list container with severity filtering
IncidentCard.jsx — Individual incident card (root cause, affected services, severity badge, timestamp, summary)
CI/CD (GitHub Actions)
Backend job: Python 3.13 → pip install → syntax check (py_compile) → import verification → FastAPI app load test
Frontend job: Node 20 → npm ci → vite production build
Design Decisions
ReAct agent loop with OpenAI function calling over a fixed pipeline — the LLM decides which tools to invoke and in what order based on the log...
8 LLM-callable tools with OpenAI function schemas — each tool is a well-defined operation (query, analyze, alert, store). The LLM selects the right...
Prompt templates isolated in reasoning.py — all system prompts, RCA format instructions, and anomaly detection prompts live in one file, never inside...
Redis sliding-window cache for deduplication — prevents the same error burst from triggering multiple RCA cycles and alert emails. TTL-based expiry...
Two Cloud Function deployment modes (scheduled + reactive) — scheduled mode provides guaranteed periodic sweeps; reactive log-sink mode provides...
FastAPI incident store over a database — in-memory incident list in server.py sufficient for dashboard use case. Avoids adding PostgreSQL/MongoDB...
Gmail SMTP for email alerts over SendGrid/Mailgun — zero additional API cost, works with any Gmail account and App Password. Simple SMTP is reliable...
React polling dashboard over WebSocket — simpler architecture for an ops dashboard that refreshes every 30s. WebSocket would add server-side...
Tradeoffs & Constraints
In-memory incident store in FastAPI — incidents are lost on server restart. Production use requires persistent storage (PostgreSQL, Firestore) for...
GPT-4o latency per monitoring cycle (~5-15s for ReAct loop) — acceptable for scheduled monitoring but may be too slow for sub-minute reactive...
Redis required as local dependency — adds operational complexity for local development. Cloud Memorystore or Upstash Redis would simplify deployment...
Gmail SMTP with App Password — requires 2FA on the sender account and manual App Password generation. Not suitable for team-shared alert accounts; a...
Google service account JSON key as file path — GOOGLE_APPLICATION_CREDENTIALS points to a local file, which doesn't work in serverless environments....
No alert deduplication beyond Redis TTL — if two monitoring cycles run before TTL expires, the second will be deduplicated even if the incident is...
Would improve: Add persistent incident storage (Firestore), implement smarter alert fingerprinting, support Slack/PagerDuty alerting alongside email,...
Outcome & Impact
Production-deployable GCP SRE agent with ReAct loop, 8 LLM-callable tools, Redis deduplication, SMTP alerting, and two GCP Cloud Function deployment...
ReAct agent orchestrator runs full monitoring cycle autonomously: fetch GCP logs → Redis deduplication check → GPT-4o ReAct loop (tool selection +...
8 LLM-callable tools with OpenAI function schemas enable flexible agent behaviour — LLM selects tool sequence based on log context rather than fixed...
Two Cloud Function entry points cover both monitoring modes: log_monitoring_trigger (Cloud Scheduler → Pub/Sub, periodic sweeps) and pubsub_trigger...
7 FastAPI endpoints provide full API surface: health check, incident CRUD, natural language log analysis, filtered log queries, error aggregation,...
Redis sliding-window cache prevents duplicate RCA cycles and alert emails for the same error burst, reducing alert fatigue.
React dashboard polls /incidents endpoint for real-time incident visibility with severity filtering and structured RCA display (root cause, affected...
GitHub Actions CI/CD validates every push: Python syntax check, import verification, FastAPI app load test, Node 20 frontend build.
Tech Stack
Agent: Python, OpenAI GPT-4o (ReAct loop with function calling), 8 LLM-callable tools
GCP: Cloud Logging SDK (google-cloud-logging), Cloud Functions (python313 runtime), Cloud Scheduler, Pub/Sub
Cache: Redis (sliding-window deduplication, TTL-based expiry)
Backend: FastAPI (7 endpoints, in-memory incident store), Uvicorn
Alerting: Gmail SMTP (smtplib, HTML email dispatch)
Frontend: React, Vite, polling dashboard (IncidentList, IncidentCard components)
CI/CD: GitHub Actions (backend syntax/import/load checks + frontend build)
Automation: Makefile (install, dev, backend, frontend, monitor, check, stop, clean)