GCP AI Log Monitoring Agent

AI-powered SRE agent for real-time GCP log monitoring. ReAct agent loop with GPT-4o and 8 LLM-callable tools: fetch GCP logs → Redis sliding-window deduplication → anomaly detection → RCA generation → SMTP email alerts.

PythonOpenAI GPT-4oFastAPIRedisGCP Cloud LoggingGCP Cloud FunctionsPub/SubCloud SchedulerReactViteSMTPGitHub ActionsMakefile

View Code

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

•

SRE teams monitoring GCP workloads face alert fatigue — high-volume raw logs with no automated root cause analysis require constant human triage,...

•

Correlating related log errors across multiple GCP services to identify a root cause is manual, time-consuming, and requires deep infrastructure...

•

Writing RCA reports after incidents is a slow manual process that delays post-mortem documentation and often gets skipped under pressure, leaving...

•

Duplicate alerts from the same underlying incident flood on-call channels — no deduplication or sliding-window logic means the same error fires...

•

No open-source tool combined real-time GCP Cloud Logging queries, LLM-powered anomaly detection, automated RCA generation, Redis deduplication, and...

The Solution

•

Built a ReAct agent orchestrator with modular services and two GCP Cloud Function deployment modes.

Agent Architecture

•

agent/monitoring_agent.py — Orchestrator running the full monitoring cycle:

•

1. Fetch recent logs from GCP Cloud Logging via log_service.py

•

2. Check Redis cache (sliding window) for duplicate log batches

•

3. Pass logs to GPT-4o with 8 registered tools and system prompt from reasoning.py

•

4. ReAct loop: LLM calls tools in sequence (query_logs → analyze_anomaly → generate_rca → send_alert → store_incident)

•

5. Email RCA report via alert_service.py (Gmail SMTP)

•

6. Store incident in FastAPI backend for dashboard visibility

•

agent/tools.py — 8 LLM-callable tools with OpenAI function calling schemas:

•

query_logs — fetch GCP logs with filters (severity, time range, service)

•

get_error_summary — aggregate error counts by service and type

•

analyze_anomaly — detect anomalous patterns in log batches

•

generate_rca — produce structured RCA report (root cause, affected services, severity, summary)

•

send_alert — dispatch email via SMTP

•

store_incident — POST RCA to FastAPI /incident endpoint

•

flush_cache — clear Redis cache

•

health_check — verify all services responding

•

agent/reasoning.py — All prompt templates isolated from business logic (system prompt, RCA format, anomaly detection instructions)

Services

•

services/log_service.py — GCP Cloud Logging client using google-cloud-logging SDK. Queries by project ID, severity filter, time range, and...

•

services/redis_cache.py — Redis sliding-window cache for log deduplication. Stores hash of recent log batches with TTL. Prevents duplicate...

•

services/alert_service.py — Gmail SMTP email dispatch. Formats RCA report as HTML email, sends to configured ALERT_EMAIL_TO address.

API Backend (FastAPI)

•

api/server.py — 7 endpoints:

•

GET /health — health check

•

GET /incidents — list all stored RCA reports

•

POST /incident — store RCA report (root_cause, affected_services, severity, summary)

•

POST /analyze — run agent on natural language query

•

POST /logs/query — query GCP logs with filters

•

GET /logs/error-summary — aggregated error summary

•

DELETE /cache — flush Redis cache

Cloud Function Deployment (cloud/cloud_function.py)

•

Two entry points for GCP deployment: - log_monitoring_trigger — triggered by Cloud Scheduler → Pub/Sub on a schedule (every N minutes) -...

•

Both entry points invoke the same monitoring_agent.py orchestrator.

Frontend (React + Vite)

•

App.jsx — Dashboard root with polling (setInterval) against /incidents endpoint, state management for incident list

•

IncidentList.jsx — Incident list container with severity filtering

•

IncidentCard.jsx — Individual incident card (root cause, affected services, severity badge, timestamp, summary)

CI/CD (GitHub Actions)

•

Backend job: Python 3.13 → pip install → syntax check (py_compile) → import verification → FastAPI app load test

•

Frontend job: Node 20 → npm ci → vite production build

Design Decisions

•

ReAct agent loop with OpenAI function calling over a fixed pipeline — the LLM decides which tools to invoke and in what order based on the log...

•

8 LLM-callable tools with OpenAI function schemas — each tool is a well-defined operation (query, analyze, alert, store). The LLM selects the right...

•

Prompt templates isolated in reasoning.py — all system prompts, RCA format instructions, and anomaly detection prompts live in one file, never inside...

•

Redis sliding-window cache for deduplication — prevents the same error burst from triggering multiple RCA cycles and alert emails. TTL-based expiry...

•

Two Cloud Function deployment modes (scheduled + reactive) — scheduled mode provides guaranteed periodic sweeps; reactive log-sink mode provides...

•

FastAPI incident store over a database — in-memory incident list in server.py sufficient for dashboard use case. Avoids adding PostgreSQL/MongoDB...

•

Gmail SMTP for email alerts over SendGrid/Mailgun — zero additional API cost, works with any Gmail account and App Password. Simple SMTP is reliable...

•

React polling dashboard over WebSocket — simpler architecture for an ops dashboard that refreshes every 30s. WebSocket would add server-side...

Tradeoffs & Constraints

•

In-memory incident store in FastAPI — incidents are lost on server restart. Production use requires persistent storage (PostgreSQL, Firestore) for...

•

GPT-4o latency per monitoring cycle (~5-15s for ReAct loop) — acceptable for scheduled monitoring but may be too slow for sub-minute reactive...

•

Redis required as local dependency — adds operational complexity for local development. Cloud Memorystore or Upstash Redis would simplify deployment...

•

Gmail SMTP with App Password — requires 2FA on the sender account and manual App Password generation. Not suitable for team-shared alert accounts; a...

•

Google service account JSON key as file path — GOOGLE_APPLICATION_CREDENTIALS points to a local file, which doesn't work in serverless environments....

•

No alert deduplication beyond Redis TTL — if two monitoring cycles run before TTL expires, the second will be deduplicated even if the incident is...

•

Would improve: Add persistent incident storage (Firestore), implement smarter alert fingerprinting, support Slack/PagerDuty alerting alongside email,...

Outcome & Impact

•

Production-deployable GCP SRE agent with ReAct loop, 8 LLM-callable tools, Redis deduplication, SMTP alerting, and two GCP Cloud Function deployment...

•

ReAct agent orchestrator runs full monitoring cycle autonomously: fetch GCP logs → Redis deduplication check → GPT-4o ReAct loop (tool selection +...

•

8 LLM-callable tools with OpenAI function schemas enable flexible agent behaviour — LLM selects tool sequence based on log context rather than fixed...

•

Two Cloud Function entry points cover both monitoring modes: log_monitoring_trigger (Cloud Scheduler → Pub/Sub, periodic sweeps) and pubsub_trigger...

•

7 FastAPI endpoints provide full API surface: health check, incident CRUD, natural language log analysis, filtered log queries, error aggregation,...

•

Redis sliding-window cache prevents duplicate RCA cycles and alert emails for the same error burst, reducing alert fatigue.

•

React dashboard polls /incidents endpoint for real-time incident visibility with severity filtering and structured RCA display (root cause, affected...

•

GitHub Actions CI/CD validates every push: Python syntax check, import verification, FastAPI app load test, Node 20 frontend build.

Tech Stack

•

Agent: Python, OpenAI GPT-4o (ReAct loop with function calling), 8 LLM-callable tools

•

GCP: Cloud Logging SDK (google-cloud-logging), Cloud Functions (python313 runtime), Cloud Scheduler, Pub/Sub

•

Cache: Redis (sliding-window deduplication, TTL-based expiry)

•

Backend: FastAPI (7 endpoints, in-memory incident store), Uvicorn

•

Alerting: Gmail SMTP (smtplib, HTML email dispatch)

•

Frontend: React, Vite, polling dashboard (IncidentList, IncidentCard components)

•

CI/CD: GitHub Actions (backend syntax/import/load checks + frontend build)

•

Automation: Makefile (install, dev, backend, frontend, monitor, check, stop, clean)

Back to Projects