Homelab AI Monitoring Agent
An AI-powered infrastructure monitoring agent that watches over 25+ self-hosted LXC containers, analyses metrics using AWS Bedrock, and delivers intelligent alerts via Telegram — with inline action buttons for semi-autonomous remediation.
What It Does
The agent runs on a dedicated Proxmox LXC container and executes a pipeline every 15 minutes: collect metrics from the Proxmox API → pre-summarise to ~500 tokens → store in SQLite → send to AWS Bedrock for AI analysis → deliver actionable alerts via Telegram.
When thresholds are breached, alerts arrive with inline Telegram buttons — acknowledge, snooze, or request deeper analysis. For critical issues the agent can propose remediation actions (log cleanup, container restart) that require explicit approval before executing — and always takes a Proxmox snapshot first as a safety net.
Architecture
Proxmox API (read-only token)
│
▼
metrics_collector.py ──→ SQLite DB
│
▼
bedrock_analyzer.py ──→ AWS Bedrock (Nova Micro/Lite/Pro)
│
▼
alert_monitor.py ──→ Telegram Bot (inline buttons)
│
▼
action_executor.py ──→ Proxmox API (snapshot → remediate)
│
▼
apprise_notifier.py ──→ ntfy (failsafe blast channel)
Orchestration: Three n8n workflows replace systemd timers — metrics collection every 15 minutes, AI analysis every 6 hours, alert monitoring every 15 minutes. All visible and editable through a web UI.
Technical Highlights
- Tiered AI model routing — Nova Micro for routine 6-hourly checks (~$0.00004), Nova Lite for daily reports, Nova Pro for critical alerts. Cost stays under £1/month.
- Pre-summarisation — raw Proxmox JSON is compressed to ~500 tokens before reaching Bedrock, reducing token costs by ~90% versus sending raw data.
- State-aware alerting — SQLite tracks alert state so each breach fires exactly once, not on every 15-minute cycle.
- Snapshot-first safety — every remediation action triggers a Proxmox snapshot before execution, enabling one-command rollback.
- Circuit breaker — blocks further automated actions if more than 3 have executed in the last hour.
- Multi-channel failsafe — Apprise simultaneously dispatches to Telegram and ntfy, ensuring alerts reach at least one channel even if the primary bot is throttled.
- Per-container overrides — containers that legitimately run hot (media server, download clients) get individual threshold overrides rather than inflating global limits.
Challenges Overcome
- Telegram flood control — creating a new Bot instance per container in a loop triggers Telegram’s flood control (599-second retry). Solved by passing a single shared Bot instance across the loop with a 1-second inter-message delay.
- Bedrock response formatting — Nova models occasionally wrap output in markdown code fences. Added post-processing to strip these before using the response.
- Orphaned alert records — if alert_monitor.py crashes mid-cycle, some containers are partially recorded as new alerts on the next run. Solved by tracking completion state and adding a cleanup utility.
- n8n SSH reliability — the dedicated SSH node proved less reliable than the Execute Command node with SSH credentials for long-running Python scripts.
Tech Stack
- Python 3 (requests, boto3, python-telegram-bot, apprise, pyyaml)
- AWS Bedrock — Amazon Nova Micro / Lite / Pro
- SQLite (metrics, alert state, action history)
- Proxmox VE API (read-only token auth)
- Telegram Bot API (inline keyboards, callback handlers)
- n8n (workflow orchestration, replaces systemd timers)
- Apprise (multi-channel notification library)
Current Status
Live and running in production. Monitoring 25+ containers with alerts active. ~£0.80/month AWS cost. Phase 3 remediation actions (log cleanup, container restart) deployed and tested.
What I Learned
Building this taught me that the hardest part of an AI-powered system isn’t the AI — it’s the plumbing. Getting reliable data collection, sensible state management, and safe action execution right takes more thought than the Bedrock API calls. The pre-summarisation approach was the single best architectural decision: it made the system both cheaper and faster.
Part of an ongoing homelab AI infrastructure project.