An AI-powered infrastructure monitoring agent that watches over 25+ self-hosted LXC containers, analyses metrics using AWS Bedrock, and delivers intelligent alerts via Telegram — with inline action buttons for semi-autonomous remediation.


What It Does

The agent runs on a dedicated Proxmox LXC container and executes a pipeline every 15 minutes: collect metrics from the Proxmox API → pre-summarise to ~500 tokens → store in SQLite → send to AWS Bedrock for AI analysis → deliver actionable alerts via Telegram.

When thresholds are breached, alerts arrive with inline Telegram buttons — acknowledge, snooze, or request deeper analysis. For critical issues the agent can propose remediation actions (log cleanup, container restart) that require explicit approval before executing — and always takes a Proxmox snapshot first as a safety net.


Architecture

Proxmox API (read-only token)
        │
        ▼
metrics_collector.py  ──→  SQLite DB
        │
        ▼
bedrock_analyzer.py   ──→  AWS Bedrock (Nova Micro/Lite/Pro)
        │
        ▼
alert_monitor.py      ──→  Telegram Bot (inline buttons)
        │
        ▼
action_executor.py    ──→  Proxmox API (snapshot → remediate)
        │
        ▼
apprise_notifier.py   ──→  ntfy (failsafe blast channel)

Orchestration: Three n8n workflows replace systemd timers — metrics collection every 15 minutes, AI analysis every 6 hours, alert monitoring every 15 minutes. All visible and editable through a web UI.


Technical Highlights

  • Tiered AI model routing — Nova Micro for routine 6-hourly checks (~$0.00004), Nova Lite for daily reports, Nova Pro for critical alerts. Cost stays under £1/month.
  • Pre-summarisation — raw Proxmox JSON is compressed to ~500 tokens before reaching Bedrock, reducing token costs by ~90% versus sending raw data.
  • State-aware alerting — SQLite tracks alert state so each breach fires exactly once, not on every 15-minute cycle.
  • Snapshot-first safety — every remediation action triggers a Proxmox snapshot before execution, enabling one-command rollback.
  • Circuit breaker — blocks further automated actions if more than 3 have executed in the last hour.
  • Multi-channel failsafe — Apprise simultaneously dispatches to Telegram and ntfy, ensuring alerts reach at least one channel even if the primary bot is throttled.
  • Per-container overrides — containers that legitimately run hot (media server, download clients) get individual threshold overrides rather than inflating global limits.

Challenges Overcome

  • Telegram flood control — creating a new Bot instance per container in a loop triggers Telegram’s flood control (599-second retry). Solved by passing a single shared Bot instance across the loop with a 1-second inter-message delay.
  • Bedrock response formatting — Nova models occasionally wrap output in markdown code fences. Added post-processing to strip these before using the response.
  • Orphaned alert records — if alert_monitor.py crashes mid-cycle, some containers are partially recorded as new alerts on the next run. Solved by tracking completion state and adding a cleanup utility.
  • n8n SSH reliability — the dedicated SSH node proved less reliable than the Execute Command node with SSH credentials for long-running Python scripts.

Tech Stack

  • Python 3 (requests, boto3, python-telegram-bot, apprise, pyyaml)
  • AWS Bedrock — Amazon Nova Micro / Lite / Pro
  • SQLite (metrics, alert state, action history)
  • Proxmox VE API (read-only token auth)
  • Telegram Bot API (inline keyboards, callback handlers)
  • n8n (workflow orchestration, replaces systemd timers)
  • Apprise (multi-channel notification library)

Current Status

Live and running in production. Monitoring 25+ containers with alerts active. ~£0.80/month AWS cost. Phase 3 remediation actions (log cleanup, container restart) deployed and tested.


What I Learned

Building this taught me that the hardest part of an AI-powered system isn’t the AI — it’s the plumbing. Getting reliable data collection, sensible state management, and safe action execution right takes more thought than the Bedrock API calls. The pre-summarisation approach was the single best architectural decision: it made the system both cheaper and faster.


Part of an ongoing homelab AI infrastructure project.