The Problem

Running a homelab with 25+ LXC containers means a lot can go wrong quietly. Disks fill up. Services crash overnight. Memory leaks build over days. By the time you notice, something important has been broken for hours.

The standard answer is Grafana dashboards and Prometheus alerts — and I have those. But they tell you what the numbers are, not what they mean. A dashboard at 2am showing a container at 78% disk usage doesn’t tell you whether that’s normal for that container, whether it’s trending toward a problem, or whether you should actually wake up about it.

I wanted something smarter. An agent that could look at the whole picture, apply some judgement, and only bother me when it actually mattered.

The Architecture

The agent runs inside a dedicated Proxmox LXC container (CT 900) and has three jobs:

Collect — every 15 minutes, a Python script queries the Proxmox API using a read-only token, gathering CPU, memory, disk, and network metrics for every running container.

Analyse — every 6 hours, the metrics are pre-summarised into approximately 500 tokens and sent to AWS Bedrock (Amazon Nova Micro) for AI analysis. Nova Micro reads the summary and returns a structured assessment: status, issues identified, and recommendations.

Alert — a separate alert monitor runs every 15 minutes, checks metrics against configurable thresholds, and fires Telegram messages when something is genuinely wrong — but only once per breach, not on every check cycle.

Proxmox API → metrics_collector.py → SQLite
                                         ↓
                              bedrock_analyzer.py → Nova Micro (AWS)
                                         ↓
                              alert_monitor.py → Telegram Bot

The Cost Question

Before building anything AI-powered for personal use, the cost question matters. AWS Bedrock pricing for Nova Micro is $0.035 per million input tokens and $0.14 per million output tokens.

A typical analysis in this agent uses around 540 input tokens and 150 output tokens. That works out to approximately $0.00004 per analysis call — four hundredths of a cent. Running six-hourly, the agent costs roughly $0.0002 per day, or about $0.07 per month.

The key to keeping costs this low is pre-summarisation. Rather than sending raw JSON metrics (which would be thousands of tokens), the collector first compresses everything into a structured plain-text summary. Bedrock only ever sees the summary, not the raw data.

Raw metrics JSON: ~8,000 tokens
Pre-summarised:     ~540 tokens
Cost reduction:      93%

State-Aware Alerting

The most important design decision was making alerts state-aware. Without this, every threshold breach generates an alert on every check cycle — so if a container’s disk is at 78%, you get a Telegram message every 15 minutes until you fix it. That’s alert fatigue, and it trains you to ignore alerts.

The agent instead tracks alert state in SQLite. When a breach is first detected, it fires once and records it. On subsequent checks, it recognises the breach as known and stays silent. When the metric recovers, it marks the alert as resolved. You only hear about something new.

Each alert also arrives with three inline buttons:

  • ✅ Acknowledge — you’ve seen it, stops further escalation
  • 😴 Snooze 7 days — you know about it and it’s not actionable right now
  • 🔍 Analyse Now — triggers an immediate Bedrock analysis so you can understand the context

For containers that legitimately run above global thresholds — my Cloudflare tunnel container lives at ~78% disk by design — per-container overrides in the config file raise the threshold individually without affecting global settings.

What I Learned

Pre-summarisation is everything. The single biggest architectural decision was compressing metrics before sending them to Bedrock. It keeps costs negligible and actually improves the quality of analysis because the AI isn’t distracted by noise.

Read-only first. The Proxmox API token has audit-only permissions. The agent cannot restart, modify, or delete anything. This was a deliberate choice — trust needs to be earned before automation gets write access. Autonomous actions come in a later phase, behind a Telegram approval gate.

Region matters for new AWS accounts. I originally configured the agent against eu-west-1 (Ireland) for data residency reasons. New AWS accounts have significantly lower default Bedrock quotas in EU regions than in us-east-1. After hitting quota limits repeatedly and raising a support ticket, the practical fix was switching to us-east-1 — same model, same pricing, higher default limits.

SQLite is underrated. For a single-node homelab agent collecting metrics every 15 minutes, SQLite is a perfect fit. No server to run, no connection pooling, trivially backed up with a file copy, and fast enough for everything this agent needs. The database stores 30 days of metrics, analyses, container status history, and alert state in a single ~5MB file.

The Telegram Interface

The bot provides a simple command interface alongside the automated alerts:

CommandPurpose
/statusLatest metrics summary — instant, no AI call
/healthAI-analysed health report
/analyseDeeper daily-tier analysis
/costThis month’s Bedrock spend

Having the AI analysis available on demand via /health turns out to be genuinely useful beyond just monitoring. Before making any significant change to the homelab infrastructure, a quick /health check gives a current baseline to compare against afterwards.

What’s Next

The agent is currently read-only. The next phase introduces semi-autonomous actions — the ability to propose and, with Telegram approval, execute remediation actions like container restarts and log cleanup. Every proposed action will trigger a Proxmox snapshot first, so rollback is always available.

Beyond that: tiered model routing (Nova Micro for routine checks, upgrading to Claude Haiku or Sonnet for critical alerts where richer reasoning is worth the extra cost), n8n workflow orchestration to replace the scattered systemd timers with a visual pipeline, and eventually an auto-documentation command that generates and publishes infrastructure state directly to this site.

The full technical reference — file locations, configuration, database schema, commands — is maintained as a living document here .


The complete implementation is available on GitHub .