Homelab AI Monitoring Agent

Overview

A self-hosted AI infrastructure monitoring platform built on a Proxmox homelab, integrating AWS Bedrock and Python. The system monitors 25+ containerised services in real time, interprets infrastructure health using large language models, and delivers intelligent alerts with actionable controls.

This project demonstrates end-to-end system design: from infrastructure and API design through to AI integration and security-conscious deployment - built entirely from scratch over a series of focused development sessions.

The Problem

A self-hosted homelab running 25+ LXC containers and a Docker stack generates constant metrics and events. Traditional monitoring tools surface raw numbers - CPU at 78%, disk at 85% - but provide no interpretation. You still have to decide what it means, whether it matters, and what to do about it.

The goal was to replace that process with something genuinely intelligent: a system that observes your infrastructure the way an experienced engineer would, explains what it sees, and puts the right controls in your hands from wherever you are.

Architecture

The system runs across three locations: a home Proxmox server, an AWS account, and a Romanian VPS - connected by WireGuard and Cloudflare tunnels.

Home server (HP Elite Mini 800 G9, Proxmox VE) A dedicated LXC container runs the entire agent stack. It collects metrics from all containers every 15 minutes via the Proxmox API, stores them in SQLite, and exposes a Flask REST API. A separate container runs n8n to orchestrate all scheduled workflows. Another container runs a Docker stack of seven services managed by Sablier for on-demand wake/sleep behaviour.

AWS (Bedrock, us-east-1) Three Nova models handle different analysis tiers. Nova Micro runs every six hours for routine checks at approximately $0.00004 per analysis. Nova Lite handles daily reports. Nova Pro handles critical alerts and monthly deep audits. A pre-summarisation step compresses metrics before sending to Bedrock, keeping token usage and costs minimal.

Romanian VPS Hosts the public portfolio site and a media stack. The VPS also acts as an external watchdog - if the home agent stops responding, the VPS fires an alert before the outage is discovered by other means.

What It Does

Intelligent health analysis Rather than just checking thresholds, the agent sends structured metrics summaries to AWS Bedrock for interpretation. The AI explains what it sees, identifies likely causes, and suggests action in plain language. A known_states injection prevents false positives from expected conditions.

State-aware alerting Alerts fire on transitions, not conditions. If a container was already stopped when the agent started, it is baselined silently. An alert only fires if a container moves from running to stopped. This eliminates the alert floods that plague naive threshold-based systems.

Tiered model routing Routine checks use the cheapest model. Daily summaries use a mid-tier model. Critical alerts escalate to the most capable model. The entire system costs less than $1/month to run at current usage.

Multi-channel notifications Alerts route via Firebase Cloud Messaging, ntfy for secondary delivery, and Telegram for legacy access. All three channels fire from a single notification module.

PBS backup monitoring Live backup status - last run time, which containers were backed up, and which are stale - read from Proxmox Backup Server datastore timestamps directly via SSH.

VPS and home agent watchdog The Romanian VPS independently pings the home agent’s health endpoint every 15 minutes. If the home server goes offline, the VPS fires an alert immediately.

Simulation Engine

Added in March 2026, the simulation engine addresses a limitation common to all monitoring systems: you can only observe what has already happened.

The engine runs the full alert and action pipeline against a hypothetical infrastructure state without touching anything live. It accepts three input types: a structured JSON scenario defining metric overrides, a natural language query converted to a scenario via Bedrock Micro, or a replay of a historical metrics snapshot from the database.

A natural language query such as “what happens if sabnzbd memory hits 96% while the Docker host is thrashing?” is converted to a structured scenario in a single Bedrock Micro call costing approximately $0.000002, then run through the same threshold evaluation, alert generation, and action prediction logic the live agent uses. The result is a structured report showing which alerts would fire, their severity, what remediation actions would be proposed, which safety gates they would hit, and whether a snapshot would be taken first.

Practical uses include testing threshold changes before applying them, planning maintenance windows by simulating the expected alert state, and regression-testing after incidents - feeding the original metrics back through the current logic to verify the agent would now catch the problem earlier.

The history replay capability is particularly useful for post-incident analysis. The SABnzbd OOM incident in early 2026 can be replayed against current thresholds to confirm the memory critical alert would now fire before the OOM kill rather than after it.

The simulation endpoint is exposed via the Flask API at POST /simulate, making it queryable from any connected client. The entire pipeline is read-only and makes no changes to infrastructure or the metrics database.

Key Technical Decisions

Why SQLite? Single container, no scaling requirement. Fast, zero-dependency, and the entire metrics history is one portable file queryable directly from the command line.

Why n8n for scheduling? Visual workflow editor, execution history, and retry logic - without the friction of editing systemd unit files for every schedule change.

Why AWS Bedrock over a local model? Running a local LLM capable of infrastructure analysis would require significantly more RAM than is available. Nova Micro delivers competent analysis at a fraction of a cent per run with a highly predictable cost model.

Outcomes

Complete AI-powered infrastructure monitoring running in production 24/7
Monthly Bedrock cost under $1 for a 25+ container homelab
Simulation engine enabling hypothetical scenario testing without touching live infrastructure
Clean public GitHub repository with sanitised config template

Technologies

Python - Flask - SQLite - AWS Bedrock (Nova Micro/Lite/Pro) - Firebase Cloud Messaging - Proxmox VE - Docker - n8n - Sablier - Cloudflare Tunnels - WireGuard - Apprise - ntfy - Telegram Bot API