Running an AI Monitoring Agent on AWS Bedrock for £0.02 a Month

When I started building an AI monitoring agent for the homelab, the obvious concern was cost. AWS Bedrock is powerful, but running AI analysis on a continuous basis against 25+ containers sounded like a recipe for a surprising monthly bill.

The solution was tiered model routing — and it keeps the entire operation running for approximately £0.02 per month.

The Tiering Logic

AWS Bedrock’s Nova model family has three tiers with very different price points. Nova Micro is the cheapest available at around $0.035 per million input tokens. Nova Lite sits in the middle. Nova Pro is the most capable and the most expensive.

The agent routes queries to the appropriate tier based on what it needs to do:

Nova Micro — routine 15-minute health checks, threshold monitoring, standard alert generation. This is 90% of all queries.
Nova Lite — mid-complexity analysis when something looks unusual but not critical. Behaviour deviation checks, trend analysis across multiple containers.
Nova Pro — monthly deep audits only. Full infrastructure review, capacity planning analysis, strategic recommendations.

By sending the vast majority of traffic to Nova Micro and reserving Pro for the monthly audit, the cost stays negligible. Pre-summarisation of metrics before they hit Bedrock keeps token counts tight — the agent sends a compressed summary rather than raw data, which costs roughly $0.00004 per analysis cycle.

What the Agent Actually Does

Every 15 minutes the agent collects metrics from the Proxmox API and Docker socket — CPU, memory, disk, and running state for every container. Those metrics go into a SQLite database. A threshold monitor checks for breaches and fires alerts if something crosses a warning or critical level.

Every 6 hours the Bedrock analyser runs, building a picture of infrastructure health across the collection window. It compares current state against nightly-computed baselines and flags any containers showing unusual behaviour patterns — not just threshold breaches, but deviation from their own established norms.

Alerts go to an Android push notification via Firebase Cloud Messaging, with Telegram as a secondary channel. Action buttons on the notification let me acknowledge or snooze directly from the lock screen.

Approved remediation actions — container restarts, for instance — always take a Proxmox snapshot before executing. The agent observes and suggests; it does not act without approval.

The Simulation Engine

One of the more interesting additions is a simulation engine that can model infrastructure failure scenarios using natural language input via Bedrock. Six pre-built scenarios cover the most realistic failure modes: SABnzbd OOM kills, VPS outages, NAS power loss, cascade stress events, monitoring gaps, and disk pressure.

The simulation lets me test how the agent would respond to conditions that are difficult to reproduce safely on live infrastructure — useful both for validating the monitoring logic and as a portfolio demonstration of AI-assisted infrastructure reasoning.

Why This Matters for Cloud Consulting

The architecture demonstrates several things that are directly relevant to cloud consulting work: cost-conscious design, tiered resource usage, AI integration with practical constraints, and operational discipline around automated actions. The fact that it runs on real infrastructure rather than a demo environment makes it a genuine reference implementation rather than a toy project.

Full technical details are on the project page at anthonyapierre.com.