Self-Healing Homelab MCP with AWS Route 53, CloudWatch and Lambda

There is a particular kind of frustration that comes from a tool dropping mid-session with no warning. My MCP server — the bridge between Claude and my homelab infrastructure — had developed a habit of going quiet at exactly the wrong moment. The fix I built turned out to be a solid piece of AWS architecture. The plot twist was that the problem itself was something else entirely.

The Problem

My MCP server runs on CT104 (ai-lab) inside a Proxmox LXC container on my home network, exposed via a Cloudflare tunnel. Claude connects to it over SSE (Server-Sent Events), which keeps a long-lived HTTP connection open so tool results can stream back in real time. When that connection dropped, the session was dead — no tools, no homelab access, nothing. The only recovery was a manual SSH in to restart the process.

Manual restarts are fine once. They stop being fine the third time it happens in a week.

The Architecture

Self-healing MCP architecture

The self-healer uses four AWS services connected in a chain:

Route 53 health check — polls the MCP endpoint every 30 seconds. If it fails three consecutive checks, it flips to unhealthy and triggers a CloudWatch alarm.

CloudWatch alarm — watches the Route 53 health check metric. On state change to ALARM, it fires an SNS notification.

SNS topic — receives the notification and fans it out to two subscribers: a Lambda function and an ntfy push notification so I know it fired.

Lambda function — the self-healer. It calls the exec relay on CT140, which runs a restart command against CT104 without needing any SSH keys stored in AWS. The exec relay is a FastAPI endpoint that accepts authenticated POST requests and executes commands on the homelab — Lambda never touches the infrastructure directly. Clean separation, least privilege throughout.

The Lambda function itself is minimal Python. It receives the SNS event, POSTs to the exec relay with the restart command, and logs the result. Total execution time is under two seconds. The function is deployed as mcp-healer (ARN ae34f9ec), the SNS topic is mcp-healer-alerts.

The Exec Relay Trick

This is the part worth highlighting for anyone building similar patterns. Lambda functions running in AWS have no direct path into a home network. The obvious approaches — storing SSH keys in Secrets Manager, setting up a VPN — add complexity and attack surface.

The exec relay solves this cleanly. CT140 runs a FastAPI service that accepts authenticated POST requests over HTTPS via Cloudflare tunnel. Lambda POSTs a signed request with the command to run. The relay executes it locally and returns the result. Lambda never holds credentials for the homelab directly — it only holds the relay API key, which can be rotated without touching the homelab at all.

For IAM, the Lambda execution role has exactly three permissions: write to CloudWatch Logs, receive from SNS, and call SSM Parameter Store to retrieve the relay key. Nothing else.

Testing It

The test was straightforward: stop the MCP process on CT104 manually, then watch the chain fire. Route 53 flagged unhealthy within 90 seconds of the first failed check. CloudWatch alarm transitioned to ALARM state. SNS delivered to Lambda and ntfy simultaneously. Lambda called the exec relay, which restarted the MCP service. Route 53 flipped back to healthy. Total recovery time from first failed check to confirmed healthy: under four minutes, no human involvement.

The ntfy notification arrived on my phone before I had finished watching the CloudWatch console.

The Plot Twist

With the self-healer in place and working, I started looking more carefully at when the drops were happening. The pattern was consistent: sessions that went quiet for more than a minute or two. CT104 was not crashing. The process was still running. The SSE connection was simply being closed.

The culprit was Cloudflare’s idle connection timeout. Cloudflare terminates connections that carry no traffic for 100 seconds. An SSE stream sitting idle — waiting for a tool call — looks like an idle connection. Cloudflare closes it. Claude loses the session. CT104 never knew anything was wrong.

The fix was three lines in index.js v1.2.0: a keepalive comment sent on the SSE stream every 30 seconds. Cloudflare sees traffic, keeps the connection alive, and the drops stopped entirely.

setInterval(() => {
  res.write(': keepalive\n\n');
}, 30000);

The Route 53 self-healer is still running and still useful — it catches genuine CT104 crashes, which do happen occasionally after Proxmox maintenance. But the day-to-day reliability improvement came from the keepalive, not the healer.

Cost

Route 53 health checks cost $0.50 per month per endpoint. Lambda invocations at this scale are effectively free — a handful of restarts per month stays well within the free tier. CloudWatch and SNS add negligible fractions of a cent. Total: approximately $0.60 per month for a self-healing MCP infrastructure.

SAA-C03 Angle

This architecture maps cleanly onto several SAA-C03 exam domains. Route 53 health checks feeding CloudWatch alarms is a standard high-availability pattern. SNS fan-out to multiple subscribers (Lambda + notification) demonstrates decoupled event-driven design. Lambda with least-privilege IAM — no stored SSH keys, relay key in Parameter Store — covers security best practices. The exec relay pattern itself is worth understanding: it is essentially a lightweight internal API gateway for infrastructure operations, keeping cloud and on-premises concerns separated.

If you are studying for SAA-C03 and want a practical project that touches Route 53, CloudWatch, SNS, Lambda, IAM and SSM in a real operational context, this is a good one to build.

What I Would Do Differently

Route 53 health checks work on public endpoints only. The MCP server is publicly accessible via Cloudflare tunnel, which makes this work — but if the tunnel itself goes down, Route 53 sees a failure that Lambda cannot fix (CT104 is fine, the tunnel is the problem). A more complete solution would add a separate tunnel health check and a distinct recovery path for tunnel failures. That is on the backlog.

The self-healer and the keepalive fix together have eliminated unplanned MCP downtime. The architecture cost less than a coffee to build and costs less than a coffee per month to run.