Whisper Voice Infrastructure
Overview
A self-hosted speech-to-text platform that started as a containerised API on the homelab and grew into a voice-first input layer spanning every application on the desktop, a Telegram bot, and a dedicated Android workflow. The system uses faster-whisper with an OpenAI-compatible REST API, AWS Bedrock Nova Micro for transcript cleanup, and Cloudflare Zero Trust Service Auth for secure remote access.
This project demonstrates the difference between deploying a tool and actually integrating it into how you work — the container was the starting point, not the outcome.
The Problem
Voice input on most platforms is either locked to a single application or depends on third-party services that process audio on external servers. For someone with active security clearance, routing audio through OpenAI’s Whisper API or a Chrome extension is not an acceptable trade-off — regardless of what those providers claim about data retention.
The goal was a fully self-hosted, always-available voice input layer that works in any application, with all audio processed locally on infrastructure under my control.
Architecture
CT120 — Docker container (whisper-web)
The core service runs as a Docker container on CT120 alongside the rest of the homelab Docker stack. It exposes a POST /transcribe endpoint accepting audio files and returning transcribed text. The API key is managed via environment variable and enforced on every request.
The container mounts main.py as a volume from the host, allowing configuration changes without rebuilding the image. AWS Bedrock Nova Micro is wired in as an optional cleanup pass — when cleanup=true is passed, the raw Whisper transcript is post-processed to strip filler words, fix punctuation, and normalise sentence structure before returning.
Access is secured via Cloudflare Zero Trust Service Auth at whisper.sevenbirches.co.uk. The browser-facing microphone endpoint requires HTTPS, which the Cloudflare tunnel provides automatically.
Windows Voice Daemon
The most significant extension was a system-wide voice input daemon running on Windows. Rather than integrating voice into a single application, the daemon listens for a global hotkey and injects transcribed text into whatever window currently has focus — Claude.ai, VS Code, Word, Outlook, a terminal, anything.
How it works:
- Press the hotkey (Scroll Lock — completely unused by any application)
- Recording starts via
sounddeviceat 16kHz - Press again to stop — audio is written to a temporary WAV file
- The file is
POSTed to the local Whisper API withcleanup=true - The transcript is written to the clipboard and pasted into the active window via
pyperclip+pyautogui
The daemon is registered as a Windows Task Scheduler entry with Run with highest privileges (required for global hotkey capture), triggered at logon with a 15-second delay. It survives reboots transparently.
The first iteration used Ctrl+Shift+Space as the hotkey, which introduced a click artefact at the start of every recording. This was resolved by trimming the first 0.3 seconds of audio before transcription. Subsequent iteration moved to Scroll Lock as a single-key trigger — no modifier chord, no conflicts, no artefacts.
The audio posts directly to http://192.168.55.120:9000/transcribe on the LAN — no Cloudflare hop needed from home, which keeps latency under 800ms end-to-end including the Nova Micro cleanup pass.
Telegram Voice Integration
Hermes, the homelab AI agent running on the VPS, receives voice messages natively via Telegram. When a voice message is sent to the bot, the gateway downloads the OGG/Opus file, transcribes it using the configured STT backend, and injects the result as text into the agent conversation.
The STT backend for Hermes is configured to use Groq’s cloud transcription API (free tier), which handles the OGG format that Telegram produces natively and returns results faster than local faster-whisper on the VPS.
For outgoing voice, the agent can respond as a native Telegram voice bubble. Edge TTS handles synthesis with ffmpeg converting output to Opus format for Telegram compatibility.
The combined result: send a voice message from the phone, receive a spoken reply. The entire round-trip — Telegram → VPS → Groq STT → DeepSeek V4-Flash → Edge TTS → Telegram voice bubble — completes in under six seconds for typical queries.
Stack
| Component | Technology |
|---|---|
| Core STT engine | faster-whisper (base model) |
| Transcript cleanup | AWS Bedrock Nova Micro |
| Container runtime | Docker on CT120 (LXC, Proxmox) |
| Remote access | Cloudflare Zero Trust Service Auth |
| Windows daemon | Python — sounddevice, keyboard, requests, pyperclip |
| Startup | Windows Task Scheduler, Run with highest privileges |
| Telegram STT | Groq cloud transcription API |
| Telegram TTS | Edge TTS + ffmpeg (Opus for voice bubbles) |
| Agent platform | Hermes on VPS, DeepSeek V4-Flash |
Cost
The whisper-web container itself has zero ongoing cost — faster-whisper runs on CPU with no GPU required, and the container consumes approximately 7MB RAM at idle. The AWS Bedrock Nova Micro cleanup pass costs fractions of a cent per transcript at typical usage volumes.
Groq’s free tier handles Telegram voice transcription within its rate limits without cost. Total monthly spend attributable to this project: effectively £0.
What This Demonstrates
Self-hosted AI infrastructure is only valuable if it integrates into actual workflows. This project documents the full path from container deployment through to daily use — including the security reasoning behind self-hosting, the engineering decisions around hotkey selection and audio trimming, and the architectural choices that keep latency acceptable without sacrificing data sovereignty.
The voice daemon in particular is a practical example of system integration work: understanding how global hotkeys interact with Windows privilege levels, how audio sampling rates affect transcription quality, and how to make a technical tool feel natural enough to actually use every day.