Whisper Voice Infrastructure

Overview

A self-hosted speech-to-text platform that started as a containerised API on the homelab and grew into a voice-first input layer spanning every application on the desktop, a Telegram bot, and a dedicated Android workflow. The system uses faster-whisper with an OpenAI-compatible REST API, AWS Bedrock Nova Micro for transcript cleanup, and Cloudflare Zero Trust Service Auth for secure remote access.

This project demonstrates the difference between deploying a tool and actually integrating it into how you work — the container was the starting point, not the outcome.

The Problem

Voice input on most platforms is either locked to a single application or depends on third-party services that process audio on external servers. For someone with active security clearance, routing audio through OpenAI’s Whisper API or a Chrome extension is not an acceptable trade-off — regardless of what those providers claim about data retention.

The goal was a fully self-hosted, always-available voice input layer that works in any application, with all audio processed locally on infrastructure under my control.

Architecture

CT120 — Docker container (whisper-web)

The core service runs as a Docker container on CT120 alongside the rest of the homelab Docker stack. It exposes a POST /transcribe endpoint accepting audio files and returning transcribed text. The API key is managed via environment variable and enforced on every request.

The container mounts main.py as a volume from the host, allowing configuration changes without rebuilding the image. AWS Bedrock Nova Micro is wired in as an optional cleanup pass — when cleanup=true is passed, the raw Whisper transcript is post-processed to strip filler words, fix punctuation, and normalise sentence structure before returning.

Access is secured via Cloudflare Zero Trust Service Auth at whisper.sevenbirches.co.uk. The browser-facing microphone endpoint requires HTTPS, which the Cloudflare tunnel provides automatically.

Windows Voice Daemon

The most significant extension was a system-wide voice input daemon running on Windows. Rather than integrating voice into a single application, the daemon listens for a global hotkey and injects transcribed text into whatever window currently has focus — Claude.ai, VS Code, Word, Outlook, a terminal, anything.

How it works:

Press the hotkey (Scroll Lock — completely unused by any application)
Recording starts via sounddevice at 16kHz
Press again to stop — audio is written to a temporary WAV file
The file is POSTed to the local Whisper API with cleanup=true
The transcript is written to the clipboard and pasted into the active window via pyperclip + pyautogui

The daemon is registered as a Windows Task Scheduler entry with Run with highest privileges (required for global hotkey capture), triggered at logon with a 15-second delay. It survives reboots transparently.

The first iteration used Ctrl+Shift+Space as the hotkey, which introduced a click artefact at the start of every recording. This was resolved by trimming the first 0.3 seconds of audio before transcription. Subsequent iteration moved to Scroll Lock as a single-key trigger — no modifier chord, no conflicts, no artefacts.

The audio posts directly to http://192.168.55.120:9000/transcribe on the LAN — no Cloudflare hop needed from home, which keeps latency under 800ms end-to-end including the Nova Micro cleanup pass.

Telegram Voice Integration

Hermes, the homelab AI agent running on the VPS, receives voice messages natively via Telegram. When a voice message is sent to the bot, the gateway downloads the OGG/Opus file, transcribes it using the configured STT backend, and injects the result as text into the agent conversation.

The STT backend for Hermes is configured to use Groq’s cloud transcription API (free tier), which handles the OGG format that Telegram produces natively and returns results faster than local faster-whisper on the VPS.

For outgoing voice, the agent can respond as a native Telegram voice bubble. Edge TTS handles synthesis with ffmpeg converting output to Opus format for Telegram compatibility.

The combined result: send a voice message from the phone, receive a spoken reply. The entire round-trip — Telegram → VPS → Groq STT → DeepSeek V4-Flash → Edge TTS → Telegram voice bubble — completes in under six seconds for typical queries.

Stack

Component	Technology
Core STT engine	faster-whisper (base model)
Transcript cleanup	AWS Bedrock Nova Micro
Container runtime	Docker on CT120 (LXC, Proxmox)
Remote access	Cloudflare Zero Trust Service Auth
Windows daemon	Python — `sounddevice`, `keyboard`, `requests`, `pyperclip`
Startup	Windows Task Scheduler, Run with highest privileges
Telegram STT	Groq cloud transcription API
Telegram TTS	Edge TTS + ffmpeg (Opus for voice bubbles)
Agent platform	Hermes on VPS, DeepSeek V4-Flash

Cost

The whisper-web container itself has zero ongoing cost — faster-whisper runs on CPU with no GPU required, and the container consumes approximately 7MB RAM at idle. The AWS Bedrock Nova Micro cleanup pass costs fractions of a cent per transcript at typical usage volumes.

Groq’s free tier handles Telegram voice transcription within its rate limits without cost. Total monthly spend attributable to this project: effectively £0.

What This Demonstrates

Self-hosted AI infrastructure is only valuable if it integrates into actual workflows. This project documents the full path from container deployment through to daily use — including the security reasoning behind self-hosting, the engineering decisions around hotkey selection and audio trimming, and the architectural choices that keep latency acceptable without sacrificing data sovereignty.

The voice daemon in particular is a practical example of system integration work: understanding how global hotkeys interact with Windows privilege levels, how audio sampling rates affect transcription quality, and how to make a technical tool feel natural enough to actually use every day.