Building a Self-Hosted Voice-to-Text Pipeline with Whisper and AWS Bedrock
There is a small but persistent friction point in my workflow: I think faster than I type. When I am working through a problem or drafting something, the gap between thought and text slows everything down. I wanted to be able to speak, get clean written text back, and carry on without sending audio to a third-party API or paying per minute.
This is how I built it.
The Stack faster-whisper - a CTranslate2-based reimplementation of OpenAI’s Whisper. No PyTorch dependency, significantly smaller image footprint (~500MB vs ~2GB), and genuinely faster on CPU. FastAPI - the backend, handling audio uploads and browser recording. AWS Bedrock Nova Micro - an optional cleanup pass that fixes punctuation, capitalisation, and removes filler words from the raw transcript. Docker on Proxmox LXC - containerised on my existing Docker host alongside Pierre and Sablier. Cloudflare tunnel - HTTPS without a reverse proxy, accessible from desktop and mobile. Why Not the OpenAI Whisper API? The hosted API is fine for occasional use but charges per minute of audio. More importantly, audio stays local with a self-hosted solution. Everything runs on my own infrastructure - the only external call is the optional Bedrock cleanup step, which sends only the raw text transcript, not the audio itself.