Building a Self-Hosted Voice-to-Text Pipeline with Whisper and AWS Bedrock

There is a small but persistent friction point in my workflow: I think faster than I type. When I am working through a problem or drafting something, the gap between thought and text slows everything down. I wanted to be able to speak, get clean written text back, and carry on without sending audio to a third-party API or paying per minute.

This is how I built it.

The Stack

faster-whisper - a CTranslate2-based reimplementation of OpenAI’s Whisper. No PyTorch dependency, significantly smaller image footprint (~500MB vs ~2GB), and genuinely faster on CPU.
FastAPI - the backend, handling audio uploads and browser recording.
AWS Bedrock Nova Micro - an optional cleanup pass that fixes punctuation, capitalisation, and removes filler words from the raw transcript.
Docker on Proxmox LXC - containerised on my existing Docker host alongside Pierre and Sablier.
Cloudflare tunnel - HTTPS without a reverse proxy, accessible from desktop and mobile.

Why Not the OpenAI Whisper API?

The hosted API is fine for occasional use but charges per minute of audio. More importantly, audio stays local with a self-hosted solution. Everything runs on my own infrastructure - the only external call is the optional Bedrock cleanup step, which sends only the raw text transcript, not the audio itself.

The Two-Stage Pipeline

The pipeline is intentionally simple.

Stage one - transcription. The browser either records audio directly via the MediaRecorder API or accepts a file upload (MP3, MP4, WAV, OGG, WebM). The audio is sent to the FastAPI endpoint, written to a temporary file, and passed to faster-whisper using the base model with int8 quantisation. The model runs on CPU - no GPU required.

Whisper natively drops filler sounds (um, uh, er) during transcription. This alone makes the raw output significantly cleaner than a human transcript.

Stage two - cleanup (optional toggle). The raw transcript is sent to AWS Bedrock Nova Micro with a tightly scoped prompt: fix punctuation and capitalisation, remove remaining filler words, make it read naturally as written text. The model returns only the cleaned text - no commentary, no additions.

The toggle matters. For quick voice notes you want the raw transcript immediately. For anything you plan to publish or share, the cleanup pass is worth the half-second wait.

The Interface

A single-page web UI with a large record button, a file upload area with drag-and-drop, and a toggle for the cleanup step. When transcription completes, the text appears in an editable field with a one-click copy button.

The design goal was: open the page, speak, copy, close. No accounts, no settings to configure, no friction.

What Whisper Filters Automatically

Something I did not fully appreciate until testing: Whisper’s training data means it treats certain sounds as non-speech and silently discards them. Filler sounds, false starts, and background noise are largely handled at the transcription layer before Bedrock sees anything. The cleanup step then handles the structural editing - sentence boundaries, punctuation, word choice.

The combination is genuinely good. Speaking naturally and receiving publication-ready text back is the goal, and it is mostly there with the base model.

Practical Use Cases

Dictation for long-form writing. Thinking out loud is faster than typing for drafts. Speak the structure, clean it up, edit from there.

Meeting and call transcription. Record a call to MP3, upload the file, get a transcript. Useful for reviewing interviews or any call you need to reference later.

Voice memos to text. WhatsApp voice messages, phone recordings, any audio file - drop it in, get text back.

n8n integration. The /transcribe endpoint is a plain HTTP POST. Any n8n workflow can call it. A voice memo landing on a network share could trigger automatic transcription and delivery - the building block is there.

Mobile access. Because the service sits behind a Cloudflare tunnel with HTTPS, the browser microphone works on mobile without any configuration. The same interface that works on desktop works on a phone.

Build Notes

A few things worth knowing if you want to replicate this.

Browsers require HTTPS for microphone access. Plain HTTP over a local network IP will be refused. The Cloudflare tunnel handles this automatically.

faster-whisper vs openai-whisper. The original library pulls in PyTorch, which adds roughly 1.5GB to the image. faster-whisper uses CTranslate2 instead and produces the same output. For a CPU-only homelab container the size difference is significant.

Model size tradeoff. The base model (~142MB) is accurate enough for clear speech and runs in a few seconds per recording on CPU. The small model (~466MB) handles accents and background noise noticeably better but is slower. Worth testing both if accuracy matters more than speed for your use case.

Model is baked into the image. The Dockerfile runs the model download during build, not at runtime. Container startup is fast and there is no network dependency after the image is built.

The Repository

The full source - Dockerfile, docker-compose, FastAPI backend, and frontend - is on GitHub at github.com/mrapierre/whisper-web .

The setup is deliberately minimal. Four files, one Docker command to build, one to run.