Backend · Post-mortemLong readFrom production

FastAPI + WebSockets: a speech pipeline post-mortem

Whisper, asyncio, and the art of not dropping a syllable when the network blinks. What we shipped, what we'd undo, and what stays.

Nitin Negi (Ice Bear)

Software Engineer · Sonoka.asia

Apr 24, 20263 min read9.6k reads

A streaming speech-to-text pipeline that doesn't drop syllables sounds like one job. It's actually three: capture, transport, transcribe. Each one breaks in a different way, and the user only sees the final tape.

Streaming voice, the hard way

We needed live captions in a meeting app. Latency budget: under 500ms from speech to text on screen. We had Whisper, FastAPI, and a WebSocket between them. The first prototype worked in the lab and immediately fell over the second a real network got in the way.

Why FastAPI for audio

FastAPI ships async-first WebSocket support that doesn't fight Python's asyncio runtime. Tornado and Sanic are also fine choices. We picked FastAPI because the rest of the codebase already lived there and "one less stack" is its own feature.

@app.websocket("/audio")
async def audio_socket(ws: WebSocket):
    await ws.accept()
    async for chunk in audio_stream(ws):
        await transcribe_queue.put(chunk)

That's the entire transport layer. Backpressure, errors, and reconnects are not in this code block — that's where most of the bug fixes lived.

Chunking, but for waveforms

Whisper wants 30-second windows. The user wants real-time results. The compromise is overlapping windows — a 5-second decode pass over the most recent 30 seconds of audio, fired every 1.5 seconds. The overlap means we re-decode some audio every tick, which feels wasteful until you measure it. Whisper-small on a T4 happily eats that load.

Backpressure isn't a metaphor

A WebSocket on a wonky network will silently buffer megabytes before it crashes the event loop. We added a hard cap: if the outbound queue exceeds 50 chunks, drop the oldest and emit a backpressure event. The UI shows a yellow dot. Users tolerate yellow; they hate stale captions.

The reconnect dance

Real networks die. The trick is making the reconnect invisible. We assign every session a UUID, send a session_resume frame on reconnect, and replay the last 5 seconds of audio. The transcript stitches itself back together with no visible seam.

What it cost to ship

Whisper-small on a single T4: $0.42 / hour of decoded audio. Storage is negligible. The expensive part was the developer time spent learning that asyncio queues are not magic.

Six lessons, taped to the fridge

Measure the network, not your laptop. The lab lies.
Backpressure is a UI feature. Tell the user when you drop frames.
Reconnect with state. Stateless retries always lose the last syllable.
Whisper-small ships. Don't reach for the big model first.
Overlap your windows. Real-time is a series of small re-decodes.
Pin your runtime. Async + Whisper + cuDNN has more moving parts than you think.

FastAPI + WebSockets: a speech pipeline post-mortem

Streaming voice, the hard way

Why FastAPI for audio

Chunking, but for waveforms

Backpressure isn't a metaphor

The reconnect dance

What it cost to ship

Six lessons, taped to the fridge

Read next

Ice Bear ships RAG: a chatbot, but make it useful

One contract, three clients: types across web, mobile, desktop

Async scrapers: 12× throughput, zero rate-limit bans