How to Transcribe Live Audio in Real-Time with Python, WebSockets, and Whisper

Ideation

Hoodie on. AirPods in. Zoned out building my next thing.

The cost: missed exam dates, missed deadlines, getting called out with no idea what was happening.

So I built ClassRec — live lecture transcription that alerts you when your professor mentions anything important.

Building it taught me more than any lecture did. Here’s the technical breakdown

1. Capture Raw Audio from the Browser

Everything starts with the browser’s MediaStream API. We grab the mic and get raw PCM samples — 32-bit floats at 16kHz.

// Get mic access immediately on button click
// Mobile browsers block getUserMedia() if called after any async work
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioCtx = new AudioContext({ sampleRate: 16000 });
const source = audioCtx.createMediaStreamSource(stream);

source.connect(processor);
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
// Whisper expects PCM16 — convert before sending
const int16 = convertFloat32ToInt16(float32);
socket.send(int16.buffer);
};

One real lesson: Get mic access immediately after a user gesture. Mobile browsers enforce this hard — any async work between the click and getUserMedia() and the browser silently blocks you.

2. Stream over websocket

We open a WebSocket to the FastAPI backend and stream raw PCM bytes as binary frames. No HTTP overhead, no delay — a continuous pipe of audio.

 @app.websocket("/ws/transcribe")
  async def websocket_transcribe(websocket: WebSocket):
      await websocket.accept()
      buffer = bytearray()

      async for data in websocket.iter_bytes():
          buffer.extend(data)

          if len(buffer) >= CHUNK_SIZE_BYTES:
              chunk = bytes(buffer[:CHUNK_SIZE_BYTES])
              buffer = buffer[CHUNK_SIZE_BYTES:]
              # transcribe_chunk handles silence filtering + Whisper — covered below
              asyncio.create_task(transcribe_chunk(chunk, websocket))

2–3 seconds is the sweet spot — long enough for Whisper to have context, short enough to feel real-time.

3. Calling Whisper for Live Transcription

Whisper doesn’t accept raw PCM — it needs a proper WAV file. So we convert first, then hit the API.

def call_whisper(wav_file) -> str:
# A prompt primes Whisper with context — reduces hallucinations on noisy audio
response = openai_client.audio.transcriptions.create(
model="whisper-1",
file=wav_file,
prompt="Lecture transcription. Listen for exam dates, assignments, and deadlines."
)
return response.tex

One thing most tutorials skip: passing a prompt to Whisper significantly reduces hallucinations. It primes the model to expect lecture-style speech — without it, Whisper will confidently transcribe background noise as words.

4. The Silence Problem

Sending every chunk to Whisper is expensive. A 90-minute lecture is mostly pauses, writing on the board, shuffling. We don’t want to pay for that.

First attempt: RMS threshold. Measure loudness, skip Whisper if quiet.

def is_silent_rms(pcm_bytes: bytes) -> bool:
samples = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32) / 32768.0
rms = np.sqrt(np.mean(samples ** 2))
return rms < SILENCE_THRESHOLD

Worked — until it didn’t.

The problem: Background noise (HVAC, hallway chatter, chairs scraping) easily surpasses the RMS threshold. Whisper sees noisy-but-speechless audio and hallucinates — confidently transcribing words that were never said.

RMS measures loudness, not speech. We needed something smarter.

5. Voice Activity Detection — The Real Fix

The solution: VAD (Voice Activity Detection) — a model that specifically asks: “Is there a human voice in this audio?”

We use Silero VAD for two reasons:

It’s tiny and fast
It exports to ONNX — no PyTorch runtime needed on the server

Why ONNX over PyTorch?

PyTorch:         ~500MB+  (full training framework — overkill for inference)
onnxruntime:     ~15MB (lean inference engine — runs .onnx model files only)

On a student budget server, this matters.

6. How Silero VAD Works

Silero VAD uses an LSTM under the hood. It takes a 512-sample audio window (~32ms) and returns a speech probability score between 0 and 1.

Why LSTM? Audio is temporal — whether something is speech depends on what came before it. The rhythm, pitch, phoneme transitions. LSTM carries hidden state (h) and cell state (c) across windows so the model remembers recent context.

▎ Want to understand how LSTM works from scratch?— Here is the link

def contains_speech(pcm_bytes: bytes) -> bool:
      samples = np.frombuffer(pcm_bytes, dtype=np.int16).astype(np.float32) / 32768.0

      # h = hidden state, c = cell state — carry forward across every window
      h = np.zeros((2, 1, 64), dtype=np.float32)
      c = np.zeros((2, 1, 64), dtype=np.float32)

      for i in range(0, len(samples), WINDOW_SIZE):
          window = samples[i : i + WINDOW_SIZE]
          score, h, c = run_vad_window(window, h, c)

          if score > VAD_THRESHOLD:
              return True  # speech confirmed — exit early

      return False

Carry LSTM state forward. h and c from each window feed into the next. Drop them and the model loses temporal context — it can no longer tell the difference between a cough and a word.

Early exit. Once any window crosses the threshold, speech is confirmed. No need to score the rest of the chunk.

7. Wire It All Together

The Full Picture

References

Live: https://www.classrec.com

Full code: https://github.com/codereyinish/ClassRec