opensourceprojects.dev

A broadsheet for software that doesn't ask for your email

WhisperX: 70x realtime transcription with word-level timestamps and speaker diar...

WhisperX: 70x realtime transcription with word-level timestamps and speaker diar...

GitHub RepoImpressions742

Project Description

View on GitHub

WhisperX: Real-Time Transcription at 70x Speed with Speaker Labels

Ever needed to transcribe a long meeting or podcast and found yourself waiting forever? Or tried to figure out who said what in a transcript and got frustrated because there were no speaker labels?

WhisperX is an open-source tool that tackles both problems at once. It builds on OpenAI’s Whisper model but goes way beyond—delivering word-level timestamps, speaker diarization (who spoke when), and transcription speeds that hit 70x real-time on a decent GPU.

If you’ve ever wanted to process hours of audio in minutes with clean, speaker-attributed text, this is worth a look.

What It Does

WhisperX takes an audio file (or a batch of files) and produces a full transcript with:

  • Word-level timestamps (every word gets a start and end time)
  • Speaker diarization (labels like [SPEAKER_00], [SPEAKER_01])
  • Support for multiple languages via the Whisper model family
  • Batch processing for speed

Under the hood, it uses Whisper for the initial transcription, then aligns the text to the audio with a separate alignment model (wav2vec2 or similar) to get precise word boundaries. For speaker diarization, it integrates a pretrained speaker embedding model and clustering algorithm. The result is a single pipeline that gives you both high-quality transcription and speaker identity.

Why It’s Cool

The main appeal is speed. The 70x real-time claim isn’t marketing fluff—it’s because the pipeline is optimized for GPU inference and batches multiple audio segments. You can take a 10-minute recording and get a full transcript with speaker labels in under 10 seconds. That’s genuinely useful for day-to-day work.

Another neat detail: the word-level timestamps are really accurate. With plain Whisper, you get sentence-level timestamps at best. Here, you can click on any word in the transcript and jump to that exact moment in the audio. That’s gold for editing or review.

Speaker diarization is handled surprisingly well for an open-source tool. It’s not perfect—overlapping speech still confuses it—but for clear recordings with a few speakers, it’s damn good. It uses a modified version of the pyannote-audio speaker segmentation model, tuned for this pipeline.

Finally, WhisperX supports multiple Whisper model sizes (tiny, base, small, medium, large-v2, large-v3). You trade speed for accuracy depending on your needs. The smaller models can run on CPU if you’re patient.

How to Try It

You can install it via pip and run it in minutes. Here’s the quick start:

pip install whisperx

Then in a Python script or CLI:

import whisperx

model = whisperx.load_model("base", device="cuda")  # or "cpu"

audio = whisperx.load_audio("meeting.wav")
result = model.transcribe(audio)

# Align for word-level timestamps
align_model, metadata = whisperx.load_align_model(language_code="en", device="cuda")
result = whisperx.align(result["segments"], align_model, metadata, audio, device="cuda")

# Diarize
diarize_model = whisperx.DiarizationPipeline(use_auth_token=None, device="cuda")
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

That’s it. The output is a dictionary with segments, each containing words with timestamps and speaker labels.

Or if you want to skip all that, you can use the command-line tool:

whisperx meeting.wav --model small --diarize

The GitHub repo has detailed instructions and a Colab notebook if you want to try it without installing anything.

Final Thoughts

WhisperX is one of those tools that makes you wonder why it didn’t exist sooner. The combination of speed, accuracy, and speaker separation is genuinely useful for developers, journalists, podcasters, or anyone dealing with audio content. It’s not perfect—diarization can still mess up in noisy environments, and you need a decent GPU for the fast speeds—but the open-source nature means the community keeps improving it.

If you’re building anything that involves transcription, meeting notes, or voice interfaces, this project is worth your time. It’s practical, well-documented, and works right now.


Follow us at [@githubprojects] for more developer tools and open-source highlights.

Back to Projects
Project ID: 835c04b3-d772-4fd9-b63f-e310893cb8bbLast updated: June 26, 2026 at 04:12 AM