Building an English Video Subtitle Workflow: From YouTube Transcripts to WhisperX

Lately, I've been looking into workflows for creating English video subtitles. A lot of English short-form videos have amazing subtitle effects—especially in tech news, business interviews, and podcast clips. The captions aren't just accurate; they feature word highlighting, per-word animations, color changes, scaling emphasis, and more, which gives them a highly professional look.

For example: Alex Hormozi Shorts @YouTube

After digging into it, I found out that this subtitle style has become a pretty standard production technique among creators. It's usually called:

Hormozi-style captions
Dynamic captions / kinetic typography
Word-by-word captions / karaoke-style highlighting

I wanted to try replicating these effects. But before diving in, I needed some good test samples. Ideally, I wanted a video that already had subtitles so I could focus on parsing them, handling the timeline, rendering styles, and animating—rather than jumping straight into speech recognition and generating subtitles from scratch.

With that in mind, I turned to YouTube first. Many videos already have subtitles, and YouTube has a built-in transcript feature.

I started downloading YouTube videos using yt-dlp, a tool that can download video files as well as list or download existing YouTube subtitles. Here are some common commands:

GitHub Repo

yt-dlp/yt-dlp

A feature-rich command-line audio/video downloader

# Uses Node.js to handle dynamic logic
# List subtitles for a video
yt-dlp --js-runtimes node --list-subs "VIDEO_URL"
 
# Download the best quality video + user-uploaded English subtitles
# To avoid downloading 4K, you can limit it to 1080P: -f "bv*[height<=1080]+ba/b[height<=1080]/b"
yt-dlp --js-runtimes node -f "bv*+ba/b" --write-subs --sub-lang en --sub-format srt "VIDEO_URL"
 
# Download the best quality video + automatically generated English subtitles
yt-dlp --js-runtimes node -f "bv*+ba/b" --write-auto-subs --sub-lang en --sub-format srt "VIDEO_URL"
 
# Only download subtitles, not the video
yt-dlp --js-runtimes node --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format srt "VIDEO_URL"

At first, I figured I could just use the subtitles already on YouTube—whether user-uploaded or auto-generated—straight for post-production testing. But after downloading a few, I realized they're more like "viewing aids" and aren't really suited for a video production pipeline.

The Problem with YouTube Transcripts

Inaccurate timing: Some subtitles don't sync tightly with the speech, which becomes super obvious when you try to do word-by-word highlighting or beat-synced animations.
Unnatural line breaks: Especially in news and interviews, subtitles are often split mechanically based on the speech recognition output, rather than by natural phrasing or reading rhythm.
Unusable formatting: Some user-uploaded subtitles are entirely in ALL CAPS. They're fine for reading but terrible as source material for fine-grained subtitle work.
Poor reading experience: Subtitles might appear too late, disappear too early, or chop a sentence up too finely, making layout and animation a nightmare later on.

It turns out that "transcribable" and "production-ready" are two very different things.

YouTube transcripts are great as a reference to quickly grasp a video's content. But if you're aiming for professional subtitle effects—like word highlighting, precise timing, natural breaks, and a better visual rhythm—they're not the ideal starting point.

In the AI era, plenty of ASR (Automatic Speech Recognition) tools can turn speech directly into subtitle files. There are tons of commercial APIs and web apps out there, but as a developer, I prefer an open-source, local, and controllable approach. I want models running on my own hardware, a pipeline I can understand piece by piece, and something I can easily plug into my own video generation workflow later.

After consulting ChatGPT, Gemini, and Claude, and doing some of my own research, all signs pointed to the Whisper family: OpenAI Whisper, Faster Whisper, and WhisperX.

In short:

OpenAI Whisper is the original model—great for understanding baseline capabilities and basic CLI workflows.
Faster Whisper is a reimplementation of Whisper using CTranslate2—it's generally much faster and uses less memory.
WhisperX adds forced alignment on top of Whisper's recognition to provide highly accurate word-level timestamps—perfect for subtitle alignment, word highlighting, and post-production.

Overall, WhisperX seemed like exactly what I needed. But out of habit, I didn't want to just jump straight to the final tool. I wanted to start with OpenAI Whisper, move to Faster Whisper, and finally try WhisperX, so I could understand how they relate, how they differ, and where each one fits in.

So, this article is a walkthrough of my local experiments with OpenAI Whisper, Faster Whisper, and WhisperX, along with my final thoughts and conclusions.

My test setup:

MacBook Air 15-inch
Chip: Apple M5
Memory: 24GB
OS: macOS Tahoe 26.4.1

OpenAI Whisper: From "Transcribing" to "Producing"

GitHub Repo

openai/whisper

Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is an automatic speech recognition (ASR) model that OpenAI open-sourced in 2022. It's had a massive impact on the open-source ASR ecosystem. A lot of the tools that came after it are either directly related to or built on top of Whisper's capabilities, including:

Faster Whisper
WhisperX
whisper.cpp
Various WebUIs, desktop clients, and subtitle toolchains

Even in 2026, Whisper is still one of the most important foundational models in open-source ASR. That's thanks to its open-source nature, stable quality, full multilingual support, excellent English recognition, and a mature community. Of course, plenty of new alternatives have popped up since then—you can check out the Hugging Face leaderboard for more.

Environment Setup

The official requirements are Python 3.9+ and PyTorch 1.10.1

But in practice, the supported range is pretty flexible:

Python 3.8 ~ 3.11
Any reasonably recent version of PyTorch

I used conda to isolate my test environment. Conda isn't strictly required, but I highly recommend using a dedicated virtual environment to avoid dependency conflicts, especially when running things locally.

# Optional
conda create -n test_whisper python=3.11
conda activate test_whisper

Install Whisper:

pip install -U openai-whisper

Whisper doesn't decode audio or video on its own; it relies on ffmpeg. If you're on macOS, you can install it like this:

brew install ffmpeg

I didn't run into any issues with Rust or tiktoken, so I won't cover those here. If you run into errors with them, just check the official docs.

Models

Whisper's classic model lineup includes:

tiny
base
small
medium
large

Later on, OpenAI and the community added variants like:

large-v2
large-v3
turbo

If you check out Whisper's model list on Hugging Face, you'll see there are 12 models available today:

Model	Parameters	VRAM	Notes
tiny	39M	~1GB	Very fast, lower accuracy
base	74M	~1GB	Good for initial testing
small	244M	~2GB	Balanced
medium	769M	~5GB	Higher accuracy
large	1550M	~10GB	High quality
turbo	809M	~6GB	Heavily optimized fast variant

The .en suffix indicates English-only models, which are usually a bit faster and more accurate for English content.
- small.en
- medium.en
Models without the .en suffix are multilingual.

I went with large-v3-turbo—the latest speed-optimized variant—which is more than enough for my machine.

The model weights download automatically the first time you run a command. On macOS, they're saved in ~/.cache/whisper.

Testing

whisper "test.mp4" \
--model large-v3-turbo \
--language en \
--task transcribe \
--output_format all \
--fp16 False \
--device cpu \
--output_dir .

test.mp4:
- The input video file. Whisper automatically calls ffmpeg to extract the audio.
- This means mp4, mov, mkv, mp3, and wav all work fine.
--model large-v3-turbo:
- Specifies which model to use (it downloads automatically if you don't have it).
--language en:
- Forces the recognition language to English.
- If you leave this out, Whisper will try to auto-detect the language first.
--task transcribe:
- Tells Whisper to transcribe the audio.
- The other option is translate (which translates audio into English).
--output_format all:
- Generates all output formats: txt, srt, vtt, tsv, and json.
- You can also just specify a single format like srt.
--fp16 False:
- On a macOS CPU, using FP16 doesn't necessarily speed things up, so I turned it off.

For context, INT8, FP16, and FP32 refer to the model's compute precision:

Type	Precision	Speed	Memory Footprint
FP32	Highest	Slowest	Largest
FP16	Higher	Faster	Smaller
INT8	Lower	Fastest	Leanest

Conclusion

Here's how it performed on my machine:

A 1-minute 48-second video took about 8 minutes and 44 seconds to process.
The generated subtitles still had some noticeable issues:

Issue	Description
Micro-segments	Some subtitle timelines were only 0.18s or 0.2s long.
Hallucinations	Obvious repeated text or phrases.
Choppy timelines	Sentences were split up too finely, causing timestamp instability.

Honestly, the speed wasn't great. But to be fair, I also had a lot running in the background:

Chrome
Cursor
Fork
Visual Studio Code
Various background services

So take these timings with a grain of salt. Still, it became pretty clear that the original openai-whisper package is really meant to be an "official reference implementation" for research and validating model capabilities, rather than a highly optimized production tool.

And that's exactly why Faster Whisper exists.

Faster Whisper

GitHub Repo

SYSTRAN/faster-whisper

Faster Whisper transcription with CTranslate2

The LLMs strongly recommended this one: a reimplementation of OpenAI Whisper using CTranslate2.

Under the hood, it uses CTranslate2, which is a high-performance inference engine specifically optimized for Transformer models.

Feature	OpenAI Whisper	Faster Whisper
Nature	Official implementation	High-performance reimplementation
Inference Stack	PyTorch	CTranslate2
Primary Focus	Reference & research	Inference optimization
CPU Performance	Moderate	Excellent

Environment Setup

Python 3.9+
No need to install the ffmpeg CLI separately.

This is a nice departure from the original Whisper. Faster Whisper uses PyAV, which provides Python bindings for FFmpeg.

In simple terms:

It uses FFmpeg's capabilities internally.
But it doesn't require you to have ffmpeg installed on your system.

To install it:

pip install faster-whisper

Models

Faster Whisper uses Whisper weights that have been converted to the CTranslate2 format, rather than raw PyTorch weights. You can find them here: SYSTRAN Whisper Models.

They are based on the exact same Whisper models.
Only the weight formats are different ( PyTorch vs. CTranslate2).
The default download location is usually ~/.cache/huggingface.
Models will download automatically the first time you run the script.
During the first download, it might prompt you for hf auth login. I recommend setting up a Hugging Face token to speed up downloads.

Testing

Unlike OpenAI Whisper, Faster Whisper is primarily designed to be used via Python. Here's the test script I wrote:

from faster_whisper import WhisperModel
 
video = "test.mp4"
 
model_size = "large-v3-turbo"
 
model = WhisperModel(
    model_size,
    device="cpu",
    compute_type="int8"
)
 
segments, info = model.transcribe(
    video,
    language="en",
    vad_filter=True,
    beam_size=5
)
 
with open("test.srt", "w", encoding="utf-8") as f:
 
    for i, segment in enumerate(segments, start=1):
 
        def fmt(t):
            h = int(t // 3600)
            m = int((t % 3600) // 60)
            s = int(t % 60)
            ms = int((t - int(t)) * 1000)
 
            return f"{h:02}:{m:02}:{s:02},{ms:03}"
 
        f.write(f"{i}\n")
        f.write(f"{fmt(segment.start)} --> {fmt(segment.end)}\n")
        f.write(segment.text.strip() + "\n\n")

compute_type="int8":
- Uses 8-bit quantized inference.
- Greatly reduces memory usage.
- Usually noticeably faster on a CPU.
- It also supports float32 and float16.
vad_filter=True:
- Enables Voice Activity Detection (VAD).
- This tries to filter out silence, background noise, and non-speech segments. The original Whisper doesn't do this by default.
beam_size=5:
- A parameter for beam search. Larger values make the search more thorough, which can improve accuracy but slows down inference.
- beam_size=5 is a solid middle ground.

Conclusion

Video Length	Processing Time
1m 48s	32s
19m 21s	5m 58s

The speed improvement was massive.
However, I still occasionally saw repeated subtitles. For example, this line was duplicated with an incorrect timeline: "President Trump posting about the incident on social media":

18
00:01:20,700 --> 00:01:26,879
President Trump posting about the incident on social media said it goes to show how important it is for presidents and
 
19
00:01:26,879 --> 00:01:26,959
future presidents to be protected.
 
20
00:01:26,959 --> 00:01:27,219
The White House needs to be protected.
 
21
00:01:27,239 --> 00:01:35,900
President Trump posting about the incident on social media said it goes to show how important it is for future presidents to get the most safe and secure space of its kind ever built, referencing his ballroom and what he says will be the security complex underneath.

WhisperX

GitHub Repo

m-bain/whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Both OpenAI Whisper and Faster Whisper do a great job handling:

Speech recognition
Subtitle generation
Local execution
Inference speed

But when you're producing video subtitles, timeline precision is everything—especially if you want to do word highlighting, per-word animations, or fine-grained syncing.

While Faster Whisper does support word-level timestamps, WhisperX takes it a step further by adding wav2vec2 forced alignment to generate highly precise word-level timestamps. That's exactly why the LLMs and my search results kept recommending it.

For context, wav2vec 2.0 is a speech model developed by Meta.

Model	Excels At
Whisper	Speech recognition (transcription)
wav2vec 2.0	Audio-to-text time alignment

Environment Setup

Conda is highly recommended.
Or at least a dedicated virtual environment.

WhisperX's dependencies are noticeably heavier than the previous tools. It requires:

PyTorch
torchaudio
pyannote
transformers
faster-whisper
CTranslate2

Because there are so many dependencies, the underlying library versions can easily conflict with each other, which is why a clean environment is so important.

pip install whisperx

Models

WhisperX actually integrates Faster Whisper under the hood.

This means it also uses the CTranslate2-formatted Whisper models.

If you've already run Faster Whisper, you can reuse many of the model files (usually found in ~/.cache/huggingface).

On top of the Whisper models, WhisperX will also download:

Alignment models
wav2vec 2.0 models
pyannote models (if you enable speaker diarization)

Testing

WhisperX supports both a CLI and a Python API.

whisperx test.mp4 \
  --model large-v3-turbo \
  --device cpu \
  --compute_type int8 \
  --output_dir . \
  --output_format all \
  --language en

For the same 1m 48s video, it took 40.699s.

During testing, I also ran into this warning:

UserWarning: 
torchcodec is not installed correctly so built-in audio decoding will fail. Solutions are:
* use audio preloaded in-memory as a {'waveform': (channel, time) torch.Tensor, 'sample_rate': int} dictionary;
* fix torchcodec installation. Error message was:

Could not load libtorchcodec. Likely causes:
          1. FFmpeg is not properly installed in your environment. We support
             versions 4, 5, 6 and 7

The reason? I had FFmpeg 8.x installed, but torchcodec ( PyTorch's newer media decoding stack) currently only supports FFmpeg versions 4 through 7. It doesn't break anything, so you can safely ignore it for now.

Conclusion

Video Length	Processing Time
1m 48s	40.69s
19m 21s	6m 28s

It's slightly slower than Faster Whisper (due to the extra alignment step).
On my test samples, I didn't see any obvious text repetition or timeline glitches anymore. Of course, larger-scale testing might still surface some edge cases, but the results were very promising.