Building an English Video Subtitle Workflow: From YouTube Transcripts to WhisperX

From YouTube Transcripts and yt-dlp to OpenAI Whisper, Faster Whisper, and WhisperX—a complete local walkthrough of English video subtitle production and dynamic caption timeline alignment. Covers model differences, speed tests, word-level timestamps, forced alignment, and subtitle workflows.

ASRYouTube TranscriptDynamic CaptionsWord-by-Word CaptionsHormozi Style Captionsyt-dlpOpenAI WhisperFaster WhisperWhisperXwav2vec2forced alignmentdynamic captions / kinetic typographyword-by-word highlight / karaoke styleEnglish subtitle productionspeech recognitionword-level timestampssubtitle timelineWhisper subtitlesWhisperX tutorialvideo subtitle workflowAI subtitle generation

Lately, I've been looking into workflows for creating English video subtitles. A lot of English short-form videos have amazing subtitle effects—especially in tech news, business interviews, and podcast clips. The captions aren't just accurate; they feature word highlighting, per-word animations, color changes, scaling emphasis, and more, which gives them a highly professional look.

For example: Alex Hormozi Shorts

After digging into it, I found out that this subtitle style has become a pretty standard production technique among creators. It's usually called:

  • Hormozi-style captions
  • Dynamic captions / kinetic typography
  • Word-by-word captions / karaoke-style highlighting

I wanted to try replicating these effects. But before diving in, I needed some good test samples. Ideally, I wanted a video that already had subtitles so I could focus on parsing them, handling the timeline, rendering styles, and animating—rather than jumping straight into speech recognition and generating subtitles from scratch.

With that in mind, I turned to YouTube first. Many videos already have subtitles, and YouTube has a built-in transcript feature.

I started downloading YouTube videos using yt-dlp, a tool that can download video files as well as list or download existing YouTube subtitles. Here are some common commands:

GitHub Repo

yt-dlp/yt-dlp

A feature-rich command-line audio/video downloader

# Uses Node.js to handle dynamic logic
# List subtitles for a video
yt-dlp --js-runtimes node --list-subs "VIDEO_URL"
 
# Download the best quality video + user-uploaded English subtitles
# To avoid downloading 4K, you can limit it to 1080P: -f "bv*[height<=1080]+ba/b[height<=1080]/b"
yt-dlp --js-runtimes node -f "bv*+ba/b" --write-subs --sub-lang en --sub-format srt "VIDEO_URL"
 
# Download the best quality video + automatically generated English subtitles
yt-dlp --js-runtimes node -f "bv*+ba/b" --write-auto-subs --sub-lang en --sub-format srt "VIDEO_URL"
 
# Only download subtitles, not the video
yt-dlp --js-runtimes node --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format srt "VIDEO_URL"

At first, I figured I could just use the subtitles already on YouTube—whether user-uploaded or auto-generated—straight for post-production testing. But after downloading a few, I realized they're more like "viewing aids" and aren't really suited for a video production pipeline.

The Problem with YouTube Transcripts

  1. Inaccurate timing: Some subtitles don't sync tightly with the speech, which becomes super obvious when you try to do word-by-word highlighting or beat-synced animations.
  2. Unnatural line breaks: Especially in news and interviews, subtitles are often split mechanically based on the speech recognition output, rather than by natural phrasing or reading rhythm.
  3. Unusable formatting: Some user-uploaded subtitles are entirely in ALL CAPS. They're fine for reading but terrible as source material for fine-grained subtitle work.
  4. Poor reading experience: Subtitles might appear too late, disappear too early, or chop a sentence up too finely, making layout and animation a nightmare later on.

It turns out that "transcribable" and "production-ready" are two very different things.

YouTube transcripts are great as a reference to quickly grasp a video's content. But if you're aiming for professional subtitle effects—like word highlighting, precise timing, natural breaks, and a better visual rhythm—they're not the ideal starting point.

In the AI era, plenty of ASR (Automatic Speech Recognition) tools can turn speech directly into subtitle files. There are tons of commercial APIs and web apps out there, but as a developer, I prefer an open-source, local, and controllable approach. I want models running on my own hardware, a pipeline I can understand piece by piece, and something I can easily plug into my own video generation workflow later.

After consulting ChatGPT, Gemini, and Claude, and doing some of my own research, all signs pointed to the Whisper family: OpenAI Whisper, Faster Whisper, and WhisperX.

In short:

  • OpenAI Whisper is the original model—great for understanding baseline capabilities and basic CLI workflows.
  • Faster Whisper is a reimplementation of Whisper using CTranslate2—it's generally much faster and uses less memory.
  • WhisperX adds forced alignment on top of Whisper's recognition to provide highly accurate word-level timestamps—perfect for subtitle alignment, word highlighting, and post-production.

Overall, WhisperX seemed like exactly what I needed. But out of habit, I didn't want to just jump straight to the final tool. I wanted to start with OpenAI Whisper, move to Faster Whisper, and finally try WhisperX, so I could understand how they relate, how they differ, and where each one fits in.

So, this article is a walkthrough of my local experiments with OpenAI Whisper, Faster Whisper, and WhisperX, along with my final thoughts and conclusions.

My test setup:

  • MacBook Air 15-inch
  • Chip: Apple M5
  • Memory: 24GB
  • OS: macOS Tahoe 26.4.1

OpenAI Whisper: From "Transcribing" to "Producing"

GitHub Repo

openai/whisper

Robust Speech Recognition via Large-Scale Weak Supervision

Whisper is an automatic speech recognition (ASR) model that OpenAI open-sourced in 2022. It's had a massive impact on the open-source ASR ecosystem. A lot of the tools that came after it are either directly related to or built on top of Whisper's capabilities, including:

  • Faster Whisper
  • WhisperX
  • whisper.cpp
  • Various WebUIs, desktop clients, and subtitle toolchains

Even in 2026, Whisper is still one of the most important foundational models in open-source ASR. That's thanks to its open-source nature, stable quality, full multilingual support, excellent English recognition, and a mature community. Of course, plenty of new alternatives have popped up since then—you can check out the Hugging Face leaderboard for more.

Environment Setup

The official requirements are Python3.9 and PyTorch 1.10.1.

But in practice, the supported range is pretty flexible:

I used conda to isolate my test environment. Conda isn't strictly required, but I highly recommend using a dedicated virtual environment to avoid dependency conflicts, especially when running things locally.

# Optional
conda create -n test_whisper python=3.11
conda activate test_whisper
  • Install Whisper:
pip install -U openai-whisper
  • Whisper doesn't decode audio or video on its own; it relies on ffmpeg. If you're on macOS, you can install it like this:
brew install ffmpeg
  • I didn't run into any issues with Rust or tiktoken, so I won't cover those here. If you run into errors with them, just check the official docs.

Models

Whisper's classic model lineup includes:

  • tiny
  • base
  • small
  • medium
  • large

Later on, OpenAI and the community added variants like:

  • large-v2
  • large-v3
  • turbo

If you check out Whisper's model list on Hugging Face, you'll see there are 12 models available today:

ModelParametersVRAMNotes
tiny39M~1GBVery fast, lower accuracy
base74M~1GBGood for initial testing
small244M~2GBBalanced
medium769M~5GBHigher accuracy
large1550M~10GBHigh quality
turbo809M~6GBHeavily optimized fast variant
  • The .en suffix indicates English-only models, which are usually a bit faster and more accurate for English content.
    • small.en
    • medium.en
  • Models without the .en suffix are multilingual.

I went with large-v3-turbo—the latest speed-optimized variant—which is more than enough for my machine.

The model weights download automatically the first time you run a command. On macOS, they're saved in ~/.cache/whisper.

Testing

whisper "test.mp4" \
--model large-v3-turbo \
--language en \
--task transcribe \
--output_format all \
--fp16 False \
--device cpu \
--output_dir .
  • test.mp4:
    • The input video file. Whisper automatically calls ffmpeg to extract the audio.
    • This means mp4, mov, mkv, mp3, and wav all work fine.
  • --model large-v3-turbo:
    • Specifies which model to use (it downloads automatically if you don't have it).
  • --language en:
    • Forces the recognition language to English.
    • If you leave this out, Whisper will try to auto-detect the language first.
  • --task transcribe:
    • Tells Whisper to transcribe the audio.
    • The other option is translate (which translates audio into English).
  • --output_format all:
    • Generates all output formats: txt, srt, vtt, tsv, and json.
    • You can also just specify a single format like srt.
  • --fp16 False:
    • On a macOS CPU, using FP16 doesn't necessarily speed things up, so I turned it off.

For context, INT8, FP16, and FP32 refer to the model's compute precision:

TypePrecisionSpeedMemory Footprint
FP32HighestSlowestLargest
FP16HigherFasterSmaller
INT8LowerFastestLeanest

Conclusion

Here's how it performed on my machine:

  • A 1-minute 48-second video took about 8 minutes and 44 seconds to process.
  • The generated subtitles still had some noticeable issues:
IssueDescription
Micro-segmentsSome subtitle timelines were only 0.18s or 0.2s long.
HallucinationsObvious repeated text or phrases.
Choppy timelinesSentences were split up too finely, causing timestamp instability.

Honestly, the speed wasn't great. But to be fair, I also had a lot running in the background:

  • Chrome
  • Cursor
  • Fork
  • Visual Studio Code
  • Various background services

So take these timings with a grain of salt. Still, it became pretty clear that the original openai-whisper package is really meant to be an "official reference implementation" for research and validating model capabilities, rather than a highly optimized production tool.

And that's exactly why Faster Whisper exists.

Faster Whisper

GitHub Repo

SYSTRAN/faster-whisper

Faster Whisper transcription with CTranslate2

The LLMs strongly recommended this one: a reimplementation of OpenAI Whisper using CTranslate2.

Under the hood, it uses CTranslate2, which is a high-performance inference engine specifically optimized for Transformer models.

FeatureOpenAI WhisperFaster Whisper
NatureOfficial implementationHigh-performance reimplementation
Inference StackPyTorchCTranslate2
Primary FocusReference & researchInference optimization
CPU PerformanceModerateExcellent

Environment Setup

  • Python3.9+
  • No need to install the ffmpeg CLI separately.

This is a nice departure from the original Whisper. Faster Whisper uses PyAV, which provides Python bindings for FFmpeg.

In simple terms:

  • It uses FFmpeg's capabilities internally.
  • But it doesn't require you to have ffmpeg installed on your system.

To install it:

pip install faster-whisper

Models

Faster Whisper uses Whisper weights that have been converted to the CTranslate2 format, rather than raw PyTorch weights. You can find them here: SYSTRAN Whisper Models.

  • They are based on the exact same Whisper models.
  • Only the weight formats are different (PyTorch vs. CTranslate2).
  • The default download location is usually ~/.cache/huggingface.
  • Models will download automatically the first time you run the script.
  • During the first download, it might prompt you for hf auth login. I recommend setting up a Hugging Face token to speed up downloads.

Testing

Unlike OpenAI Whisper, Faster Whisper is primarily designed to be used via Python. Here's the test script I wrote:

from faster_whisper import WhisperModel
 
video = "test.mp4"
 
model_size = "large-v3-turbo"
 
model = WhisperModel(
    model_size,
    device="cpu",
    compute_type="int8"
)
 
segments, info = model.transcribe(
    video,
    language="en",
    vad_filter=True,
    beam_size=5
)
 
with open("test.srt", "w", encoding="utf-8") as f:
 
    for i, segment in enumerate(segments, start=1):
 
        def fmt(t):
            h = int(t // 3600)
            m = int((t % 3600) // 60)
            s = int(t % 60)
            ms = int((t - int(t)) * 1000)
 
            return f"{h:02}:{m:02}:{s:02},{ms:03}"
 
        f.write(f"{i}\n")
        f.write(f"{fmt(segment.start)} --> {fmt(segment.end)}\n")
        f.write(segment.text.strip() + "\n\n")
  • compute_type="int8":
    • Uses 8-bit quantized inference.
    • Greatly reduces memory usage.
    • Usually noticeably faster on a CPU.
    • It also supports float32 and float16.
  • vad_filter=True:
    • Enables Voice Activity Detection (VAD).
    • This tries to filter out silence, background noise, and non-speech segments. The original Whisper doesn't do this by default.
  • beam_size=5:
    • A parameter for beam search. Larger values make the search more thorough, which can improve accuracy but slows down inference.
    • beam_size=5 is a solid middle ground.

Conclusion

Video LengthProcessing Time
1m 48s32s
19m 21s5m 58s
  • The speed improvement was massive.
  • However, I still occasionally saw repeated subtitles. For example, this line was duplicated with an incorrect timeline: "President Trump posting about the incident on social media":
18
00:01:20,700 --> 00:01:26,879
President Trump posting about the incident on social media said it goes to show how important it is for presidents and
 
19
00:01:26,879 --> 00:01:26,959
future presidents to be protected.
 
20
00:01:26,959 --> 00:01:27,219
The White House needs to be protected.
 
21
00:01:27,239 --> 00:01:35,900
President Trump posting about the incident on social media said it goes to show how important it is for future presidents to get the most safe and secure space of its kind ever built, referencing his ballroom and what he says will be the security complex underneath.

WhisperX

GitHub Repo

m-bain/whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Both OpenAI Whisper and Faster Whisper do a great job handling:

  • Speech recognition
  • Subtitle generation
  • Local execution
  • Inference speed

But when you're producing video subtitles, timeline precision is everything—especially if you want to do word highlighting, per-word animations, or fine-grained syncing.

While Faster Whisper does support word-level timestamps, WhisperX takes it a step further by adding wav2vec2 forced alignment to generate highly precise word-level timestamps. That's exactly why the LLMs and my search results kept recommending it.

For context, wav2vec 2.0 is a speech model developed by Meta.

ModelExcels At
WhisperSpeech recognition (transcription)
wav2vec 2.0Audio-to-text time alignment

Environment Setup

  • Conda is highly recommended.
  • Or at least a dedicated virtual environment.

WhisperX's dependencies are noticeably heavier than the previous tools. It requires:

  • PyTorch
  • torchaudio
  • pyannote
  • transformers
  • faster-whisper
  • CTranslate2

Because there are so many dependencies, the underlying library versions can easily conflict with each other, which is why a clean environment is so important.

pip install whisperx

Models

WhisperX actually integrates Faster Whisper under the hood.

This means it also uses the CTranslate2-formatted Whisper models.

If you've already run Faster Whisper, you can reuse many of the model files (usually found in ~/.cache/huggingface).

On top of the Whisper models, WhisperX will also download:

  • Alignment models
  • wav2vec 2.0 models
  • pyannote models (if you enable speaker diarization)

Testing

WhisperX supports both a CLI and a Python API.

whisperx test.mp4 \
  --model large-v3-turbo \
  --device cpu \
  --compute_type int8 \
  --output_dir . \
  --output_format all \
  --language en

For the same 1m 48s video, it took 40.699s.

During testing, I also ran into this warning:

UserWarning: 
torchcodec is not installed correctly so built-in audio decoding will fail. Solutions are:
* use audio preloaded in-memory as a {'waveform': (channel, time) torch.Tensor, 'sample_rate': int} dictionary;
* fix torchcodec installation. Error message was:

Could not load libtorchcodec. Likely causes:
          1. FFmpeg is not properly installed in your environment. We support
             versions 4, 5, 6 and 7

The reason? I had FFmpeg 8.x installed, but torchcodec (PyTorch's newer media decoding stack) currently only supports FFmpeg versions 4 through 7. It doesn't break anything, so you can safely ignore it for now.

Conclusion

Video LengthProcessing Time
1m 48s40.69s
19m 21s6m 28s
  • It's slightly slower than Faster Whisper (due to the extra alignment step).
  • On my test samples, I didn't see any obvious text repetition or timeline glitches anymore. Of course, larger-scale testing might still surface some edge cases, but the results were very promising.

No table of contents