Lately, I've been looking into workflows for creating English video subtitles. A lot of English short-form videos have amazing subtitle effects—especially in tech news, business interviews, and podcast clips. The captions aren't just accurate; they feature word highlighting, per-word animations, color changes, scaling emphasis, and more, which gives them a highly professional look.
For example: Alex Hormozi Shorts
After digging into it, I found out that this subtitle style has become a pretty standard production technique among creators. It's usually called:
- Hormozi-style captions
- Dynamic captions / kinetic typography
- Word-by-word captions / karaoke-style highlighting
I wanted to try replicating these effects. But before diving in, I needed some good test samples. Ideally, I wanted a video that already had subtitles so I could focus on parsing them, handling the timeline, rendering styles, and animating—rather than jumping straight into speech recognition and generating subtitles from scratch.
With that in mind, I turned to YouTube first. Many videos already have subtitles, and YouTube has a built-in transcript feature.
I started downloading YouTube videos using yt-dlp, a tool that can download video files as well as list or download existing YouTube subtitles. Here are some common commands:
GitHub Repo
yt-dlp/yt-dlp
A feature-rich command-line audio/video downloader
# Uses Node.js to handle dynamic logic
# List subtitles for a video
yt-dlp --js-runtimes node --list-subs "VIDEO_URL"
# Download the best quality video + user-uploaded English subtitles
# To avoid downloading 4K, you can limit it to 1080P: -f "bv*[height<=1080]+ba/b[height<=1080]/b"
yt-dlp --js-runtimes node -f "bv*+ba/b" --write-subs --sub-lang en --sub-format srt "VIDEO_URL"
# Download the best quality video + automatically generated English subtitles
yt-dlp --js-runtimes node -f "bv*+ba/b" --write-auto-subs --sub-lang en --sub-format srt "VIDEO_URL"
# Only download subtitles, not the video
yt-dlp --js-runtimes node --skip-download --write-subs --write-auto-subs --sub-lang en --sub-format srt "VIDEO_URL"At first, I figured I could just use the subtitles already on YouTube—whether user-uploaded or auto-generated—straight for post-production testing. But after downloading a few, I realized they're more like "viewing aids" and aren't really suited for a video production pipeline.
The Problem with YouTube Transcripts
- Inaccurate timing: Some subtitles don't sync tightly with the speech, which becomes super obvious when you try to do word-by-word highlighting or beat-synced animations.
- Unnatural line breaks: Especially in news and interviews, subtitles are often split mechanically based on the speech recognition output, rather than by natural phrasing or reading rhythm.
- Unusable formatting: Some user-uploaded subtitles are entirely in ALL CAPS. They're fine for reading but terrible as source material for fine-grained subtitle work.
- Poor reading experience: Subtitles might appear too late, disappear too early, or chop a sentence up too finely, making layout and animation a nightmare later on.
It turns out that "transcribable" and "production-ready" are two very different things.
YouTube transcripts are great as a reference to quickly grasp a video's content. But if you're aiming for professional subtitle effects—like word highlighting, precise timing, natural breaks, and a better visual rhythm—they're not the ideal starting point.
In the AI era, plenty of ASR (Automatic Speech Recognition) tools can turn speech directly into subtitle files. There are tons of commercial APIs and web apps out there, but as a developer, I prefer an open-source, local, and controllable approach. I want models running on my own hardware, a pipeline I can understand piece by piece, and something I can easily plug into my own video generation workflow later.
After consulting ChatGPT, Gemini, and Claude, and doing some of my own research, all signs pointed to the Whisper family: OpenAI Whisper, Faster Whisper, and WhisperX.
In short:
- OpenAI Whisper is the original model—great for understanding baseline capabilities and basic CLI workflows.
- Faster Whisper is a reimplementation of Whisper using CTranslate2—it's generally much faster and uses less memory.
- WhisperX adds forced alignment on top of Whisper's recognition to provide highly accurate word-level timestamps—perfect for subtitle alignment, word highlighting, and post-production.
Overall, WhisperX seemed like exactly what I needed. But out of habit, I didn't want to just jump straight to the final tool. I wanted to start with OpenAI Whisper, move to Faster Whisper, and finally try WhisperX, so I could understand how they relate, how they differ, and where each one fits in.
So, this article is a walkthrough of my local experiments with OpenAI Whisper, Faster Whisper, and WhisperX, along with my final thoughts and conclusions.
My test setup:
- MacBook Air 15-inch
- Chip: Apple M5
- Memory: 24GB
- OS: macOS Tahoe 26.4.1
OpenAI Whisper: From "Transcribing" to "Producing"
GitHub Repo
openai/whisper
Robust Speech Recognition via Large-Scale Weak Supervision
Whisper is an automatic speech recognition (ASR) model that OpenAI open-sourced in 2022. It's had a massive impact on the open-source ASR ecosystem. A lot of the tools that came after it are either directly related to or built on top of Whisper's capabilities, including:
- Faster Whisper
- WhisperX
- whisper.cpp
- Various WebUIs, desktop clients, and subtitle toolchains
Even in 2026, Whisper is still one of the most important foundational models in open-source ASR. That's thanks to its open-source nature, stable quality, full multilingual support, excellent English recognition, and a mature community. Of course, plenty of new alternatives have popped up since then—you can check out the Hugging Face leaderboard for more.
Environment Setup
The official requirements are Python3.9 and PyTorch 1.10.1.
But in practice, the supported range is pretty flexible:
- Python3.8 ~ 3.11
- Any reasonably recent version of PyTorch
I used conda to isolate my test environment. Conda isn't strictly required, but I highly recommend using a dedicated virtual environment to avoid dependency conflicts, especially when running things locally.
# Optional
conda create -n test_whisper python=3.11
conda activate test_whisper- Install Whisper:
pip install -U openai-whisper- Whisper doesn't decode audio or video on its own; it relies on
ffmpeg. If you're on macOS, you can install it like this:
brew install ffmpeg- I didn't run into any issues with Rust or
tiktoken, so I won't cover those here. If you run into errors with them, just check the official docs.
Models
Whisper's classic model lineup includes:
- tiny
- base
- small
- medium
- large
Later on, OpenAI and the community added variants like:
- large-v2
- large-v3
- turbo
If you check out Whisper's model list on Hugging Face, you'll see there are 12 models available today:
| Model | Parameters | VRAM | Notes |
|---|---|---|---|
| tiny | 39M | ~1GB | Very fast, lower accuracy |
| base | 74M | ~1GB | Good for initial testing |
| small | 244M | ~2GB | Balanced |
| medium | 769M | ~5GB | Higher accuracy |
| large | 1550M | ~10GB | High quality |
| turbo | 809M | ~6GB | Heavily optimized fast variant |
- The
.ensuffix indicates English-only models, which are usually a bit faster and more accurate for English content.small.enmedium.en
- Models without the
.ensuffix are multilingual.
I went with large-v3-turbo—the latest speed-optimized variant—which is more than enough for my machine.
The model weights download automatically the first time you run a command. On macOS, they're saved in ~/.cache/whisper.
Testing
whisper "test.mp4" \
--model large-v3-turbo \
--language en \
--task transcribe \
--output_format all \
--fp16 False \
--device cpu \
--output_dir .test.mp4:- The input video file. Whisper automatically calls
ffmpegto extract the audio. - This means
mp4,mov,mkv,mp3, andwavall work fine.
- The input video file. Whisper automatically calls
--model large-v3-turbo:- Specifies which model to use (it downloads automatically if you don't have it).
--language en:- Forces the recognition language to English.
- If you leave this out, Whisper will try to auto-detect the language first.
--task transcribe:- Tells Whisper to transcribe the audio.
- The other option is
translate(which translates audio into English).
--output_format all:- Generates all output formats:
txt,srt,vtt,tsv, andjson. - You can also just specify a single format like
srt.
- Generates all output formats:
--fp16 False:- On a macOS CPU, using FP16 doesn't necessarily speed things up, so I turned it off.
For context, INT8, FP16, and FP32 refer to the model's compute precision:
| Type | Precision | Speed | Memory Footprint |
|---|---|---|---|
| FP32 | Highest | Slowest | Largest |
| FP16 | Higher | Faster | Smaller |
| INT8 | Lower | Fastest | Leanest |
Conclusion
Here's how it performed on my machine:
- A 1-minute 48-second video took about 8 minutes and 44 seconds to process.
- The generated subtitles still had some noticeable issues:
| Issue | Description |
|---|---|
| Micro-segments | Some subtitle timelines were only 0.18s or 0.2s long. |
| Hallucinations | Obvious repeated text or phrases. |
| Choppy timelines | Sentences were split up too finely, causing timestamp instability. |
Honestly, the speed wasn't great. But to be fair, I also had a lot running in the background:
- Chrome
- Cursor
- Fork
- Visual Studio Code
- Various background services
So take these timings with a grain of salt. Still, it became pretty clear that the original openai-whisper package is really meant to be an "official reference implementation" for research and validating model capabilities, rather than a highly optimized production tool.
And that's exactly why Faster Whisper exists.
Faster Whisper
GitHub Repo
SYSTRAN/faster-whisper
Faster Whisper transcription with CTranslate2
The LLMs strongly recommended this one: a reimplementation of OpenAI Whisper using CTranslate2.
Under the hood, it uses CTranslate2, which is a high-performance inference engine specifically optimized for Transformer models.
| Feature | OpenAI Whisper | Faster Whisper |
|---|---|---|
| Nature | Official implementation | High-performance reimplementation |
| Inference Stack | PyTorch | CTranslate2 |
| Primary Focus | Reference & research | Inference optimization |
| CPU Performance | Moderate | Excellent |
Environment Setup
- Python3.9+
- No need to install the
ffmpegCLI separately.
This is a nice departure from the original Whisper. Faster Whisper uses PyAV, which provides Python bindings for FFmpeg.
In simple terms:
- It uses FFmpeg's capabilities internally.
- But it doesn't require you to have
ffmpeginstalled on your system.
To install it:
pip install faster-whisperModels
Faster Whisper uses Whisper weights that have been converted to the CTranslate2 format, rather than raw PyTorch weights. You can find them here: SYSTRAN Whisper Models.
- They are based on the exact same Whisper models.
- Only the weight formats are different (PyTorch vs. CTranslate2).
- The default download location is usually
~/.cache/huggingface. - Models will download automatically the first time you run the script.
- During the first download, it might prompt you for
hf auth login. I recommend setting up a Hugging Face token to speed up downloads.
Testing
Unlike OpenAI Whisper, Faster Whisper is primarily designed to be used via Python. Here's the test script I wrote:
from faster_whisper import WhisperModel
video = "test.mp4"
model_size = "large-v3-turbo"
model = WhisperModel(
model_size,
device="cpu",
compute_type="int8"
)
segments, info = model.transcribe(
video,
language="en",
vad_filter=True,
beam_size=5
)
with open("test.srt", "w", encoding="utf-8") as f:
for i, segment in enumerate(segments, start=1):
def fmt(t):
h = int(t // 3600)
m = int((t % 3600) // 60)
s = int(t % 60)
ms = int((t - int(t)) * 1000)
return f"{h:02}:{m:02}:{s:02},{ms:03}"
f.write(f"{i}\n")
f.write(f"{fmt(segment.start)} --> {fmt(segment.end)}\n")
f.write(segment.text.strip() + "\n\n")compute_type="int8":- Uses 8-bit quantized inference.
- Greatly reduces memory usage.
- Usually noticeably faster on a CPU.
- It also supports
float32andfloat16.
vad_filter=True:- Enables Voice Activity Detection (VAD).
- This tries to filter out silence, background noise, and non-speech segments. The original Whisper doesn't do this by default.
beam_size=5:- A parameter for beam search. Larger values make the search more thorough, which can improve accuracy but slows down inference.
beam_size=5is a solid middle ground.
Conclusion
| Video Length | Processing Time |
|---|---|
| 1m 48s | 32s |
| 19m 21s | 5m 58s |
- The speed improvement was massive.
- However, I still occasionally saw repeated subtitles. For example, this line was duplicated with an incorrect timeline: "President Trump posting about the incident on social media":
18
00:01:20,700 --> 00:01:26,879
President Trump posting about the incident on social media said it goes to show how important it is for presidents and
19
00:01:26,879 --> 00:01:26,959
future presidents to be protected.
20
00:01:26,959 --> 00:01:27,219
The White House needs to be protected.
21
00:01:27,239 --> 00:01:35,900
President Trump posting about the incident on social media said it goes to show how important it is for future presidents to get the most safe and secure space of its kind ever built, referencing his ballroom and what he says will be the security complex underneath.WhisperX
GitHub Repo
m-bain/whisperX
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Both OpenAI Whisper and Faster Whisper do a great job handling:
- Speech recognition
- Subtitle generation
- Local execution
- Inference speed
But when you're producing video subtitles, timeline precision is everything—especially if you want to do word highlighting, per-word animations, or fine-grained syncing.
While Faster Whisper does support word-level timestamps, WhisperX takes it a step further by adding wav2vec2 forced alignment to generate highly precise word-level timestamps. That's exactly why the LLMs and my search results kept recommending it.
For context, wav2vec 2.0 is a speech model developed by Meta.
| Model | Excels At |
|---|---|
| Whisper | Speech recognition (transcription) |
| wav2vec 2.0 | Audio-to-text time alignment |
Environment Setup
- Conda is highly recommended.
- Or at least a dedicated virtual environment.
WhisperX's dependencies are noticeably heavier than the previous tools. It requires:
- PyTorch
- torchaudio
- pyannote
- transformers
- faster-whisper
- CTranslate2
Because there are so many dependencies, the underlying library versions can easily conflict with each other, which is why a clean environment is so important.
pip install whisperxModels
WhisperX actually integrates Faster Whisper under the hood.
This means it also uses the CTranslate2-formatted Whisper models.
If you've already run Faster Whisper, you can reuse many of the model files (usually found in ~/.cache/huggingface).
On top of the Whisper models, WhisperX will also download:
- Alignment models
wav2vec 2.0modelspyannotemodels (if you enable speaker diarization)
Testing
WhisperX supports both a CLI and a Python API.
whisperx test.mp4 \
--model large-v3-turbo \
--device cpu \
--compute_type int8 \
--output_dir . \
--output_format all \
--language enFor the same 1m 48s video, it took 40.699s.
During testing, I also ran into this warning:
UserWarning:
torchcodec is not installed correctly so built-in audio decoding will fail. Solutions are:
* use audio preloaded in-memory as a {'waveform': (channel, time) torch.Tensor, 'sample_rate': int} dictionary;
* fix torchcodec installation. Error message was:
Could not load libtorchcodec. Likely causes:
1. FFmpeg is not properly installed in your environment. We support
versions 4, 5, 6 and 7
The reason? I had FFmpeg 8.x installed, but torchcodec (PyTorch's newer media decoding stack) currently only supports FFmpeg versions 4 through 7. It doesn't break anything, so you can safely ignore it for now.
Conclusion
| Video Length | Processing Time |
|---|---|
| 1m 48s | 40.69s |
| 19m 21s | 6m 28s |
- It's slightly slower than Faster Whisper (due to the extra alignment step).
- On my test samples, I didn't see any obvious text repetition or timeline glitches anymore. Of course, larger-scale testing might still surface some edge cases, but the results were very promising.