Speech Transcription on M4 Pro: From CPU Torture to GPU Bliss
The Problem
I recently recorded a brainstorming session (15MB m4a, ~20 minutes) and wanted to transcribe it locally. OpenAI’s Whisper seemed like the obvious choice — open source, free, runs locally, no privacy concerns.
My machine: MacBook Pro M4 Pro with 24GB unified memory.
First Attempt: CPU Whisper
I used the Homebrew-installed whisper CLI with the turbo model:
whisper recording.m4a \
--model turbo \
--language en \
--initial_prompt "Context about the recording topic." \
--output_dir ./docs \
--output_format txt
Then I waited… for 2.5 hours. CPU temperature spiked, fans went full blast, top showed over 1000% CPU usage.
The output directory was empty. Not a single word.
The issue is clear: OpenAI’s Python Whisper defaults to CPU on macOS and completely ignores the M4 Pro’s GPU.
The Solution: mlx-whisper
Apple Silicon Macs have a dedicated ML framework called MLX, built by Apple specifically for their chips (think PyTorch but for Apple Silicon). The community built mlx-whisper on top of it, leveraging Metal GPU acceleration.
Installation
# macOS Homebrew Python doesn't allow direct pip install, use pipx
pipx install mlx-whisper
Usage
mlx_whisper \
--model mlx-community/whisper-turbo \
--language en \
--initial-prompt "Context about the recording topic." \
--output-dir ./docs \
--output-format txt \
recording.m4a
Result: Transcription completed in a few minutes with good quality, handling mixed English-Mandarin segments well.
Performance Comparison
| Method | Time | CPU Usage | Result |
|---|---|---|---|
| whisper (CPU) | 2.5+ hours | >1000% | No output |
| mlx-whisper (GPU) | ~3 minutes | Normal | Full transcript |
That’s roughly a 50x difference. If you’re on Apple Silicon, mlx-whisper is non-negotiable.
About the Turbo Model
Whisper turbo’s full name is large-v3-turbo — turbo is just the shorthand. It was created by pruning large-v3’s 32 decoder layers down to 4, then fine-tuning:
- Parameters: 809M (between medium at 769M and large at 1550M)
- Speed: ~8x faster than large
- Accuracy: Only ~1-2% WER increase
- Multilingual only, no English-only variant
For most use cases, turbo offers the best bang for your buck.
Open-Source STT Landscape in 2026
I did some research and found that Whisper hasn’t been updated in a while:
- 2022-09: Whisper initial release
- 2023-11: large-v3 released
- 2024-10: large-v3-turbo released (last update)
OpenAI has shifted focus to closed-source API models (gpt-4o-transcribe). The open-source Whisper is essentially in maintenance mode.
Meanwhile, other open-source models have surpassed Whisper:
| Model | WER | Highlights | M4 Pro Compatible? |
|---|---|---|---|
| NVIDIA Canary Qwen 2.5B | 5.63 | Current open-source best | ⚠️ Requires NeMo, poor Mac support |
| IBM Granite Speech 8B | 5.85 | Enterprise-grade | ⚠️ 8B model too large, 24GB barely enough |
| Whisper Large V3 | 7.4 | Most mature ecosystem | ✅ via mlx-whisper |
| Whisper Turbo | 7.75 | Fast | ✅ via mlx-whisper |
| NVIDIA Parakeet TDT 1.1B | ~8.0 | Ultra-low latency | ✅ MLX version available |
For Apple Silicon users, mlx-whisper (turbo) remains the most practical choice — mature ecosystem, easy setup. If you need lower latency, Parakeet TDT also has an MLX version worth trying.
Cache Cleanup Reminder
Whisper model caches eat up significant disk space. I found 6 different model versions cached on my machine, totaling 11.3GB. Remember to clean up after use:
# Check cache sizes
du -sh ~/.cache/huggingface/hub/models--*whisper*
du -sh ~/.cache/whisper/
# Remove unused models
rm -rf ~/.cache/huggingface/hub/models--unused-model
rm -rf ~/.cache/whisper/
Takeaways
- On Apple Silicon Macs, always use mlx-whisper — vanilla Whisper on CPU is unusably slow
turbo(aka large-v3-turbo) is the most cost-effective model available- While Whisper’s open-source development has stalled, the MLX ecosystem keeps it the most convenient option on Mac
- Clean up model caches after use — easily saves 10GB+ of disk space
If you found this helpful, consider buying me a coffee to support more content like this.
Buy me a coffee