← All open source projects

Whisper

openai/whisper

Whisper is OpenAI’s model and Python package for speech recognition, speech translation into English, and language identification in audio files.

Forks 12,490
Author openai
Language Python
License MIT
Synced 2026-06-09

What it is

Whisper is OpenAI’s open project for speech recognition. It was trained on a large and diverse audio dataset and works as a multitask model: multilingual speech recognition, speech translation into English, language identification, and related audio tasks.

Technically it is a Transformer sequence-to-sequence model. A key point is the shared token format for different tasks: the model is not assembled from several separate pipeline stages, but predicts the result as a sequence. That made Whisper useful for CLI tools, Python scripts, subtitles, media archives, and voice interfaces.

What is inside and how people use it

The repository contains the Python package, CLI, model card, examples, and a table of model sizes. The sizes trade off speed, memory, and accuracy: from tiny and base to large and turbo. In practice, model choice depends on hardware, language, and task.

CLI and Python API

This example shows two levels of use: a ready command for a file and a programmatic call inside Python.

Language: Python
# CLI:
# whisper audio.mp3 --model turbo

import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

Common scenarios include transcribing recordings, interviews, podcasts, lectures, videos, or calls. Another scenario is embedding recognition into an application through the Python API and sending text into search, editing, summarization, or subtitles.

Strengths and limitations

Whisper’s strength is simple installation and broad language coverage. It is useful when you need text from audio without building a speech model yourself.

The limitations are practical: quality depends on noise, diction, language, and model choice; ffmpeg must be installed separately; larger models need meaningful memory and hardware. Turbo is faster, but not meant for speech translation into English.