← All open source projects

VibeVoice

microsoft/VibeVoice

VibeVoice is Microsoft’s open model family for speech recognition, text-to-speech, and streaming voice generation.

Forks 5,466
Author microsoft
Language Python
License MIT
Synced 2026-06-10

What it is

VibeVoice is a Microsoft repository with open voice models and demonstration code. It brings together several areas: long-form speech recognition, multi-speaker text-to-speech, and streaming voice generation. The project matters not as a single installable library, but as a collection of models, reports, examples, and instructions around modern voice AI infrastructure.

The repository appeared in 2025 and became visible because of its practical focus: not only short one-sentence demos, but long recordings, multilingual scenarios, multiple speakers, and modes where response latency matters. Voice products need those distinctions because assistants, dictation, narration, and transcription all have different tradeoffs.

What is inside the repository

The repository has separate materials for VibeVoice-ASR, VibeVoice-TTS, and the streaming version. The documentation points to Hugging Face models, reports, Colab demos, and run files. ASR highlights support for many languages, while the speech-generation side focuses on long-form and multi-speaker use cases.

A typical experiment start

This example shows a safe exploration order: create the environment first, install dependencies, run a demo, and only then move to your own audio and model parameters.

Language: Bash
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Where it is useful

VibeVoice is useful for teams building speech interfaces, narration systems, meeting transcription tools, or research prototypes around speech models. The repository makes it easier to compare adjacent tasks inside one family: recognition, generation, and streaming output are documented side by side.

Limitations

Voice models always involve risk. Quality depends on language, voice, background noise, recording length, and hardware. The repository explicitly separates risks and limitations, so it should be treated as a serious research and applied base rather than a button that produces perfect speech. Production use still needs quality checks, voice-rights review, safety decisions, and inference-cost planning.