Voicebox is a free, MIT-licensed, local-first voice studio with 29,000 GitHub stars that clones voices from a few seconds of audio, generates speech in 23 languages across seven AI engines, and handles global voice dictation, replacing ElevenLabs subscriptions that run from $11 to $99 per month and your WisprFlow dictation tool at the same time. Everything runs on your own machine; no audio ever leaves it.

Voicebox is a free, self-hostable AI voice studio that clones any voice from a few seconds of audio, generates speech in 23 languages, and handles voice dictation across your entire computer, displacing ElevenLabs subscriptions that run from $11 to $99 per month and eliminating the need for a separate dictation tool like WisprFlow. It has crossed 29,000 GitHub stars since launching in January 2026, and every model, voice sample, and audio capture stays entirely on your machine.

What you are currently paying for

ElevenLabs is the dominant paid voice AI product for content teams, marketers, and developers who need high-quality speech generation. Its Creator tier runs $11 per month after a first-month discount, covers 100,000 characters of generated speech, and gives you access to voice cloning. The Pro tier at $99 per month extends the character limit, adds more simultaneous voices, and unlocks higher-quality models. Scale and Business tiers push into the hundreds and near a thousand dollars per month for teams generating audio at volume.

For teams using AI heavily right now - generating podcast introductions, training narrations, customer-facing explainers, product demos, or agent voice output - those character limits disappear quickly. A ten-minute narration at a comfortable speaking pace runs roughly 15,000 characters. A team producing that kind of audio weekly hits the Creator tier ceiling in a month.

WisprFlow, a separate product, covers the input side: it is a voice dictation tool that transcribes your speech into any text field on your computer. That is a separate subscription on top of ElevenLabs. Between the two, a solo creator or small team is easily spending $30 to $150 per month on voice tooling before factoring in any API usage.

What Voicebox actually does

Voicebox is a native desktop application - built on Tauri, which means it uses Rust under the hood rather than the memory-heavier Electron approach most desktop apps take. It runs on macOS, Windows, Linux, and Docker.

The voice generation side works through seven different AI engines you can switch between per generation: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro. That variety matters because different engines have different strengths - Chatterbox Multilingual covers the broadest language set, Kokoro runs on a tiny 82-million-parameter model fast enough for CPU inference, and Qwen3-TTS handles delivery instructions like "speak slowly" or "whisper" as natural language.

Voice cloning is zero-shot: you give it a short audio reference and it generates speech in that voice. There are also 50-plus preset voices included, so you do not need to supply reference audio to get started. The Stories editor adds a multi-track timeline for building podcast-style conversations with different voices on separate tracks.

The dictation side works through a global hotkey that activates anywhere on your computer - any text field, any app. It uses Whisper for transcription and pastes the result directly into whatever you are typing in. This is the feature that eliminates the WisprFlow subscription for most users.

The self-hosting economics

The MIT license covers everything. There are no cloud usage fees, no character caps, and no tiered feature gates. The meaningful cost is the hardware.

Voice generation requires a GPU to run at useful speeds with the higher-quality models. For teams that already have modern laptops - Apple Silicon Macs use MLX Metal acceleration, Windows machines use CUDA - there is no additional cost. Generation runs locally on the hardware you have. The LuxTTS engine runs at 150 times realtime on CPU, so even machines without a GPU can generate standard speech.

If you are running this as a server rather than a local app, you need a cloud instance with a GPU. A basic GPU instance on AWS or GCP runs $0.35 to $0.90 per hour on-demand, or around $50 to $150 per month for a shared instance running intermittently. For high-volume generation, that compares well against the $99 ElevenLabs Pro tier. For light use, it does not - the free ElevenLabs tier or the $11 Creator tier will cost less if you are only generating a few thousand characters per month.

What to know before switching

Setup requires downloading model files, which run from under 1 gigabyte for Kokoro to multi-gigabyte downloads for the higher-quality engines. First-run model downloads will take meaningful time depending on your internet connection. That is a one-time friction, but it is real friction.

The latest stable release is v0.5.0, published in late April 2026. The project has been shipping updates consistently since January, which is a positive signal, but it is still relatively young software. Pre-built binaries are available for macOS and Windows; Linux users need to build from source, which involves Node and Rust toolchains and is not a beginner process.

A few important feature gaps relative to ElevenLabs: there is no cloud API you can call from your own applications without self-hosting the full server (it does expose a local REST API and MCP server, but those require the desktop app or server running). ElevenLabs is trivially embeddable in any application via its API. Voicebox is primarily a desktop tool that has API capabilities, not the other way around.

The voice cloning quality is competitive for most use cases but not identical to ElevenLabs' proprietary models. Open-source models have narrowed the gap significantly in 2025 and 2026, but if the nuance of a specific voice matters for your brand or production, test before canceling the subscription.

Where this fits in a business context

The realistic displacement case is a content team or creator that generates audio regularly, has at least one technically capable person who can run a download and click install, and is already on modern hardware. In that scenario, the $11 to $99 per month in ElevenLabs fees and the WisprFlow subscription both go to zero.

For teams that need a straightforward API integration, are not technical enough to manage a local install, or generate audio only occasionally, the hosted ElevenLabs product is the right call.

The gap between "cloud product you pay for" and "open-source tool you install" used to be enormous in voice AI. A 29,000-star MIT project now ships with seven TTS engines, voice cloning, a multi-track story editor, and global dictation in a single native app. In mid-2026, that gap has mostly closed on the feature side. What remains is a question of convenience versus cost.

The most surprising thing about Voicebox is not that it exists. It is how little noise it has made relative to what it does.