What it is
ElevenLabs ships the highest-fidelity text-to-speech and voice-cloning models on the market, plus a full realtime conversational stack. The official Python SDK (elevenlabs-python) is MIT licensed and currently at v2.50.0 (May 26, 2026). The same surface is mirrored in @elevenlabs/elevenlabs-js for Node, with Go, Java and .NET SDKs in the wider org.
The product surface has four pillars: Text-to-Speech (synth from any string), Instant Voice Cloning (IVC) (build a custom voice from ~1 minute of audio), Speech-to-Speech (re-perform a recording in a different voice), and ElevenAgents (realtime conversational AI with ASR + LLM + TTS in one WebSocket pipeline). Output formats span MP3, PCM, μ-law and Opus across multiple sample rates.
Architecture
Every model is a transformer trained on multilingual paired audio/text. The SDK is a thin client over a REST + WebSocket API at api.elevenlabs.io. Auth is a single header, xi-api-key. Four models cover four points on the latency–quality curve:
- Eleven v3 — the most expressive model, dramatic delivery, supports
[laughs],[whispers],[sighs]audio-tag directives. 70+ languages. Best for narration, audiobooks, video VO. - Eleven Multilingual v2 — stable accent reproduction, 29 languages. Best for branded product voices.
- Eleven Flash v2.5 — ~75ms model latency, 50% cheaper. Best for interactive agents.
- Eleven Turbo v2.5 — speed-optimised, lower compute. Best for high-throughput apps where you generate thousands of clips.
Output formats you’ll actually use: mp3_44100_128 (default), mp3_44100_192 (higher quality), pcm_16000 / pcm_22050 / pcm_44100 (for downstream DSP), ulaw_8000 (Twilio phone lines).
Install
# Python
pip install elevenlabs
# Optional: mpv or ffmpeg for the play() helper
brew install mpv
# Node / TypeScript
npm install @elevenlabs/elevenlabs-js
Set the API key as an env var (the client auto-reads ELEVENLABS_API_KEY):
export ELEVENLABS_API_KEY=sk_…
Configuration
Two scopes of configuration matter here: the client (auth + base URL) and the per-request voice settings. The client is constructed once and reused; it reads ELEVENLABS_API_KEY from the environment by default, or you can pass an explicit api_key argument for multi-tenant servers where the key changes per request. Voice settings are passed per request and control how the model interprets the text. The four knobs:
stability(0.0–1.0) — lower = more emotional range and inflection variance; higher = monotone, predictable.similarity_boost(0.0–1.0) — how strictly the output adheres to the reference voice. Pushing too high can amplify reference artifacts.style(0.0–1.0) — style exaggeration. v3 / Multilingual v2 only.use_speaker_boost(bool) — sharpens speaker resemblance, small CPU cost.
Voice IDs come from the public voice library (elevenlabs.voices.search()) or from your own IVC clones. The example voice JBFqnCBsd6RMkjVDRZzb is George, one of the default English voices.
Code examples
Basic TTS — one call, MP3 bytes back, playback via mpv:
from elevenlabs.client import ElevenLabs
from elevenlabs.play import play
elevenlabs = ElevenLabs() # reads ELEVENLABS_API_KEY
audio = elevenlabs.text_to_speech.convert(
text="The first move is what sets everything in motion.",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_v3",
output_format="mp3_44100_128",
)
play(audio)
Save to disk:
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
Realtime streaming — bytes start arriving while the model is still synthesising the tail. Critical for low-latency agents.
from elevenlabs import stream
from elevenlabs.client import ElevenLabs
elevenlabs = ElevenLabs()
audio_stream = elevenlabs.text_to_speech.stream(
text="This is a test of realtime streaming.",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5", # ~75ms latency
output_format="pcm_22050",
voice_settings={
"stability": 0.4,
"similarity_boost": 0.75,
"style": 0.2,
"use_speaker_boost": True,
},
)
stream(audio_stream)
Async streaming — for integration into FastAPI / aiohttp pipelines:
import asyncio
from elevenlabs.client import AsyncElevenLabs
async def main():
elevenlabs = AsyncElevenLabs()
audio = elevenlabs.text_to_speech.stream(
text="Hello from async land.",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5",
)
async for chunk in audio:
# ship chunk to client over websocket
...
asyncio.run(main())
Instant Voice Cloning — upload 1–5 minutes of clean audio, get a custom voice id:
voice = elevenlabs.voices.ivc.create(
name="Alex",
files=["./sample_0.mp3", "./sample_1.mp3", "./sample_2.mp3"],
description="Calm narrator voice for product onboarding.",
labels={"accent": "british", "use_case": "narration"}
)
print(voice.voice_id) # use this as voice_id in future TTS calls
Speech-to-Speech — re-perform an existing recording in a different voice:
with open("./input.wav", "rb") as f:
audio = elevenlabs.speech_to_speech.convert(
voice_id=voice.voice_id,
audio=f,
model_id="eleven_multilingual_sts_v2",
output_format="mp3_44100_192",
)
ElevenAgents — realtime conversational AI. The platform packages ASR (their Scribe model) + an LLM of your choice + TTS into a single duplex WebSocket. You configure the agent in the dashboard (system prompt, knowledge base, allowed tools, voice) and connect from any client:
from elevenlabs.conversational_ai import Conversation
from elevenlabs.client import ElevenLabs
elevenlabs = ElevenLabs()
conversation = Conversation(
client=elevenlabs,
agent_id="agent_01jc…", # from the ElevenAgents dashboard
requires_auth=True,
audio_interface=None, # provide a mic/speaker interface for live
)
conversation.start_session()
# — user speaks, agent replies, all in <500ms RTT —
conversation.end_session()
Agent config (dashboard YAML excerpt — values match what the API expects):
agent:
name: "Booking Assistant"
language: "en"
llm:
provider: "anthropic"
model: "claude-sonnet-4-5"
system_prompt: |
You take restaurant reservations. Always confirm date, party size and name.
asr:
model: "scribe_v1"
tts:
model: "eleven_flash_v2_5"
voice_id: "JBFqnCBsd6RMkjVDRZzb"
stability: 0.4
similarity_boost: 0.75
tools:
- name: "create_reservation"
webhook_url: "https://api.myapp.com/reservations"
Picking a model
The model choice is the single biggest lever on cost, latency and perceived quality. Rough heuristic from my own usage:
- Narration / audiobook / video VO — Eleven v3. The expressivity and audio-tag directives are worth the extra latency.
- Realtime voice agents — Flash v2.5. Anything over ~150ms model time and the conversation feels sluggish; 75ms keeps it natural.
- Branded product voice across 29 languages — Multilingual v2. Accents are the most stable here.
- Bulk batch generation (thousands of clips) — Turbo v2.5. Lower compute, still acceptable quality, the only model that survives a tight cost budget at scale.
What’s new / version
- Eleven v3 — audio-tag directives (
[laughs],[whispers]) let you direct delivery inline. - Flash v2.5 — the latency leader, ~75ms model time. ElevenAgents defaults to this.
- Scribe v1 — their own ASR, accurate enough to replace Whisper in the realtime pipeline.
- ElevenAgents tools — the agent can call out to your webhook mid-conversation (function-calling, but voice-native).
- SDK v2.50.0 — full async coverage across every endpoint, typed Pydantic models for every response.
- 27 output formats — the full grid covers MP3, PCM, WAV, Opus and telephony (alaw/μ-law) variants across every common sample rate, so you rarely need a transcoding pass after generation.
Why it matters / where I use it
I’ve been using ElevenLabs for narration on short demo videos, for AE Moments’s automated photo-booth voice greetings, and as the voice layer behind a couple of experimental agent prototypes inside the Agents Hub in Bot-UI. The quality gap vs the open-source alternatives is still real — v3 sounds like a person, the rest sound like models — and the streaming latency is good enough that you stop noticing it. For phone-bot prototypes, the ulaw_8000 output format drops straight into Twilio’s Media Streams without a transcoding pass, which removes one of the most annoying glue layers I’d otherwise have to build.
Where I’m careful: voice cloning is the most legally and ethically loaded primitive in this whole stack. ElevenLabs requires consent verification for IVC, and you should never clone a voice you don’t own. Treat the API key the same way you’d treat an OpenAI key — in env vars, never in repo, and rotate it if a Vercel deploy log ever leaks it. For server-side use, prefer the async client and stream into your response body rather than buffering the whole MP3 in memory; the latency win compounds when you’re also waiting on an LLM upstream of the TTS call.