Text-to-Speech
Convert text into natural-sounding speech using various TTS models from different providers. Supports multiple languages, voices, and customization options including speed control, voice instructions, and audio format selection.
Overview
Convert text into natural-sounding speech using various TTS models. Supports multiple languages, voices, and customization options including speed control and voice instructions. Looking for synchronous, low‑latency TTS that returns audio bytes directly? See Speech (POST/v1/audio/speech).
Want to clone a custom voice from a reference audio clip? See Voice Cloning.
Supported Models
- Kokoro-82m: 44 multilingual voices ($0.001/1k chars)
- Elevenlabs-Turbo-V2.5: Premium quality with style controls ($0.06/1k chars)
- tts-1: OpenAI standard quality ($0.015/1k chars)
- tts-1-hd: OpenAI high definition ($0.030/1k chars)
- gpt-4o-mini-tts: Ultra-low cost ($0.0006/1k chars)
- MiniMax Speech models: Supports cloned voices via custom voice IDs (see Voice Cloning)
- Qwen-3-TTS-1.7B: Supports cloned voices via speaker embeddings (see Voice Cloning)
Basic Usage
Async Status and Result Retrieval
Some TTS models run asynchronously. When queued, the API returns HTTP 202 with a ticket containing arunId and model. Use the TTS Status endpoint to poll until the job is complete. Synchronous models return audio immediately and do not require status polling.
Endpoints
- Submit TTS:
POST /api/tts - Check TTS Status (async only):
GET /api/tts/status?runId=...&model=...
When you see status: “pending”
If your initialPOST /api/tts returns HTTP 202 with a body like:
runId and model. If present, include cost, paymentSource, and isApiRequest from the ticket when polling to help with automatic refunds if the upstream provider later rejects content.
cURL — Submit, then Poll
Synchronous vs. Asynchronous Models
- Synchronous models (examples:
tts-1,tts-1-hd,gpt-4o-mini-tts,Kokoro-82m) return immediately fromPOST /api/ttswith either binary audio or JSON containing{ audioUrl, contentType }depending on the provider. - Asynchronous models (examples:
Elevenlabs-Turbo-V2.5,Elevenlabs-V3,Elevenlabs-Music-V1) return HTTP 202 with a polling ticket. UseGET /api/tts/statusuntil completed.
POST /api/v1/audio/speech, see Music Generation.
Best Practices
- Poll every 2–3 seconds; stop after 2–3 minutes and show a timeout error.
- Always include
runIdandmodel. If available, includecost,paymentSource, andisApiRequestfrom the ticket for better error handling and refund automation. - On
completed, prefer using theaudioUrldirectly (streaming or download). Cache URLs client‑side if you plan to replay. - If you receive
CONTENT_POLICY_VIOLATION, do not retry the same content; surface a clear message to the user.
FAQ
- Why did I get 202/pending? The selected model runs asynchronously; your request was queued and billed after a successful queue submission.
- Can I cancel a pending TTS? Not currently. Let it complete or time out client‑side.
- Do all TTS models require polling? No. Only async models. Synchronous models return immediately.
Model-Specific Examples
Kokoro-82m - Multilingual Voices
44 voices across 13 language groups:Elevenlabs-Turbo-V2.5 - Advanced Voice Controls
Premium quality with style adjustments:OpenAI Models - Multiple Formats & Instructions
Response Examples
JSON Response (Most Models)
Binary Response (OpenAI Models)
OpenAI models return audio data directly as binary with appropriate headers:Voice Options
Kokoro-82m Voices
- American Female: af_bella, af_nova, af_aoede, af_jessica, af_sarah
- American Male: am_adam, am_onyx, am_eric, am_liam
- British: bf_alice, bf_emma, bm_daniel, bm_george
- Asian Languages: jf_alpha (Japanese), zf_xiaoxiao (Chinese)
- European: ff_siwis (French), im_nicola (Italian)
Elevenlabs-Turbo-V2.5 Voices
Rachel, Adam, Bella, Brian, Sarah, Michael, Emily, James, Nicole, and 37 moreOpenAI Voices
alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verseError Handling
- 400: Invalid parameters or missing text
- 401: Invalid or missing API key
- 413: Text exceeds model character limit
- 429: Rate limit exceeded
Authorizations
Body
Text-to-speech generation parameters
The text to convert to speech
"Hello! This is a test of the text-to-speech API."
The TTS model to use for generation
Kokoro-82m, Elevenlabs-Turbo-V2.5, tts-1, tts-1-hd, gpt-4o-mini-tts, Minimax-Speech-02-HD, Minimax-Speech-2.6-HD, Minimax-Speech-2.6-Turbo, Minimax-Speech-2.8-HD, Minimax-Speech-2.8-Turbo, Qwen-3-TTS-1.7B The voice to use for synthesis (available voices depend on selected model)
"af_bella"
Speaker embedding file URL for Qwen TTS voice cloning (Qwen-3-TTS-1.7B only)
Optional transcript of the reference clip (Qwen TTS)
Language hint (Qwen TTS). Example values: Auto, English, Chinese, Japanese
Optional style prompt (Qwen TTS)
Speech speed multiplier (0.1-5, not supported for gpt-4o-mini-tts)
0.1 <= x <= 5Audio output format (OpenAI models only)
mp3, opus, aac, flac, wav, pcm Voice instructions for fine-tuning (gpt-4o-mini-tts and tts-1-hd only)
"speak with enthusiasm"
Voice stability (Elevenlabs-Turbo-V2.5 only, 0-1)
0 <= x <= 1Voice similarity boost (Elevenlabs-Turbo-V2.5 only, 0-1)
0 <= x <= 1Style exaggeration (Elevenlabs-Turbo-V2.5 only, 0-1)
0 <= x <= 1Response
Text-to-speech response. Returns either JSON with audio URL or binary audio data depending on the model.
URL to the generated audio file
"https://storage.url/audio-file.wav"
MIME type of the audio file
"audio/wav"
Model used for generation
The input text that was synthesized
Voice used for synthesis
Speed multiplier used
Duration of the generated audio in seconds
Cost of the generation
Currency of the cost