Text-to-Speech - NanoGPT API Documentation

cURL

curl --request POST \
  --url https://nano-gpt.com/api/tts \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "text": "Hello! This is a test of the text-to-speech API.",
  "model": "Kokoro-82m",
  "voice": "af_bella",
  "speaker_voice_embedding_file_url": "<string>",
  "reference_text": "<string>",
  "language": "<string>",
  "prompt": "<string>",
  "speed": 1,
  "response_format": "mp3",
  "instructions": "speak with enthusiasm",
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0
}
'

{
  "audioUrl": "https://storage.url/audio-file.wav",
  "contentType": "audio/wav",
  "model": "<string>",
  "text": "<string>",
  "voice": "<string>",
  "speed": 123,
  "duration": 123,
  "cost": 123,
  "currency": "<string>"
}

POST

tts

cURL

curl --request POST \
  --url https://nano-gpt.com/api/tts \
  --header 'Content-Type: application/json' \
  --header 'x-api-key: <api-key>' \
  --data '
{
  "text": "Hello! This is a test of the text-to-speech API.",
  "model": "Kokoro-82m",
  "voice": "af_bella",
  "speaker_voice_embedding_file_url": "<string>",
  "reference_text": "<string>",
  "language": "<string>",
  "prompt": "<string>",
  "speed": 1,
  "response_format": "mp3",
  "instructions": "speak with enthusiasm",
  "stability": 0.5,
  "similarity_boost": 0.75,
  "style": 0
}
'

{
  "audioUrl": "https://storage.url/audio-file.wav",
  "contentType": "audio/wav",
  "model": "<string>",
  "text": "<string>",
  "voice": "<string>",
  "speed": 123,
  "duration": 123,
  "cost": 123,
  "currency": "<string>"
}

Overview

Convert text into natural-sounding speech using various TTS models. Supports multiple languages, voices, and customization options including speed control and voice instructions. Looking for synchronous, low‑latency TTS that returns audio bytes directly? See Speech (POST /v1/audio/speech). Want to clone a custom voice from a reference audio clip? See Voice Cloning.

Supported Models

Kokoro-82m: 44 multilingual voices ($0.001/1k chars)
Elevenlabs-Turbo-V2.5: Premium quality with style controls ($0.06/1k chars)
tts-1: OpenAI standard quality ($0.015/1k chars)
tts-1-hd: OpenAI high definition ($0.030/1k chars)
gpt-4o-mini-tts: Ultra-low cost ($0.0006/1k chars)
MiniMax Speech models: Supports cloned voices via custom voice IDs (see Voice Cloning)
Qwen-3-TTS-1.7B: Supports cloned voices via speaker embeddings (see Voice Cloning)

Basic Usage

import requests

def text_to_speech(text, model="Kokoro-82m", voice=None, **kwargs):
    headers = {
        "x-api-key": "YOUR_API_KEY",
        "Content-Type": "application/json"
    }
    
    payload = {
        "text": text,
        "model": model
    }
    
    if voice:
        payload["voice"] = voice
    
    payload.update(kwargs)
    
    response = requests.post(
        "https://nano-gpt.com/api/tts",
        headers=headers,
        json=payload
    )
    
    if response.status_code == 200:
        content_type = response.headers.get('content-type', '')
        
        if 'application/json' in content_type:
            # JSON response with audio URL
            data = response.json()
            audio_response = requests.get(data['audioUrl'])
            with open('output.wav', 'wb') as f:
                f.write(audio_response.content)
        else:
            # Binary audio data (OpenAI models)
            with open('output.mp3', 'wb') as f:
                f.write(response.content)
        
        return response
    else:
        raise Exception(f"Error: {response.status_code}")

# Basic usage
text_to_speech(
    "Hello! Welcome to our service.",
    model="Kokoro-82m",
    voice="af_bella"
)

Async Status and Result Retrieval

Some TTS models run asynchronously. When queued, the API returns HTTP 202 with a ticket containing a runId and model. Use the TTS Status endpoint to poll until the job is complete. Synchronous models return audio immediately and do not require status polling.

Endpoints

Submit TTS: POST /api/tts
Check TTS Status (async only): GET /api/tts/status?runId=...&model=...

When you see status: “pending”

If your initial POST /api/tts returns HTTP 202 with a body like:

{
  "status": "pending",
  "runId": "98b0d593-fe8d-49b8-89c9-233022232297",
  "model": "Elevenlabs-Turbo-V2.5",
  "charged": true,
  "cost": 0.0050388,
  "paymentSource": "USD",
  "isApiRequest": true
}

…the request is queued. Poll the Status endpoint using the runId and model. If present, include cost, paymentSource, and isApiRequest from the ticket when polling to help with automatic refunds if the upstream provider later rejects content.

cURL — Submit, then Poll

# 1) Submit TTS
curl -X POST https://nano-gpt.com/api/tts \
  -H 'x-api-key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Hello there!",
    "model": "Elevenlabs-Turbo-V2.5",
    "voice": "Rachel",
    "speed": 1.0
  }'

# 2) If response is 202/pending, poll using returned values
curl "https://nano-gpt.com/api/tts/status?runId=98b0d593-fe8d-49b8-89c9-233022232297&model=Elevenlabs-Turbo-V2.5&cost=0.0050388&paymentSource=USD&isApiRequest=true" \
  -H 'x-api-key: YOUR_API_KEY'

# 3) On completion, you'll receive an audioUrl
# {
#   "status": "completed",
#   "audioUrl": "https://.../file.mp3",
#   "contentType": "audio/mpeg",
#   "model": "Elevenlabs-Turbo-V2.5"
# }

Synchronous vs. Asynchronous Models

Synchronous models (examples: tts-1, tts-1-hd, gpt-4o-mini-tts, Kokoro-82m) return immediately from POST /api/tts with either binary audio or JSON containing { audioUrl, contentType } depending on the provider.
Asynchronous models (examples: Elevenlabs-Turbo-V2.5, Elevenlabs-V3, Elevenlabs-Music-V1) return HTTP 202 with a polling ticket. Use GET /api/tts/status until completed.

For OpenAI-compatible music generation via POST /api/v1/audio/speech, see Music Generation.

Best Practices

Poll every 2–3 seconds; stop after 2–3 minutes and show a timeout error.
Always include runId and model. If available, include cost, paymentSource, and isApiRequest from the ticket for better error handling and refund automation.
On completed, prefer using the audioUrl directly (streaming or download). Cache URLs client‑side if you plan to replay.
If you receive CONTENT_POLICY_VIOLATION, do not retry the same content; surface a clear message to the user.

FAQ

Why did I get 202/pending? The selected model runs asynchronously; your request was queued and billed after a successful queue submission.
Can I cancel a pending TTS? Not currently. Let it complete or time out client‑side.
Do all TTS models require polling? No. Only async models. Synchronous models return immediately.

Model-Specific Examples

Kokoro-82m - Multilingual Voices

44 voices across 13 language groups:

# Popular voice examples by category
voices = {
    "american_female": ["af_bella", "af_nova", "af_aoede"],
    "american_male": ["am_adam", "am_onyx", "am_eric"],
    "british_female": ["bf_alice", "bf_emma"],
    "british_male": ["bm_daniel", "bm_george"],
    "japanese_female": ["jf_alpha", "jf_gongitsune"],
    "chinese_female": ["zf_xiaoxiao", "zf_xiaoyi"],
    "french_female": ["ff_siwis"],
    "italian_male": ["im_nicola"]
}

# Generate multilingual samples
samples = [
    {"text": "Hello, welcome!", "voice": "af_bella", "lang": "English"},
    {"text": "Bonjour et bienvenue!", "voice": "ff_siwis", "lang": "French"},
    {"text": "こんにちは！", "voice": "jf_alpha", "lang": "Japanese"},
    {"text": "你好，欢迎！", "voice": "zf_xiaoxiao", "lang": "Chinese"}
]

for sample in samples:
    text_to_speech(
        text=sample["text"],
        model="Kokoro-82m",
        voice=sample["voice"]
    )

Elevenlabs-Turbo-V2.5 - Advanced Voice Controls

Premium quality with style adjustments:

# Stable, consistent voice
text_to_speech(
    text="This is a professional announcement.",
    model="Elevenlabs-Turbo-V2.5",
    voice="Rachel",
    stability=0.9,
    similarity_boost=0.8,
    style=0
)

# Expressive, dynamic voice  
text_to_speech(
    text="This is so exciting!",
    model="Elevenlabs-Turbo-V2.5",
    voice="Rachel",
    stability=0.3,
    similarity_boost=0.7,
    style=0.8,
    speed=1.2
)

# Available voices: Rachel, Adam, Bella, Brian, etc.

OpenAI Models - Multiple Formats & Instructions

# High-definition with voice instructions
text_to_speech(
    text="Welcome to customer service.",
    model="tts-1-hd",
    voice="nova",
    instructions="Speak warmly and professionally like a customer service representative",
    response_format="flac"
)

# Ultra-low cost option
text_to_speech(
    text="This is a cost-effective option.",
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak clearly and cheerfully",
    response_format="mp3"
)

# Different format examples
formats = ["mp3", "wav", "opus", "flac", "aac"]
for fmt in formats:
    text_to_speech(
        text=f"This is {fmt.upper()} format.",
        model="tts-1",
        voice="echo",
        response_format=fmt
    )

Response Examples

JSON Response (Most Models)

{
  "audioUrl": "https://storage.url/audio-file.wav",
  "contentType": "audio/wav",
  "model": "Kokoro-82m",
  "text": "Hello world",
  "voice": "af_bella",
  "speed": 1,
  "duration": 2.3,
  "cost": 0.001,
  "currency": "USD"
}

Binary Response (OpenAI Models)

OpenAI models return audio data directly as binary with appropriate headers:

Content-Type: audio/mp3
Content-Length: 123456
[Binary audio data]

Voice Options

Kokoro-82m Voices

American Female: af_bella, af_nova, af_aoede, af_jessica, af_sarah
American Male: am_adam, am_onyx, am_eric, am_liam
British: bf_alice, bf_emma, bm_daniel, bm_george
Asian Languages: jf_alpha (Japanese), zf_xiaoxiao (Chinese)
European: ff_siwis (French), im_nicola (Italian)

Elevenlabs-Turbo-V2.5 Voices

Rachel, Adam, Bella, Brian, Sarah, Michael, Emily, James, Nicole, and 37 more

OpenAI Voices

alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verse

Error Handling

try:
    result = text_to_speech("Hello world!", model="Kokoro-82m")
    print("Success!")
except Exception as e:
    if "400" in str(e):
        print("Bad request - check parameters")
    elif "401" in str(e):
        print("Unauthorized - check API key")
    elif "413" in str(e):
        print("Text too long for model")
    else:
        print(f"Error: {e}")

Common errors:

400: Invalid parameters or missing text
401: Invalid or missing API key
413: Text exceeds model character limit
429: Rate limit exceeded

Authorizations

x-api-key

string

header

required

Body

application/json

Text-to-speech generation parameters

text

string

required

The text to convert to speech

Example:

"Hello! This is a test of the text-to-speech API."

model

enum<string>

default:Kokoro-82m

The TTS model to use for generation

Available options:

Kokoro-82m,

Elevenlabs-Turbo-V2.5,

tts-1,

tts-1-hd,

gpt-4o-mini-tts,

Minimax-Speech-02-HD,

Minimax-Speech-2.6-HD,

Minimax-Speech-2.6-Turbo,

Minimax-Speech-2.8-HD,

Minimax-Speech-2.8-Turbo,

Qwen-3-TTS-1.7B

voice

string

The voice to use for synthesis (available voices depend on selected model)

Example:

"af_bella"

speaker_voice_embedding_file_url

string

Speaker embedding file URL for Qwen TTS voice cloning (Qwen-3-TTS-1.7B only)

reference_text

string

Optional transcript of the reference clip (Qwen TTS)

language

string

Language hint (Qwen TTS). Example values: Auto, English, Chinese, Japanese

prompt

string

Optional style prompt (Qwen TTS)

speed

number

default:1

Speech speed multiplier (0.1-5, not supported for gpt-4o-mini-tts)

Required range: 0.1 <= x <= 5

response_format

enum<string>

default:mp3

Audio output format (OpenAI models only)

Available options:

mp3,

opus,

aac,

flac,

wav,

pcm

instructions

string

Voice instructions for fine-tuning (gpt-4o-mini-tts and tts-1-hd only)

Example:

"speak with enthusiasm"

stability

number

default:0.5

Voice stability (Elevenlabs-Turbo-V2.5 only, 0-1)

Required range: 0 <= x <= 1

similarity_boost

number

default:0.75

Voice similarity boost (Elevenlabs-Turbo-V2.5 only, 0-1)

Required range: 0 <= x <= 1

style

number

default:0

Style exaggeration (Elevenlabs-Turbo-V2.5 only, 0-1)

Required range: 0 <= x <= 1

Response

Text-to-speech response. Returns either JSON with audio URL or binary audio data depending on the model.

audioUrl

string<uri>

URL to the generated audio file

Example:

"https://storage.url/audio-file.wav"

contentType

string

MIME type of the audio file

Example:

"audio/wav"

model

string

Model used for generation

text

string

The input text that was synthesized

voice

string

Voice used for synthesis

speed

number

Speed multiplier used

duration

number

Duration of the generated audio in seconds

cost

number

Cost of the generation

currency

string

Currency of the cost

v1/audio/speech (TTS + Music)TTS Status

​Overview

​Supported Models

​Basic Usage

​Async Status and Result Retrieval

​Endpoints

​When you see status: “pending”

​cURL — Submit, then Poll

​Synchronous vs. Asynchronous Models

​Best Practices

​FAQ

​Model-Specific Examples

​Kokoro-82m - Multilingual Voices

​Elevenlabs-Turbo-V2.5 - Advanced Voice Controls

​OpenAI Models - Multiple Formats & Instructions

​Response Examples

​JSON Response (Most Models)

​Binary Response (OpenAI Models)

​Voice Options

​Kokoro-82m Voices

​Elevenlabs-Turbo-V2.5 Voices

​OpenAI Voices

​Error Handling

Authorizations

Body

Response

Overview

Supported Models

Basic Usage

Async Status and Result Retrieval

Endpoints

When you see status: “pending”

cURL — Submit, then Poll

Synchronous vs. Asynchronous Models

Best Practices

FAQ

Model-Specific Examples

Kokoro-82m - Multilingual Voices

Elevenlabs-Turbo-V2.5 - Advanced Voice Controls

OpenAI Models - Multiple Formats & Instructions

Response Examples

JSON Response (Most Models)

Binary Response (OpenAI Models)

Voice Options

Kokoro-82m Voices

Elevenlabs-Turbo-V2.5 Voices

OpenAI Voices

Error Handling