Chat Completion - NanoGPT API Documentation

If you are on a NanoGPT subscription and want to keep requests limited to subscription-included models (or you have no prepaid balance), use the subscription base URL: https://nano-gpt.com/api/subscription/v1/chat/completions (swap /api/v1 for /api/subscription/v1).

Provider selection is available for supported open-source models. X-Provider explicitly selects a provider for the request and is always billed pay-as-you-go at the selected provider’s price, including provider-selection markup. Provider-selection-capable models also support routing preference suffixes such as :fast and :cheap. For subscription users, sending X-Provider bypasses subscription coverage for that request; X-Billing-Mode: paygo is only needed when forcing pay-as-you-go without an explicit provider or when saved provider preferences should apply to subscription-included traffic. See Provider Selection, Model Suffixes, and Pay-As-You-Go Billing Override.

X-402 Micropayments: To enable anonymous pay-per-request access with cryptocurrency when you have insufficient balance, include the X-X402: true header. See X-402 Micropayments for details.

Page map

Use the jump list below to navigate the long-form reference quickly.

Basics

Tool calling
Overview
Provider Routing Suffixes

Sampling & decoding

Sampling & Decoding Controls
Temperature & Nucleus
Length & Stopping
Penalties & Repetition Guards
Logit Shaping & Determinism
Sampling example request

Structured outputs

Structured Outputs (response_format)
Supported Formats
JSON Object Mode
JSON Schema Mode
Schema Requirements
Example Request
Example Response
Vercel AI SDK

Web search

Web Search
Option A: model suffixes
Option B: request body configuration
Provider-specific options
Examples
Pricing by provider
Bring your own key (BYOK)

Images & caching

Image Input
Supported Forms
Message Shape
cURL - Image URL
cURL - Base64 Data URL
cURL - Streaming SSE
Caching (Implicit and Explicit Controls)
Cache-Capable Provider Routing
Cache Consistency
Troubleshooting

Memory & reasoning

Context Memory
Custom Context Size Override
Reasoning Streams
Endpoint variants
Streaming payload format
Showing or hiding reasoning
Reasoning Effort
Model suffix: :reasoning-exclude
Legacy delta field compatibility

Other

Service tiers (flex and priority)
YouTube Transcripts
Performance Benchmarks
Important Notes

Tool calling

The /api/v1/chat/completions endpoint supports OpenAI-compatible function calling. You can describe callable functions in the tools array, control when the model may invoke them, and continue the conversation by echoing tool role messages that reference the assistant’s chosen call.

Request parameters

tools (optional array): Each entry must be { "type": "function", "function": { "name": string, "description"?: string, "parameters"?: JSON-Schema object } }. Only function tools are accepted. The serialized tools payload is limited to 200 KB (overrides via TOOL_SPEC_MAX_BYTES); violating the shape or size yields a 400 with tool_spec_too_large, invalid_tool_spec, or invalid_tool_spec_parse.
tool_choice (optional string or object): Defaults to auto. Set "none" to guarantee no tool calls (the server also drops the tools payload upstream), "required" to force the next response to be a tool call, or { "type": "function", "function": { "name": "your_function" } } to pin the exact function.
parallel_tool_calls (optional boolean): When true the flag is forwarded to providers that support issuing multiple tool calls in a single turn. Models that ignore the flag fall back to sequential calls.
messages[].tool_calls (assistant role): Persist the tool call metadata returned by the model so future turns can see which functions were invoked. Each item uses the OpenAI shape { id, type: "function", function: { name, arguments } }.
messages[] with role: "tool": Respond to the model by sending { "role": "tool", "tool_call_id": "<assistant tool_calls id>", "content": "<JSON or text payload>" }. The server drops any tool response that references an unknown tool_call_id, so keep the IDs in sync.
Validation behavior: If you send tool_choice: "none" with a tools array the request is accepted but the tools are omitted before hitting the model; invalid schemas or oversize payloads return the error codes above.

Example request

POST /api/v1/chat/completions
{
  "model": "google/gemini-3-flash-preview",
  "messages": [
    { "role": "user", "content": "What's the temperature in San Francisco right now?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "lookup_weather",
        "description": "Fetch the current weather for a city.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": { "type": "string" },
            "unit": { "type": "string", "enum": ["c", "f"] }
          },
          "required": ["city"]
        }
      }
    }
  ],
  "tool_choice": "auto",
  "parallel_tool_calls": true
}

Example assistant/tool turn

{
  "role": "assistant",
  "content": null,
  "tool_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "lookup_weather",
        "arguments": "{\"city\":\"San Francisco\",\"unit\":\"f\"}"
      }
    }
  ]
}

{
  "role": "tool",
  "tool_call_id": "call_abc123",
  "content": "{\"city\":\"San Francisco\",\"temperatureF\":58,\"conditions\":\"foggy\"}"
}

Streaming responses emit delta events that mirror OpenAI’s tool_calls schema, so consumers can reuse their existing parsing logic without changes.

Overview

The Chat Completion endpoint provides OpenAI-compatible chat completions.

Provider Routing Suffixes

Provider-selection-capable models support routing preference suffixes such as :fast and :cheap. See Provider Selection > Per-Request Routing Preference for the full list and billing rules, or Model Suffixes for all suffix composition rules.

Sampling & Decoding Controls

The /api/v1/chat/completions endpoint accepts a full set of sampling and decoding knobs. All fields are optional; omit any you want to leave at provider defaults.

Temperature & Nucleus

Parameter	Range/Default	Description
`temperature`	0–2 (provider default)	Classic randomness control; higher values explore more. If omitted, NanoGPT does not force a value and the routed provider/model default applies.
`top_p`	0–1 (default 1)	Nucleus sampling that trims to the smallest set above `top_p` cumulative probability.
`top_k`	1+	Sample only from the top-k tokens each step.
`top_a`	provider default	Blends temperature and nucleus behavior; set only if a model calls for it.
`min_p`	0–1	Require each candidate token to exceed a probability floor.
`tfs`	0–1	Tail free sampling; 1 disables.
`eta_cutoff` / `epsilon_cutoff`	provider default	Drop tokens once they fall below the tail thresholds.
`typical_p`	0–1	Entropy-based nucleus sampling; keeps tokens whose surprise matches expected entropy.
`mirostat_mode`	0/1/2	Enable Mirostat sampling; set tau/eta when active.
`mirostat_tau` / `mirostat_eta`	provider default	Target entropy and learning rate for Mirostat.

Length & Stopping

Parameter	Range/Default	Description
`max_tokens`	1+ (provider default)	Upper bound on generated tokens. If omitted, NanoGPT does not enforce an explicit default and the routed provider/model default applies.
`min_tokens`	0+ (default 0)	Minimum completion length when provider supports it.
`stop`	string or string[]	Stop sequences passed upstream.
`stop_token_ids`	int[]	Stop generation on specific token IDs (limited provider support).
`include_stop_str_in_output`	boolean (default false)	Keep the stop sequence in the final text where supported.
`ignore_eos`	boolean (default false)	Continue even if the model predicts EOS internally.

Penalties & Repetition Guards

Parameter	Range/Default	Description
`frequency_penalty`	-2 – 2 (default 0)	Penalize tokens proportional to prior frequency.
`presence_penalty`	-2 – 2 (default 0)	Penalize tokens based on whether they appeared at all.
`repetition_penalty`	-2 – 2	Provider-agnostic repetition modifier; >1 discourages repeats.
`no_repeat_ngram_size`	0+	Forbid repeating n-grams of the given size (limited support).
`custom_token_bans`	int[]	Fully block listed token IDs.

Logit Shaping & Determinism

Parameter	Range/Default	Description
`logit_bias`	object	Map token IDs to additive logits (OpenAI-compatible).
`logprobs`	boolean or int	Return token-level logprobs where supported.
`prompt_logprobs`	boolean	Request logprobs on the prompt when available.
`seed`	integer	Make completions repeatable where the provider allows it.

Usage notes

Parameters can be combined (e.g., temperature + top_p + top_k), but overly narrow settings may lead to early stops.
Invalid ranges yield a 400 before reaching the provider.
Provider defaults apply to any omitted field.

Example request

curl -X POST https://nano-gpt.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [
      {"role": "user", "content": "Write a creative story about space exploration"}
    ],
    "temperature": 0.8,
    "top_p": 0.9,
    "top_k": 40,
    "tfs": 0.8,
    "typical_p": 0.95,
    "mirostat_mode": 2,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "max_tokens": 500,
    "frequency_penalty": 0.3,
    "presence_penalty": 0.1,
    "repetition_penalty": 1.1,
    "stop": ["###"],
    "seed": 42
  }'

Structured Outputs (response_format)

The /api/v1/chat/completions endpoint supports OpenAI-compatible structured outputs via the response_format parameter. This ensures the model returns valid JSON matching your specified schema.

Supported Formats

Type	Description
`json_object`	Forces the model to return valid JSON
`json_schema`	Forces the model to return JSON matching a specific schema
`text`	Default text output (no constraint)

JSON Object Mode

Request valid JSON output without a specific schema:

{
  "model": "openai/gpt-5.1",
  "messages": [{"role": "user", "content": "List 3 colors as JSON"}],
  "response_format": {"type": "json_object"}
}

JSON Schema Mode (Structured Outputs)

Request JSON that conforms to a specific schema:

{
  "model": "openai/gpt-5.1",
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "math_answer",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "answer": {"type": "number"},
          "explanation": {"type": "string"}
        },
        "required": ["answer", "explanation"],
        "additionalProperties": false
      }
    }
  }
}

Schema Requirements

When using strict: true:

All properties must be listed in required
Set additionalProperties: false
NanoGPT automatically transforms optional properties to be nullable for OpenAI compatibility

Supported Models

JSON schema mode works with most models including:

OpenAI models (GPT-5.1, GPT-5.2, etc.)
Anthropic Claude models
Google Gemini models
Many open-source models

Example Request

curl -X POST https://nano-gpt.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.1",
    "messages": [
      {"role": "user", "content": "Generate a person profile"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "person",
        "strict": true,
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "age": {"type": "number"},
            "skills": {
              "type": "array",
              "items": {"type": "string"}
            }
          },
          "required": ["name", "age", "skills"],
          "additionalProperties": false
        }
      }
    },
    "stream": false
  }'

Example Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1769278225,
  "model": "openai/gpt-5.1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"name\":\"Alice Chen\",\"age\":28,\"skills\":[\"Python\",\"Machine Learning\",\"Data Analysis\"]}"
      },
      "finish_reason": "stop"
    }
  ]
}

Usage with Vercel AI SDK

The response_format parameter is compatible with Vercel AI SDK’s generateObject:

import { generateObject } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';
import { z } from 'zod';

const nanogpt = createOpenAI({
  baseURL: 'https://nano-gpt.com/api/v1',
  apiKey: 'YOUR_API_KEY',
});

const { object } = await generateObject({
  model: nanogpt('openai/gpt-5.1'),
  schema: z.object({
    name: z.string(),
    age: z.number(),
    skills: z.array(z.string()),
  }),
  prompt: 'Generate a person profile',
});

console.log(object);
// { name: "Alice Chen", age: 28, skills: ["Python", "Machine Learning", "Data Analysis"] }

Usage Notes

Works with both streaming and non-streaming requests
The name field in json_schema is required and should describe the output
Response content is a JSON string; parse it with JSON.parse() in your application
Some provider-specific limitations may apply; if you encounter issues with a specific model, try an alternative

Web Search

Enable web search in two ways: model suffixes or a webSearch object in the request body. The legacy linkup object is still supported as an alias. If webSearch.enabled (or linkup.enabled) is true, it takes precedence over any model suffix. OpenAI native web search: GPT-5+ / o1 / o3 / o4 models use OpenAI’s built-in web search automatically. No suffix is required; you can still set webSearch.search_context_size and webSearch.user_location. To force a different provider, specify a provider or suffix. If you need full direct control over the search call itself (query, outputType, date/domain filters, or structured schema output), use Direct Web Search API (POST /api/web).

Use case	Recommended endpoint
Model should answer with web context in one call	`POST /api/v1/chat/completions`
You need raw/structured web payload control	`POST /api/web`

Option A: model suffixes

Append one of these to your model value:

:online (default web search, standard depth)
:online/linkup (Linkup, standard)
:online/linkup-deep (Linkup, deep)
:online/tavily (Tavily, standard)
:online/tavily-deep (Tavily, deep)
:online/brave (Brave, standard)
:online/brave-deep (Brave, deep)
:online/exa-fast (Exa, fast)
:online/exa-auto (Exa, auto)
:online/exa-neural (Exa, neural)
:online/exa-deep (Exa, deep)
:online/exa-instant (Exa, instant)
:online/exa-deep-reasoning (Exa, deep-reasoning)
:online/kagi (Kagi, standard, search)
:online/kagi-web (Kagi, standard, web)
:online/kagi-news (Kagi, standard, news)
:online/kagi-search (Kagi, deep, search)
:online/perplexity (Perplexity, standard)
:online/perplexity-deep (Perplexity, deep)
:online/valyu (Valyu, standard, all sources)
:online/valyu-deep (Valyu, deep, all sources)
:online/valyu-web (Valyu, standard, web only)
:online/valyu-web-deep (Valyu, deep, web only)

:online without an explicit provider uses the default web search backend (Linkup).

Option B: request body configuration (recommended)

Send a webSearch object in the request body. The legacy linkup object is accepted as an alias. This works with or without a model suffix and controls web search across all providers. webSearch fields:

enabled (boolean, required to activate web search)
provider (string): linkup | tavily | brave | exa | kagi | perplexity | valyu
depth (string):
- Linkup/Tavily/Brave/Perplexity/Valyu: standard or deep
- Exa: fast, auto, neural, deep, instant, deep-reasoning (use standard if you want auto)
- Kagi: standard or deep (search source only)
search_context_size or searchContextSize (string, OpenAI native): low | medium | high (default: medium)
user_location or userLocation (object, OpenAI native): { type: "approximate", country, city, region }
searchType (string, Valyu only): all | web
kagiSource or kagi_source (string, Kagi only): web | news | search

Legacy alias example:

{
  "linkup": {
    "enabled": true,
    "provider": "tavily",
    "search_context_size": "medium"
  }
}

Provider-specific options (set inside `webSearch`)

Perplexity

{
  "maxResults": 1-20,
  "maxTokensPerPage": number,
  "maxTokens": 1-1000000,
  "country": "string",
  "searchDomainFilter": ["domain1.com", "domain2.com"],
  "searchLanguageFilter": ["en", "de"]
}

Limits: searchDomainFilter max 20 entries; searchLanguageFilter max 10 entries (ISO 639-1).

Valyu

{
  "searchType": "all" | "web",
  "fastMode": boolean,
  "maxNumResults": 1-50,
  "maxPrice": number,
  "relevanceThreshold": 0-1,
  "responseLength": "short" | "medium" | "large" | "max" | number,
  "countryCode": "US",
  "includedSources": ["source1.com"],
  "excludedSources": ["source2.com"],
  "urlOnly": boolean,
  "category": "string"
}

countryCode uses a 2-letter ISO country code.

Tavily

{
  "maxResults": 0-20,
  "includeAnswer": boolean | "basic" | "advanced",
  "includeRawContent": boolean | "markdown" | "text",
  "includeImages": boolean,
  "includeImageDescriptions": boolean,
  "includeFavicon": boolean,
  "topic": "general" | "news" | "finance",
  "timeRange": "day" | "week" | "month" | "year",
  "startDate": "YYYY-MM-DD",
  "endDate": "YYYY-MM-DD",
  "chunksPerSource": 1-3,
  "country": "string"
}

Exa

{
  "numResults": 1-100,
  "category": "company" | "research paper" | "news" | "pdf" | "github" | "tweet" | "personal site" | "people" | "financial report",
  "userLocation": "US",
  "additionalQueries": ["query2"],
  "startCrawlDate": "ISO 8601",
  "endCrawlDate": "ISO 8601",
  "startPublishedDate": "ISO 8601",
  "endPublishedDate": "ISO 8601",
  "includeText": ["pattern"],
  "excludeText": ["pattern"],
  "livecrawl": "never" | "fallback" | "always" | "preferred",
  "livecrawlTimeout": number,
  "subpages": number,
  "subpageTarget": "string" | ["strings"]
}

OpenAI native (GPT-5.2)

{
  "search_context_size": "low" | "medium" | "high",
  "user_location": {
    "type": "approximate",
    "country": "US",
    "city": "San Francisco",
    "region": "California"
  }
}

Examples

import requests
import json

BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Suffix-based standard web search
data = {
    "model": "openai/gpt-5.2:online",
    "messages": [
        {"role": "user", "content": "What are the latest developments in AI?"}
    ]
}

response = requests.post(
    f"{BASE_URL}/chat/completions",
    headers=headers,
    json=data
)

# Request-body configuration (Exa neural)
data_search = {
    "model": "openai/gpt-5.2",
    "messages": [
        {"role": "user", "content": "Provide a comprehensive analysis of recent AI breakthroughs"}
    ],
    "webSearch": {
        "enabled": True,
        "provider": "exa",
        "depth": "neural",
        "numResults": 10
    }
}

Pricing by provider

Provider	Standard	Deep	Notes
Linkup	$0.006	$0.06	Default provider
Tavily	$0.008	$0.016	Good value, free tier available
Exa	$0.005 base	+ $0.001/page	For contents retrieval
Kagi Web/News	$0.002	N/A	Cheapest for enrichment
Kagi Search	$0.025	N/A	Full search mode
Perplexity	$0.005	N/A	Flat rate
Valyu	~$0.0015/result	Variable	Dynamic pricing
Brave	$0.005	$0.005	Flat rate
OpenAI Native	$0.01 + tokens	N/A	Per-call fee + model token costs

For standard NanoGPT usage, provider credentials are handled automatically.

Bring your own key (BYOK)

BYOK lets you route requests through your own upstream provider credentials.

Configure keys once: https://nano-gpt.com/byok
Opt in per request via x-use-byok: true or byok.enabled: true
Optionally force the provider via x-byok-provider or byok.provider
BYOK usage includes a 5% platform fee (your provider bills you directly for usage)

See: Bring Your Own Key (BYOK)

Web search BYOK

Web search BYOK availability is provider-dependent and can change over time. See the BYOK reference for the current support matrix.

Advanced behavior (optional)

Provider routing: For GPT-5+ / o1 / o3 / o4 models, :online without an explicit provider uses OpenAI native web search. If you set webSearch.provider or use an explicit :online/<provider> suffix, that provider is used instead.
Model suffix normalization: :online (and provider/depth suffixes) are stripped from the model name before routing to the base model; the suffix only controls search behavior.
Query formation (non-OpenAI providers): The search query is derived from your latest user message and may include the previous user message if the latest is short. If you need full control over the query or raw results, use the Web Search endpoint (/api/web).
scraping: true URL handling: When enabled, NanoGPT scans messages for public http(s) URLs, ignores local/private URLs, de-duplicates, and caps at 5. If no eligible URLs are found, scraping is skipped. Inline scraping in chat is billed at $0.0015 per successfully scraped URL. For explicit URL lists and the standalone endpoint price ($0.001 per URL), use /scrape-urls.

Image Input

Send images using the OpenAI‑compatible chat format. Provide image parts alongside text in the messages array.

Supported Forms

Remote URL: {"type":"image_url","image_url":{"url":"https://..."}}
Base64 data URL: {"type":"image_url","image_url":{"url":"data:image/png;base64,...."}}

Notes:

Prefer HTTPS URLs; some upstreams reject non‑HTTPS. If in doubt, use base64 data URLs.
Accepted mime types: image/png, image/jpeg, image/jpg, image/webp.
Inline markdown images in plain text (e.g., ![alt](data:image/...;base64,...)) are auto‑normalized into structured parts server‑side.

Message Shape

{
  "role": "user",
  "content": [
    { "type": "text", "text": "What is in this image?" },
    { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } }
  ]
}

cURL — Image URL (non‑streaming)

curl -sS \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -X POST https://nano-gpt.com/api/v1/chat/completions \
  --data '{
    "model": "google/gemini-3-flash-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this image in three words."},
          {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/3/3f/Fronalpstock_big.jpg"}}
        ]
      }
    ],
    "stream": false
  }'

cURL — Base64 Data URL (non‑streaming)

Embed your image as a data URL. Replace ...BASE64... with your image bytes.

curl -sS \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type": "application/json" \
  -X POST https://nano-gpt.com/api/v1/chat/completions \
  --data '{
    "model": "google/gemini-3-flash-preview",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is shown here?"},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,...BASE64..."}}
        ]
      }
    ],
    "stream": false
  }'

cURL — Streaming SSE

Caching (Implicit and Explicit Controls)

For the full guide (supported models, thresholds, pricing, and usage fields), see Prompt Caching. NanoGPT automatically applies implicit caching on providers/models that support it (including OpenAI, Gemini, and many open-source provider/model routes), so most requests do not need caching flags. Set top-level caching: true when you want NanoGPT to route the request to any available provider that supports prompt/input caching. This is capability-based routing: you do not need to choose a provider. If no cache-capable provider is available for the model, the request fails rather than silently using a non-caching provider. Use explicit prompt-caching controls (prompt_caching, promptCaching, and body-level cache_control alias, plus inline cache_control) when you need Claude-specific cache boundaries, TTL selection, or prompt_caching.stickyProvider consistency control. Top-level caching: true does not add Anthropic-style cache_control markers or configure cache TTLs.

Cache-Capable Provider Routing

Top-level caching: true is provider routing, not prompt-cache annotation. It requires the routed provider to be marked as prompt-caching capable for the requested provider-selection model.

{
  "model": "model-id",
  "caching": true,
  "messages": [
    { "role": "user", "content": "Hello" }
  ]
}

By default, caching: true also enables sticky provider routing. After the first successful matching request, NanoGPT will try to use the same provider for later matching requests from the same API key or session, improving the chance of provider-side cache hits. This does not guarantee that a request will be served from cache. To require a cache-capable provider without sticky routing, set stickyprovider: false:

{
  "model": "model-id",
  "caching": true,
  "stickyprovider": false,
  "messages": [
    { "role": "user", "content": "Hello" }
  ]
}

Top-level fields:

Parameter	Type	Default	Description
`caching`	boolean	`false`	Require a cache-capable provider for this request. If none is usable for the model, the request fails.
`stickyprovider`	boolean	`true` when `caching: true`	Prefer the previously recorded provider for later matching cache-capable requests. Set `false` to restore non-sticky cache-capable routing.
`stickyProvider`	boolean	Alias	CamelCase alias for top-level `stickyprovider`. Use `stickyprovider` in examples.

For caching: true, routing works as follows:

Filter to providers that are available, not excluded by preferences, and marked as prompt-caching capable.
If stickiness is enabled, prefer the previously recorded provider for the same cache-relevant request shape when still usable.
Otherwise choose the cheapest cache-capable provider by base input + output price.
Use cache write/read pricing only as tie-breakers.

The prompt_caching / promptCaching helper accepts these options:

Parameter	Type	Default	Description
`enabled`	boolean	—	Enable prompt caching
`ttl`	string	`"5m"`	Cache time-to-live: `"5m"` or `"1h"`
`cut_after_message_index` / `cutAfterMessageIndex`	integer	—	Zero-based index; cache all messages up to and including this index
`stickyProvider`	boolean	`false`	When `true`, disable automatic failover to preserve explicit prompt-cache consistency. Returns 503 error instead of switching services.

headers = {
  "Authorization": "Bearer YOUR_API_KEY",
  "Content-Type": "application/json"
}

payload = {
  "model": "anthropic/claude-sonnet-4.6",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "Reference handbook + rules of engagement.",
          "cache_control": {"type": "ephemeral", "ttl": "5m"}
        }
      ]
    },
    {"role": "user", "content": "Live request goes here"}
  ]
}

requests.post("https://nano-gpt.com/api/v1/chat/completions", headers=headers, json=payload)

Each cache_control marker caches the full prefix up to that block. Place them on every static chunk (system messages, tool definitions, large contexts) you plan to reuse.
Explicit TTL controls are 5m and 1h for Claude caching flows. See Prompt Caching.
anthropic-beta: prompt-caching-2024-07-31 is supported for compatibility (and required for Anthropic-native Claude caching flows).
For implicit-caching providers, no explicit cache_control markers are required.

For a simpler experience, send the helper fields and NanoGPT will stamp the first N messages for you before forwarding upstream:

await client.chat.completions.create(
  {
    model: 'anthropic/claude-opus-4.6',
    messages: [
      { role: 'system', content: 'Static rubric lives here.' },
      { role: 'user', content: 'Additional reusable context.' },
      { role: 'user', content: 'This turn is not cached.' },
    ],
    prompt_caching: {
      enabled: true,
      ttl: '1h',
      cut_after_message_index: 1,
    },
  },
  {
    headers: { 'anthropic-beta': 'prompt-caching-2024-07-31' },
  },
);

cut_after_message_index is zero-based. If omitted, NanoGPT will select a cache boundary automatically; set it explicitly if you need full control. Switch back to explicit cache_control blocks if you need multiple cache breakpoints or mixed TTLs in the same payload.

Explicit Prompt Cache Consistency

NanoGPT automatically fails over to backup services when the primary service is temporarily unavailable. While this ensures high availability, it can break your prompt cache because each backend service maintains its own separate cache. If cache consistency is more important than availability for your use case, you can enable the stickyProvider option:

{
  "model": "anthropic/claude-sonnet-4.6",
  "messages": [...],
  "prompt_caching": {
    "enabled": true,
    "ttl": "5m",
    "stickyProvider": true
  }
}

Behavior:

stickyProvider: false (default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).
stickyProvider: true — If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.

When to use stickyProvider: true:

You have very large cached contexts where cache misses are expensive
You prefer to retry failed requests yourself rather than pay for cache rebuilds
Cost predictability is more important than request success rate

When to use stickyProvider: false (default):

You prefer requests to always succeed when possible
Occasional cache misses are acceptable
You’re using shorter contexts where cache rebuilds are inexpensive

Error response when stickyProvider blocks a failover:

{
  "error": {
    "message": "Service is temporarily unavailable. Fallback disabled to preserve prompt cache consistency. Switching services would invalidate your cached tokens. Remove stickyProvider option or retry later.",
    "status": 503,
    "type": "service_unavailable",
    "code": "fallback_blocked_for_cache_consistency"
  }
}

Troubleshooting

400 unsupported image: ensure the image is a valid PNG/JPEG/WebP, not a tiny 1×1 pixel, and either HTTPS URL or a base64 data URL.
503 after fallbacks: try a different model, verify API key/session, and prefer base64 data URL for local or protected assets.
Missing usage events: confirm include_usage is true in the payload or that prompt caching is enabled.

Context Memory

Enable unlimited-length conversations with lossless, hierarchical memory.

Append :memory to any model name
Or send header memory: true
Can be combined with web search: :online:memory
Retention: default 30 days; configure via :memory-<days> (1..365) or header memory_expiration_days: <days>; header takes precedence

import requests

BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Suffix-based
payload = {
    "model": "openai/gpt-5.2:memory",
    "messages": [{"role": "user", "content": "Keep our previous discussion in mind and continue."}]
}
requests.post(f"{BASE_URL}/chat/completions", headers=headers, json=payload)

Custom Context Size Override

When Context Memory is enabled, you can override the model-derived context size used for the memory compression step with model_context_limit.

Parameter: model_context_limit (number or numeric string)
Default: Derived from the selected model’s context size
Minimum: Values below 10,000 are clamped internally
Scope: Only affects memory compression; does not change the target model’s own window

Examples:

# Enable memory via header; use model default context size
curl -s -X POST \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -H "memory: true" \
  https://nano-gpt.com/api/v1/chat/completions \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [{"role":"user","content":"Briefly say hello."}],
    "stream": false
  }'

# Explicit numeric override
curl -s -X POST \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -H "memory: true" \
  https://nano-gpt.com/api/v1/chat/completions \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [{"role":"user","content":"Briefly say hello."}],
    "model_context_limit": 20000,
    "stream": false
  }'

# String override (server coerces to number)
curl -s -X POST \
  -H "Authorization: Bearer $NANOGPT_API_KEY" \
  -H "Content-Type: application/json" \
  -H "memory: true" \
  https://nano-gpt.com/api/v1/chat/completions \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [{"role":"user","content":"Briefly say hello."}],
    "model_context_limit": "30000",
    "stream": false
  }'

Reasoning Streams

The Chat Completions endpoint separates the model’s visible answer from its internal reasoning. By default, reasoning is included and delivered alongside normal content so that clients can decide whether to display it. :thinking is model-specific and only works when that exact ID (or a documented alias) exists. -thinking is a legacy alias pattern for some model families only, not universal. Do not assume -thinking works for arbitrary model IDs. Always check GET /api/v1/models for exact valid IDs. See also: Extended Thinking (Reasoning).

Endpoint variants

Choose the base path that matches how your client consumes reasoning streams:

https://nano-gpt.com/api/v1/chat/completions — default endpoint that streams internal thoughts through choices[0].delta.reasoning (and repeats them in message.reasoning on completion). Recommended for apps like SillyTavern that understand the modern response shape.
https://nano-gpt.com/api/v1legacy/chat/completions — legacy contract that swaps the field name to choices[0].delta.reasoning_content / message.reasoning_content for older OpenAI-compatible clients. Use this for LiteLLM’s OpenAI adapter to avoid downstream parsing errors.
https://nano-gpt.com/api/v1thinking/chat/completions — reasoning-aware models write everything into the normal choices[0].delta.content stream so clients that ignore reasoning fields still see the full conversation transcript. This is the preferred base URL for JanitorAI.

Streaming payload format

Server-Sent Event (SSE) streams emit the answer in choices[0].delta.content and the thought process in choices[0].delta.reasoning (plus optional delta.reasoning_details). Reasoning deltas are dispatched before or alongside regular content, letting you render both panes in real-time.

data: {
  "choices": [{
    "delta": {
      "reasoning": "Assessing possible tool options…"
    }
  }]
}
data: {
  "choices": [{
    "delta": {
      "content": "Let me walk you through the solution."
    }
  }]
}

When streaming completes, the formatter aggregates the collected values and repeats them in the final payload: choices[0].message.content contains the assistant reply and choices[0].message.reasoning (plus reasoning_details when available) contains the full chain-of-thought. Non-streaming requests reuse the same formatter, so the reasoning block is present as a dedicated field.

Showing or hiding reasoning

Send reasoning: { "exclude": true } to strip the reasoning payload from both streaming deltas and the final message. With this flag set, delta.reasoning and message.reasoning are omitted entirely.

curl -X POST https://nano-gpt.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-opus-4.6",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "reasoning": {"exclude": true}
  }'

Without reasoning.exclude:

{
  "choices": [{
    "message": {
      "content": "The answer is 4.",
      "reasoning": "The user is asking for a simple addition. 2+2 equals 4."
    }
  }]
}

With reasoning.exclude:

{
  "choices": [{
    "message": {
      "content": "The answer is 4."
    }
  }]
}

Reasoning Effort

reasoning_effort (or reasoning.effort) controls reasoning depth and also acts as an explicit reasoning-mode signal. Any value other than "none" is treated as a request to enable reasoning/thinking behavior. Use "none" to explicitly disable reasoning behavior.

Parameter: `reasoning_effort`

Value	Description
`none`	Explicitly disables reasoning
`minimal`	Lowest reasoning depth
`low`	Low reasoning depth
`medium`	Medium reasoning depth
`high`	High reasoning depth
`xhigh`	Maximum reasoning depth

Usage

The reasoning_effort parameter can be passed at the top level:

curl -X POST https://nano-gpt.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic/claude-opus-4.6",
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement step by step"}
    ],
    "reasoning_effort": "high",
    "max_tokens": 4096
  }'

Alternatively, pass it as part of the reasoning object:

{
  "model": "anthropic/claude-opus-4.6",
  "messages": [{"role": "user", "content": "Solve this complex math problem..."}],
  "reasoning": {
    "effort": "high"
  }
}

Both formats are accepted. If both are present, top-level reasoning_effort is authoritative for Chat Completions request shaping.

Combining effort with exclude

reasoning.exclude controls output visibility only. It hides reasoning fields/blocks, but does not inherently disable reasoning compute. If an effort level is set to a non-none value, reasoning can still run while hidden.

{
  "model": "anthropic/claude-opus-4.6",
  "messages": [{"role": "user", "content": "..."}],
  "reasoning": {
    "effort": "high",
    "exclude": true
  }
}

Model suffix: `:reasoning-exclude`

You can toggle the filter without altering your JSON body by appending :reasoning-exclude to the model name.

Equivalent to sending { "reasoning": { "exclude": true } }
Only the :reasoning-exclude suffix is stripped before the request is routed; other suffixes remain active
Works for streaming and non-streaming responses on both Chat Completions and Text Completions

{
  "model": "anthropic/claude-opus-4.6:reasoning-exclude",
  "messages": [{ "role": "user", "content": "What is 2+2?" }]
}

Combine with other suffixes

:reasoning-exclude composes safely with the other routing suffixes you already use:

:thinking (when that exact model ID exists). -thinking variants are legacy aliases for some families only.
:online and :online/linkup-deep
:memory and :memory-<days>

Examples:

anthropic/claude-sonnet-4.6:thinking:8192:reasoning-exclude
openai/gpt-5.2:online:reasoning-exclude
anthropic/claude-opus-4.6:memory-30:online/linkup-deep:reasoning-exclude
zai-org/glm-5:fast:reasoning-exclude
zai-org/glm-5:cheap:reasoning-exclude

Legacy delta field compatibility

Older clients that expect the legacy reasoning_content field can opt in per request. Set reasoning.delta_field to "reasoning_content", or use the top-level shorthands reasoning_delta_field / reasoning_content_compat if updating nested objects is difficult. When the toggle is active, every streaming and non-streaming response exposes reasoning_content instead of reasoning, and the modern key is omitted. The compatibility pass is skipped if reasoning.exclude is true, because no reasoning payload is emitted. If you cannot change the request payload, target https://nano-gpt.com/api/v1legacy/chat/completions instead—the legacy endpoint keeps reasoning_content without extra flags. LiteLLM’s OpenAI adapter should point here to maintain compatibility. For clients that ignore reasoning-specific fields entirely, use https://nano-gpt.com/api/v1thinking/chat/completions so the full text appears in the standard content stream; this is the correct choice for JanitorAI.

{
  "model": "openai/gpt-5.2",
  "messages": [...],
  "reasoning": {
    "delta_field": "reasoning_content"
  }
}

Notes and limitations

GPU-TEE models (phala/*) require byte-for-byte SSE passthrough for signature verification. For those models, streaming cannot be filtered; the suffix has no effect on the streaming bytes.
When assistant content is an array (e.g., vision/text parts), only text parts are filtered; images and tool/metadata content are untouched.

Service tiers (flex and priority)

Set service_tier to request a non-default capacity tier on providers that support service tiers:

auto or omitted: use NanoGPT’s normal routing and the provider default.
default: request the provider’s standard tier where the provider accepts an explicit default value.
flex: request lower-cost, variable-capacity processing where supported.
priority: request higher-cost priority processing where supported.

Behavior notes:

When service_tier is "flex" or "priority", NanoGPT prefers routing to providers that support the requested tier.
Service tier availability is model- and provider-specific. Model pages show which tiers are supported.
Not all providers support service tiers, so tiered requests may be routed differently than default requests.
Header provider overrides (like X-Provider) and explicit provider selection are honored for pricing and x402 estimates.
Provider-native web search can force routing; tier pricing follows that routing.
If you explicitly force a provider that does not support service tiers, the requested tier may be ignored by the upstream provider, or routing and pricing may differ from the default route.

Billing note:

Flex requests are billed at flex rates where applicable.
Priority requests are billed at priority rates where applicable.
High-context pricing may also apply for models and providers with separate high-context SKUs, such as es2k pricing for GPT-5.5/GPT-5.4 where available.

Response note:

Responses now include a top-level service_tier field when it is provided on the request.

Example: flex tier

{
  "model": "openai/gpt-5.5",
  "messages": [
    { "role": "user", "content": "Give me a concise release note." }
  ],
  "service_tier": "flex"
}

Example: priority tier

{
  "model": "openai/gpt-5.5",
  "messages": [
    { "role": "user", "content": "Give me a concise release note." }
  ],
  "service_tier": "priority"
}

YouTube Transcripts

Automatically fetch and prepend YouTube video transcripts when the latest user message contains YouTube links.

Defaults

Parameter: youtube_transcripts (boolean)
Default: false (opt-in)
Opt-in: set youtube_transcripts to true (string "true" is also accepted) to fetch transcripts
Limit: Up to 3 YouTube URLs processed per request
Higher volume: Use the standalone POST /api/youtube-transcribe endpoint for up to 10 URLs per request
Injection: Transcripts are added as a system message before your messages
Billing: $0.01 per transcript fetched

Enable automatic transcripts

By default, YouTube links are ignored. Set youtube_transcripts to true when you want the system to retrieve and bill for transcripts.

curl -X POST https://nano-gpt.com/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemini-3-flash-preview",
    "messages": [
      {"role": "user", "content": "Summarize this: https://youtu.be/dQw4w9WgXcQ"}
    ],
    "youtube_transcripts": true
  }'

Notes

Web scraping is separate. To scrape non‑YouTube URLs, set scraping: true. YouTube transcripts do not require scraping: true.
When not requested, YouTube links are ignored for transcript fetching and are not billed.
If your balance is insufficient when enabled, the request may be blocked with a 402.

Performance Benchmarks

LinkUp achieves state-of-the-art performance on OpenAI’s SimpleQA benchmark:

Provider	Score
LinkUp Deep Search	90.10%
Exa	90.04%
Perplexity Sonar Pro	86%
LinkUp Standard Search	85%
Perplexity Sonar	77%
Tavily	73%

Important Notes

Web search increases input token count, which affects total cost
Models gain access to real-time information published less than a minute ago
Internet connectivity can provide up to 10x improvement in factuality
All models support web search - append a suffix or send a webSearch object (linkup is supported as an alias)

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Headers

X-Provider

string

Optional explicit provider override for supported open-source models (case-insensitive). Explicit provider selection is billed pay-as-you-go at the selected provider's price, including provider-selection markup; for subscription users it bypasses subscription coverage for that request.

X-Billing-Mode

string

Optional billing override to force pay-as-you-go without an explicit provider, or to apply saved provider preferences to subscription-included traffic (e.g., paygo). Header name is case-insensitive.

Body

application/json

Parameters for chat completion

model

string

default:minimax/minimax-m2.7

required

The model to use for completion. The model value may include supported model suffixes, including web search (':online', ':online/'), memory (':memory', ':memory-'), reasoning visibility (':reasoning-exclude'), thinking variants where listed by the model catalog (':thinking'), and provider routing preferences for eligible models (':fast', ':cheap', etc.).

Examples:

"minimax/minimax-m2.7"

"zai-org/glm-5:fast"

"zai-org/glm-5:cheap"

"openai/gpt-5.2:online/exa-instant"

"openai/gpt-5.2:online/exa-deep-reasoning"

messages

object[]

required

Array of message objects with role and content

Show child attributes

billing_mode

string

Billing override to force pay-as-you-go without an explicit provider, or to apply saved provider preferences to subscription-included traffic. Accepted values (case-insensitive): paygo, pay-as-you-go, pay_as_you_go, paid, payg.

billingMode

string

Alias for billing_mode.

caching

boolean

default:false

When true, route to an available provider that is marked as prompt/input-caching capable for the requested provider-selection model. If no usable cache-capable provider exists, the request fails instead of falling back to a non-cache-capable provider. This is provider capability routing only; it does not add cache_control markers or configure cache TTLs.

stickyprovider

boolean

Top-level routing control for caching: true. When caching is true, defaults to true and prefers the previously recorded provider for later matching requests from the same API key or session when still usable. Set false to require a cache-capable provider without sticky routing.

stickyProvider

boolean

CamelCase alias for top-level stickyprovider. Distinct from prompt_caching.stickyProvider, which controls explicit prompt-cache failover behavior.

stream

boolean

default:false

Whether to stream the response

service_tier

enum<string>

Optional service tier: "auto", "default", "flex", or "priority". Use "flex" for lower-cost variable-capacity processing or "priority" for higher-cost priority processing where supported by the routed model/provider.

Available options:

auto,

default,

flex,

priority

temperature

number

Classic randomness control. Accepts any decimal between 0-2. If omitted, NanoGPT does not force a value and the routed provider/model default applies

Required range: 0 <= x <= 2

max_tokens

integer

Upper bound on generated tokens. If omitted, NanoGPT does not enforce an explicit default and the routed provider/model default applies

Required range: x >= 1

top_p

number

default:1

Nucleus sampling. When set below 1.0, trims candidate tokens to the smallest set whose cumulative probability exceeds top_p. Works well as an alternative to tweaking temperature

Required range: 0 <= x <= 1

frequency_penalty

number

default:0

Penalizes tokens proportionally to how often they appeared previously. Negative values encourage repetition; positive values discourage it

Required range: -2 <= x <= 2

presence_penalty

number

default:0

Penalizes tokens based on whether they appeared at all. Good for keeping the model on topic without outright banning words

Required range: -2 <= x <= 2

repetition_penalty

number

Provider-agnostic repetition modifier (distinct from OpenAI penalties). Values >1 discourage repetition

Required range: -2 <= x <= 2

top_k

integer

Caps sampling to the top-k highest probability tokens per step

top_a

number

Combines top-p and temperature behavior; leave unset unless a model description explicitly calls for it

min_p

number

Ensures each candidate token probability exceeds a floor (0-1). Helpful for stopping models from collapsing into low-entropy loops

Required range: 0 <= x <= 1

tfs

number

Tail free sampling. Values between 0-1 let you shave the long tail of the distribution; 1.0 disables the feature

Required range: 0 <= x <= 1

eta_cutoff

number

Cut probabilities as soon as they fall below the specified tail threshold

epsilon_cutoff

number

Cut probabilities as soon as they fall below the specified tail threshold

typical_p

number

Typical sampling (aka entropy-based nucleus). Works like top_p but preserves tokens whose surprise matches the expected entropy

Required range: 0 <= x <= 1

mirostat_mode

enum<integer>

Enables Mirostat sampling for models that support it. Set to 1 or 2 to activate

Available options:

0,

1,

2

mirostat_tau

number

Mirostat target entropy parameter. Used when mirostat_mode is enabled

mirostat_eta

number

Mirostat learning rate parameter. Used when mirostat_mode is enabled

min_tokens

integer

default:0

For providers that support it, enforces a minimum completion length before stop conditions fire

Required range: x >= 0

stop

Stop sequences. Accepts string or array of strings. Values are passed directly to upstream providers

stop_token_ids

integer[]

Numeric array that lets callers stop generation on specific token IDs. Not supported by many providers

include_stop_str_in_output

boolean

default:false

When true, keeps the stop sequence in the final text. Not supported by many providers

ignore_eos

boolean

default:false

Allows completions to continue even if the model predicts EOS internally. Useful for long creative writing runs

no_repeat_ngram_size

integer

Extension that forbids repeating n-grams of the given size. Not supported by many providers

Required range: x >= 0

custom_token_bans

integer[]

List of token IDs to fully block

logit_bias

object

Object mapping token IDs to additive logits. Works just like OpenAI's version

Show child attributes

logprobs

When true or a number, forwards the request to providers that support returning token-level log probabilities

prompt_logprobs

boolean

Requests logprobs on the prompt itself when the upstream API allows it

seed

integer

Numeric seed. Wherever supported, passes the value to make completions repeatable

prompt_caching

object

Helper to tag leading messages for explicit prompt-caching control (primarily Claude flows). Providers with implicit caching support (including OpenAI, Gemini, and many open-source provider routes) do not require this helper. NanoGPT injects cache_control blocks on each message up to the specified index before forwarding upstream. If cut_after_message_index is omitted, NanoGPT selects a cache boundary automatically.

Show child attributes

reasoning_effort

enum<string>

Controls reasoning depth and acts as an explicit reasoning-mode signal. Any value other than "none" requests reasoning/thinking behavior. Use "none" to explicitly disable reasoning.

Available options:

none,

minimal,

low,

medium,

high,

xhigh

reasoning

object

Reasoning configuration. exclude controls output visibility (hides reasoning fields/blocks) and does not inherently disable reasoning compute. effort controls depth, and delta_field switches to legacy reasoning_content fields.

Show child attributes

reasoning_delta_field

enum<string>

Shorthand for reasoning.delta_field

Available options:

reasoning_content

reasoning_content_compat

boolean

Shorthand to force legacy reasoning_content fields in the response

Response

Chat completion response

string

Unique identifier for the completion

object

string

Object type, always 'chat.completion'

created

integer

Unix timestamp of when the completion was created

choices

object[]

Array of completion choices

Show child attributes

usage

object

Show child attributes

service_tier

string

Service tier used (echoed when provided on the request)

Get Started

Endpoint Examples

API Reference

Miscellaneous

Integrations

Documentation Index

​Page map

​Tool calling

​Request parameters

​Example request

​Example assistant/tool turn

​Overview

​Provider Routing Suffixes

​Sampling & Decoding Controls

​Temperature & Nucleus

​Length & Stopping

​Penalties & Repetition Guards

​Logit Shaping & Determinism

​Usage notes

​Example request

​Structured Outputs (response_format)

​Supported Formats

​JSON Object Mode

​JSON Schema Mode (Structured Outputs)

​Schema Requirements

​Supported Models

​Example Request

​Example Response

​Usage with Vercel AI SDK

​Usage Notes

​Web Search

​Option A: model suffixes

​Option B: request body configuration (recommended)

​Provider-specific options (set inside webSearch)

Perplexity

Valyu

Tavily

Exa

OpenAI native (GPT-5.2)

​Examples

​Pricing by provider

​Bring your own key (BYOK)

​Web search BYOK

​Advanced behavior (optional)

​Image Input

​Supported Forms

​Message Shape

​cURL — Image URL (non‑streaming)

​cURL — Base64 Data URL (non‑streaming)

​cURL — Streaming SSE

​Caching (Implicit and Explicit Controls)

​Cache-Capable Provider Routing

​Explicit Prompt Cache Consistency

​Troubleshooting

​Context Memory

​Custom Context Size Override

​Reasoning Streams

​Endpoint variants

​Streaming payload format

​Showing or hiding reasoning

​Reasoning Effort

​Parameter: reasoning_effort

​Usage

​Combining effort with exclude

​Model suffix: :reasoning-exclude

​Combine with other suffixes

​Legacy delta field compatibility

​Notes and limitations

​Service tiers (flex and priority)

​Example: flex tier

​Example: priority tier

​YouTube Transcripts

​Defaults

​Enable automatic transcripts

​Notes

​Performance Benchmarks

​Important Notes

Authorizations

Headers

Body

Page map

Tool calling

Request parameters

Example request

Example assistant/tool turn

Overview

Provider Routing Suffixes

Sampling & Decoding Controls

Temperature & Nucleus

Length & Stopping

Penalties & Repetition Guards

Logit Shaping & Determinism

Usage notes

Example request

Structured Outputs (response_format)

Supported Formats

JSON Object Mode

JSON Schema Mode (Structured Outputs)

Schema Requirements

Supported Models

Example Request

Example Response

Usage with Vercel AI SDK

Usage Notes

Web Search

Option A: model suffixes

Option B: request body configuration (recommended)

Provider-specific options (set inside `webSearch`)

Examples

Pricing by provider

Bring your own key (BYOK)

Web search BYOK

Advanced behavior (optional)

Image Input

Supported Forms

Message Shape

cURL — Image URL (non‑streaming)

cURL — Base64 Data URL (non‑streaming)

cURL — Streaming SSE

Caching (Implicit and Explicit Controls)

Cache-Capable Provider Routing

Explicit Prompt Cache Consistency

Troubleshooting

Context Memory

Custom Context Size Override

Reasoning Streams

Endpoint variants

Streaming payload format

Showing or hiding reasoning

Reasoning Effort

Parameter: `reasoning_effort`

Usage

Combining effort with exclude

Model suffix: `:reasoning-exclude`

Combine with other suffixes

Legacy delta field compatibility

Notes and limitations

Service tiers (flex and priority)

Example: flex tier

Example: priority tier

YouTube Transcripts

Defaults

Enable automatic transcripts

Notes

Performance Benchmarks

Important Notes