Creates a chat completion for the provided messages
Documentation Index
Fetch the complete documentation index at: https://docs.nano-gpt.com/llms.txt
Use this file to discover all available pages before exploring further.
https://nano-gpt.com/api/subscription/v1/chat/completions (swap /api/v1 for /api/subscription/v1).X-Provider explicitly selects a provider for the request and is always billed pay-as-you-go at the selected provider’s price, including provider-selection markup. Provider-selection-capable models also support routing preference suffixes such as :fast and :cheap. For subscription users, sending X-Provider bypasses subscription coverage for that request; X-Billing-Mode: paygo is only needed when forcing pay-as-you-go without an explicit provider or when saved provider preferences should apply to subscription-included traffic. See Provider Selection, Model Suffixes, and Pay-As-You-Go Billing Override.X-X402: true header. See X-402 Micropayments for details.Sampling & decoding
Structured outputs
Web search
Images & caching
Memory & reasoning
/api/v1/chat/completions endpoint supports OpenAI-compatible function calling. You can describe callable functions in the tools array, control when the model may invoke them, and continue the conversation by echoing tool role messages that reference the assistant’s chosen call.
tools (optional array): Each entry must be { "type": "function", "function": { "name": string, "description"?: string, "parameters"?: JSON-Schema object } }. Only function tools are accepted. The serialized tools payload is limited to 200 KB (overrides via TOOL_SPEC_MAX_BYTES); violating the shape or size yields a 400 with tool_spec_too_large, invalid_tool_spec, or invalid_tool_spec_parse.tool_choice (optional string or object): Defaults to auto. Set "none" to guarantee no tool calls (the server also drops the tools payload upstream), "required" to force the next response to be a tool call, or { "type": "function", "function": { "name": "your_function" } } to pin the exact function.parallel_tool_calls (optional boolean): When true the flag is forwarded to providers that support issuing multiple tool calls in a single turn. Models that ignore the flag fall back to sequential calls.messages[].tool_calls (assistant role): Persist the tool call metadata returned by the model so future turns can see which functions were invoked. Each item uses the OpenAI shape { id, type: "function", function: { name, arguments } }.messages[] with role: "tool": Respond to the model by sending { "role": "tool", "tool_call_id": "<assistant tool_calls id>", "content": "<JSON or text payload>" }. The server drops any tool response that references an unknown tool_call_id, so keep the IDs in sync.tool_choice: "none" with a tools array the request is accepted but the tools are omitted before hitting the model; invalid schemas or oversize payloads return the error codes above.tool_calls schema, so consumers can reuse their existing parsing logic without changes.
:fast and :cheap. See Provider Selection > Per-Request Routing Preference for the full list and billing rules, or Model Suffixes for all suffix composition rules.
/api/v1/chat/completions endpoint accepts a full set of sampling and decoding knobs. All fields are optional; omit any you want to leave at provider defaults.
| Parameter | Range/Default | Description |
|---|---|---|
temperature | 0–2 (provider default) | Classic randomness control; higher values explore more. If omitted, NanoGPT does not force a value and the routed provider/model default applies. |
top_p | 0–1 (default 1) | Nucleus sampling that trims to the smallest set above top_p cumulative probability. |
top_k | 1+ | Sample only from the top-k tokens each step. |
top_a | provider default | Blends temperature and nucleus behavior; set only if a model calls for it. |
min_p | 0–1 | Require each candidate token to exceed a probability floor. |
tfs | 0–1 | Tail free sampling; 1 disables. |
eta_cutoff / epsilon_cutoff | provider default | Drop tokens once they fall below the tail thresholds. |
typical_p | 0–1 | Entropy-based nucleus sampling; keeps tokens whose surprise matches expected entropy. |
mirostat_mode | 0/1/2 | Enable Mirostat sampling; set tau/eta when active. |
mirostat_tau / mirostat_eta | provider default | Target entropy and learning rate for Mirostat. |
| Parameter | Range/Default | Description |
|---|---|---|
max_tokens | 1+ (provider default) | Upper bound on generated tokens. If omitted, NanoGPT does not enforce an explicit default and the routed provider/model default applies. |
min_tokens | 0+ (default 0) | Minimum completion length when provider supports it. |
stop | string or string[] | Stop sequences passed upstream. |
stop_token_ids | int[] | Stop generation on specific token IDs (limited provider support). |
include_stop_str_in_output | boolean (default false) | Keep the stop sequence in the final text where supported. |
ignore_eos | boolean (default false) | Continue even if the model predicts EOS internally. |
| Parameter | Range/Default | Description |
|---|---|---|
frequency_penalty | -2 – 2 (default 0) | Penalize tokens proportional to prior frequency. |
presence_penalty | -2 – 2 (default 0) | Penalize tokens based on whether they appeared at all. |
repetition_penalty | -2 – 2 | Provider-agnostic repetition modifier; >1 discourages repeats. |
no_repeat_ngram_size | 0+ | Forbid repeating n-grams of the given size (limited support). |
custom_token_bans | int[] | Fully block listed token IDs. |
| Parameter | Range/Default | Description |
|---|---|---|
logit_bias | object | Map token IDs to additive logits (OpenAI-compatible). |
logprobs | boolean or int | Return token-level logprobs where supported. |
prompt_logprobs | boolean | Request logprobs on the prompt when available. |
seed | integer | Make completions repeatable where the provider allows it. |
temperature + top_p + top_k), but overly narrow settings may lead to early stops./api/v1/chat/completions endpoint supports OpenAI-compatible structured outputs via the response_format parameter. This ensures the model returns valid JSON matching your specified schema.
| Type | Description |
|---|---|
json_object | Forces the model to return valid JSON |
json_schema | Forces the model to return JSON matching a specific schema |
text | Default text output (no constraint) |
strict: true:
requiredadditionalProperties: falseresponse_format parameter is compatible with Vercel AI SDK’s generateObject:
name field in json_schema is required and should describe the outputJSON.parse() in your applicationwebSearch object in the request body. The legacy linkup object is still supported as an alias. If webSearch.enabled (or linkup.enabled) is true, it takes precedence over any model suffix.
OpenAI native web search: GPT-5+ / o1 / o3 / o4 models use OpenAI’s built-in web search automatically. No suffix is required; you can still set webSearch.search_context_size and webSearch.user_location. To force a different provider, specify a provider or suffix.
If you need full direct control over the search call itself (query, outputType, date/domain filters, or structured schema output), use Direct Web Search API (POST /api/web).
| Use case | Recommended endpoint |
|---|---|
| Model should answer with web context in one call | POST /api/v1/chat/completions |
| You need raw/structured web payload control | POST /api/web |
model value:
:online (default web search, standard depth):online/linkup (Linkup, standard):online/linkup-deep (Linkup, deep):online/tavily (Tavily, standard):online/tavily-deep (Tavily, deep):online/brave (Brave, standard):online/brave-deep (Brave, deep):online/exa-fast (Exa, fast):online/exa-auto (Exa, auto):online/exa-neural (Exa, neural):online/exa-deep (Exa, deep):online/exa-instant (Exa, instant):online/exa-deep-reasoning (Exa, deep-reasoning):online/kagi (Kagi, standard, search):online/kagi-web (Kagi, standard, web):online/kagi-news (Kagi, standard, news):online/kagi-search (Kagi, deep, search):online/perplexity (Perplexity, standard):online/perplexity-deep (Perplexity, deep):online/valyu (Valyu, standard, all sources):online/valyu-deep (Valyu, deep, all sources):online/valyu-web (Valyu, standard, web only):online/valyu-web-deep (Valyu, deep, web only):online without an explicit provider uses the default web search backend (Linkup).
webSearch object in the request body. The legacy linkup object is accepted as an alias. This works with or without a model suffix and controls web search across all providers.
webSearch fields:
enabled (boolean, required to activate web search)provider (string): linkup | tavily | brave | exa | kagi | perplexity | valyudepth (string):
standard or deepfast, auto, neural, deep, instant, deep-reasoning (use standard if you want auto)standard or deep (search source only)search_context_size or searchContextSize (string, OpenAI native): low | medium | high (default: medium)user_location or userLocation (object, OpenAI native): { type: "approximate", country, city, region }searchType (string, Valyu only): all | webkagiSource or kagi_source (string, Kagi only): web | news | searchwebSearch)searchDomainFilter max 20 entries; searchLanguageFilter max 10 entries (ISO 639-1).
countryCode uses a 2-letter ISO country code.
| Provider | Standard | Deep | Notes |
|---|---|---|---|
| Linkup | $0.006 | $0.06 | Default provider |
| Tavily | $0.008 | $0.016 | Good value, free tier available |
| Exa | $0.005 base | + $0.001/page | For contents retrieval |
| Kagi Web/News | $0.002 | N/A | Cheapest for enrichment |
| Kagi Search | $0.025 | N/A | Full search mode |
| Perplexity | $0.005 | N/A | Flat rate |
| Valyu | ~$0.0015/result | Variable | Dynamic pricing |
| Brave | $0.005 | $0.005 | Flat rate |
| OpenAI Native | $0.01 + tokens | N/A | Per-call fee + model token costs |
x-use-byok: true or byok.enabled: truex-byok-provider or byok.provider:online without an explicit provider uses OpenAI native web search. If you set webSearch.provider or use an explicit :online/<provider> suffix, that provider is used instead.:online (and provider/depth suffixes) are stripped from the model name before routing to the base model; the suffix only controls search behavior./api/web).scraping: true URL handling: When enabled, NanoGPT scans messages for public http(s) URLs, ignores local/private URLs, de-duplicates, and caps at 5. If no eligible URLs are found, scraping is skipped. Inline scraping in chat is billed at $0.0015 per successfully scraped URL. For explicit URL lists and the standalone endpoint price ($0.001 per URL), use /scrape-urls.messages array.
{"type":"image_url","image_url":{"url":"https://..."}}{"type":"image_url","image_url":{"url":"data:image/png;base64,...."}}image/png, image/jpeg, image/jpg, image/webp.) are auto‑normalized into structured parts server‑side....BASE64... with your image bytes.
data: { ... } lines until a final terminator. Usage metrics appear only when requested: set stream_options.include_usage to true for streaming responses, or send "include_usage": true on non-streaming calls.
Note: Prompt-caching helpers implicitly force include_usage, so cached requests still receive usage data without extra flags.
caching: true when you want NanoGPT to route the request to any available provider that supports prompt/input caching. This is capability-based routing: you do not need to choose a provider. If no cache-capable provider is available for the model, the request fails rather than silently using a non-caching provider.
Use explicit prompt-caching controls (prompt_caching, promptCaching, and body-level cache_control alias, plus inline cache_control) when you need Claude-specific cache boundaries, TTL selection, or prompt_caching.stickyProvider consistency control. Top-level caching: true does not add Anthropic-style cache_control markers or configure cache TTLs.
caching: true is provider routing, not prompt-cache annotation. It requires the routed provider to be marked as prompt-caching capable for the requested provider-selection model.
caching: true also enables sticky provider routing. After the first successful matching request, NanoGPT will try to use the same provider for later matching requests from the same API key or session, improving the chance of provider-side cache hits. This does not guarantee that a request will be served from cache.
To require a cache-capable provider without sticky routing, set stickyprovider: false:
| Parameter | Type | Default | Description |
|---|---|---|---|
caching | boolean | false | Require a cache-capable provider for this request. If none is usable for the model, the request fails. |
stickyprovider | boolean | true when caching: true | Prefer the previously recorded provider for later matching cache-capable requests. Set false to restore non-sticky cache-capable routing. |
stickyProvider | boolean | Alias | CamelCase alias for top-level stickyprovider. Use stickyprovider in examples. |
caching: true, routing works as follows:
prompt_caching / promptCaching helper accepts these options:
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | — | Enable prompt caching |
ttl | string | "5m" | Cache time-to-live: "5m" or "1h" |
cut_after_message_index / cutAfterMessageIndex | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | When true, disable automatic failover to preserve explicit prompt-cache consistency. Returns 503 error instead of switching services. |
cache_control marker caches the full prefix up to that block. Place them on every static chunk (system messages, tool definitions, large contexts) you plan to reuse.5m and 1h for Claude caching flows. See Prompt Caching.anthropic-beta: prompt-caching-2024-07-31 is supported for compatibility (and required for Anthropic-native Claude caching flows).cache_control markers are required.cut_after_message_index is zero-based. If omitted, NanoGPT will select a cache boundary automatically; set it explicitly if you need full control. Switch back to explicit cache_control blocks if you need multiple cache breakpoints or mixed TTLs in the same payload.
stickyProvider option:
stickyProvider: false (default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).stickyProvider: true — If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.stickyProvider: true:
stickyProvider: false (default):
include_usage is true in the payload or that prompt caching is enabled.:memory to any model namememory: true:online:memory:memory-<days> (1..365) or header memory_expiration_days: <days>; header takes precedencemodel_context_limit.
model_context_limit (number or numeric string):thinking is model-specific and only works when that exact ID (or a documented alias) exists.
-thinking is a legacy alias pattern for some model families only, not universal.
Do not assume -thinking works for arbitrary model IDs. Always check GET /api/v1/models for exact valid IDs.
See also: Extended Thinking (Reasoning).
https://nano-gpt.com/api/v1/chat/completions — default endpoint that streams internal thoughts through choices[0].delta.reasoning (and repeats them in message.reasoning on completion). Recommended for apps like SillyTavern that understand the modern response shape.https://nano-gpt.com/api/v1legacy/chat/completions — legacy contract that swaps the field name to choices[0].delta.reasoning_content / message.reasoning_content for older OpenAI-compatible clients. Use this for LiteLLM’s OpenAI adapter to avoid downstream parsing errors.https://nano-gpt.com/api/v1thinking/chat/completions — reasoning-aware models write everything into the normal choices[0].delta.content stream so clients that ignore reasoning fields still see the full conversation transcript. This is the preferred base URL for JanitorAI.choices[0].delta.content and the thought process in choices[0].delta.reasoning (plus optional delta.reasoning_details). Reasoning deltas are dispatched before or alongside regular content, letting you render both panes in real-time.
choices[0].message.content contains the assistant reply and choices[0].message.reasoning (plus reasoning_details when available) contains the full chain-of-thought. Non-streaming requests reuse the same formatter, so the reasoning block is present as a dedicated field.
reasoning: { "exclude": true } to strip the reasoning payload from both streaming deltas and the final message. With this flag set, delta.reasoning and message.reasoning are omitted entirely.
reasoning_effort (or reasoning.effort) controls reasoning depth and also acts as an explicit reasoning-mode signal.
Any value other than "none" is treated as a request to enable reasoning/thinking behavior.
Use "none" to explicitly disable reasoning behavior.
reasoning_effort| Value | Description |
|---|---|
none | Explicitly disables reasoning |
minimal | Lowest reasoning depth |
low | Low reasoning depth |
medium | Medium reasoning depth |
high | High reasoning depth |
xhigh | Maximum reasoning depth |
reasoning_effort parameter can be passed at the top level:
reasoning object:
reasoning_effort is authoritative for Chat Completions request shaping.
reasoning.exclude controls output visibility only. It hides reasoning fields/blocks, but does not inherently disable reasoning compute.
If an effort level is set to a non-none value, reasoning can still run while hidden.
:reasoning-exclude:reasoning-exclude to the model name.
{ "reasoning": { "exclude": true } }:reasoning-exclude suffix is stripped before the request is routed; other suffixes remain active:reasoning-exclude composes safely with the other routing suffixes you already use:
:thinking (when that exact model ID exists). -thinking variants are legacy aliases for some families only.:online and :online/linkup-deep:memory and :memory-<days>anthropic/claude-sonnet-4.6:thinking:8192:reasoning-excludeopenai/gpt-5.2:online:reasoning-excludeanthropic/claude-opus-4.6:memory-30:online/linkup-deep:reasoning-excludezai-org/glm-5:fast:reasoning-excludezai-org/glm-5:cheap:reasoning-excludereasoning_content field can opt in per request. Set reasoning.delta_field to "reasoning_content", or use the top-level shorthands reasoning_delta_field / reasoning_content_compat if updating nested objects is difficult. When the toggle is active, every streaming and non-streaming response exposes reasoning_content instead of reasoning, and the modern key is omitted. The compatibility pass is skipped if reasoning.exclude is true, because no reasoning payload is emitted. If you cannot change the request payload, target https://nano-gpt.com/api/v1legacy/chat/completions instead—the legacy endpoint keeps reasoning_content without extra flags. LiteLLM’s OpenAI adapter should point here to maintain compatibility. For clients that ignore reasoning-specific fields entirely, use https://nano-gpt.com/api/v1thinking/chat/completions so the full text appears in the standard content stream; this is the correct choice for JanitorAI.
phala/*) require byte-for-byte SSE passthrough for signature verification. For those models, streaming cannot be filtered; the suffix has no effect on the streaming bytes.service_tier to request a non-default capacity tier on providers that support service tiers:
auto or omitted: use NanoGPT’s normal routing and the provider default.default: request the provider’s standard tier where the provider accepts an explicit default value.flex: request lower-cost, variable-capacity processing where supported.priority: request higher-cost priority processing where supported.service_tier is "flex" or "priority", NanoGPT prefers routing to providers that support the requested tier.X-Provider) and explicit provider selection are honored for pricing and x402 estimates.es2k pricing for GPT-5.5/GPT-5.4 where available.service_tier field when it is provided on the request.youtube_transcripts (boolean)false (opt-in)youtube_transcripts to true (string "true" is also accepted) to fetch transcriptsPOST /api/youtube-transcribe endpoint for up to 10 URLs per requestyoutube_transcripts to true when you want the system to retrieve and bill for transcripts.
scraping: true. YouTube transcripts do not require scraping: true.| Provider | Score |
|---|---|
| LinkUp Deep Search | 90.10% |
| Exa | 90.04% |
| Perplexity Sonar Pro | 86% |
| LinkUp Standard Search | 85% |
| Perplexity Sonar | 77% |
| Tavily | 73% |
webSearch object (linkup is supported as an alias)Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Optional explicit provider override for supported open-source models (case-insensitive). Explicit provider selection is billed pay-as-you-go at the selected provider's price, including provider-selection markup; for subscription users it bypasses subscription coverage for that request.
Optional billing override to force pay-as-you-go without an explicit provider, or to apply saved provider preferences to subscription-included traffic (e.g., paygo). Header name is case-insensitive.
Parameters for chat completion
The model to use for completion. The model value may include supported model suffixes, including web search (':online', ':online/'), memory (':memory', ':memory-'), reasoning visibility (':reasoning-exclude'), thinking variants where listed by the model catalog (':thinking'), and provider routing preferences for eligible models (':fast', ':cheap', etc.).
"minimax/minimax-m2.7"
"zai-org/glm-5:fast"
"zai-org/glm-5:cheap"
"openai/gpt-5.2:online/exa-instant"
"openai/gpt-5.2:online/exa-deep-reasoning"
Array of message objects with role and content
Billing override to force pay-as-you-go without an explicit provider, or to apply saved provider preferences to subscription-included traffic. Accepted values (case-insensitive): paygo, pay-as-you-go, pay_as_you_go, paid, payg.
Alias for billing_mode.
When true, route to an available provider that is marked as prompt/input-caching capable for the requested provider-selection model. If no usable cache-capable provider exists, the request fails instead of falling back to a non-cache-capable provider. This is provider capability routing only; it does not add cache_control markers or configure cache TTLs.
Top-level routing control for caching: true. When caching is true, defaults to true and prefers the previously recorded provider for later matching requests from the same API key or session when still usable. Set false to require a cache-capable provider without sticky routing.
CamelCase alias for top-level stickyprovider. Distinct from prompt_caching.stickyProvider, which controls explicit prompt-cache failover behavior.
Whether to stream the response
Optional service tier: "auto", "default", "flex", or "priority". Use "flex" for lower-cost variable-capacity processing or "priority" for higher-cost priority processing where supported by the routed model/provider.
auto, default, flex, priority Classic randomness control. Accepts any decimal between 0-2. If omitted, NanoGPT does not force a value and the routed provider/model default applies
0 <= x <= 2Upper bound on generated tokens. If omitted, NanoGPT does not enforce an explicit default and the routed provider/model default applies
x >= 1Nucleus sampling. When set below 1.0, trims candidate tokens to the smallest set whose cumulative probability exceeds top_p. Works well as an alternative to tweaking temperature
0 <= x <= 1Penalizes tokens proportionally to how often they appeared previously. Negative values encourage repetition; positive values discourage it
-2 <= x <= 2Penalizes tokens based on whether they appeared at all. Good for keeping the model on topic without outright banning words
-2 <= x <= 2Provider-agnostic repetition modifier (distinct from OpenAI penalties). Values >1 discourage repetition
-2 <= x <= 2Caps sampling to the top-k highest probability tokens per step
Combines top-p and temperature behavior; leave unset unless a model description explicitly calls for it
Ensures each candidate token probability exceeds a floor (0-1). Helpful for stopping models from collapsing into low-entropy loops
0 <= x <= 1Tail free sampling. Values between 0-1 let you shave the long tail of the distribution; 1.0 disables the feature
0 <= x <= 1Cut probabilities as soon as they fall below the specified tail threshold
Cut probabilities as soon as they fall below the specified tail threshold
Typical sampling (aka entropy-based nucleus). Works like top_p but preserves tokens whose surprise matches the expected entropy
0 <= x <= 1Enables Mirostat sampling for models that support it. Set to 1 or 2 to activate
0, 1, 2 Mirostat target entropy parameter. Used when mirostat_mode is enabled
Mirostat learning rate parameter. Used when mirostat_mode is enabled
For providers that support it, enforces a minimum completion length before stop conditions fire
x >= 0Stop sequences. Accepts string or array of strings. Values are passed directly to upstream providers
Numeric array that lets callers stop generation on specific token IDs. Not supported by many providers
When true, keeps the stop sequence in the final text. Not supported by many providers
Allows completions to continue even if the model predicts EOS internally. Useful for long creative writing runs
Extension that forbids repeating n-grams of the given size. Not supported by many providers
x >= 0List of token IDs to fully block
Object mapping token IDs to additive logits. Works just like OpenAI's version
When true or a number, forwards the request to providers that support returning token-level log probabilities
Requests logprobs on the prompt itself when the upstream API allows it
Numeric seed. Wherever supported, passes the value to make completions repeatable
Helper to tag leading messages for explicit prompt-caching control (primarily Claude flows). Providers with implicit caching support (including OpenAI, Gemini, and many open-source provider routes) do not require this helper. NanoGPT injects cache_control blocks on each message up to the specified index before forwarding upstream. If cut_after_message_index is omitted, NanoGPT selects a cache boundary automatically.
Controls reasoning depth and acts as an explicit reasoning-mode signal. Any value other than "none" requests reasoning/thinking behavior. Use "none" to explicitly disable reasoning.
none, minimal, low, medium, high, xhigh Reasoning configuration. exclude controls output visibility (hides reasoning fields/blocks) and does not inherently disable reasoning compute. effort controls depth, and delta_field switches to legacy reasoning_content fields.
Shorthand for reasoning.delta_field
reasoning_content Shorthand to force legacy reasoning_content fields in the response
Chat completion response