Documentation Index
Fetch the complete documentation index at: https://docs.nano-gpt.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The NanoGPT API offers multiple ways to generate text, including OpenAI-compatible endpoints and our legacy options. This guide covers all available text generation methods.
If you are using a TEE-backed model (e.g., prefixed with TEE/), you can also verify the enclave attestation and signatures for your chat completions. See the TEE Model Verification guide for more details.
Provider Selection
Provider selection is available for supported open-source models. X-Provider explicitly selects a provider for the request and is always billed pay-as-you-go at the selected provider’s price, including provider-selection markup. For subscription users, sending X-Provider bypasses subscription coverage for that request; X-Billing-Mode: paygo is only needed when forcing pay-as-you-go without an explicit provider or when saved provider preferences should apply to subscription-included traffic. See Provider Selection and Pay-As-You-Go Billing Override.
For one-off routing preferences, append a suffix to eligible model IDs:
:fast / :speed for fastest estimated completion
:cheap / :price / :floor for cheapest provider
:throughput for highest TPS
:latency for lowest TTFT
:tools for tools-capable routing
See Model Suffixes for the complete suffix list and conflict rules.
OpenAI Compatible Endpoints
Chat Completions (v1/chat/completions)
This endpoint mimics OpenAI’s chat completions API:
import requests
import json
BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY" # Replace with your API key
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
"Accept": "text/event-stream" # Required for SSE streaming
}
def stream_chat_completion(messages, model="openai/gpt-5.2"):
"""
Send a streaming chat completion request using the OpenAI-compatible endpoint.
"""
data = {
"model": model,
"messages": messages,
"stream": True # Enable streaming
}
response = requests.post(
f"{BASE_URL}/chat/completions",
headers=headers,
json=data,
stream=True
)
if response.status_code != 200:
raise Exception(f"Error: {response.status_code}")
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
line = line[6:]
if line == '[DONE]':
break
try:
chunk = json.loads(line)
if chunk['choices'][0]['delta'].get('content'):
yield chunk['choices'][0]['delta']['content']
except json.JSONDecodeError:
continue
# Example usage
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Please explain the concept of artificial intelligence."}
]
try:
print("Assistant's Response:")
for content_chunk in stream_chat_completion(messages):
print(content_chunk, end='', flush=True)
print("")
except Exception as e:
print(f"Error: {str(e)}")
Responses (v1/responses)
Use the OpenAI Responses-compatible endpoint for stateful threading (previous_response_id), background processing, and Responses-style streaming events. See the dedicated docs at /api-reference/endpoint/responses.
import requests
BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
data = {
"model": "openai/gpt-5.2",
"input": "Summarize the Responses API in one sentence."
}
response = requests.post(
f"{BASE_URL}/responses",
headers=headers,
json=data
)
response.raise_for_status()
print(response.json())
Direct Web Search (api/web)
Use POST /api/web when you need direct search control instead of chat orchestration:
- Explicit
query payload control
- Linkup output types:
searchResults, sourcedAnswer, structured
- Date and domain filters (
fromDate, toDate, includeDomains, excludeDomains)
See Direct Web Search API for request/response schema, pricing, and error codes.
Text Completions (v1/completions)
This endpoint mimics OpenAI’s legacy text completions API:
POST /api/v1/completions is best effort. Performance and compatibility may be less consistent than POST /api/v1/chat/completions because not all providers we route to support the legacy completions API.
import requests
import json
BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY" # Replace with your API key
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
def get_completion(prompt, model="openai/gpt-5.2"):
"""
Send a text completion request using the OpenAI-compatible endpoint.
"""
data = {
"model": model,
"prompt": prompt,
"max_tokens": 1000, # Optional: maximum number of tokens to generate
"temperature": 0.7, # Optional: controls randomness (0-2)
"top_p": 1, # Optional: nucleus sampling parameter
"stream": False # Set to True for streaming responses
}
response = requests.post(
f"{BASE_URL}/completions",
headers=headers,
json=data
)
if response.status_code != 200:
raise Exception(f"Error: {response.status_code}")
return response.json()
# Example usage
prompt = "Write a short story about a robot learning to paint:"
try:
response = get_completion(prompt)
print("Completion:", response['choices'][0]['text'])
except Exception as e:
print(f"Error: {str(e)}")
Legacy Text Completions
For the older, non-OpenAI compatible endpoint:
import requests
import json
BASE_URL = "https://nano-gpt.com/api"
API_KEY = "YOUR_API_KEY"
headers = {
"x-api-key": API_KEY,
"Content-Type": "application/json"
}
def talk_to_gpt(prompt, model="openai/gpt-5.2", messages=[]):
data = {
"prompt": prompt,
"model": model,
"messages": messages
}
response = requests.post(f"{BASE_URL}/talk-to-gpt", headers=headers, json=data)
return response.text if response.status_code == 200 else None
# Example usage
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing well, thank you! How can I assist you today?"}
]
prompt = "Please explain the concept of artificial intelligence."
response = talk_to_gpt(prompt, messages=messages)
if response:
# Split the response to separate the text and NanoGPT info
parts = response.split('<NanoGPT>')
text_response = parts[0].strip()
nano_info = json.loads(parts[1].split('</NanoGPT>')[0])
print("NanoGPT Response:", text_response)
print("Cost:", nano_info['cost'])
print("Input Tokens:", nano_info['inputTokens'])
print("Output Tokens:", nano_info['outputTokens'])
else:
print("Failed to get response from GPT")
Caching (Implicit and Explicit Controls)
For the full guide (supported models, thresholds, pricing, and usage fields), see Prompt Caching.
NanoGPT automatically applies implicit caching on providers/models that support it (including OpenAI, Gemini, and many open-source provider/model routes), so most requests do not need caching flags.
Set top-level caching: true when you want NanoGPT to route the request to any available provider that supports prompt/input caching. This is capability-based routing: you do not need to choose a provider. If no cache-capable provider is available for the model, the request fails rather than silently using a non-caching provider.
Use explicit prompt-caching controls (prompt_caching, promptCaching, and body-level cache_control alias, plus inline cache_control) when you need Claude-specific cache boundaries, TTL selection, or prompt_caching.stickyProvider consistency control. Top-level caching: true does not add Anthropic-style cache_control markers or configure cache TTLs.
Cache-Capable Provider Routing
{
"model": "model-id",
"caching": true,
"messages": [
{ "role": "user", "content": "Hello" }
]
}
By default, caching: true enables sticky provider routing. After the first successful matching request, NanoGPT will try to use the same provider for later matching requests from the same API key or session, improving the chance of provider-side cache hits. This does not guarantee that a request will be served from cache.
To require a cache-capable provider without sticky routing, set stickyprovider: false:
{
"model": "model-id",
"caching": true,
"stickyprovider": false,
"messages": [
{ "role": "user", "content": "Hello" }
]
}
Top-level stickyProvider is also accepted as a camelCase alias for stickyprovider.
For caching: true, NanoGPT filters to available, non-excluded, prompt-caching-capable providers; prefers the recorded sticky provider when enabled and still usable; otherwise chooses the cheapest cache-capable provider by base input + output price. Cache write/read pricing is used only as a tie-breaker.
The prompt_caching / promptCaching helper accepts these options:
| Parameter | Type | Default | Description |
|---|
enabled | boolean | — | Enable prompt caching |
ttl | string | "5m" | Cache time-to-live: "5m" or "1h" |
cut_after_message_index / cutAfterMessageIndex | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | When true, disable automatic failover to preserve explicit prompt-cache consistency. Returns 503 error instead of switching services. |
Explicit cache_control markers
def chat_completion_with_prompt_cache(messages, model="anthropic/claude-sonnet-4.5"):
"""
Attach cache_control directly to the static prompt blocks you want reused.
"""
headers_with_cache = {**headers} # reuse Authorization + Content-Type from above
payload = {
"model": model,
"messages": messages,
"temperature": 0.7,
"max_tokens": 4000,
"stream": False
}
response = requests.post(
f"{BASE_URL}/chat/completions",
headers=headers_with_cache,
json=payload
)
response.raise_for_status()
return response.json()
messages_with_breakpoint = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a financial watchdog. Answer in JSON with rationale fields.",
"cache_control": {"type": "ephemeral", "ttl": "5m"}
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Context: <10 kB of policy + rubric that rarely changes>\n"
"A separate uncached message will carry the live question."
),
"cache_control": {"type": "ephemeral", "ttl": "5m"}
}
]
},
{"role": "user", "content": "What risks should I watch for in today's filing?"}
]
result = chat_completion_with_prompt_cache(messages_with_breakpoint)
print(result["choices"][0]["message"]["content"])
cache_control belongs to individual content blocks (system, user, tool definitions, etc.). Each marker caches the entire prefix up to and including that block.
- Supported explicit TTLs are
5m and 1h (Claude flows). Omit ttl to use the default 5m window.
anthropic-beta: prompt-caching-2024-07-31 is supported for compatibility and required for Anthropic-native Claude caching flows.
- For implicit-caching providers, no explicit
cache_control markers are required.
- Check
usage.prompt_tokens_details.cached_tokens in NanoGPT’s response to confirm what was billed at the discounted rate.
Using the prompt_caching helper
If you prefer not to duplicate cache_control entries manually, NanoGPT accepts a helper object that tags the leading prefix for you.
payload = {
"model": "anthropic/claude-opus-4.5",
"messages": [
{"role": "system", "content": "Summaries must be under 100 words."},
{"role": "user", "content": "Cache the playbook for an hour."},
{"role": "user", "content": "Live question goes here"}
],
"prompt_caching": {
"enabled": True,
"ttl": "1h",
"cut_after_message_index": 1 # cache everything through message index 1
}
}
requests.post(
f"{BASE_URL}/chat/completions",
headers={
**headers,
"anthropic-beta": "prompt-caching-2024-07-31"
},
json=payload
)
cut_after_message_index is zero-based and points at the last message in the static prefix. NanoGPT will attach a cache_control block with your TTL to each message up to that index before forwarding the request upstream. If you omit cut_after_message_index, NanoGPT will select a cache boundary automatically; set it explicitly if you need full control. If you need different cache durations or non-contiguous breakpoints, fall back to explicit cache_control markers in your messages array.
Explicit Prompt Cache Consistency
NanoGPT automatically fails over to backup services when the primary service is temporarily unavailable. While this ensures high availability, it can break your prompt cache because each backend service maintains its own separate cache.
If cache consistency is more important than availability for your use case, you can enable the stickyProvider option:
{
"model": "anthropic/claude-sonnet-4.5",
"messages": [...],
"prompt_caching": {
"enabled": true,
"ttl": "5m",
"stickyProvider": true
}
}
Behavior:
stickyProvider: false (default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).
stickyProvider: true — If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.
When to use stickyProvider: true:
- You have very large cached contexts where cache misses are expensive
- You prefer to retry failed requests yourself rather than pay for cache rebuilds
- Cost predictability is more important than request success rate
When to use stickyProvider: false (default):
- You prefer requests to always succeed when possible
- Occasional cache misses are acceptable
- You’re using shorter contexts where cache rebuilds are inexpensive
Error response when stickyProvider blocks a failover:
{
"error": {
"message": "Service is temporarily unavailable. Fallback disabled to preserve prompt cache consistency. Switching services would invalidate your cached tokens. Remove stickyProvider option or retry later.",
"status": 503,
"type": "service_unavailable",
"code": "fallback_blocked_for_cache_consistency"
}
}
Chat Completions with Web Search
Enable real-time web access for any model by appending special suffixes:
def chat_completion_with_web_search(messages, model="openai/gpt-5.2", search_depth="standard"):
"""
Send a chat completion request with web search enabled.
Args:
messages: List of message objects
model: Base model name
search_depth: "standard" ($0.006) or "deep" ($0.06)
"""
# Append the appropriate suffix for web search
if search_depth == "deep":
model_with_search = f"{model}:online/linkup-deep"
else:
model_with_search = f"{model}:online"
data = {
"model": model_with_search,
"messages": messages,
"stream": True
}
response = requests.post(
f"{BASE_URL}/chat/completions",
headers=headers,
json=data,
stream=True
)
if response.status_code != 200:
raise Exception(f"Error: {response.status_code}")
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
line = line[6:]
if line == '[DONE]':
break
try:
chunk = json.loads(line)
if chunk['choices'][0]['delta'].get('content'):
yield chunk['choices'][0]['delta']['content']
except json.JSONDecodeError:
continue
# Example: Get current information
messages = [
{"role": "user", "content": "What happened in the tech industry this week?"}
]
print("Standard web search:")
for content in chat_completion_with_web_search(messages):
print(content, end='', flush=True)
# Example: Deep research
research_messages = [
{"role": "user", "content": "Provide a comprehensive analysis of the latest developments in quantum computing"}
]
print("\n\nDeep web search:")
for content in chat_completion_with_web_search(research_messages, search_depth="deep"):
print(content, end='', flush=True)
Web Search Options
:online - Standard search with 10 results ($0.006 per request)
:online/linkup-deep - Deep iterative search ($0.06 per request)
For the full provider-specific suffix list, including :online/exa-instant, :online/exa-deep-reasoning, :online/brave, and :online/valyu-web-deep, see Model Suffixes.
Web search dramatically improves factuality - Gemini 3 Flash Preview with web access shows a 10x improvement in accuracy, making it twice as accurate as non-web baselines.
For direct /api/web usage with structured output, domain/date filters, and explicit query control, see Direct Web Search API.