Text Generation

Overview

The NanoGPT API offers multiple ways to generate text, including OpenAI-compatible endpoints and our legacy options. This guide covers all available text generation methods. If you are using a TEE-backed model (e.g., prefixed with TEE/), you can also verify the enclave attestation and signatures for your chat completions. See the TEE Model Verification guide for more details.

Provider Selection

Pay-as-you-go requests can select an upstream provider for supported open-source models using the X-Provider header or saved preferences. Subscription endpoints ignore provider selection unless you force paid routing with the pay-as-you-go billing override (billing_mode: "paygo" or X-Billing-Mode: paygo). See Provider Selection and Pay-As-You-Go Billing Override.

OpenAI Compatible Endpoints

Chat Completions (v1/chat/completions)

This endpoint mimics OpenAI’s chat completions API:

import requests
import json

BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY"  # Replace with your API key

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
    "Accept": "text/event-stream"  # Required for SSE streaming
}

def stream_chat_completion(messages, model="openai/gpt-5.2"):
    """
    Send a streaming chat completion request using the OpenAI-compatible endpoint.
    """
    data = {
        "model": model,
        "messages": messages,
        "stream": True  # Enable streaming
    }

    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers=headers,
        json=data,
        stream=True
    )

    if response.status_code != 200:
        raise Exception(f"Error: {response.status_code}")

    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            if line.startswith('data: '):
                line = line[6:]
            if line == '[DONE]':
                break
            try:
                chunk = json.loads(line)
                if chunk['choices'][0]['delta'].get('content'):
                    yield chunk['choices'][0]['delta']['content']
            except json.JSONDecodeError:
                continue

# Example usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please explain the concept of artificial intelligence."}
]

try:
    print("Assistant's Response:")
    for content_chunk in stream_chat_completion(messages):
        print(content_chunk, end='', flush=True)
    print("")
except Exception as e:
    print(f"Error: {str(e)}")

Responses (v1/responses)

Use the OpenAI Responses-compatible endpoint for stateful threading (previous_response_id), background processing, and Responses-style streaming events. See the dedicated docs at /api-reference/endpoint/responses.

import requests

BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
}

data = {
    "model": "openai/gpt-5.2",
    "input": "Summarize the Responses API in one sentence."
}

response = requests.post(
    f"{BASE_URL}/responses",
    headers=headers,
    json=data
)

response.raise_for_status()
print(response.json())

Text Completions (v1/completions)

This endpoint mimics OpenAI’s legacy text completions API:

import requests
import json

BASE_URL = "https://nano-gpt.com/api/v1"
API_KEY = "YOUR_API_KEY"  # Replace with your API key

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

def get_completion(prompt, model="openai/gpt-5.2"):
    """
    Send a text completion request using the OpenAI-compatible endpoint.
    """
    data = {
        "model": model,
        "prompt": prompt,
        "max_tokens": 1000,  # Optional: maximum number of tokens to generate
        "temperature": 0.7,  # Optional: controls randomness (0-2)
        "top_p": 1,         # Optional: nucleus sampling parameter
        "stream": False     # Set to True for streaming responses
    }

    response = requests.post(
        f"{BASE_URL}/completions",
        headers=headers,
        json=data
    )

    if response.status_code != 200:
        raise Exception(f"Error: {response.status_code}")

    return response.json()

# Example usage
prompt = "Write a short story about a robot learning to paint:"
try:
    response = get_completion(prompt)
    print("Completion:", response['choices'][0]['text'])
except Exception as e:
    print(f"Error: {str(e)}")

Legacy Text Completions

For the older, non-OpenAI compatible endpoint:

import requests
import json

BASE_URL = "https://nano-gpt.com/api"
API_KEY = "YOUR_API_KEY"

headers = {
    "x-api-key": API_KEY,
    "Content-Type": "application/json"
}

def talk_to_gpt(prompt, model="openai/gpt-5.2", messages=[]):
    data = {
        "prompt": prompt,
        "model": model,
        "messages": messages
    }
    response = requests.post(f"{BASE_URL}/talk-to-gpt", headers=headers, json=data)
    return response.text if response.status_code == 200 else None

# Example usage
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing well, thank you! How can I assist you today?"}
]
prompt = "Please explain the concept of artificial intelligence."
response = talk_to_gpt(prompt, messages=messages)
if response:
    # Split the response to separate the text and NanoGPT info
    parts = response.split('<NanoGPT>')
    text_response = parts[0].strip()
    nano_info = json.loads(parts[1].split('</NanoGPT>')[0])

    print("NanoGPT Response:", text_response)
    print("Cost:", nano_info['cost'])
    print("Input Tokens:", nano_info['inputTokens'])
    print("Output Tokens:", nano_info['outputTokens'])
else:
    print("Failed to get response from GPT")

Prompt Caching (Claude Models)

Claude caching follows Anthropic’s Messages schema: cache_control lives on the message content blocks you want to reuse. NanoGPT simply forwards those markers to Anthropic, so you decide where each cache breakpoint sits. The first invocation costs 1.25× (5 min TTL) or 2× (1 hour TTL); cached replays discount the same tokens by ~90%.

Note: NanoGPT’s automatic failover system ensures high availability but may occasionally cause cache misses. If you’re seeing unexpected cache misses in your usage logs, see the “Cache Consistency with stickyProvider” section below.

The prompt_caching / promptCaching helper accepts these options:

Parameter	Type	Default	Description
`enabled`	boolean	—	Enable prompt caching
`ttl`	string	`"5m"`	Cache time-to-live: `"5m"` or `"1h"`
`cut_after_message_index`	integer	—	Zero-based index; cache all messages up to and including this index
`stickyProvider`	boolean	`false`	New: When `true`, disable automatic failover to preserve cache consistency. Returns 503 error instead of switching services.

Explicit `cache_control` markers

def chat_completion_with_claude_cache(messages, model="anthropic/claude-opus-4.5"):
    """
    Attach cache_control directly to the static prompt blocks you want Claude to reuse.
    """
    headers_with_cache = {
        **headers,  # reuse Authorization + Content-Type from above
        "anthropic-beta": "prompt-caching-2024-07-31"
    }

    payload = {
        "model": model,
        "messages": messages,
        "temperature": 0.7,
        "max_tokens": 4000,
        "stream": False
    }

    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers=headers_with_cache,
        json=payload
    )
    response.raise_for_status()
    return response.json()

messages_with_breakpoint = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a financial watchdog. Answer in JSON with rationale fields.",
                "cache_control": {"type": "ephemeral", "ttl": "5m"}
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": (
                    "Context: <10 kB of policy + rubric that rarely changes>\n"
                    "A separate uncached message will carry the live question."
                ),
                "cache_control": {"type": "ephemeral", "ttl": "5m"}
            }
        ]
    },
    {"role": "user", "content": "What risks should I watch for in today's filing?"}
]

result = chat_completion_with_claude_cache(messages_with_breakpoint)
print(result["choices"][0]["message"]["content"])

cache_control belongs to the individual content blocks (system, user, tool definitions, etc.). Each marker caches the entire prefix up to and including that block, matching Anthropic’s behavior.
Supported TTLs are 5m and 1h. Omit ttl to use the default 5m window.
Set the anthropic-beta: prompt-caching-2024-07-31 header on any request that contains cache markers; Anthropic rejects cache requests without the beta flag.
Check usage.prompt_tokens_details.cached_tokens in NanoGPT’s response to confirm what was billed at the discounted rate.

Using the `prompt_caching` helper

If you prefer not to duplicate cache_control entries manually, NanoGPT accepts a helper object that tags the leading prefix for you.

payload = {
    "model": "anthropic/claude-opus-4.5",
    "messages": [
        {"role": "system", "content": "Summaries must be under 100 words."},
        {"role": "user", "content": "Cache the playbook for an hour."},
        {"role": "user", "content": "Live question goes here"}
    ],
    "prompt_caching": {
        "enabled": True,
        "ttl": "1h",
        "cut_after_message_index": 1  # cache everything through message index 1
    }
}

requests.post(
    f"{BASE_URL}/chat/completions",
    headers={
        **headers,
        "anthropic-beta": "prompt-caching-2024-07-31"
    },
    json=payload
)

cut_after_message_index is zero-based and points at the last message in the static prefix. NanoGPT will attach a cache_control block with your TTL to each message up to that index before forwarding the request to Anthropic—no additional heuristics are applied. If you need different cache durations or non-contiguous breakpoints, fall back to explicit cache_control markers in your messages array.

Cache Consistency with `stickyProvider`

NanoGPT automatically fails over to backup services when the primary service is temporarily unavailable. While this ensures high availability, it can break your prompt cache because each backend service maintains its own separate cache. If cache consistency is more important than availability for your use case, you can enable the stickyProvider option:

{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [...],
  "prompt_caching": {
    "enabled": true,
    "ttl": "5m",
    "stickyProvider": true
  }
}

Behavior:

stickyProvider: false (default) — If the primary service fails, NanoGPT automatically retries with a backup service. Your request succeeds, but the cache may be lost (you’ll pay full price for that request and need to rebuild the cache).
stickyProvider: true — If the primary service fails, NanoGPT returns a 503 error instead of failing over. Your cache remains intact for when the service recovers.

When to use stickyProvider: true:

You have very large cached contexts where cache misses are expensive
You prefer to retry failed requests yourself rather than pay for cache rebuilds
Cost predictability is more important than request success rate

When to use stickyProvider: false (default):

You prefer requests to always succeed when possible
Occasional cache misses are acceptable
You’re using shorter contexts where cache rebuilds are inexpensive

Error response when stickyProvider blocks a failover:

{
  "error": {
    "message": "Service is temporarily unavailable. Fallback disabled to preserve prompt cache consistency. Switching services would invalidate your cached tokens. Remove stickyProvider option or retry later.",
    "status": 503,
    "type": "service_unavailable",
    "code": "fallback_blocked_for_cache_consistency"
  }
}

Chat Completions with Web Search

Enable real-time web access for any model by appending special suffixes:

def chat_completion_with_web_search(messages, model="openai/gpt-5.2", search_depth="standard"):
    """
    Send a chat completion request with web search enabled.
    
    Args:
        messages: List of message objects
        model: Base model name
        search_depth: "standard" ($0.006) or "deep" ($0.06)
    """
    # Append the appropriate suffix for web search
    if search_depth == "deep":
        model_with_search = f"{model}:online/linkup-deep"
    else:
        model_with_search = f"{model}:online"
    
    data = {
        "model": model_with_search,
        "messages": messages,
        "stream": True
    }
    
    response = requests.post(
        f"{BASE_URL}/chat/completions",
        headers=headers,
        json=data,
        stream=True
    )
    
    if response.status_code != 200:
        raise Exception(f"Error: {response.status_code}")
    
    for line in response.iter_lines():
        if line:
            line = line.decode('utf-8')
            if line.startswith('data: '):
                line = line[6:]
            if line == '[DONE]':
                break
            try:
                chunk = json.loads(line)
                if chunk['choices'][0]['delta'].get('content'):
                    yield chunk['choices'][0]['delta']['content']
            except json.JSONDecodeError:
                continue

# Example: Get current information
messages = [
    {"role": "user", "content": "What happened in the tech industry this week?"}
]

print("Standard web search:")
for content in chat_completion_with_web_search(messages):
    print(content, end='', flush=True)

# Example: Deep research
research_messages = [
    {"role": "user", "content": "Provide a comprehensive analysis of the latest developments in quantum computing"}
]

print("\n\nDeep web search:")
for content in chat_completion_with_web_search(research_messages, search_depth="deep"):
    print(content, end='', flush=True)

Web Search Options

:online - Standard search with 10 results ($0.006 per request)
:online/linkup-deep - Deep iterative search ($0.06 per request)

Web search dramatically improves factuality - Gemini 3 Flash Preview with web access shows a 10x improvement in accuracy, making it twice as accurate as non-web baselines. For more advanced web search capabilities including structured output, domain filtering, and date filtering, see the Web Search API.

Get Started

Endpoint Examples

API Reference

Miscellaneous

Integrations

Overview

Provider Selection

OpenAI Compatible Endpoints

Chat Completions (v1/chat/completions)

Responses (v1/responses)

Text Completions (v1/completions)

Legacy Text Completions

Prompt Caching (Claude Models)

Explicit `cache_control` markers

Using the `prompt_caching` helper

Cache Consistency with `stickyProvider`

Chat Completions with Web Search

Web Search Options

Get Started

Endpoint Examples

API Reference

Miscellaneous

Integrations

​Overview

​Provider Selection

​OpenAI Compatible Endpoints

​Chat Completions (v1/chat/completions)

​Responses (v1/responses)

​Text Completions (v1/completions)

​Legacy Text Completions

​Prompt Caching (Claude Models)

​Explicit cache_control markers

​Using the prompt_caching helper

​Cache Consistency with stickyProvider

​Chat Completions with Web Search

​Web Search Options

Overview

Provider Selection

OpenAI Compatible Endpoints

Chat Completions (v1/chat/completions)

Responses (v1/responses)

Text Completions (v1/completions)

Legacy Text Completions

Prompt Caching (Claude Models)

Explicit `cache_control` markers

Using the `prompt_caching` helper

Cache Consistency with `stickyProvider`

Chat Completions with Web Search

Web Search Options