Prompt Caching

Overview

Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch. Benefits:

Up to ~90% cost reduction on cached input tokens (cache hits)
Lower latency on requests with large static prefixes

NanoGPT supports two caching modes:

Implicit caching (default): For providers/models that support provider-native prompt reuse (including OpenAI, Gemini, and many open-source provider routes), caching is applied automatically when eligible. No extra request fields are required.
Explicit prompt caching (opt-in): Claude models use explicit cache controls (prompt_caching/promptCaching/body-level cache_control, or inline cache_control) when you want deterministic cache boundaries and TTL control.

NanoGPT also supports cache-capable provider routing with top-level caching: true. This is not a prompt annotation mode. It only requires routing to a provider that supports prompt/input caching and, by default, tries to keep later matching requests on the same provider.

Supported Models

Implicit caching (automatic)

NanoGPT automatically uses implicit caching on providers/models that support it, including OpenAI and Gemini model families plus many open-source provider/model routes. No cache-control flags are required for this mode.

Cache-Capable Provider Routing

Set top-level caching: true when you want NanoGPT to route the request to any available provider that supports prompt/input caching. This is capability-based routing: you do not need to choose a provider. If no cache-capable provider is available for the model, the request fails rather than silently using a non-caching provider.

{
  "model": "model-id",
  "caching": true,
  "messages": [
    { "role": "user", "content": "Hello" }
  ]
}

By default, caching: true also enables sticky provider routing. After the first successful matching request, NanoGPT will try to use the same provider for later matching requests from the same API key or session, improving the chance of provider-side cache hits. This does not guarantee a cache hit. To require a cache-capable provider without stickiness:

{
  "model": "model-id",
  "caching": true,
  "stickyprovider": false,
  "messages": [
    { "role": "user", "content": "Hello" }
  ]
}

Top-level stickyProvider is accepted as a camelCase alias for stickyprovider. Use prompt_caching / promptCaching only when you need provider-specific cache-control annotations or TTL behavior. Top-level caching: true does not add Anthropic-style cache_control markers or configure cache TTLs.

Explicit prompt caching controls (Claude)

Explicit prompt-caching controls are available on Claude models, including these families (examples):

Model family	Example model IDs
Claude 3.5 Sonnet v2	`claude-3-5-sonnet-20241022`
Claude 3.5 Haiku	`claude-3-5-haiku-20241022`
Claude 3.7 Sonnet	`claude-3-7-sonnet-20250219` (and `:thinking` variants)
Claude Sonnet 4	`claude-sonnet-4-20250514` (and `:thinking` variants)
Claude Sonnet 4.5	`claude-sonnet-4-5-20250929` (and `:thinking` variants)
Claude Haiku 4.5	`claude-haiku-4-5-20251001`
Claude Opus 4	`claude-opus-4-20250514` (and `:thinking` variants)
Claude Opus 4.1	`claude-opus-4-1-20250805` (and `:thinking` variants)
Claude Opus 4.5	`claude-opus-4-5-20251101` (and `:thinking` variants)
Claude Opus 4.6	`claude-opus-4-6` (and `:thinking` variants)

All of the above are also supported via the anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).

Claude minimum cacheable tokens

If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.

Model	Minimum cacheable tokens
Claude Opus 4.5	4,096
Claude Haiku 4.5	4,096
Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.7	1,024
Claude Haiku 3.5	2,048

How To Enable Explicit Prompt Caching (Claude)

Prompt caching works on POST /api/v1/chat/completions. You can enable it in 3 ways.

Option 1: body-level helper (`promptCaching` / `prompt_caching` / `cache_control`)

Add a top-level helper object:

{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    { "role": "system", "content": "Your large static content..." },
    { "role": "user", "content": "Summarize the key points." }
  ],
  "promptCaching": {
    "enabled": true,
    "ttl": "5m",
    "cutAfterMessageIndex": 0
  }
}

Parameters:

Parameter	Type	Default	Description
`enabled`	boolean	—	Enable prompt caching
`ttl`	`"5m"` or `"1h"`	`"5m"`	Cache time-to-live
`cutAfterMessageIndex` / `cut_after_message_index`	integer	—	Zero-based index; cache all messages up to and including this index
`stickyProvider`	boolean	`false`	When `true`, avoid failover to preserve cache consistency (see stickyProvider)
`explicitCacheControl` / `explicit_cache_control`	boolean	`false`	When `true`, only refresh TTLs on existing inline `cache_control` blocks and do not auto-add cache breakpoints

explicitCacheControl (boolean, default false) When true, the system only refreshes TTLs on cache_control blocks you already placed in your request. No additional cache breakpoints are added automatically. This is useful when you use inline cache_control markers (Option 2) but also want body-level settings like ttl or stickyProvider to apply. Without this flag, the system may add its own cache breakpoints on top of yours. Also accepts the snake_case alias explicit_cache_control. Aliases are accepted:

promptCaching
prompt_caching
cache_control (body-level helper alias)

Passing true instead of an object defaults to:

{ "enabled": true, "ttl": "5m" }

If cutAfterMessageIndex is omitted, NanoGPT selects cache boundaries automatically.

Option 2: inline `cache_control` markers

Attach cache_control directly to content blocks you want cached:

{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "Your long reference document...",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    { "role": "user", "content": "Live question goes here" }
  ]
}

Combining inline markers with body-level settings

{
  "model": "anthropic/claude-sonnet-4.5",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful coding assistant with access to a large codebase...",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    {
      "role": "user",
      "content": "Summarize the auth module"
    }
  ],
  "promptCaching": {
    "enabled": true,
    "ttl": "1h",
    "explicitCacheControl": true
  }
}

In this example, the system prompt’s cache_control marker is preserved and its TTL is set to 1h. The user message does not receive an auto-generated cache breakpoint.

Option 3: `anthropic-beta` header (Claude-compatible)

The Anthropic-compatible header is supported:

anthropic-beta: prompt-caching-2024-07-31

For Claude 1-hour TTL requests using Anthropic-native routing, also include:

anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11

Controlling What Gets Cached

Note: explicitCacheControl and cutAfterMessageIndex serve different purposes. cutAfterMessageIndex tells the system where to auto-place cache breakpoints. explicitCacheControl tells the system not to auto-place any and only refresh what you already marked. If both are set, explicitCacheControl takes precedence.

`cutAfterMessageIndex`

Override automatic cache breakpoints by setting the last cached message index:

{
  "promptCaching": {
    "enabled": true,
    "cutAfterMessageIndex": 4
  }
}

Messages at indices 0..4 are cached; later messages are not. You can also set this via request header:

x-prompt-caching-cut-after: 4

Cache block limit

A maximum of 4 cache_control breakpoints are allowed per request (across system prompt, tools, and messages). If more are present, the oldest breakpoints are pruned automatically.

Forcing a Cache Write

There is no separate “force write” flag. Enable prompt caching and send the request. The first eligible request writes cache automatically (if provider thresholds/availability allow it). Repeated requests with the same cached prefix read from cache.

Usage Fields (How To Verify Cache Hits)

When caching is active (implicit or explicit), responses can include:

cache_creation_input_tokens: tokens written to cache on this request
cache_read_input_tokens: tokens read from cache on this request (cache hit when > 0)
prompt_tokens_details.cached_tokens: OpenAI-style cached token count

Example:

{
  "usage": {
    "prompt_tokens": 8500,
    "completion_tokens": 200,
    "cache_creation_input_tokens": 8000,
    "cache_read_input_tokens": 0
  }
}

When present, x_nanogpt_pricing includes cache pricing breakdown fields such as cacheCreationInputTokens, cacheReadInputTokens, cacheTTL, and cacheCost. For streaming requests, the final SSE chunk includes the same usage fields when usage is included.

Pricing

Cache writes and reads are billed differently by provider. Implicit-caching providers apply their cache pricing automatically when eligible. Explicit Claude caching uses the TTL settings below.

Gemini Pro models (implicit caching, provider-native)

Token type	Rate (per 1M tokens)	Notes
Regular input	$2.00	—
Cache write surcharge	+$0.375	Added on top of input cost
Cache read	$0.20	90% cheaper than input

Example: writing 10,000 cached tokens costs (10k × $2.00/M) + (10k × $0.375/M) = $0.02375. Reading 10,000 cached tokens costs 10k × $0.20/M = $0.002.

Gemini Flash models (implicit caching, provider-native)

Token type	Rate (per 1M tokens)	Notes
Regular input	Varies by model	—
Cache write surcharge	+$0.083	Added on top of input cost
Cache read	10% of input rate	90% cheaper than input

For Gemini 2.0 models, cache reads are 25% of the base input rate (75% cheaper), not 10%.

Claude models (explicit caching)

TTL	Creation multiplier on cached input tokens	Read multiplier
`5m`	`1.25x`	`0.1x`
`1h`	`2.0x`	`0.1x`

TTL Options (Explicit Claude Controls)

TTL	Duration	Description
`"5m"`	5 minutes	Default. Suitable for interactive sessions.
`"1h"`	1 hour	Extended. Useful for batch processing or long-running sessions.

Structuring Prompts for Cache Hits

Cache hits require the cached prefix to be byte-identical across requests. Best practices:

Put static content first (system prompt, reference docs, tool definitions).
Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
Put dynamic content after the cache boundary (typically the latest user message).

Behavior:

On cache hit, TTL resets.
If TTL expires without reuse, cache expires.
Any prefix change (even one character) causes a cache miss and new cache creation.

Cache Consistency with `stickyProvider` (Explicit Caching)

Each provider keeps its own cache. If a request fails over to another provider, the previous cache may be unavailable. If cache consistency matters more than availability, set:

{
  "promptCaching": { "enabled": true, "ttl": "5m", "stickyProvider": true }
}

Behavior:

stickyProvider: false (default): the request may succeed even if routing changes, but you might rebuild caches.
stickyProvider: true: if a fallback would be required, the request returns 503 instead.

NanoGPT Web UI

In the NanoGPT web UI, models with explicit prompt-caching controls show a prompt caching toggle where you can choose cache duration.

Limitations and Caveats

Provider-side minimum token thresholds still apply before a cache entry is created.
A maximum of 4 cache breakpoints (cache_control) are supported per request.
Some models report aggregate prompt usage differently; use cache_creation_input_tokens and cache_read_input_tokens for authoritative cached token counts.
On cache hits, a small non-zero cache_creation_input_tokens can appear due to per-request overhead and does not necessarily indicate a cache miss.
Implicit caching behavior (eligibility, TTL behavior, and exact discounts) is provider-dependent.

Get Started

Endpoint Examples

API Reference

Miscellaneous

Integrations

Overview

Supported Models

Implicit caching (automatic)

Cache-Capable Provider Routing

Explicit prompt caching controls (Claude)

Claude minimum cacheable tokens

How To Enable Explicit Prompt Caching (Claude)

Option 1: body-level helper (`promptCaching` / `prompt_caching` / `cache_control`)

Option 2: inline `cache_control` markers

Combining inline markers with body-level settings

Option 3: `anthropic-beta` header (Claude-compatible)

Controlling What Gets Cached

`cutAfterMessageIndex`

Cache block limit

Forcing a Cache Write

Usage Fields (How To Verify Cache Hits)

Pricing

Gemini Pro models (implicit caching, provider-native)

Gemini Flash models (implicit caching, provider-native)

Claude models (explicit caching)

TTL Options (Explicit Claude Controls)

Structuring Prompts for Cache Hits

Cache Consistency with `stickyProvider` (Explicit Caching)

NanoGPT Web UI

Limitations and Caveats

Get Started

Endpoint Examples

API Reference

Miscellaneous

Integrations

Documentation Index

​Overview

​Supported Models

​Implicit caching (automatic)

​Cache-Capable Provider Routing

​Explicit prompt caching controls (Claude)

​Claude minimum cacheable tokens

​How To Enable Explicit Prompt Caching (Claude)

​Option 1: body-level helper (promptCaching / prompt_caching / cache_control)

​Option 2: inline cache_control markers

​Combining inline markers with body-level settings

​Option 3: anthropic-beta header (Claude-compatible)

​Controlling What Gets Cached

​cutAfterMessageIndex

​Cache block limit

​Forcing a Cache Write

​Usage Fields (How To Verify Cache Hits)

​Pricing

​Gemini Pro models (implicit caching, provider-native)

​Gemini Flash models (implicit caching, provider-native)

​Claude models (explicit caching)

​TTL Options (Explicit Claude Controls)

​Structuring Prompts for Cache Hits

​Cache Consistency with stickyProvider (Explicit Caching)

​NanoGPT Web UI

​Limitations and Caveats

Overview

Supported Models

Implicit caching (automatic)

Cache-Capable Provider Routing

Explicit prompt caching controls (Claude)

Claude minimum cacheable tokens

How To Enable Explicit Prompt Caching (Claude)

Option 1: body-level helper (`promptCaching` / `prompt_caching` / `cache_control`)

Option 2: inline `cache_control` markers

Combining inline markers with body-level settings

Option 3: `anthropic-beta` header (Claude-compatible)

Controlling What Gets Cached

`cutAfterMessageIndex`

Cache block limit

Forcing a Cache Write

Usage Fields (How To Verify Cache Hits)

Pricing

Gemini Pro models (implicit caching, provider-native)

Gemini Flash models (implicit caching, provider-native)

Claude models (explicit caching)

TTL Options (Explicit Claude Controls)

Structuring Prompts for Cache Hits

Cache Consistency with `stickyProvider` (Explicit Caching)

NanoGPT Web UI

Limitations and Caveats