Skip to main content

Overview

Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch. Benefits:
  • Up to ~90% cost reduction on cached input tokens (cache hits)
  • Lower latency on requests with large static prefixes
Caching is opt-in and configured per request.

Supported Models

Prompt caching is available on Gemini and Claude models.

Gemini models

All Gemini models are eligible for prompt caching, including:
  • Gemini 3.1 Pro Preview (all variants)
  • Gemini 3 Flash Preview / 3 Pro Preview
  • Gemini 2.5 Flash / Pro
  • Gemini 2.0 Flash

Claude models

Prompt caching is also available on Claude models, including these families (examples):
Model familyExample model IDs
Claude 3.5 Sonnet v2claude-3-5-sonnet-20241022
Claude 3.5 Haikuclaude-3-5-haiku-20241022
Claude 3.7 Sonnetclaude-3-7-sonnet-20250219 (and :thinking variants)
Claude Sonnet 4claude-sonnet-4-20250514 (and :thinking variants)
Claude Sonnet 4.5claude-sonnet-4-5-20250929 (and :thinking variants)
Claude Haiku 4.5claude-haiku-4-5-20251001
Claude Opus 4claude-opus-4-20250514 (and :thinking variants)
Claude Opus 4.1claude-opus-4-1-20250805 (and :thinking variants)
Claude Opus 4.5claude-opus-4-5-20251101 (and :thinking variants)
Claude Opus 4.6claude-opus-4-6 (and :thinking variants)
All of the above are also supported via the anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).

Claude minimum cacheable tokens

If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.
ModelMinimum cacheable tokens
Claude Opus 4.54,096
Claude Haiku 4.54,096
Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.71,024
Claude Haiku 3.52,048

How To Enable Prompt Caching

Prompt caching works on POST /api/v1/chat/completions. You can enable it in 3 ways.

Option 1: body-level helper (promptCaching / prompt_caching / cache_control)

Add a top-level helper object:
{
  "model": "google/gemini-3.1-pro-preview",
  "messages": [
    { "role": "system", "content": "Your large static content..." },
    { "role": "user", "content": "Summarize the key points." }
  ],
  "promptCaching": {
    "enabled": true,
    "ttl": "5m",
    "cutAfterMessageIndex": 0
  }
}
Parameters:
ParameterTypeDefaultDescription
enabledbooleanEnable prompt caching
ttl"5m" or "1h""5m"Cache time-to-live
cutAfterMessageIndex / cut_after_message_indexintegerZero-based index; cache all messages up to and including this index
stickyProviderbooleanfalseWhen true, avoid failover to preserve cache consistency (see stickyProvider)
explicitCacheControl / explicit_cache_controlbooleanfalseWhen true, only refresh TTLs on existing inline cache_control blocks and do not auto-add cache breakpoints
explicitCacheControl (boolean, default false) When true, the system only refreshes TTLs on cache_control blocks you already placed in your request. No additional cache breakpoints are added automatically. This is useful when you use inline cache_control markers (Option 2) but also want body-level settings like ttl or stickyProvider to apply. Without this flag, the system may add its own cache breakpoints on top of yours. Also accepts the snake_case alias explicit_cache_control. Aliases are accepted:
  • promptCaching
  • prompt_caching
  • cache_control (body-level helper alias)
Passing true instead of an object defaults to:
{ "enabled": true, "ttl": "5m" }
If cutAfterMessageIndex is omitted, NanoGPT selects cache boundaries automatically.

Option 2: inline cache_control markers

Attach cache_control directly to content blocks you want cached:
{
  "model": "google/gemini-3.1-pro-preview",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "Your long reference document...",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    { "role": "user", "content": "Live question goes here" }
  ]
}

Combining inline markers with body-level settings

{
  "model": "google/gemini-2.5-flash",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful coding assistant with access to a large codebase...",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    {
      "role": "user",
      "content": "Summarize the auth module"
    }
  ],
  "promptCaching": {
    "enabled": true,
    "ttl": "1h",
    "explicitCacheControl": true
  }
}
In this example, the system prompt’s cache_control marker is preserved and its TTL is set to 1h. The user message does not receive an auto-generated cache breakpoint.

Option 3: anthropic-beta header

The Anthropic-compatible header is supported:
anthropic-beta: prompt-caching-2024-07-31
This header also works for Gemini models. For Claude 1-hour TTL requests using Anthropic-native routing, also include:
anthropic-beta: prompt-caching-2024-07-31,extended-cache-ttl-2025-04-11

Controlling What Gets Cached

Note: explicitCacheControl and cutAfterMessageIndex serve different purposes. cutAfterMessageIndex tells the system where to auto-place cache breakpoints. explicitCacheControl tells the system not to auto-place any and only refresh what you already marked. If both are set, explicitCacheControl takes precedence.

cutAfterMessageIndex

Override automatic cache breakpoints by setting the last cached message index:
{
  "promptCaching": {
    "enabled": true,
    "cutAfterMessageIndex": 4
  }
}
Messages at indices 0..4 are cached; later messages are not. You can also set this via request header:
x-prompt-caching-cut-after: 4

Cache block limit

A maximum of 4 cache_control breakpoints are allowed per request (across system prompt, tools, and messages). If more are present, the oldest breakpoints are pruned automatically.

Forcing a Cache Write

There is no separate “force write” flag. Enable prompt caching and send the request. The first eligible request writes cache automatically (if provider thresholds/availability allow it). Repeated requests with the same cached prefix read from cache.

Usage Fields (How To Verify Cache Hits)

When caching is active, responses can include:
  • cache_creation_input_tokens: tokens written to cache on this request
  • cache_read_input_tokens: tokens read from cache on this request (cache hit when > 0)
  • prompt_tokens_details.cached_tokens: OpenAI-style cached token count
Example:
{
  "usage": {
    "prompt_tokens": 8500,
    "completion_tokens": 200,
    "cache_creation_input_tokens": 8000,
    "cache_read_input_tokens": 0
  }
}
When present, x_nanogpt_pricing includes cache pricing breakdown fields such as cacheCreationInputTokens, cacheReadInputTokens, cacheTTL, and cacheCost. For streaming requests, the final SSE chunk includes the same usage fields when usage is included.

Pricing

Cache writes and reads are billed differently by provider.

Gemini Pro models

Token typeRate (per 1M tokens)Notes
Regular input$2.00
Cache write surcharge+$0.375Added on top of input cost
Cache read$0.2090% cheaper than input
Example: writing 10,000 cached tokens costs (10k × $2.00/M) + (10k × $0.375/M) = $0.02375. Reading 10,000 cached tokens costs 10k × $0.20/M = $0.002.

Gemini Flash models

Token typeRate (per 1M tokens)Notes
Regular inputVaries by model
Cache write surcharge+$0.083Added on top of input cost
Cache read10% of input rate90% cheaper than input
For Gemini 2.0 models, cache reads are 25% of the base input rate (75% cheaper), not 10%.

Claude models

TTLCreation multiplier on cached input tokensRead multiplier
5m1.25x0.1x
1h2.0x0.1x

TTL Options

TTLDurationDescription
"5m"5 minutesDefault. Suitable for interactive sessions.
"1h"1 hourExtended. Useful for batch processing or long-running sessions.

Structuring Prompts for Cache Hits

Cache hits require the cached prefix to be byte-identical across requests. Best practices:
  • Put static content first (system prompt, reference docs, tool definitions).
  • Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
  • Put dynamic content after the cache boundary (typically the latest user message).
Behavior:
  • On cache hit, TTL resets.
  • If TTL expires without reuse, cache expires.
  • Any prefix change (even one character) causes a cache miss and new cache creation.

Cache Consistency with stickyProvider

Each provider keeps its own cache. If a request fails over to another provider, the previous cache may be unavailable. If cache consistency matters more than availability, set:
{
  "promptCaching": { "enabled": true, "ttl": "5m", "stickyProvider": true }
}
Behavior:
  • stickyProvider: false (default): the request may succeed even if routing changes, but you might rebuild caches.
  • stickyProvider: true: if a fallback would be required, the request returns 503 instead.

NanoGPT Web UI

In the NanoGPT web UI, supported models show a prompt caching toggle where you can choose cache duration.

Limitations and Caveats

  • Provider-side minimum token thresholds still apply before a cache entry is created.
  • A maximum of 4 cache breakpoints (cache_control) are supported per request.
  • Some models report aggregate prompt usage differently; use cache_creation_input_tokens and cache_read_input_tokens for authoritative cached token counts.
  • On cache hits, a small non-zero cache_creation_input_tokens can appear due to per-request overhead and does not necessarily indicate a cache miss.