Overview
Prompt caching lets you cache large, reusable prompt prefixes (system prompts, reference documents, tool definitions, and long conversation history) so follow-up requests can reuse the cached prefix instead of reprocessing it from scratch. Benefits:- Up to ~90% cost reduction on cached input tokens (cache hits)
- Lower latency on requests with large static prefixes
Supported Models
Prompt caching is available on Gemini and Claude models.Gemini models
All Gemini models are eligible for prompt caching, including:- Gemini 3.1 Pro Preview (all variants)
- Gemini 3 Flash Preview / 3 Pro Preview
- Gemini 2.5 Flash / Pro
- Gemini 2.0 Flash
Claude models
Prompt caching is also available on Claude models, including these families (examples):| Model family | Example model IDs |
|---|---|
| Claude 3.5 Sonnet v2 | claude-3-5-sonnet-20241022 |
| Claude 3.5 Haiku | claude-3-5-haiku-20241022 |
| Claude 3.7 Sonnet | claude-3-7-sonnet-20250219 (and :thinking variants) |
| Claude Sonnet 4 | claude-sonnet-4-20250514 (and :thinking variants) |
| Claude Sonnet 4.5 | claude-sonnet-4-5-20250929 (and :thinking variants) |
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 |
| Claude Opus 4 | claude-opus-4-20250514 (and :thinking variants) |
| Claude Opus 4.1 | claude-opus-4-1-20250805 (and :thinking variants) |
| Claude Opus 4.5 | claude-opus-4-5-20251101 (and :thinking variants) |
| Claude Opus 4.6 | claude-opus-4-6 (and :thinking variants) |
anthropic/ model prefix (for example anthropic/claude-sonnet-4.5, anthropic/claude-opus-4.6:thinking).
Claude minimum cacheable tokens
If your cached prefix is smaller than the minimum, the request still succeeds but no cache entry is created.| Model | Minimum cacheable tokens |
|---|---|
| Claude Opus 4.5 | 4,096 |
| Claude Haiku 4.5 | 4,096 |
| Claude Sonnet 4.5, Sonnet 4, Opus 4, Opus 4.1, Opus 4.6, Sonnet 3.7 | 1,024 |
| Claude Haiku 3.5 | 2,048 |
How To Enable Prompt Caching
Prompt caching works onPOST /api/v1/chat/completions.
You can enable it in 3 ways.
Option 1: body-level helper (promptCaching / prompt_caching / cache_control)
Add a top-level helper object:
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | boolean | — | Enable prompt caching |
ttl | "5m" or "1h" | "5m" | Cache time-to-live |
cutAfterMessageIndex / cut_after_message_index | integer | — | Zero-based index; cache all messages up to and including this index |
stickyProvider | boolean | false | When true, avoid failover to preserve cache consistency (see stickyProvider) |
explicitCacheControl / explicit_cache_control | boolean | false | When true, only refresh TTLs on existing inline cache_control blocks and do not auto-add cache breakpoints |
explicitCacheControl (boolean, default false)
When true, the system only refreshes TTLs on cache_control blocks you already placed in your request. No additional cache breakpoints are added automatically.
This is useful when you use inline cache_control markers (Option 2) but also want body-level settings like ttl or stickyProvider to apply. Without this flag, the system may add its own cache breakpoints on top of yours.
Also accepts the snake_case alias explicit_cache_control.
Aliases are accepted:
promptCachingprompt_cachingcache_control(body-level helper alias)
true instead of an object defaults to:
cutAfterMessageIndex is omitted, NanoGPT selects cache boundaries automatically.
Option 2: inline cache_control markers
Attach cache_control directly to content blocks you want cached:
Combining inline markers with body-level settings
cache_control marker is preserved and its TTL is set to 1h. The user message does not receive an auto-generated cache breakpoint.
Option 3: anthropic-beta header
The Anthropic-compatible header is supported:
Controlling What Gets Cached
Note:explicitCacheControlandcutAfterMessageIndexserve different purposes.cutAfterMessageIndextells the system where to auto-place cache breakpoints.explicitCacheControltells the system not to auto-place any and only refresh what you already marked. If both are set,explicitCacheControltakes precedence.
cutAfterMessageIndex
Override automatic cache breakpoints by setting the last cached message index:
0..4 are cached; later messages are not.
You can also set this via request header:
Cache block limit
A maximum of 4cache_control breakpoints are allowed per request (across system prompt, tools, and messages). If more are present, the oldest breakpoints are pruned automatically.
Forcing a Cache Write
There is no separate “force write” flag. Enable prompt caching and send the request. The first eligible request writes cache automatically (if provider thresholds/availability allow it). Repeated requests with the same cached prefix read from cache.Usage Fields (How To Verify Cache Hits)
When caching is active, responses can include:cache_creation_input_tokens: tokens written to cache on this requestcache_read_input_tokens: tokens read from cache on this request (cache hit when> 0)prompt_tokens_details.cached_tokens: OpenAI-style cached token count
x_nanogpt_pricing includes cache pricing breakdown fields such as cacheCreationInputTokens, cacheReadInputTokens, cacheTTL, and cacheCost.
For streaming requests, the final SSE chunk includes the same usage fields when usage is included.
Pricing
Cache writes and reads are billed differently by provider.Gemini Pro models
| Token type | Rate (per 1M tokens) | Notes |
|---|---|---|
| Regular input | $2.00 | — |
| Cache write surcharge | +$0.375 | Added on top of input cost |
| Cache read | $0.20 | 90% cheaper than input |
(10k × $2.00/M) + (10k × $0.375/M) = $0.02375. Reading 10,000 cached tokens costs 10k × $0.20/M = $0.002.
Gemini Flash models
| Token type | Rate (per 1M tokens) | Notes |
|---|---|---|
| Regular input | Varies by model | — |
| Cache write surcharge | +$0.083 | Added on top of input cost |
| Cache read | 10% of input rate | 90% cheaper than input |
Claude models
| TTL | Creation multiplier on cached input tokens | Read multiplier |
|---|---|---|
5m | 1.25x | 0.1x |
1h | 2.0x | 0.1x |
TTL Options
| TTL | Duration | Description |
|---|---|---|
"5m" | 5 minutes | Default. Suitable for interactive sessions. |
"1h" | 1 hour | Extended. Useful for batch processing or long-running sessions. |
Structuring Prompts for Cache Hits
Cache hits require the cached prefix to be byte-identical across requests. Best practices:- Put static content first (system prompt, reference docs, tool definitions).
- Keep cached content identical across requests (no timestamps, request IDs, or dynamic inserts).
- Put dynamic content after the cache boundary (typically the latest user message).
- On cache hit, TTL resets.
- If TTL expires without reuse, cache expires.
- Any prefix change (even one character) causes a cache miss and new cache creation.
Cache Consistency with stickyProvider
Each provider keeps its own cache. If a request fails over to another provider, the previous cache may be unavailable.
If cache consistency matters more than availability, set:
stickyProvider: false(default): the request may succeed even if routing changes, but you might rebuild caches.stickyProvider: true: if a fallback would be required, the request returns503instead.
NanoGPT Web UI
In the NanoGPT web UI, supported models show a prompt caching toggle where you can choose cache duration.Limitations and Caveats
- Provider-side minimum token thresholds still apply before a cache entry is created.
- A maximum of 4 cache breakpoints (
cache_control) are supported per request. - Some models report aggregate prompt usage differently; use
cache_creation_input_tokensandcache_read_input_tokensfor authoritative cached token counts. - On cache hits, a small non-zero
cache_creation_input_tokenscan appear due to per-request overhead and does not necessarily indicate a cache miss.