Prompt Caching, Batches, and Cost Optimization

If the Messages API is the basic engine, then caching and the Batch API are the transmission that makes it viable for production workloads. Without them, a solid prototype can easily turn into a $500/month bill for something that could have cost $50.

The Economics of Prompt Caching: What You Pay For and How Much You Save

Every time the model processes a request, it "reads" all incoming tokens — the system prompt, conversation history, tool definitions. If that data is the same from request to request, you're paying to read the same content over and over again.

Prompt caching solves this through a breakpoint mechanism. You mark a block with a cache_control flag, and the tokens up to that marker are cached on Anthropic's infrastructure.

Operation	Cost (relative to base input price)
Cache write (TTL 5 min)	1.25×
Cache write (TTL 1 hour)	2.00×
Cache read	0.10×
Regular uncached token	1.00×

Writes are more expensive — the first call always costs more than usual. But every subsequent read costs ten times less than the baseline. For a 2,000-token system prompt with hundreds of requests per day, the math works in your favor very quickly.

Syntax: Two Ways to Place Breakpoints

Automatic — the simplest option. Pass cache_control at the top level of the request, and the SDK automatically moves the breakpoint to the last cacheable block as the conversation grows:

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},   # goes here
    system="You are a code analysis assistant...",
    messages=[{"role": "user", "content": query}]
)

Explicit (block-level) — when you need precise control over what gets cached. For example, a large static system prompt combined with a changing conversation:

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_static_instructions,   # 3000 tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]  # changes each time
)

You can place up to four explicit breakpoints in a single request. This is useful in long conversations: place them every 15–20 messages, otherwise the cache may not find a match (the system only looks back 20 blocks).

Проверь себя

You placed `cache_control` on the last user message in the conversation. Will the cache be hit? What will `cache_read_input_tokens` show on the next request?

TTL: 5 Minutes or 1 Hour

By default, the cache lives for 5 minutes — enough for interactive sessions. If prompts are used less frequently (batch processing, overnight jobs), specify an explicit "ttl": "1h":

"cache_control": {"type": "ephemeral", "ttl": "1h"}

Writing with a one-hour TTL costs 2× the base instead of 1.25×, but the cache will survive gaps between requests. For batch tasks (covered below), a one-hour TTL is almost mandatory, since a batch may take longer than 5 minutes to process.

How to Verify the Cache Is Working

Check response.usage in the response:

print(response.usage.cache_creation_input_tokens)  # written to cache
print(response.usage.cache_read_input_tokens)       # read from cache
print(response.usage.input_tokens)                  # uncached remainder

If cache_read_input_tokens > 0 — it's a cache hit. If everything is in cache_creation_input_tokens — this is the first call or the cache expired. If both are zero with cache_control enabled — the prompt is most likely below the minimum threshold (for claude-opus-4-8, at least 1,024 tokens are required).

A common pitfall: a developer places a breakpoint on the last user message, which changes every time. The cache is written on every request and never read — money is spent purely on writes. The rule: place the breakpoint on the last unchanging block — typically the end of the system prompt or the tool definitions block.

Message Batches API: 50% Off for Async Processing

When an immediate response isn't required — analytics, dataset evaluation, bulk description generation — switch to the Batches API. It processes requests asynchronously and costs exactly half the standard price for both input and output.

Batch limits: up to 100,000 requests or 256 MB, whichever comes first. Most batches complete within an hour; the maximum is 24 hours. Results are stored for 29 days.

import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

client = anthropic.Anthropic()

# Submit a batch
batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id=f"review-{i}",
            params=MessageCreateParamsNonStreaming(
                model="claude-opus-4-8",
                max_tokens=512,
                messages=[{"role": "user", "content": code_snippets[i]}],
            ),
        )
        for i in range(len(code_snippets))
    ]
)
print(f"Batch created: {batch.id}")

Polling Status and Retrieving Results

import time

# Wait for completion
while True:
    batch = client.messages.batches.retrieve(batch.id)
    if batch.processing_status == "ended":
        break
    print(f"Processed: {batch.request_counts.succeeded}/{batch.request_counts.processing}")
    time.sleep(60)  # check once per minute

# Read results (JSONL)
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        text = result.result.message.content[0].text
        print(f"{result.custom_id}: {text[:80]}...")
    else:
        print(f"{result.custom_id}: error — {result.result.error}")

Limitations: streaming (stream: true), fast mode, and max_tokens: 0 are not supported in batches — they are specific to synchronous mode. For batches with a shared system prompt, use "ttl": "1h" in cache_control; otherwise the cache will expire before all requests in the batch have had a chance to read it.

Проверь себя

You are sending a batch of 500 requests with a shared system prompt of 5,000 tokens. Which `cache_control` TTL would you choose — 5 minutes or 1 hour — and why?

Counting Tokens Before Sending

Billing surprises are the result of not knowing the actual size of a request. The SDK provides a dedicated token-counting endpoint that doesn't generate a response:

# Does not generate a response — only counts tokens
count = client.messages.count_tokens(
    model="claude-opus-4-8",
    system=system_prompt,
    tools=tool_definitions,
    messages=conversation_history,
)
print(f"Input: {count.input_tokens} tokens")

# Check whether we're approaching the context window limit
if count.input_tokens > 180_000:
    # Time to compact the history or use /compact
    truncate_history(conversation_history)

Useful in three scenarios: estimating cost before an expensive request, preventing conversation history from overflowing the context window, and testing whether tool definitions meet the caching minimum.

Cost Reduction Strategies: A Practical Checklist

Cache all static content first. The order of blocks in a request matters — cacheable content must come before content that changes. Typical order: system prompt → tool definitions → few-shot examples → conversation history → current user question. The breakpoint goes after the last static element.

Choose the right model for the task. Haiku 4.5 costs 5× less than Opus 4.8 on output tokens ($5 vs. $25 per million). For classification, structured data extraction, and short responses, Haiku performs excellently. Opus is for complex reasoning, architectural decisions, and code with intricate dependencies.

Use the Batch API for everything async. PR code reviews, log analysis, bulk documentation generation — none of these require an immediate response, and all get an instant 50% discount.

Don't pollute the context. A long conversation history costs money. Use /compact in Claude Code and equivalent summarization in API applications. For more on context management, see Managing the Context Window.

Pre-warm the cache before peak load. If you know requests will start in 10 minutes, send a dummy request with max_tokens: 0 in advance so the cache is already warm:

# Warm up before peak load
client.messages.create(
    model="claude-opus-4-8",
    max_tokens=0,  # does not generate a response
    system=[{"type": "text", "text": big_system_prompt,
             "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": "warmup"}]
)

Monitor your actual cache hit rate. Add logging of cache_read_input_tokens / (cache_read + cache_creation + input_tokens) in production. If the hit rate is below 70% when prompts are largely identical, the breakpoints are placed incorrectly or content is changing in a non-obvious way.

Проверь себя

You have an agentic loop with tool use: a system prompt (2,000 tokens), 5 tool definitions (1,500 tokens), a conversation history (growing), and the current question. Where should you place the breakpoint to maximize the cache hit rate?