Imagine you are running a SaaS company with an AI-powered customer support chatbot. Your bot needs to know everything about your product—documentation, FAQs, pricing details, and the recent updates. Now, this would be about 50,000 tokens of context that gets sent with the every single user query.
Now, let’s do some calculation as a CFO:
10,000 daily conversations
Average 5 messages per conversation = 50,000 API calls
Each call sends 50,000 tokens of product docs + user message
At Claude's pricing: ~$3 per million input tokens
Monthly cost just for context: $7,500
Have you noticed something kicker above?
Yes, 99% of the context is identical across all the conversations, which will lead you to pay more by sending same manual over and over again.
This is where prompt caching saves your budget.
What is Prompt Caching?
Prompt caching is like a copy-paste memory for your AI. It will read your entire product documentation once instead of reading it every time when you call API and saves it in a special fast access memory for its subsequent requests.
It's similar to how your web browser caches images—instead of re-downloading your company logo every time you visit a page, it loads from cache. But instead of images, here we're caching token sequences.
The Technical Reality Behind It
Here's what actually happens under the hood:
Without Caching:
Your API request arrives with 50,000 tokens of docs + user message
Claude's servers process all 50,000 tokens through its neural network layers
Generate response
Discard everything
Next request? Start from scratch
With Caching:
First request processes all 50,000 tokens normally
System creates a "checkpoint" of the processed state after the cached portion
Stores this checkpoint for ~5 minutes (exact TTL varies)
Subsequent requests within that window skip re-processing those tokens
Start directly from the checkpoint + new user input
The cached content is stored as a already-computed key-value pairs from the attention mechanism. Instead of recomputing these for identical text, the model loads them directly, saving both time and compute costs.
Now, let’s do some maths again.
Before caching:
50,000 API calls/day × 50,000 context tokens = 2.5 billion input tokens/day
Cost: ~$7,500/month
After caching:
First call per conversation: 50,000 tokens (full price)
Remaining 4 calls: 50,000 cached tokens (90% discount) + small user message
Cached tokens: ~$0.30 per million
Regular tokens: ~$3 per million
New monthly cost: ~$1,000/month
You just saved $6,500/month. That's 87% reduction in context costs.
Speed Benefits
Caching isn't just about money—it's about speed as well:
Without cache: Time-to-first-token (TTFT) ~2-3 seconds for large contexts
With cache: TTFT ~0.5-1 second
For users, this feels dramatically more responsive. In customer support, that difference is huge for the satisfaction scores.
How to Implement It (The "Static-First" Rule)
For caching to work, providers like OpenAI and Anthropic use Prefix Matching. If even one character at the beginning of your prompt changes, the cache breaks.
Step 1: Structure Your Prompt
You must organize your prompt so the "Never-Changing" parts are at the top and the "Dynamic" parts are at the bottom.
✅ Correct:
[System Instructions] + [Tool Definitions] + [Fixed Context] + [User Query]❌ Incorrect:
[User Query] + [System Instructions](The query changes every time, so the cache never triggers for the instructions below it).
Step 2: Provider-Specific Implementation
Provider | Implementation Method | Pricing Benefit |
OpenAI | Automatic: Triggers on prompts >1,024 tokens. No code changes needed. | ~50% off cached input tokens. |
Anthropic | Manual: You must add a | ~90% off cached input tokens. |
Gemini | Manual: You create a | ~75% off cached input tokens. |
Implementation Example (Anthropic/Claude)
Anthropic gives you the most control by allowing you to set "breakpoints." Let’s take an example of Jira Agent to implement it:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert Jira Assistant. Here is the 5,000-word JQL manual...",
"cache_control": {"type": "ephemeral"} # <--- THE BREAKPOINT
}
],
messages=[{"role": "user", "content": "Find all open bugs in Project X"}]
)Key implementation details:
Cache breakpoints: You can only cache at specific positions—system messages and the last user message content block. Put your static content there.
Minimum size: Content must be at least 1,024 tokens to be cacheable. Smaller content won't cache.
Cache TTL: Caches live for ~5 minutes. Each cache hit refreshes the timer.
Prefix matching: Only the exact prefix matches. If your docs are "ABC" and you send "ABD", no cache hit.
Advanced Caching Patterns for Product Managers
Pattern 1: Tiered Caching
In this each level caches independently. When promotions update, you only lose that cache layer.
# Cache multiple levels
messages = [
{
"role": "user",
"content": [
# Level 1: Rarely changes (product docs)
{"type": "text", "text": PRODUCT_DOCS, "cache_control": {"type": "ephemeral"}},
# Level 2: Changes daily (today's promotions)
{"type": "text", "text": TODAY_PROMOTIONS, "cache_control": {"type": "ephemeral"}},
# Level 3: User-specific (conversation context)
{"type": "text", "text": conversation_context, "cache_control": {"type": "ephemeral"}}
]
}
]Pattern 2: Code Analysis Tools
In this each PR review reuses the cached codebase context.
# Cache entire codebase for AI code reviews
CODEBASE = load_repository_files() # 100K tokens
def review_pr(pr_diff):
response = client.messages.create(
model="claude-sonnet-4-20250514",
system=[
{"type": "text", "text": CODEBASE, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "You are a senior code reviewer..."}
],
messages=[{"role": "user", "content": f"Review this PR:\n{pr_diff}"}]
)
return responsePattern 3: RAG with Caching
Similar queries retrieve similar docs → cache hits across users.
def rag_with_cache(query):
# Retrieve relevant docs
relevant_docs = vector_search(query, top_k=20)
# Combine into cached context
context = "\n\n".join(relevant_docs)
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": context, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": f"Question: {query}"}
]
}]
)
return responseDon't cache if:
Content changes every request (personalized data)
Content is under 1,024 tokens (won't cache anyway)
Requests are spread out >5 minutes (cache expires)
You have infinite variety in context (no prefix matching)
Now let’s look at the example of bad caching:
When user data changes every time, basically, user profile is getting updated with each new user, then do not use the cache.
# DON'T DO THIS - user data changes every time
user_profile = get_user_data(user_id) # Different for each user
messages = [{
"role": "user",
"content": [
{"type": "text", "text": user_profile, "cache_control": {"type": "ephemeral"}}, # Won't help!
{"type": "text", "text": query}
]
}]Measuring Success
Track these metrics:
cache hit rate, cost savings, latency improvement
# After each API call
cache_hit_rate = usage.cache_read_input_tokens / (usage.cache_read_input_tokens + usage.input_tokens)
cost_savings = (usage.cache_read_input_tokens * 2.70) / 1_000_000 # $2.70 saved per million cached tokens
latency_improvement = time_with_cache / time_without_cacheFor our customer support scenario, you should monitor:
Cache hit rate (target: >80% after first message)
Cost per conversation (should drop 70-90%)
P95 response time (should drop 50%+)
Cache miss rate (indicates context changes or cache expiry issues)
P95 response time means "95th percentile response time" - it's the response time where 95% of requests are faster and only 5% are slower.
For example, Imagine you made 100 API calls to your chatbot today. You sort all the response times from fastest to slowest. The P95 is the response time of the 95th request in that sorted list.
95 requests responded in under 2 seconds
5 requests took longer (maybe 3, 4, or 5 seconds)
Your P95 = 2 seconds
Why P95 Matters More Than Average
Suppose you're tracking your customer support chatbot's performance:
Scenario A - Using Average:
99 requests: 0.5 seconds each
1 request: 50 seconds (something went wrong)
Average = (99 × 0.5 + 1 × 50) / 100 = 0.995 secondsYou'd report: "Our average response time is under 1 second! "
But one poor customer waited 50 seconds and rage-quit. The average hides this problem.
Scenario B - Using P95:
Same data:
P95 = 0.5 seconds ✓
P99 = 50 seconds ⚠️ (warning sign!)Now you can see: "Most users get fast responses, but our worst 1% has a terrible experience."
Why Product Managers Love P95
When you tell your CEO:
❌ "Average response time is 1.2 seconds" → They don't know what this means for user experience
✅ "95% of users get responses in under 2.5 seconds" → Clear, actionable, user-focused
The Strategic Takeaway
Prompt caching is one of those rare optimizations that improves both cost and user experience. For AI product managers, this should be a day-one implementation for any use case with:
Large, static context (docs, code, knowledge bases)
High request volume on that context
Requests clustered in time (conversations, work hours)
