Imagine you are running a SaaS company with an AI-powered customer support chatbot. Your bot needs to know everything about your product—documentation, FAQs, pricing details, and the recent updates. Now, this would be about 50,000 tokens of context that gets sent with the every single user query.

Now, let’s do some calculation as a CFO:

  • 10,000 daily conversations

  • Average 5 messages per conversation = 50,000 API calls

  • Each call sends 50,000 tokens of product docs + user message

  • At Claude's pricing: ~$3 per million input tokens

  • Monthly cost just for context: $7,500

Have you noticed something kicker above?

Yes, 99% of the context is identical across all the conversations, which will lead you to pay more by sending same manual over and over again.

This is where prompt caching saves your budget.

What is Prompt Caching?

Prompt caching is like a copy-paste memory for your AI. It will read your entire product documentation once instead of reading it every time when you call API and saves it in a special fast access memory for its subsequent requests.

It's similar to how your web browser caches images—instead of re-downloading your company logo every time you visit a page, it loads from cache. But instead of images, here we're caching token sequences.

The Technical Reality Behind It

Here's what actually happens under the hood:

Without Caching:

  1. Your API request arrives with 50,000 tokens of docs + user message

  2. Claude's servers process all 50,000 tokens through its neural network layers

  3. Generate response

  4. Discard everything

  5. Next request? Start from scratch

With Caching:

  1. First request processes all 50,000 tokens normally

  2. System creates a "checkpoint" of the processed state after the cached portion

  3. Stores this checkpoint for ~5 minutes (exact TTL varies)

  4. Subsequent requests within that window skip re-processing those tokens

  5. Start directly from the checkpoint + new user input

The cached content is stored as a already-computed key-value pairs from the attention mechanism. Instead of recomputing these for identical text, the model loads them directly, saving both time and compute costs.

Now, let’s do some maths again.

Before caching:

  • 50,000 API calls/day × 50,000 context tokens = 2.5 billion input tokens/day

  • Cost: ~$7,500/month

After caching:

  • First call per conversation: 50,000 tokens (full price)

  • Remaining 4 calls: 50,000 cached tokens (90% discount) + small user message

  • Cached tokens: ~$0.30 per million

  • Regular tokens: ~$3 per million

New monthly cost: ~$1,000/month

You just saved $6,500/month. That's 87% reduction in context costs.

Speed Benefits

Caching isn't just about money—it's about speed as well:

  • Without cache: Time-to-first-token (TTFT) ~2-3 seconds for large contexts

  • With cache: TTFT ~0.5-1 second

For users, this feels dramatically more responsive. In customer support, that difference is huge for the satisfaction scores.

How to Implement It (The "Static-First" Rule)

For caching to work, providers like OpenAI and Anthropic use Prefix Matching. If even one character at the beginning of your prompt changes, the cache breaks.

Step 1: Structure Your Prompt

You must organize your prompt so the "Never-Changing" parts are at the top and the "Dynamic" parts are at the bottom.

  • Correct: [System Instructions] + [Tool Definitions] + [Fixed Context] + [User Query]

  • Incorrect: [User Query] + [System Instructions] (The query changes every time, so the cache never triggers for the instructions below it).

Step 2: Provider-Specific Implementation

Provider

Implementation Method

Pricing Benefit

OpenAI

Automatic: Triggers on prompts >1,024 tokens. No code changes needed.

~50% off cached input tokens.

Anthropic

Manual: You must add a cache_control block at specific breakpoints.

~90% off cached input tokens.

Gemini

Manual: You create a CachedContent resource with a specific TTL (Time to Live).

~75% off cached input tokens.

Implementation Example (Anthropic/Claude)

Anthropic gives you the most control by allowing you to set "breakpoints." Let’s take an example of Jira Agent to implement it:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text", 
            "text": "You are an expert Jira Assistant. Here is the 5,000-word JQL manual...",
            "cache_control": {"type": "ephemeral"} # <--- THE BREAKPOINT
        }
    ],
    messages=[{"role": "user", "content": "Find all open bugs in Project X"}]
)

Key implementation details:

  1. Cache breakpoints: You can only cache at specific positions—system messages and the last user message content block. Put your static content there.

  2. Minimum size: Content must be at least 1,024 tokens to be cacheable. Smaller content won't cache.

  3. Cache TTL: Caches live for ~5 minutes. Each cache hit refreshes the timer.

  4. Prefix matching: Only the exact prefix matches. If your docs are "ABC" and you send "ABD", no cache hit.

Advanced Caching Patterns for Product Managers

Pattern 1: Tiered Caching

In this each level caches independently. When promotions update, you only lose that cache layer.

# Cache multiple levels
messages = [
    {
        "role": "user",
        "content": [
            # Level 1: Rarely changes (product docs)
            {"type": "text", "text": PRODUCT_DOCS, "cache_control": {"type": "ephemeral"}},
            
            # Level 2: Changes daily (today's promotions)
            {"type": "text", "text": TODAY_PROMOTIONS, "cache_control": {"type": "ephemeral"}},
            
            # Level 3: User-specific (conversation context)
            {"type": "text", "text": conversation_context, "cache_control": {"type": "ephemeral"}}
        ]
    }
]

Pattern 2: Code Analysis Tools

In this each PR review reuses the cached codebase context.

# Cache entire codebase for AI code reviews
CODEBASE = load_repository_files()  # 100K tokens

def review_pr(pr_diff):
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        system=[
            {"type": "text", "text": CODEBASE, "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": "You are a senior code reviewer..."}
        ],
        messages=[{"role": "user", "content": f"Review this PR:\n{pr_diff}"}]
    )
    return response

Pattern 3: RAG with Caching

Similar queries retrieve similar docs → cache hits across users.

def rag_with_cache(query):
    # Retrieve relevant docs
    relevant_docs = vector_search(query, top_k=20)
    
    # Combine into cached context
    context = "\n\n".join(relevant_docs)
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": context, "cache_control": {"type": "ephemeral"}},
                {"type": "text", "text": f"Question: {query}"}
            ]
        }]
    )
    return response

Don't cache if:

  • Content changes every request (personalized data)

  • Content is under 1,024 tokens (won't cache anyway)

  • Requests are spread out >5 minutes (cache expires)

  • You have infinite variety in context (no prefix matching)

Now let’s look at the example of bad caching:

When user data changes every time, basically, user profile is getting updated with each new user, then do not use the cache.

# DON'T DO THIS - user data changes every time
user_profile = get_user_data(user_id)  # Different for each user
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": user_profile, "cache_control": {"type": "ephemeral"}},  # Won't help!
        {"type": "text", "text": query}
    ]
}]

Measuring Success

Track these metrics:

cache hit rate, cost savings, latency improvement

# After each API call
cache_hit_rate = usage.cache_read_input_tokens / (usage.cache_read_input_tokens + usage.input_tokens)

cost_savings = (usage.cache_read_input_tokens * 2.70) / 1_000_000  # $2.70 saved per million cached tokens

latency_improvement = time_with_cache / time_without_cache

For our customer support scenario, you should monitor:

  • Cache hit rate (target: >80% after first message)

  • Cost per conversation (should drop 70-90%)

  • P95 response time (should drop 50%+)

  • Cache miss rate (indicates context changes or cache expiry issues)

P95 response time means "95th percentile response time" - it's the response time where 95% of requests are faster and only 5% are slower.

For example, Imagine you made 100 API calls to your chatbot today. You sort all the response times from fastest to slowest. The P95 is the response time of the 95th request in that sorted list.

  • 95 requests responded in under 2 seconds

  • 5 requests took longer (maybe 3, 4, or 5 seconds)

  • Your P95 = 2 seconds

Why P95 Matters More Than Average

Suppose you're tracking your customer support chatbot's performance:

Scenario A - Using Average:

99 requests: 0.5 seconds each
1 request: 50 seconds (something went wrong)

Average = (99 × 0.5 + 1 × 50) / 100 = 0.995 seconds

You'd report: "Our average response time is under 1 second! "

But one poor customer waited 50 seconds and rage-quit. The average hides this problem.

Scenario B - Using P95:

Same data:
P95 = 0.5 seconds ✓
P99 = 50 seconds ⚠️ (warning sign!)

Now you can see: "Most users get fast responses, but our worst 1% has a terrible experience."

Why Product Managers Love P95

When you tell your CEO:

  • "Average response time is 1.2 seconds" → They don't know what this means for user experience

  • "95% of users get responses in under 2.5 seconds" → Clear, actionable, user-focused

The Strategic Takeaway

Prompt caching is one of those rare optimizations that improves both cost and user experience. For AI product managers, this should be a day-one implementation for any use case with:

  1. Large, static context (docs, code, knowledge bases)

  2. High request volume on that context

  3. Requests clustered in time (conversations, work hours)

Reply

Avatar

or to participate

Keep Reading

No posts found