Prompt caching can slash your Claude API costs by serving cached responses for identical prompts, bypassing the Claude API entirely for repeated queries.

Let’s see this in action. Imagine we have a simple Python application that asks Claude to summarize a news article.

import os
import anthropic

client = anthropic.Anthropic(
    api_key=os.environ.get("ANTHROPIC_API_KEY"),
)

def summarize_article(article_text):
    message = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1000,
        messages=[
            {"role": "user", "content": f"Summarize this article:\n\n{article_text}"}
        ]
    )
    return message.content[0].text

# First call - will hit the API
article1 = "The quick brown fox jumps over the lazy dog. This is a classic sentence used for testing."
summary1 = summarize_article(article1)
print(f"Summary 1: {summary1}\n")

# Second call with the exact same article - without caching, this also hits the API
summary2 = summarize_article(article1)
print(f"Summary 2: {summary2}\n")

Without prompt caching, both summarize_article calls above would incur a cost. The Claude API processes each request independently. The model doesn’t inherently "remember" previous identical inputs within the same session or across different sessions.

To implement caching, we need a way to store and retrieve responses based on the prompt. A simple in-memory dictionary or a more persistent key-value store like Redis can work. The core idea is to:

  1. Before calling the API: Check if the exact prompt already exists in our cache.
  2. If found: Return the cached response immediately. This saves an API call and its associated cost.
  3. If not found: Call the Claude API, get the response, store the prompt and its response in the cache, and then return the response.

Here’s an example using an in-memory dictionary for caching:

import os
import anthropic

client = anthropic.Anthropic(
    api_key=os.environ.get("ANTHROPIC_API_KEY"),
)

# Simple in-memory cache
prompt_cache = {}

def summarize_article_with_cache(article_text):
    prompt = f"Summarize this article:\n\n{article_text}"

    if prompt in prompt_cache:
        print("--- Cache HIT ---")
        return prompt_cache[prompt]
    else:
        print("--- Cache MISS ---")
        message = client.messages.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=1000,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        response_text = message.content[0].text
        prompt_cache[prompt] = response_text  # Store in cache
        return response_text

# First call - Cache MISS
article1 = "The quick brown fox jumps over the lazy dog. This is a classic sentence used for testing."
summary1 = summarize_article_with_cache(article1)
print(f"Summary 1: {summary1}\n")

# Second call with the exact same article - Cache HIT
summary2 = summarize_article_with_cache(article1)
print(f"Summary 2: {summary2}\n")

# Third call with a different article - Cache MISS
article2 = "The weather today is sunny with a slight breeze. Perfect for a picnic."
summary3 = summarize_article_with_cache(article2)
print(f"Summary 3: {summary3}\n")

When you run this, you’ll see "Cache MISS" for the first and third calls, and "Cache HIT" for the second. The second call for article1 bypasses the API entirely, saving you money. The prompt_cache dictionary stores the prompt string as the key and the Claude response string as the value.

The primary benefit is cost reduction. By avoiding redundant API calls, you directly reduce your Claude API bill. This is especially powerful for applications that frequently ask the same questions or process similar data. Think about:

  • Customer support bots: Answering the same FAQs repeatedly.
  • Content summarization services: Users often ask for summaries of popular articles.
  • Data analysis tools: Running the same queries on datasets.
  • Educational platforms: Explaining core concepts multiple times.

The effectiveness of prompt caching is directly tied to the exactness of the prompt. If a prompt varies even slightly (e.g., adding a comma, changing a word, altering the order of information), it will be treated as a new, uncached prompt. This means you need to be deliberate about how you construct your prompts. For example, if you’re summarizing user-submitted text, you might want to normalize the text (e.g., convert to lowercase, remove extra whitespace) before creating the prompt string to increase cache hit rates.

The true magic of prompt caching isn’t just about saving money; it’s about transforming your application’s responsiveness. For frequently encountered queries, users get answers virtually instantaneously because the response is served from memory, not from a network round trip to an external API and then processing time by the model. This dramatically improves user experience by reducing latency for common tasks, making your application feel much snappier.

When implementing caching, especially for longer-running or more complex prompts, you might encounter situations where the desired output for the same input prompt can subtly change due to the inherent stochasticity of LLMs or minor variations in the model’s internal state. For instance, a summarization prompt might yield slightly different phrasing on subsequent calls. If your application requires absolute deterministic output for identical inputs, you might need to consider techniques like setting a very low temperature (e.g., 0.0 or 0.1) in your API calls, or implementing a more sophisticated caching layer that can handle minor output variations or even re-fetch if a cached response is deemed "stale" by some heuristic.

The next logical step after implementing basic prompt caching is to explore more advanced caching strategies, such as time-based expiration, cache invalidation, or using distributed caching systems for multi-instance applications.

Want structured learning?

Take the full Claude-api course →