Cut Claude API Costs with the Batch API (2026)

The Claude Batch API lets you send many prompts in a single request, saving you money by reducing overhead.

Let’s see it in action. Imagine you have a list of customer reviews and you want to classify each one as either "positive" or "negative".

import anthropic

client = anthropic.Anthropic(
    api_key="YOUR_ANTHROPIC_API_KEY",
)

reviews = [
    "This product is amazing! I love it.",
    "The quality was terrible, a complete waste of money.",
    "It was okay, nothing special but it worked.",
    "Absolutely fantastic! Highly recommend.",
    "Broke after one use, very disappointed."
]

# Constructing the batch request
batch_items = []
for i, review in enumerate(reviews):
    batch_items.append(
        {
            "prompt": f"\n\nHuman: Classify the following customer review as 'positive' or 'negative'.\nReview: \"{review}\"\n\nAssistant:",
            "max_tokens": 10,
            "metadata": {"review_id": str(i)} # Optional: for tracking
        }
    )

response = client.completions.create(
    model="claude-2.1", # Or another model
    batch=batch_items
)

# Processing the results
for completion in response.completions:
    review_id = completion.metadata.get("review_id")
    classification = completion.completion.strip()
    print(f"Review {review_id}: {classification}")

This code snippet demonstrates how to send multiple classification tasks to Claude in one go. Instead of making an individual API call for each review, we bundle them into a single batch parameter. The completions.create method then handles sending this batch to the API and returns a list of completions, each corresponding to an item in your original batch.

The core problem the Batch API solves is the per-request overhead associated with API calls. Every time you call an API, there’s a cost in terms of network latency, authentication, and processing at the API provider’s end. For tasks where you have many small, independent requests, this overhead can become a significant portion of your total cost. By grouping these small requests into a single larger one, you amortize that overhead across all the individual tasks. This is particularly impactful for high-volume, low-complexity tasks like sentiment analysis, simple Q&A on short documents, or data extraction from many similar items.

Internally, when you send a batch, Claude’s infrastructure processes each item in the batch sequentially or in parallel where feasible, but it presents the results back to you as a single response object containing all individual completions. The batch parameter expects a list of dictionaries, where each dictionary represents a single prompt and its associated parameters like max_tokens, temperature, and stop_sequences. You can also include metadata to help you correlate the results back to your original data.

The most significant lever you control is the model parameter. Different models have different pricing structures and capabilities. For batch processing, you’ll likely want to use a model that offers a good balance of cost and performance for your specific task. If your tasks are simple, a smaller, faster, and cheaper model might suffice. If they require deeper understanding or longer context, you might need a more powerful model, but the cost savings from batching will still apply. The max_tokens parameter for each item in the batch is crucial for controlling cost per item. Setting this too high means you’re paying for more tokens than you need, even if Claude doesn’t generate them. Always aim to set max_tokens to just slightly more than the expected output length.

What most people don’t realize is that the metadata field is not just for tracking; it’s a direct passthrough. Whatever you put in the metadata for a specific item in the batch request will be returned in the corresponding completion object. This means you can embed identifiers, context, or even small pieces of the original data directly into the metadata to ensure you can easily reassemble your results with your source data without needing complex lookups later.

The next evolution of this pattern is exploring streaming batch responses for even lower perceived latency on individual items within a large batch.