Claude’s token counting is a bit like tracking a package: you need to know what’s in the box and how big it is before you ship it, otherwise, you’re just guessing at the postage. The surprise is that Claude counts tokens in both the input prompt and the output completion, and the cost is directly proportional to the total.

Let’s see this in action. Imagine a simple prompt to Claude:

import anthropic

client = anthropic.Anthropic(
    api_key="YOUR_ANTHROPIC_API_KEY",
)

response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello, Claude!"}
    ]
)

print(response.content)

In this snippet, response.usage.input_tokens and response.usage.output_tokens will tell you exactly how many tokens were consumed. Even for a simple "Hello, Claude!", the input might be 8 tokens, and the output, perhaps "Hello there! How can I help you today?", could be another 10. Total: 18 tokens. Opus is priced at $15 per million tokens for input and $75 per million for output, so even this tiny interaction has a minuscule but non-zero cost.

The core problem Claude solves is natural language understanding and generation at scale. It takes your text, breaks it down into tokens (which are not always words, but can be parts of words, punctuation, or even spaces), processes them through a massive neural network, and then generates new text, also token by token. The max_tokens parameter isn’t a hard limit on the length of the generated text in terms of characters or words; it’s a limit on the number of tokens Claude can output. If Claude reaches this limit before finishing its thought, it will truncate the response.

You control Claude’s cost primarily through two levers: the complexity and length of your input prompt, and the max_tokens setting for the output. For long-running, complex tasks, or when you’re iterating on prompts, costs can add up rapidly if you’re not mindful.

Here’s a breakdown of key considerations for controlling costs:

  • Input Prompt Length: Every word, punctuation mark, and even newline in your prompt consumes tokens. A verbose prompt, while sometimes necessary for clarity, directly increases input costs.
  • max_tokens for Output: This is your primary defense against runaway output costs. Setting it too low can lead to incomplete answers, but setting it unnecessarily high means you’re paying for Claude to potentially generate a lot of text, even if it doesn’t.
  • Model Choice: Different Claude models have different pricing. Claude 3 Opus is the most powerful and expensive, while Claude 3 Haiku is the fastest and cheapest. For tasks that don’t require cutting-edge reasoning, opting for a less powerful model can significantly reduce costs.
  • Context Window: Claude models have a context window (e.g., 200K tokens for Claude 3). While you can fit a lot of information, the cost is per token used. If you’re sending a large document, ensure you’re only sending the relevant parts.
  • System Prompts: These are instructions given to the model before the user’s turn. They also consume tokens and contribute to the input cost.

A common pitfall is assuming max_tokens is a hard character limit. It’s not. A single token can represent multiple characters, or just one. For instance, the word "tokenization" might be one token, or it could be broken into "token", "ization". The exact breakdown is model-dependent and not something you directly control, but it’s crucial to remember that the count is token-based.

When you’re debugging or experimenting with prompts, it’s easy to forget that each API call, even a failed one or one that returns very little text, incurs a cost based on the input tokens sent. Always inspect the response.usage object after each call, especially when developing.

The next hurdle you’ll likely encounter is managing conversational history efficiently to stay within token limits for multi-turn dialogues.

Want structured learning?

Take the full Claude-api course →