Claude struggles with complex reasoning tasks because it tries to jump straight to the answer. Chain-of-Thought (CoT) prompting forces it to break down the problem into intermediate steps, significantly improving accuracy.

Let’s see this in action. Imagine we want Claude to solve a simple arithmetic word problem:

Initial prompt (no CoT): "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"

Claude’s likely response (no CoT): "Roger has 11 tennis balls."

This is correct, but it’s a guess. Now, let’s apply CoT:

Chain-of-Thought prompt: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? Let’s think step by step."

Claude’s response (with CoT): "Okay, let’s break this down step by step:

  1. Roger starts with 5 tennis balls.
  2. He buys 2 cans of tennis balls.
  3. Each can contains 3 tennis balls.
  4. So, the number of new tennis balls he bought is 2 cans * 3 balls/can = 6 tennis balls.
  5. To find the total, we add the initial balls to the new balls: 5 + 6 = 11 tennis balls. Therefore, Roger has 11 tennis balls now."

The final answer is the same, but the process is transparent and verifiable. Claude isn’t just spitting out a number; it’s showing its work.

The Problem Claude Solves: Over-reliance on Direct Mapping

Large language models like Claude are trained to predict the next most likely token. For simple, direct questions, this works brilliantly. But when a problem requires multiple logical steps, or a sequence of calculations, simply predicting the final answer token can lead to errors. The model might "guess" the final answer based on patterns it’s seen, without actually performing the underlying reasoning. This is akin to a student memorizing answers without understanding the math.

How Chain-of-Thought Works Internally

CoT leverages the model’s inherent ability to generate sequential text. By explicitly asking Claude to "think step by step," we’re guiding it to generate a sequence of tokens that represent the intermediate reasoning process, rather than just the final answer token. This sequence acts as a kind of internal scratchpad.

The model, in generating these intermediate steps, is essentially performing the reasoning it needs to do. Each step informs the next, and the final answer is a natural consequence of this structured thought process. This is particularly powerful for tasks involving:

  • Arithmetic: Multi-step calculations, word problems.
  • Logic Puzzles: Deductive reasoning, constraint satisfaction.
  • Symbolic Manipulation: Algebraic simplification, code transformation.
  • Commonsense Reasoning: Inferring outcomes based on a series of events.

The "Let’s think step by step" phrase is a simple trigger, but the underlying mechanism is Claude generating text that describes a reasoning process, which in turn helps it arrive at a more accurate conclusion.

Levers You Control

  1. The Trigger Phrase: While "Let’s think step by step" is common and effective, you can experiment with others like "Break this down for me," "Show your work," or even providing a few example steps yourself in a few-shot prompt. The key is to signal that a multi-step process is expected.
  2. The Complexity of Steps: For very complex problems, you might need to guide Claude on what kind of steps to take. For example, if solving a physics problem, you might prompt: "First, identify the known variables. Second, state the relevant physical laws. Third, rearrange the equations…"
  3. Few-Shot CoT: For even more robust results, you can provide a few examples of questions with their step-by-step solutions before posing your actual question. This demonstrates the desired format and reasoning structure more explicitly.

Example with Code Generation Reasoning:

Consider this prompt: "Write a Python function that takes a list of numbers and returns the sum of the even numbers. Explain your reasoning."

Claude’s CoT Response: "Okay, let’s break this down step by step to create the Python function:

  1. Function Definition: I need to define a Python function that accepts one argument: a list of numbers. I’ll call it sum_even_numbers and the parameter numbers_list.
    def sum_even_numbers(numbers_list):
        # Function body will go here
        pass
    
  2. Initialization: I’ll need a variable to keep track of the sum of even numbers. I’ll initialize it to 0 before I start iterating through the list. Let’s call it even_sum.
    def sum_even_numbers(numbers_list):
        even_sum = 0
        # ... rest of the logic
    
  3. Iteration: I need to go through each number in the input numbers_list. A for loop is appropriate for this.
    def sum_even_numbers(numbers_list):
        even_sum = 0
        for number in numbers_list:
            # Check if the number is even
            pass
    
  4. Even Number Check: Inside the loop, for each number, I need to determine if it’s even. A number is even if it is perfectly divisible by 2, meaning the remainder when divided by 2 is 0. The modulo operator (%) is perfect for this.
    def sum_even_numbers(numbers_list):
        even_sum = 0
        for number in numbers_list:
            if number % 2 == 0:
                # It's even! Add it to our sum.
                pass
    
  5. Accumulation: If a number is even, I add it to my even_sum variable.
    def sum_even_numbers(numbers_list):
        even_sum = 0
        for number in numbers_list:
            if number % 2 == 0:
                even_sum += number
    
  6. Return Value: After the loop has finished checking all numbers, the function should return the final even_sum.
    def sum_even_numbers(numbers_list):
        even_sum = 0
        for number in numbers_list:
            if number % 2 == 0:
                even_sum += number
        return even_sum
    

So, the complete function is:

def sum_even_numbers(numbers_list):
    even_sum = 0
    for number in numbers_list:
        if number % 2 == 0:
            even_sum += number
    return even_sum

This function iterates through the list, checks each number for divisibility by 2, and accumulates the even ones before returning the total."

The most surprising thing about CoT is that simply asking the model to "think step by step" can increase performance on complex tasks by orders of magnitude, even for models that aren’t specifically fine-tuned for this behavior. It’s like giving a calculator a piece of paper to write down its intermediate calculations.

The next frontier is understanding how to automatically detect when CoT is beneficial for a given prompt, rather than having to manually append "Let’s think step by step" to every complex query.

Want structured learning?

Take the full Claude-api course →