Claude’s ability to detect and redact Personally Identifiable Information (PII) from text isn’t just about finding names and addresses; it’s a sophisticated exercise in natural language understanding that can distinguish between a general mention of "the president" and a specific, actionable identifier.

Let’s see it in action. Imagine you have a customer support transcript and you need to scrub it before sharing it externally.

Customer: Hi, my name is Jane Doe and I'm having an issue with my account. My account number is 123456789. I live at 1600 Pennsylvania Ave NW, Washington, DC 20500. I can be reached at jane.doe@example.com or 555-123-4567.

Agent: Hello Jane, I can help with that. Can you confirm the email address you used to sign up?

Customer: Yes, it's jane.doe@example.com. And my current phone number is 555-123-4567.

Agent: Thank you. I see the account. What seems to be the problem?

Now, we’ll use Claude to identify and redact this PII. The core idea is to instruct Claude to act as a PII detection and redaction service. We’ll define what constitutes PII and how it should be handled.

from anthropic import Anthropic

client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

text_to_process = """
Customer: Hi, my name is Jane Doe and I'm having an issue with my account. My account number is 123456789. I live at 1600 Pennsylvania Ave NW, Washington, DC 20500. I can be reached at jane.doe@example.com or 555-123-4567.

Agent: Hello Jane, I can help with that. Can you confirm the email address you used to sign up?

Customer: Yes, it's jane.doe@example.com. And my current phone number is 555-123-4567.

Agent: Thank you. I see the account. What seems to be the problem?
"""

response = client.messages.create(
    model="claude-3-opus-20240229", # Or claude-3-sonnet-20240229
    max_tokens=1000,
    messages=[
        {
            "role": "user",
            "content": f"""
Please identify and redact all Personally Identifiable Information (PII) from the following text.
Redact names, email addresses, phone numbers, physical addresses, and account numbers.
Replace each piece of PII with a placeholder like [REDACTED_NAME], [REDACTED_EMAIL], [REDACTED_PHONE], [REDACTED_ADDRESS], or [REDACTED_ACCOUNT_NUMBER].
Return only the redacted text.

Text:
{text_to_process}
"""
        }
    ]
)

print(response.content[0].text)

The output you’d get from this would look like:

Customer: Hi, my name is [REDACTED_NAME] and I'm having an issue with my account. My account number is [REDACTED_ACCOUNT_NUMBER]. I live at [REDACTED_ADDRESS]. I can be reached at [REDACTED_EMAIL] or [REDACTED_PHONE].

Agent: Hello [REDACTED_NAME], I can help with that. Can you confirm the email address you used to sign up?

Customer: Yes, it's [REDACTED_EMAIL]. And my current phone number is [REDACTED_PHONE].

Agent: Thank you. I see the account. What seems to be the problem?

The system works by leveraging Claude’s advanced natural language understanding capabilities. It’s not a simple regex-based search. Claude analyzes the context of words and phrases to determine if they represent PII. For instance, it can differentiate between a generic "doctor" and a specific "Dr. Smith," or a street name like "Main Street" versus a full address. The prompt guides this process by explicitly defining the categories of PII to look for and the desired output format.

The key levers you control are the model (Opus for maximum accuracy, Sonnet for a balance of speed and accuracy), max_tokens to ensure the full response is captured, and crucially, the content of your prompt. You can refine the prompt to include or exclude specific types of PII (e.g., "do not redact dates of birth," or "also redact IP addresses"). You can also change the redaction placeholders to suit your needs, like *** or [PRIVATE_DATA].

What most people don’t realize is that Claude can also infer PII that isn’t explicitly labeled as such. For example, if a user says "My order ID is XYZ789, and I’m calling about the shipment to my home at 123 Oak St.", even if "123 Oak St." isn’t explicitly tagged as a physical address in a training set, Claude’s understanding of sentence structure and common phrasing allows it to correctly identify it as an address and redact it. It’s this contextual awareness that makes it more robust than traditional methods.

The next logical step is to integrate this into a real-time processing pipeline, perhaps using webhooks to automatically redact sensitive data from incoming messages before they are stored or displayed.

Want structured learning?

Take the full Claude-api course →