PII Detection

Identifying & Protecting Personal Identifiable Information in LLM Systems

What is PII Detection?

PII (Personally Identifiable Information) detection is the process of automatically identifying sensitive personal data in text before it's processed by LLMs or stored in logs. It's a critical component for privacy compliance, data protection, and building trustworthy AI systems.

"PII includes any information that can be used to identify, contact, or locate an individual, either alone or combined with other sources. In AI systems, undetected PII can lead to privacy violations, regulatory penalties, and loss of user trust."

— NIST Privacy Framework

Detect

Scan text for PII patterns

Classify

Identify entity types

Anonymize

Redact or mask data

Process

Safe for LLM use

Common PII Entity Types

Category	Entity Types	Examples	Sensitivity
Direct Identifiers	PERSON, SSN, PASSPORT	John Smith, 123-45-6789	High
Contact Info	EMAIL, PHONE, ADDRESS	john@email.com, +1-555-0123	High
Financial	CREDIT_CARD, IBAN, BANK_ACCOUNT	4111-1111-1111-1111	High
Technical	IP_ADDRESS, MAC_ADDRESS, URL	192.168.1.1, AA:BB:CC:DD:EE:FF	Medium
Health (PHI)	MEDICAL_LICENSE, NPI, CONDITION	Medical record numbers, diagnoses	High
Location	LOCATION, GPS_COORDINATES	123 Main St, New York, NY	Medium
Credentials	API_KEY, PASSWORD, TOKEN	sk-abc123..., Bearer eyJ...	Critical

Detection Methods

Pattern Matching (Regex)

Regular expressions to match structured formats like SSNs, credit cards, emails, and phone numbers. Fast and deterministic but limited to known patterns.

Fast Deterministic Limited coverage

Named Entity Recognition (NER)

Machine learning models trained to identify entities like names, organizations, and locations from context. More flexible than regex but requires ML infrastructure.

Context-aware Flexible Requires ML

Checksum Validation

Verify detected patterns using checksums (Luhn algorithm for credit cards, SSN validation rules). Reduces false positives from regex matches.

High precision Low false positives

Ensemble Approach

Combine multiple methods: regex for structured data, NER for unstructured, and checksum for validation. This is the approach used by tools like Presidio.

Best coverage Production-ready Recommended

Anonymization Techniques

Redaction

Complete removal or replacement with placeholder.

john@email.com → [EMAIL_REDACTED]

Masking

Partial hiding while preserving format.

4111-1111-1111-1111 → ****-****-****-1111

Pseudonymization

Replace with consistent fake values (reversible with key).

John Smith → Person_A7X9

Hashing

One-way transformation for irreversible anonymization.

john@email.com → a3f2b8c9...

Implementation with Presidio

# Install Presidio
pip install presidio-analyzer presidio-anonymizer

# Basic PII Detection & Anonymization
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Sample text with PII
text = """
Hello, my name is John Smith. You can reach me at john.smith@email.com 
or call me at 555-123-4567. My SSN is 123-45-6789 and my credit card 
is 4111-1111-1111-1111.
"""

# Detect PII entities
results = analyzer.analyze(
    text=text,
    entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", 
              "US_SSN", "CREDIT_CARD"],
    language="en"
)

# Print detected entities
for result in results:
    print(f"{result.entity_type}: {text[result.start:result.end]}")

# Anonymize with different strategies
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 10, "masking_char": "*"}),
    "CREDIT_CARD": OperatorConfig("mask", {"chars_to_mask": 12, "from_end": True}),
    "US_SSN": OperatorConfig("redact"),
}

anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
print(anonymized.text)

See full documentation: Microsoft Presidio

LLM Integration Pattern

# PII-safe LLM wrapper
from typing import Tuple, Dict

class PIISafeLLM:
    def __init__(self, llm_client, analyzer, anonymizer):
        self.llm = llm_client
        self.analyzer = analyzer
        self.anonymizer = anonymizer
        self.entity_map = {}  # For de-anonymization
    
    def sanitize_input(self, text: str) -> Tuple[str, Dict]:
        """Detect and anonymize PII, return mapping for reconstruction"""
        results = self.analyzer.analyze(text=text, language="en")
        
        # Create reversible mapping
        entity_map = {}
        for i, r in enumerate(results):
            placeholder = f"<{r.entity_type}_{i}>"
            entity_map[placeholder] = text[r.start:r.end]
        
        anonymized = self.anonymizer.anonymize(text=text, analyzer_results=results)
        return anonymized.text, entity_map
    
    def generate(self, prompt: str) -> str:
        # Sanitize input
        safe_prompt, entity_map = self.sanitize_input(prompt)
        
        # Call LLM with sanitized input
        response = self.llm.generate(safe_prompt)
        
        # Optionally restore PII in response (if needed)
        # for placeholder, original in entity_map.items():
        #     response = response.replace(placeholder, original)
        
        return response

PII Detection Tools

Microsoft Presidio

Open Source

Free & open-source
20+ built-in recognizers
Extensible with custom entities
Multiple anonymization operators

View Presidio

Google Cloud DLP

Managed Service

150+ info types globally
Image & structured data support
Risk analysis included
Pay-per-use pricing

View Cloud DLP

AWS Comprehend