GenAIHub
← Back to Technical Section

PII Detection

Identifying & Protecting Personal Identifiable Information in LLM Systems

What is PII Detection?

PII (Personally Identifiable Information) detection is the process of automatically identifying sensitive personal data in text before it's processed by LLMs or stored in logs. It's a critical component for privacy compliance, data protection, and building trustworthy AI systems.

"PII includes any information that can be used to identify, contact, or locate an individual, either alone or combined with other sources. In AI systems, undetected PII can lead to privacy violations, regulatory penalties, and loss of user trust."

Detect

Scan text for PII patterns

Classify

Identify entity types

Anonymize

Redact or mask data

Process

Safe for LLM use

Common PII Entity Types

Category Entity Types Examples Sensitivity
Direct Identifiers PERSON, SSN, PASSPORT John Smith, 123-45-6789 High
Contact Info EMAIL, PHONE, ADDRESS john@email.com, +1-555-0123 High
Financial CREDIT_CARD, IBAN, BANK_ACCOUNT 4111-1111-1111-1111 High
Technical IP_ADDRESS, MAC_ADDRESS, URL 192.168.1.1, AA:BB:CC:DD:EE:FF Medium
Health (PHI) MEDICAL_LICENSE, NPI, CONDITION Medical record numbers, diagnoses High
Location LOCATION, GPS_COORDINATES 123 Main St, New York, NY Medium
Credentials API_KEY, PASSWORD, TOKEN sk-abc123..., Bearer eyJ... Critical

Detection Methods

Pattern Matching (Regex)

Regular expressions to match structured formats like SSNs, credit cards, emails, and phone numbers. Fast and deterministic but limited to known patterns.

Fast Deterministic Limited coverage

Named Entity Recognition (NER)

Machine learning models trained to identify entities like names, organizations, and locations from context. More flexible than regex but requires ML infrastructure.

Context-aware Flexible Requires ML

Checksum Validation

Verify detected patterns using checksums (Luhn algorithm for credit cards, SSN validation rules). Reduces false positives from regex matches.

High precision Low false positives

Ensemble Approach

Combine multiple methods: regex for structured data, NER for unstructured, and checksum for validation. This is the approach used by tools like Presidio.

Best coverage Production-ready Recommended

Anonymization Techniques

Redaction

Complete removal or replacement with placeholder.

john@email.com → [EMAIL_REDACTED]

Masking

Partial hiding while preserving format.

4111-1111-1111-1111 → ****-****-****-1111

Pseudonymization

Replace with consistent fake values (reversible with key).

John Smith → Person_A7X9

Hashing

One-way transformation for irreversible anonymization.

john@email.com → a3f2b8c9...

Implementation with Presidio

# Install Presidio
pip install presidio-analyzer presidio-anonymizer

# Basic PII Detection & Anonymization
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Sample text with PII
text = """
Hello, my name is John Smith. You can reach me at john.smith@email.com 
or call me at 555-123-4567. My SSN is 123-45-6789 and my credit card 
is 4111-1111-1111-1111.
"""

# Detect PII entities
results = analyzer.analyze(
    text=text,
    entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", 
              "US_SSN", "CREDIT_CARD"],
    language="en"
)

# Print detected entities
for result in results:
    print(f"{result.entity_type}: {text[result.start:result.end]}")

# Anonymize with different strategies
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
    "EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 10, "masking_char": "*"}),
    "CREDIT_CARD": OperatorConfig("mask", {"chars_to_mask": 12, "from_end": True}),
    "US_SSN": OperatorConfig("redact"),
}

anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
print(anonymized.text)

See full documentation: Microsoft Presidio

LLM Integration Pattern

# PII-safe LLM wrapper
from typing import Tuple, Dict

class PIISafeLLM:
    def __init__(self, llm_client, analyzer, anonymizer):
        self.llm = llm_client
        self.analyzer = analyzer
        self.anonymizer = anonymizer
        self.entity_map = {}  # For de-anonymization
    
    def sanitize_input(self, text: str) -> Tuple[str, Dict]:
        """Detect and anonymize PII, return mapping for reconstruction"""
        results = self.analyzer.analyze(text=text, language="en")
        
        # Create reversible mapping
        entity_map = {}
        for i, r in enumerate(results):
            placeholder = f"<{r.entity_type}_{i}>"
            entity_map[placeholder] = text[r.start:r.end]
        
        anonymized = self.anonymizer.anonymize(text=text, analyzer_results=results)
        return anonymized.text, entity_map
    
    def generate(self, prompt: str) -> str:
        # Sanitize input
        safe_prompt, entity_map = self.sanitize_input(prompt)
        
        # Call LLM with sanitized input
        response = self.llm.generate(safe_prompt)
        
        # Optionally restore PII in response (if needed)
        # for placeholder, original in entity_map.items():
        #     response = response.replace(placeholder, original)
        
        return response

PII Detection Tools

Microsoft Presidio

Open Source
  • Free & open-source
  • 20+ built-in recognizers
  • Extensible with custom entities
  • Multiple anonymization operators
View Presidio

Google Cloud DLP

Managed Service
  • 150+ info types globally
  • Image & structured data support
  • Risk analysis included
  • Pay-per-use pricing
View Cloud DLP

AWS Comprehend

Managed Service
  • NER + PII detection combined
  • AWS ecosystem integration
  • Batch processing support
  • Pay-per-use pricing
View AWS Comprehend

spaCy NER

Open Source
  • Fast & production-ready
  • Trainable custom models
  • Multi-language support
  • Requires custom PII training
View spaCy

Best Practices

Do This

  • Scan both inputs AND outputs for PII
  • Use ensemble methods for better coverage
  • Tune confidence thresholds for your use case
  • Add custom recognizers for domain-specific PII
  • Log detection events (without PII!) for monitoring
  • Regularly test with adversarial examples

Avoid This

  • Relying only on regex patterns
  • Using over-broad detection that hurts usability
  • Logging full text including detected PII
  • Ignoring context (names in quotes, examples)
  • Assuming 100% detection accuracy
  • Skipping international PII formats

Related Topics