PII Detection
Identifying & Protecting Personal Identifiable Information in LLM Systems
What is PII Detection?
PII (Personally Identifiable Information) detection is the process of automatically identifying sensitive personal data in text before it's processed by LLMs or stored in logs. It's a critical component for privacy compliance, data protection, and building trustworthy AI systems.
"PII includes any information that can be used to identify, contact, or locate an individual, either alone or combined with other sources. In AI systems, undetected PII can lead to privacy violations, regulatory penalties, and loss of user trust."
Detect
Scan text for PII patterns
Classify
Identify entity types
Anonymize
Redact or mask data
Process
Safe for LLM use
Common PII Entity Types
| Category | Entity Types | Examples | Sensitivity |
|---|---|---|---|
| Direct Identifiers | PERSON, SSN, PASSPORT | John Smith, 123-45-6789 | High |
| Contact Info | EMAIL, PHONE, ADDRESS | john@email.com, +1-555-0123 | High |
| Financial | CREDIT_CARD, IBAN, BANK_ACCOUNT | 4111-1111-1111-1111 | High |
| Technical | IP_ADDRESS, MAC_ADDRESS, URL | 192.168.1.1, AA:BB:CC:DD:EE:FF | Medium |
| Health (PHI) | MEDICAL_LICENSE, NPI, CONDITION | Medical record numbers, diagnoses | High |
| Location | LOCATION, GPS_COORDINATES | 123 Main St, New York, NY | Medium |
| Credentials | API_KEY, PASSWORD, TOKEN | sk-abc123..., Bearer eyJ... | Critical |
Detection Methods
Pattern Matching (Regex)
Regular expressions to match structured formats like SSNs, credit cards, emails, and phone numbers. Fast and deterministic but limited to known patterns.
Named Entity Recognition (NER)
Machine learning models trained to identify entities like names, organizations, and locations from context. More flexible than regex but requires ML infrastructure.
Checksum Validation
Verify detected patterns using checksums (Luhn algorithm for credit cards, SSN validation rules). Reduces false positives from regex matches.
Ensemble Approach
Combine multiple methods: regex for structured data, NER for unstructured, and checksum for validation. This is the approach used by tools like Presidio.
Anonymization Techniques
Redaction
Complete removal or replacement with placeholder.
Masking
Partial hiding while preserving format.
Pseudonymization
Replace with consistent fake values (reversible with key).
Hashing
One-way transformation for irreversible anonymization.
Implementation with Presidio
# Install Presidio
pip install presidio-analyzer presidio-anonymizer
# Basic PII Detection & Anonymization
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Sample text with PII
text = """
Hello, my name is John Smith. You can reach me at john.smith@email.com
or call me at 555-123-4567. My SSN is 123-45-6789 and my credit card
is 4111-1111-1111-1111.
"""
# Detect PII entities
results = analyzer.analyze(
text=text,
entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
"US_SSN", "CREDIT_CARD"],
language="en"
)
# Print detected entities
for result in results:
print(f"{result.entity_type}: {text[result.start:result.end]}")
# Anonymize with different strategies
operators = {
"PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
"EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 10, "masking_char": "*"}),
"CREDIT_CARD": OperatorConfig("mask", {"chars_to_mask": 12, "from_end": True}),
"US_SSN": OperatorConfig("redact"),
}
anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators)
print(anonymized.text)
See full documentation: Microsoft Presidio
LLM Integration Pattern
# PII-safe LLM wrapper
from typing import Tuple, Dict
class PIISafeLLM:
def __init__(self, llm_client, analyzer, anonymizer):
self.llm = llm_client
self.analyzer = analyzer
self.anonymizer = anonymizer
self.entity_map = {} # For de-anonymization
def sanitize_input(self, text: str) -> Tuple[str, Dict]:
"""Detect and anonymize PII, return mapping for reconstruction"""
results = self.analyzer.analyze(text=text, language="en")
# Create reversible mapping
entity_map = {}
for i, r in enumerate(results):
placeholder = f"<{r.entity_type}_{i}>"
entity_map[placeholder] = text[r.start:r.end]
anonymized = self.anonymizer.anonymize(text=text, analyzer_results=results)
return anonymized.text, entity_map
def generate(self, prompt: str) -> str:
# Sanitize input
safe_prompt, entity_map = self.sanitize_input(prompt)
# Call LLM with sanitized input
response = self.llm.generate(safe_prompt)
# Optionally restore PII in response (if needed)
# for placeholder, original in entity_map.items():
# response = response.replace(placeholder, original)
return response
PII Detection Tools
Microsoft Presidio
Open Source- Free & open-source
- 20+ built-in recognizers
- Extensible with custom entities
- Multiple anonymization operators
Google Cloud DLP
Managed Service- 150+ info types globally
- Image & structured data support
- Risk analysis included
- Pay-per-use pricing
AWS Comprehend
Managed Service- NER + PII detection combined
- AWS ecosystem integration
- Batch processing support
- Pay-per-use pricing
spaCy NER
Open Source- Fast & production-ready
- Trainable custom models
- Multi-language support
- Requires custom PII training
Best Practices
Do This
- Scan both inputs AND outputs for PII
- Use ensemble methods for better coverage
- Tune confidence thresholds for your use case
- Add custom recognizers for domain-specific PII
- Log detection events (without PII!) for monitoring
- Regularly test with adversarial examples
Avoid This
- Relying only on regex patterns
- Using over-broad detection that hurts usability
- Logging full text including detected PII
- Ignoring context (names in quotes, examples)
- Assuming 100% detection accuracy
- Skipping international PII formats