Content Filtering
Detecting & Blocking Harmful, Toxic, or Inappropriate Content in LLM Applications
What is Content Filtering?
Content filtering in LLM applications involves detecting, classifying, and blocking harmful, toxic, or inappropriate content in both user inputs and AI-generated outputs. It's essential for creating safe, trustworthy AI experiences while meeting regulatory and brand safety requirements.
"Content moderation is essential for AI systems interacting with users. Without proper filtering, LLMs can generate harmful content, be manipulated to produce policy-violating outputs, or amplify existing biases."
Toxicity
Hate, harassment, threats
NSFW
Adult, explicit content
Illegal
Illegal activities, violence
Misinformation
False claims, propaganda
Harmful Content Categories
| Category | Description | Examples | Action |
|---|---|---|---|
| Hate Speech | Attacks based on protected characteristics | Racism, sexism, religious discrimination | Block |
| Violence | Threats, graphic violence, self-harm | Threats, instructions for harm | Block |
| Sexual Content | Explicit or suggestive material | Adult content, solicitation | Block |
| Harassment | Bullying, intimidation, stalking | Personal attacks, doxxing | Block |
| Dangerous Activities | Instructions for harm or illegal acts | Weapons, drugs, hacking | Block |
| Profanity | Strong language, obscenities | Curse words, vulgar language | Warn/Filter |
| Spam/Scams | Fraudulent or deceptive content | Phishing, get-rich-quick schemes | Block |
Filtering Approaches
Keyword/Pattern Blocklists
Simple lists of banned words, phrases, or regex patterns. Fast but easily bypassed with misspellings, leetspeak, or creative wording.
ML Classification Models
Trained classifiers (transformers, BERT-based) that understand context and semantics. More robust than keywords but require ML infrastructure and training data.
LLM-as-a-Judge
Using another LLM (like GPT-4, Claude, or Llama Guard) to evaluate content safety. Highly flexible and can follow nuanced policies but adds latency and cost.
Multimodal Filtering
Specialized models for images, audio, and video content. Essential for applications that generate or process media beyond text.
Content Filtering Pipeline
Implementation Example
# Using OpenAI Moderation API
from openai import OpenAI
client = OpenAI()
def check_content_safety(text: str) -> dict:
"""Check content using OpenAI's Moderation API"""
response = client.moderations.create(input=text)
result = response.results[0]
return {
"flagged": result.flagged,
"categories": {
k: v for k, v in result.categories.model_dump().items()
if v
},
"scores": result.category_scores.model_dump()
}
# Example usage
text = "I will hurt you if you don't do what I say"
result = check_content_safety(text)
if result["flagged"]:
print(f"⚠️ Content flagged: {result['categories']}")
# Block or handle appropriately
else:
print("✅ Content is safe")
# Proceed with LLM call
Using Llama Guard
# Llama Guard for content moderation
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load Llama Guard model
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
def moderate_with_llama_guard(user_input: str, assistant_response: str = None):
"""Classify content safety using Llama Guard"""
# Format conversation for Llama Guard
conversation = f"""[INST] Task: Check if there is unsafe content.
<BEGIN CONVERSATION>
User: {user_input}
"""
if assistant_response:
conversation += f"Assistant: {assistant_response}\n"
conversation += "<END CONVERSATION> [/INST]"
inputs = tokenizer(conversation, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(output[0], skip_special_tokens=True)
# Parse result: "safe" or "unsafe" with category
return "unsafe" in result.lower()
Content Moderation Tools
OpenAI Moderation
Free API- Free to use
- 11 content categories
- Fast response time
- English-focused
Perspective API
by Google/Jigsaw- Multi-language support
- Toxicity scoring
- Used by major platforms
- Rate limits apply
Llama Guard
by Meta- Self-hostable
- Customizable policies
- Input + output filtering
- Requires GPU resources
Azure AI Content Safety
by Microsoft- Text + image moderation
- Severity levels (0-6)
- Azure integration
- Pay-per-use
Best Practices
Do This
- Filter both inputs AND outputs
- Use multiple filtering methods (layered)
- Tune thresholds for your use case
- Provide helpful error messages
- Log filtration events for review
- Include human review for edge cases
Avoid This
- Relying only on keyword blocklists
- Over-filtering that hurts user experience
- Ignoring cultural/language differences
- Silent failures without feedback
- Assuming filters are 100% effective
- Exposing filter logic to users