AI Assistant

← Back to Technical Section

Content Filtering

Detecting & Blocking Harmful, Toxic, or Inappropriate Content in LLM Applications

What is Content Filtering?

Content filtering in LLM applications involves detecting, classifying, and blocking harmful, toxic, or inappropriate content in both user inputs and AI-generated outputs. It's essential for creating safe, trustworthy AI experiences while meeting regulatory and brand safety requirements.

"Content moderation is essential for AI systems interacting with users. Without proper filtering, LLMs can generate harmful content, be manipulated to produce policy-violating outputs, or amplify existing biases."

— OpenAI Usage Policies

Toxicity

Hate, harassment, threats

NSFW

Adult, explicit content

Illegal

Illegal activities, violence

Misinformation

False claims, propaganda

Harmful Content Categories

Category	Description	Examples	Action
Hate Speech	Attacks based on protected characteristics	Racism, sexism, religious discrimination	Block
Violence	Threats, graphic violence, self-harm	Threats, instructions for harm	Block
Sexual Content	Explicit or suggestive material	Adult content, solicitation	Block
Harassment	Bullying, intimidation, stalking	Personal attacks, doxxing	Block
Dangerous Activities	Instructions for harm or illegal acts	Weapons, drugs, hacking	Block
Profanity	Strong language, obscenities	Curse words, vulgar language	Warn/Filter
Spam/Scams	Fraudulent or deceptive content	Phishing, get-rich-quick schemes	Block

Filtering Approaches

Keyword/Pattern Blocklists

Simple lists of banned words, phrases, or regex patterns. Fast but easily bypassed with misspellings, leetspeak, or creative wording.

Fast Simple Easily bypassed High false positives

ML Classification Models

Trained classifiers (transformers, BERT-based) that understand context and semantics. More robust than keywords but require ML infrastructure and training data.

Context-aware Nuanced detection Requires training Recommended

LLM-as-a-Judge

Using another LLM (like GPT-4, Claude, or Llama Guard) to evaluate content safety. Highly flexible and can follow nuanced policies but adds latency and cost.

Highly flexible Policy-aware Higher latency Additional cost

Multimodal Filtering

Specialized models for images, audio, and video content. Essential for applications that generate or process media beyond text.

Multi-format Specialized models

Content Filtering Pipeline

User Input

Pre-Filter

LLM Generate

Post-Filter

Safe Response

Implementation Example

# Using OpenAI Moderation API
from openai import OpenAI

client = OpenAI()

def check_content_safety(text: str) -> dict:
    """Check content using OpenAI's Moderation API"""
    response = client.moderations.create(input=text)
    result = response.results[0]
    
    return {
        "flagged": result.flagged,
        "categories": {
            k: v for k, v in result.categories.model_dump().items() 
            if v
        },
        "scores": result.category_scores.model_dump()
    }

# Example usage
text = "I will hurt you if you don't do what I say"
result = check_content_safety(text)

if result["flagged"]:
    print(f"⚠️ Content flagged: {result['categories']}")
    # Block or handle appropriately
else:
    print("✅ Content is safe")
    # Proceed with LLM call

Using Llama Guard

# Llama Guard for content moderation
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Llama Guard model
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

def moderate_with_llama_guard(user_input: str, assistant_response: str = None):
    """Classify content safety using Llama Guard"""
    
    # Format conversation for Llama Guard
    conversation = f"""[INST] Task: Check if there is unsafe content.

<BEGIN CONVERSATION>
User: {user_input}
"""
    if assistant_response:
        conversation += f"Assistant: {assistant_response}\n"
    conversation += "<END CONVERSATION> [/INST]"
    
    inputs = tokenizer(conversation, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Parse result: "safe" or "unsafe" with category
    return "unsafe" in result.lower()

See: Llama Guard on HuggingFace

Content Moderation Tools

OpenAI Moderation

Free API

Free to use
11 content categories
Fast response time
English-focused

Perspective API

by Google/Jigsaw

Multi-language support
Toxicity scoring
Used by major platforms
Rate limits apply

View Perspective

Llama Guard

by Meta

Self-hostable
Customizable policies
Input + output filtering
Requires GPU resources

View on HuggingFace

Azure AI Content Safety

by Microsoft

Text + image moderation
Severity levels (0-6)
Azure integration
Pay-per-use

View Azure Content Safety

Best Practices

Do This

Filter both inputs AND outputs
Use multiple filtering methods (layered)
Tune thresholds for your use case
Provide helpful error messages
Log filtration events for review
Include human review for edge cases

Avoid This

Relying only on keyword blocklists
Over-filtering that hurts user experience
Ignoring cultural/language differences
Silent failures without feedback
Assuming filters are 100% effective
Exposing filter logic to users

Related Topics

Guardrails Prompt Injection PII Detection Data Leakage Hallucinations Responsible AI Compliance & Ethics