GenAIHub
← Back to Technical Section

Content Filtering

Detecting & Blocking Harmful, Toxic, or Inappropriate Content in LLM Applications

What is Content Filtering?

Content filtering in LLM applications involves detecting, classifying, and blocking harmful, toxic, or inappropriate content in both user inputs and AI-generated outputs. It's essential for creating safe, trustworthy AI experiences while meeting regulatory and brand safety requirements.

"Content moderation is essential for AI systems interacting with users. Without proper filtering, LLMs can generate harmful content, be manipulated to produce policy-violating outputs, or amplify existing biases."

Toxicity

Hate, harassment, threats

NSFW

Adult, explicit content

Illegal

Illegal activities, violence

Misinformation

False claims, propaganda

Harmful Content Categories

Category Description Examples Action
Hate Speech Attacks based on protected characteristics Racism, sexism, religious discrimination Block
Violence Threats, graphic violence, self-harm Threats, instructions for harm Block
Sexual Content Explicit or suggestive material Adult content, solicitation Block
Harassment Bullying, intimidation, stalking Personal attacks, doxxing Block
Dangerous Activities Instructions for harm or illegal acts Weapons, drugs, hacking Block
Profanity Strong language, obscenities Curse words, vulgar language Warn/Filter
Spam/Scams Fraudulent or deceptive content Phishing, get-rich-quick schemes Block

Filtering Approaches

Keyword/Pattern Blocklists

Simple lists of banned words, phrases, or regex patterns. Fast but easily bypassed with misspellings, leetspeak, or creative wording.

Fast Simple Easily bypassed High false positives

ML Classification Models

Trained classifiers (transformers, BERT-based) that understand context and semantics. More robust than keywords but require ML infrastructure and training data.

Context-aware Nuanced detection Requires training Recommended

LLM-as-a-Judge

Using another LLM (like GPT-4, Claude, or Llama Guard) to evaluate content safety. Highly flexible and can follow nuanced policies but adds latency and cost.

Highly flexible Policy-aware Higher latency Additional cost

Multimodal Filtering

Specialized models for images, audio, and video content. Essential for applications that generate or process media beyond text.

Multi-format Specialized models

Content Filtering Pipeline

User Input
Pre-Filter
LLM Generate
Post-Filter
Safe Response

Implementation Example

# Using OpenAI Moderation API
from openai import OpenAI

client = OpenAI()

def check_content_safety(text: str) -> dict:
    """Check content using OpenAI's Moderation API"""
    response = client.moderations.create(input=text)
    result = response.results[0]
    
    return {
        "flagged": result.flagged,
        "categories": {
            k: v for k, v in result.categories.model_dump().items() 
            if v
        },
        "scores": result.category_scores.model_dump()
    }

# Example usage
text = "I will hurt you if you don't do what I say"
result = check_content_safety(text)

if result["flagged"]:
    print(f"⚠️ Content flagged: {result['categories']}")
    # Block or handle appropriately
else:
    print("✅ Content is safe")
    # Proceed with LLM call

Using Llama Guard

# Llama Guard for content moderation
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Llama Guard model
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

def moderate_with_llama_guard(user_input: str, assistant_response: str = None):
    """Classify content safety using Llama Guard"""
    
    # Format conversation for Llama Guard
    conversation = f"""[INST] Task: Check if there is unsafe content.

<BEGIN CONVERSATION>
User: {user_input}
"""
    if assistant_response:
        conversation += f"Assistant: {assistant_response}\n"
    conversation += "<END CONVERSATION> [/INST]"
    
    inputs = tokenizer(conversation, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Parse result: "safe" or "unsafe" with category
    return "unsafe" in result.lower()

See: Llama Guard on HuggingFace

Content Moderation Tools

OpenAI Moderation

Free API
  • Free to use
  • 11 content categories
  • Fast response time
  • English-focused
View API Docs

Perspective API

by Google/Jigsaw
  • Multi-language support
  • Toxicity scoring
  • Used by major platforms
  • Rate limits apply
View Perspective

Llama Guard

by Meta
  • Self-hostable
  • Customizable policies
  • Input + output filtering
  • Requires GPU resources
View on HuggingFace

Azure AI Content Safety

by Microsoft
  • Text + image moderation
  • Severity levels (0-6)
  • Azure integration
  • Pay-per-use
View Azure Content Safety

Best Practices

Do This

  • Filter both inputs AND outputs
  • Use multiple filtering methods (layered)
  • Tune thresholds for your use case
  • Provide helpful error messages
  • Log filtration events for review
  • Include human review for edge cases

Avoid This

  • Relying only on keyword blocklists
  • Over-filtering that hurts user experience
  • Ignoring cultural/language differences
  • Silent failures without feedback
  • Assuming filters are 100% effective
  • Exposing filter logic to users

Related Topics