GenAI Cost Management
Controlling and Optimizing AI Infrastructure Investments
π° Official Pricing Pages
Understanding LLM Costs
LLM providers typically charge based on token usageβthe number of tokens processed for both input (prompt) and output (completion). Understanding this model is essential for accurate budgeting and cost optimization.
Input Tokens
System prompt + user message + context. Usually cheaper than output.
Output Tokens
Generated response. Typically 4-5x more expensive than input.
Compute Time
Some providers charge for GPU/inference time separately.
Model Pricing Comparison (per 1M tokens)
| Model | Input Price | Output Price | Best For |
|---|---|---|---|
| GPT-4o Mini | $0.15 | $0.60 | High-volume, simple tasks |
| GPT-4o | $2.50 | $10.00 | Complex reasoning, vision |
| Claude 3 Haiku | $0.80 | $4.00 | Fast, lightweight tasks |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Balanced performance |
| Claude 3 Opus | $15.00 | $75.00 | Most complex analysis |
| Gemini 1.5 Flash | $0.075 | $0.30 | Fastest, cost-effective |
| Gemini 1.5 Pro | $1.25 | $5.00 | Long context (1M+ tokens) |
* Prices as of late 2024. Check official pricing pages for current rates.
Cost Optimization Strategies
Model Selection & Routing
Route simple queries to cheaper models (GPT-4o Mini, Haiku) and complex ones to powerful models (GPT-4o, Sonnet). Can reduce costs by 60-80%.
Prompt Optimization
Remove unnecessary context, use concise system prompts, request shorter outputs. Every token saved is money savedβespecially at scale.
Semantic Caching
Cache responses for identical or similar queries. Tools like Redis + embeddings can match semantically similar questions to cached answers.
Batch Processing
Use OpenAI's Batch API for non-urgent requestsβget 50% discount on token costs with 24-hour turnaround.
Fine-Tuning & Distillation
Train smaller models on outputs from larger ones. A fine-tuned 7B model can match GPT-4 quality for specific tasks at 10-100x lower cost.
Self-Hosted Open-Source Models
Run LLaMA, Mistral, or Gemma locally for predictable costs. Higher upfront investment but can be 10-100x cheaper at scale.
SLM vs Open-Source: Cost-Quality-Latency Trade-offs
Choosing between API-based Small Language Models (SLMs), cloud-hosted inference, and self-hosted open-source models involves trade-offs in cost, quality, and latency. Here's a comprehensive comparison:
| Approach | Monthly Cost* | Quality | Latency | Best For |
|---|---|---|---|---|
| GPT-4o (API) | $500-5,000+ | βββββ | 200-500ms | Complex reasoning, multi-step |
| GPT-4o Mini / Haiku (SLM) | $50-500 | ββββ | 100-300ms | High volume, simpler tasks |
| AWS Bedrock / Azure (Managed) | $200-2,000 | ββββ | 150-400ms | Enterprise compliance needs |
| Self-Hosted LLaMA 3.1 70B | $2,000-8,000 | ββββ | 300-800ms | High volume, data privacy |
| Self-Hosted Mistral 7B | $300-1,000 | βββ | 50-150ms | Simple tasks, ultra-low latency |
| Groq (LPU Inference) | $100-800 | ββββ | 10-50ms | Real-time applications |
* Based on ~1M requests/month with average 500 input + 200 output tokens
Self-Hosted Infrastructure Costs
π₯οΈ GPU Infrastructure
- 7B Model: 1x A10G (~$1.00/hr) = ~$730/mo
- 13B Model: 1x A100 40GB (~$3.50/hr) = ~$2,500/mo
- 70B Model: 2-4x A100 80GB (~$12/hr) = ~$8,600/mo
- +20-30% for redundancy & scaling
βοΈ Operational Overhead
- DevOps: $500-2,000/mo (time/staffing)
- Monitoring: $100-500/mo
- Networking/Storage: $200-1,000/mo
- Model updates: Ongoing effort
Quality Considerations
- SLMs (GPT-4o Mini, Haiku): 85-95% quality of flagship models for most tasks. Best for classification, extraction, simple generation.
- Open-Source 70B (LLaMA 3.1): Comparable to GPT-4 for most tasks. May struggle with complex reasoning or niche domains.
- Open-Source 7B (Mistral, Phi-3): 60-80% quality. Excellent for specific fine-tuned tasks, limited for general use.
- Fine-tuned SLMs: Can exceed flagship quality for narrow domains at fraction of cost.
Decision Framework
β Use API (OpenAI/Anthropic)
- Low-medium volume (<100K req/mo)
- Need latest capabilities
- Fast time-to-market
- No GPU expertise in-house
β Use Managed (Bedrock/Vertex)
- Enterprise compliance required
- Multi-model flexibility
- Existing cloud commitment
- Single billing preferred
β Use Self-Hosted Open-Source
- Very high volume (>1M req/mo)
- Strict data residency/privacy
- Predictable, fixed costs needed
- Have ML/GPU expertise
Cost Monitoring & Governance Tools
Langfuse
Open-source LLM observability with cost tracking, token usage, and analytics.
Helicone
One-line integration for LLM cost monitoring, caching, and rate limiting.
Portkey
AI gateway with budget controls, caching, fallbacks, and cost optimization.
TrueFoundry
ML platform with GPU optimization, cost allocation, and infra management.
Enterprise Budget Framework
| Cost Category | % of Total | Key Considerations |
|---|---|---|
| API & Inference | 40-60% | Token usage, model selection, volume discounts |
| Infrastructure | 15-25% | Vector DB, compute, storage, networking |
| Development | 15-25% | Engineering time, tooling, testing |
| Observability | 5-10% | Monitoring, logging, analytics tools |
| Security & Compliance | 5-10% | Audits, guardrails, governance |
FinOps Best Practices
- Set budget alerts: Configure spend limits per project, team, and environment
- Track by use case: Tag requests to understand cost per feature/product
- Review weekly: Catch runaway costs early, optimize top spenders
- Benchmark models: Regularly test if cheaper models suffice for your tasks
- Negotiate contracts: Volume discounts, committed use, enterprise agreements
- Plan for scale: Costs grow non-linearly; model routing is essential
Test Your Knowledge
Score 8/10 or higher to pass
You need to be logged in to take this quiz.
Login to Continue