GenAI Cost Management

Controlling and Optimizing AI Infrastructure Investments

💰 Official Pricing Pages

OpenAI — API Pricing Anthropic — Claude Pricing Google — Vertex AI Pricing Azure — OpenAI Service AWS — Bedrock Pricing

Understanding LLM Costs

LLM providers typically charge based on token usage—the number of tokens processed for both input (prompt) and output (completion). Understanding this model is essential for accurate budgeting and cost optimization.

Input Tokens

System prompt + user message + context. Usually cheaper than output.

Output Tokens

Generated response. Typically 4-5x more expensive than input.

Compute Time

Some providers charge for GPU/inference time separately.

Model Pricing Comparison (per 1M tokens)

Model	Input Price	Output Price	Best For
GPT-4o Mini	$0.15	$0.60	High-volume, simple tasks
GPT-4o	$2.50	$10.00	Complex reasoning, vision
Claude 3 Haiku	$0.80	$4.00	Fast, lightweight tasks
Claude 3.5 Sonnet	$3.00	$15.00	Balanced performance
Claude 3 Opus	$15.00	$75.00	Most complex analysis
Gemini 1.5 Flash	$0.075	$0.30	Fastest, cost-effective
Gemini 1.5 Pro	$1.25	$5.00	Long context (1M+ tokens)

* Prices as of late 2024. Check official pricing pages for current rates.

Cost Optimization Strategies

Model Selection & Routing

Route simple queries to cheaper models (GPT-4o Mini, Haiku) and complex ones to powerful models (GPT-4o, Sonnet). Can reduce costs by 60-80%.

Prompt Optimization

Remove unnecessary context, use concise system prompts, request shorter outputs. Every token saved is money saved—especially at scale.

Semantic Caching

Cache responses for identical or similar queries. Tools like Redis + embeddings can match semantically similar questions to cached answers.

Batch Processing

Use OpenAI's Batch API for non-urgent requests—get 50% discount on token costs with 24-hour turnaround.

Fine-Tuning & Distillation

Train smaller models on outputs from larger ones. A fine-tuned 7B model can match GPT-4 quality for specific tasks at 10-100x lower cost.

Self-Hosted Open-Source Models

Run LLaMA, Mistral, or Gemma locally for predictable costs. Higher upfront investment but can be 10-100x cheaper at scale.

SLM vs Open-Source: Cost-Quality-Latency Trade-offs

Choosing between API-based Small Language Models (SLMs), cloud-hosted inference, and self-hosted open-source models involves trade-offs in cost, quality, and latency. Here's a comprehensive comparison:

Approach	Monthly Cost*	Quality	Latency	Best For
GPT-4o (API)	$500-5,000+	⭐⭐⭐⭐⭐	200-500ms	Complex reasoning, multi-step
GPT-4o Mini / Haiku (SLM)	$50-500	⭐⭐⭐⭐	100-300ms	High volume, simpler tasks
AWS Bedrock / Azure (Managed)	$200-2,000	⭐⭐⭐⭐	150-400ms	Enterprise compliance needs
Self-Hosted LLaMA 3.1 70B	$2,000-8,000	⭐⭐⭐⭐	300-800ms	High volume, data privacy
Self-Hosted Mistral 7B	$300-1,000	⭐⭐⭐	50-150ms	Simple tasks, ultra-low latency
Groq (LPU Inference)	$100-800	⭐⭐⭐⭐	10-50ms	Real-time applications

* Based on ~1M requests/month with average 500 input + 200 output tokens

Self-Hosted Infrastructure Costs

🖥️ GPU Infrastructure

7B Model: 1x A10G (~$1.00/hr) = ~$730/mo
13B Model: 1x A100 40GB (~$3.50/hr) = ~$2,500/mo
70B Model: 2-4x A100 80GB (~$12/hr) = ~$8,600/mo
+20-30% for redundancy & scaling

⚙️ Operational Overhead

DevOps: $500-2,000/mo (time/staffing)
Monitoring: $100-500/mo
Networking/Storage: $200-1,000/mo
Model updates: Ongoing effort

Quality Considerations

SLMs (GPT-4o Mini, Haiku): 85-95% quality of flagship models for most tasks. Best for classification, extraction, simple generation.
Open-Source 70B (LLaMA 3.1): Comparable to GPT-4 for most tasks. May struggle with complex reasoning or niche domains.
Open-Source 7B (Mistral, Phi-3): 60-80% quality. Excellent for specific fine-tuned tasks, limited for general use.
Fine-tuned SLMs: Can exceed flagship quality for narrow domains at fraction of cost.

Decision Framework

✅ Use API (OpenAI/Anthropic)

Low-medium volume (<100K req/mo)
Need latest capabilities
Fast time-to-market
No GPU expertise in-house

✅ Use Managed (Bedrock/Vertex)

Enterprise compliance required
Multi-model flexibility
Existing cloud commitment
Single billing preferred

✅ Use Self-Hosted Open-Source

Very high volume (>1M req/mo)
Strict data residency/privacy
Predictable, fixed costs needed
Have ML/GPU expertise

Enterprise Budget Framework

Cost Category	% of Total	Key Considerations
API & Inference	40-60%	Token usage, model selection, volume discounts
Infrastructure	15-25%	Vector DB, compute, storage, networking
Development	15-25%	Engineering time, tooling, testing
Observability	5-10%	Monitoring, logging, analytics tools
Security & Compliance	5-10%	Audits, guardrails, governance

FinOps Best Practices

Set budget alerts: Configure spend limits per project, team, and environment
Track by use case: Tag requests to understand cost per feature/product
Review weekly: Catch runaway costs early, optimize top spenders
Benchmark models: Regularly test if cheaper models suffice for your tasks
Negotiate contracts: Volume discounts, committed use, enterprise agreements
Plan for scale: Costs grow non-linearly; model routing is essential

Test Your Knowledge

Score 8/10 or higher to pass