Cost Optimization for LLM Applications
Cost Breakdown
Section titled “Cost Breakdown”Typical costs:
- Input tokens: $0.15-$15 per 1M tokens
- Output tokens: $0.60-$75 per 1M tokens
- Embeddings: $0.02-$0.13 per 1M tokens
- Fine-tuning: $8-$25 per 1M tokens
Optimization Strategies
Section titled “Optimization Strategies”1. Prompt Compression
Section titled “1. Prompt Compression”Before (500 tokens):
You are a helpful customer service agent. Please analyze this customer inquiry carefully and provide a detailed, empathetic response that addresses all their concerns. Make sure to be professional and courteous...
Customer: Where is my order?After (50 tokens):
Customer service: respond professionally.Customer: Where is my order?Savings: 90% fewer input tokens
2. Response Caching
Section titled “2. Response Caching”import redisimport hashlib
redis_client = redis.Redis()
def get_cached_response(prompt, ttl=3600): cache_key = hashlib.md5(prompt.encode()).hexdigest()
# Check cache cached = redis_client.get(cache_key) if cached: return cached.decode()
# Call LLM response = call_llm(prompt)
# Cache result redis_client.setex(cache_key, ttl, response) return response3. Model Tiering
Section titled “3. Model Tiering”def route_to_model(query, complexity_threshold=0.5): complexity = assess_complexity(query)
if complexity < complexity_threshold: return "gpt-4o-mini" # $0.15/$0.60 else: return "gpt-4o" # $2.50/$104. Streaming with Early Stop
Section titled “4. Streaming with Early Stop”def stream_with_stop(prompt, max_cost=0.01): tokens_used = 0 max_tokens = calculate_max_tokens(max_cost)
stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], stream=True, max_tokens=max_tokens )
for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") tokens_used += 15. Batch Processing
Section titled “5. Batch Processing”# Instead of 100 individual callsfor item in items: result = call_llm(item) # 100 API calls
# Batch thembatch_prompt = "\n".join([f"{i}. {item}" for i, item in enumerate(items)])results = call_llm(f"Process these:\n{batch_prompt}") # 1 API call6. Prompt Optimization
Section titled “6. Prompt Optimization”# Reduce verbosityprompts = [ "Original: The customer is asking about...", # Verbose "Optimized: Customer asks: delivery status" # Concise]
# Use abbreviations# Specify exact length neededCost Monitoring
Section titled “Cost Monitoring”import loggingfrom datetime import datetime
class CostTracker: def __init__(self): self.costs = []
def track_call(self, model, input_tokens, output_tokens): cost = calculate_cost(model, input_tokens, output_tokens) self.costs.append({ 'timestamp': datetime.now(), 'model': model, 'input_tokens': input_tokens, 'output_tokens': output_tokens, 'cost': cost }) logging.info(f"API call cost: ${cost:.4f}")
def daily_report(self): total = sum(c['cost'] for c in self.costs) return f"Daily cost: ${total:.2f}"Model Selection Matrix
Section titled “Model Selection Matrix”| Use Case | Model | Input Cost | Best For |
|---|---|---|---|
| Simple Q&A | GPT-4o-mini | $0.15/1M | High volume |
| Complex reasoning | GPT-4o | $2.50/1M | Quality critical |
| Long documents | Claude Sonnet | $3/1M | Large context |
| Budget critical | Llama 3 (self-host) | Infrastructure | Cost-sensitive |
Cost Reduction Checklist
Section titled “Cost Reduction Checklist”✅ Implement caching (30-50% savings) ✅ Compress prompts (20-40% savings) ✅ Use appropriate model tier (40-60% savings) ✅ Batch requests (10-30% savings) ✅ Set max_tokens (prevent overruns) ✅ Monitor and alert on costs ✅ Use streaming with early stop ✅ Cache embeddings ✅ Implement rate limiting
Real-World Example
Section titled “Real-World Example”Before optimization:
- 10,000 requests/day
- GPT-4o for all
- Avg 1,000 input, 500 output tokens
- Cost: $125/day = $3,750/month
After optimization:
- Cache hit rate: 40%
- Prompt compression: 30%
- Model tiering: 60% to mini
- Batching: 20% reduction
New cost:
- $37/day = $1,110/month
- Savings: 70% ($2,640/month)
Found an issue? Open an issue!