Production Systems
18 minLesson 14 of 14
Evaluation and Cost Optimization
Measure quality and optimize costs in production LLM systems
Learning goals
- •Learn to evaluate LLM output quality
- •Understand cost optimization strategies
- •Implement monitoring and observability
Evaluation Methods
Automated Metrics
- Exact match: Response matches expected output
- BLEU/ROUGE: Text similarity scores
- Custom validators: Schema compliance, keyword presence
LLM-as-Judge
Use a (different) LLM to evaluate responses:
Rate this response on accuracy (1-5):
Question: {question}
Expected themes: {themes}
Response: {response}Human Evaluation
- A/B testing with users
- Expert review for high-stakes applications
- Periodic quality audits
Cost Optimization
Model Selection
- Use smaller models for simple tasks
- Mini and Nano (e.g. GPT-4.1 Mini/Nano) for light tasks: classification, formatting, short answers
- Thinking/reasoning models (e.g. o1, GPT-5.4) for code, complex analysis, and deep reasoning
- Fine-tuned small models can outperform large general models
Prompt Optimization
- Shorter prompts = lower costs
- Remove redundant instructions
- Compress examples
Caching
- Cache identical queries
- Semantic caching for similar queries
- TTL based on content freshness needs
Monitoring in Production
Track these metrics:
Performance
- Latency (p50, p95, p99)
- Time to first token
- Token throughput
Quality
- Error rates
- User feedback signals
- Automated quality scores
Cost
- Token usage per request
- Cost per user/feature
- Daily/weekly spend trends
Common mistakes
×Not measuring baseline performance before optimizations
×Optimizing for cost without tracking quality impact
×Using expensive models for simple tasks
×Ignoring caching opportunities for repeated queries
Key takeaways
+Combine automated metrics, LLM-as-judge, and human evaluation for comprehensive assessment
+Match model capability to task complexity—don't use a frontier or reasoning model (e.g. GPT-5.4) for simple classification
+Implement caching, prompt optimization, and batching to reduce costs
+Monitor latency, error rates, token usage, and quality scores in production