Production Systems
18 minLesson 14 of 14
Evaluation and Cost Optimization
Measure quality and optimize costs in production LLM systems
Learning goals
- •Learn to evaluate LLM output quality
- •Understand cost optimization strategies
- •Implement monitoring and observability
Evaluation Methods
Automated Metrics - **Exact match**: Response matches expected output - **BLEU/ROUGE**: Text similarity scores - **Custom validators**: Schema compliance, keyword presence
LLM-as-Judge Use a (different) LLM to evaluate responses: ``` Rate this response on accuracy (1-5): Question: {question} Expected themes: {themes} Response: {response} ```
Human Evaluation - A/B testing with users - Expert review for high-stakes applications - Periodic quality audits
Cost Optimization
Model Selection - Use smaller models for simple tasks - GPT-3.5 for classification, GPT-4 for complex reasoning - Fine-tuned small models can outperform large general models
Prompt Optimization - Shorter prompts = lower costs - Remove redundant instructions - Compress examples
Caching - Cache identical queries - Semantic caching for similar queries - TTL based on content freshness needs
Monitoring in Production
Track these metrics:
Performance - Latency (p50, p95, p99) - Time to first token - Token throughput
Quality - Error rates - User feedback signals - Automated quality scores
Cost - Token usage per request - Cost per user/feature - Daily/weekly spend trends
Common mistakes
×Not measuring baseline performance before optimizations
×Optimizing for cost without tracking quality impact
×Using expensive models for simple tasks
×Ignoring caching opportunities for repeated queries
Key takeaways
+Combine automated metrics, LLM-as-judge, and human evaluation for comprehensive assessment
+Match model capability to task complexity—don't use GPT-4 for simple classification
+Implement caching, prompt optimization, and batching to reduce costs
+Monitor latency, error rates, token usage, and quality scores in production