Evaluation and Cost Optimization

Measure quality and optimize costs in production LLM systems

Learning goals

•Learn to evaluate LLM output quality
•Understand cost optimization strategies
•Implement monitoring and observability

Evaluation Methods

Automated Metrics

Exact match: Response matches expected output
BLEU/ROUGE: Text similarity scores
Custom validators: Schema compliance, keyword presence

LLM-as-Judge

Use a (different) LLM to evaluate responses:

Rate this response on accuracy (1-5):
Question: {question}
Expected themes: {themes}
Response: {response}

Human Evaluation

A/B testing with users
Expert review for high-stakes applications
Periodic quality audits

Cost Optimization

Model Selection

Use smaller models for simple tasks
Mini and Nano (e.g. GPT-4.1 Mini/Nano) for light tasks: classification, formatting, short answers
Thinking/reasoning models (e.g. o1, GPT-5.4) for code, complex analysis, and deep reasoning
Fine-tuned small models can outperform large general models

Prompt Optimization

Shorter prompts = lower costs
Remove redundant instructions
Compress examples

Caching

Cache identical queries
Semantic caching for similar queries
TTL based on content freshness needs

Monitoring in Production

Track these metrics:

Performance

Latency (p50, p95, p99)
Time to first token
Token throughput

Quality

Error rates
User feedback signals
Automated quality scores

Cost

Token usage per request
Cost per user/feature
Daily/weekly spend trends

Common mistakes

×Not measuring baseline performance before optimizations

×Optimizing for cost without tracking quality impact

×Using expensive models for simple tasks

×Ignoring caching opportunities for repeated queries

Key takeaways

+Combine automated metrics, LLM-as-judge, and human evaluation for comprehensive assessment

+Match model capability to task complexity—don't use a frontier or reasoning model (e.g. GPT-5.4) for simple classification

+Implement caching, prompt optimization, and batching to reduce costs

+Monitor latency, error rates, token usage, and quality scores in production