Main takeaways
Traditional ML Metrics:
- Accuracy is not enough: Especially dangerous with class imbalance
- Precision/Recall trade-off: Choose based on false positive vs. false negative costs
- Use cross-validation and keep a truly unseen test set
LLM Evaluation:
- Perplexity ≠ truth: Fluent text can still be wrong
- LLM-as-a-Judge and G-Eval: Scalable evaluation with rubrics
- RAG: Ground answers in documents to reduce hallucinations
- Red teaming: Proactively find vulnerabilities before users do
- Benchmarks have limits: Goodhart’s Law and “benchmaxxing”
- Subgroup analysis: Always look beyond averages