DATASCI 185: Introduction to AI Applications

Lecture 05: Metrics, Validation and Overfitting

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

We explored three major paradigms of machine learning
Supervised learning: Learn from labelled examples (classification, regression)
Unsupervised learning: Find patterns without labels (clustering, dimensionality reduction)
Reinforcement learning: Learn from rewards (agents and environments)
We discussed how different problems require different approaches
Today: How do we know if our models are actually any good? 🤔

Source: Codefinity

Lecture overview

Today’s agenda

Part 1: Traditional ML Metrics

Why accuracy isn’t enough
Confusion matrix, precision, recall
ROC curves and regression metrics
Overfitting and cross-validation

Part 2: LLM Evaluation

Why LLMs are harder to evaluate
LLM-as-a-Judge and G-Eval
Hallucinations, RAG, and red teaming
Benchmarks vs real-world performance

Source: Miquido - AI Glossary

Tweet of the day

Why metrics matter 📊

Think about this 🤔

Scenario: You’re building a model to predict whether patients have a rare disease that affects 1 in 1,000 people.

Your colleague shows you a model and proudly announces: “It achieves 99.9% accuracy!”

Question for you: Is this model any good?

Take 30 seconds to think about it…

The uncomfortable truth: That “99.9% accurate” model catches zero actual cases. Every sick patient is missed!

The accuracy paradox

This is called the accuracy paradox
Accuracy measures overall correctness, but doesn’t tell us:
- How many sick patients we actually found
- How many healthy patients we falsely alarmed
- Whether errors are concentrated in one group
The problem is especially severe with class imbalance
Real examples of imbalanced data:
- Fraud: 0.2% of transactions
- Security threats: < 0.01% of events
- Manufacturing defects: Often < 1%

The “99.9% Accurate” Model Matrix

	Predicted: Sick	Predicted: Healthy
Actual: Sick	0 (TP)	1 (FN)
Actual: Healthy	0 (FP)	999 (TN)

Accuracy: \(\frac{0 + 999}{1,000} = 99.9\%\)

Result: Catches 0% of actual cases! 🚨

What are we really measuring?

Key questions before choosing metrics

Every metric captures one aspect of model performance
There’s no single “best” metric, it all depends on your specific context
Questions to ask yourself:
- What’s the cost of a false positive? (Saying “yes” when it’s “no”)
- What’s the cost of a false negative? (Saying “no” when it’s “yes”)
- Are the classes balanced or imbalanced?
- What action will be taken based on the prediction?
Picking the wrong metric can mean missing every sick patient or flagging every healthy one

Source: Medium

Classification metrics 🎯

The confusion matrix

The foundation of classification evaluation

A table showing all possible prediction outcomes
Four key quantities:
- True Positives (TP): Correctly predicted positive
- True Negatives (TN): Correctly predicted negative
- False Positives (FP): Predicted positive, but actually negative (Type I error)
- False Negatives (FN): Predicted negative, but actually positive (Type II error)
All classification metrics come from these four numbers!

Precision, recall, and the trade-off

The two sides of classification performance

Precision = TP / (TP + FP)

When the model says “yes”, how often is it right?
High precision = few false alarms
Important when false positives are costly (spam, fraud)

Recall = TP / (TP + FN)

Of all actual positives, how many did we find?
High recall = we don’t miss many positives
Important when false negatives are costly (disease, security)

Most classifiers output a probability score. Moving the threshold trades off precision and recall…you can’t maximise both!

Source: Analytics Vidhya

Regression metrics

When the target is continuous

For regression (predicting numbers), we need different metrics:

Metric	Formula	Intuition
MAE	Mean of \|actual - predicted\|	Average error size (in original units)
MSE	Mean of (actual - predicted)²	Average squared error (penalises big errors)
RMSE	√MSE	Same as MSE, but in original units
R²	1 - (SS_res / SS_tot)	Proportion of variance explained

Easy examples:

MAE (Mean Absolute Error): Predict 10 pizzas, 12 arrive → off by 2. Predict 10, get 8 → off by 2. MAE = 2 pizzas
RMSE (Root Mean Squared Error): Off by 2 one day, off by 10 another. RMSE punishes that 10 way more than the 2
R²: R² = 0.8 means your model explains 80% of why scores differ; 20% remains unexplained

Validation & overfitting ⚠️

What is overfitting?

The enemy of generalisation

Overfitting: Model performs well on training data but poorly on new data
The model has memorised the training set, including its noise and quirks
It hasn’t learned the underlying patterns
Signs of overfitting:
- Excellent training performance
- Poor test performance
- Gap grows with model complexity
Think of it like memorising exam answers vs understanding the material
The student who memorises fails when questions are rephrased

Source: X.com

Underfitting vs overfitting

Finding the sweet spot

Underfitting 📉

Model too simple
High bias, low variance
Poor on training AND test
Hasn’t captured the pattern
Solution: More complex model

Good fit ✅

Right complexity
Balanced bias and variance
Good on both sets
Captures the true pattern
This is what we want!

Overfitting 📈

Model too complex
Low bias, high variance
Great on training, poor on test
Memorised noise
Solution: Regularisation

Data splits and cross-validation

How to evaluate fairly

The three-way split:

Training set (~60-70%): Fit the model
Validation set (~15-20%): Tune hyperparameters
Test set (~15-20%): Final evaluation only!
Never peek at test data during development (that’s cheating!)

K-fold cross-validation:

Split data into K parts (folds)
Train on K-1 folds, validate on the remaining fold
Repeat K times, average the results
Every data point gets tested once
More reliable than a single split, especially for small datasets

Evaluating LLMs 🤖

Why is LLM evaluation so hard?

A fundamentally different problem

Traditional ML: One correct answer per input
- 😺: Cat or dog? → “Cat” ✅
Language generation: Many valid outputs!
- “The cat sat on the mat” ≈ “A feline rested upon the rug”
- Both are correct! How do we score them?
We need to evaluate multiple dimensions at once:
- Fluency: Is it grammatical and natural?
- Relevance: Does it address the question?
- Factuality: Is it actually true?
- Helpfulness: Is it useful to the user?
Old text metrics (BLEU, ROUGE) just count word overlaps
They can’t tell if text is meaningful or even correct!

Source: TinyML SubStack

Perplexity explained

Measuring how “surprised” the model is

Perplexity measures how well an LLM predicts a sequence of text
Imagine the model guessing the next word
- If it’s very confident about the right answer → Low perplexity
- If it’s confused among many options → High perplexity
Think of it as: How many equally likely words could come next?
- Perplexity of 10 → Model choosing from ~10 words
- Perplexity of 100 → Model choosing from ~100 words
Lower is better: A model that understands language well makes confident, correct predictions
Critical limitation: Perplexity measures fluency, not truth
🎬 Watch: What is Perplexity for LLMs? (5 min)

Examples:

“The capital of France is ___”

“Paris” → Low perplexity ✅
“elephant” → High perplexity ❌

The big catch:

“The earth is ___”

“flat” → Could have low perplexity!

Perplexity measures how well language flows, not whether statements are true.

LLM-as-a-Judge

Using AI to evaluate AI

Use a powerful LLM to grade responses from other models
Give the judge a rubric (evaluation criteria) and examples
The judge scores responses on dimensions like:
- Helpfulness, accuracy, relevance, safety
Why it works:
- Evaluating is often easier than generating
- Scales infinitely (no human bottleneck)
- Can evaluate subjective qualities
The catch:
- The judge is only as good as its own capabilities
- Potential biases: May favour its own style
- Need a strong judge (often GPT-5 or Claude)

┌─────────────────────┐
│   User Question     │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Model Response    │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│  Reference Answer   │
│  (if available)     │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│      LLM Judge      │
│   + Rubric/Criteria │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Score (1-5)       │
│   + Explanation     │
└─────────────────────┘

G-Eval: A practical framework

Chain-of-thought evaluation

G-Eval is a popular method for custom LLM evaluation
How it works:
1. Define your evaluation criteria (e.g., coherence, relevance)
2. The LLM generates evaluation steps using chain-of-thought
3. Apply these steps to score the output (typically 1-5)
Example criteria: “Coherence: The collective quality of all sentences in the actual output”
Why it’s useful:
- Creates task-specific metrics on the fly
- The LLM “thinks through” how to evaluate
- Better alignment with human judgement than simpler metrics

G-Eval in action:

Step 1: Define criterion

“Rate coherence from 1-5”

Step 2: LLM generates steps

“1. Check logical flow between sentences 2. Verify topic consistency 3. Look for contradictions…”

Step 3: Apply and score

Score: 4/5 “Good flow but minor transition issue in paragraph 2”

Hallucinations 🎭

What are hallucinations?

When AI confidently makes things up

Hallucination: AI output that is fluent but factually incorrect or fabricated
The model generates plausible-sounding content that isn’t grounded in reality
Common forms:
- Fabricated citations: Inventing papers that don’t exist
- Made-up statistics: “73% of scientists agree…”
- False biographical details: Wrong dates, events, achievements
- Confident nonsense: Eloquent explanations of things that are simply wrong
This is one of the biggest challenges in deploying language models
Why? Because they’re trained to predict likely text, not true text

Please remember that:

Models optimise for:

P(next word | context)

Not for:

P(statement is true)

Source: Nielsen Norman Group

Real-world consequences

Why hallucinations matter

Notable incidents (all real!):

⚖️ Legal (2023): Two lawyers submitted a brief citing six non-existent cases generated by ChatGPT. They were sanctioned by the court.
🏥 Medical: AI chatbots have provided dangerous health advice citing fabricated studies
📰 News: AI-generated fake news stories have moved stock prices
📚 Academic: Students have submitted essays with entirely invented references
More people trusting AI outputs without verification
Harder for non-experts to spot fabrications
Legal and reputational risks for organisations

Discussion question:

How would you verify if an AI’s answer is correct?

Check primary sources?
Ask another AI?
Trust your intuition?
Rely on reputation of the AI company?

Always check the sources yourself!

🎬 Watch: IBM Explains AI Hallucinations (5 min)

RAG: Retrieval-Augmented Generation

Grounding answers in real documents

RAG: Instead of relying on “memory”, the model looks things up
How it works:
1. User asks a question
2. System retrieves relevant documents from a knowledge base
3. LLM generates answer using retrieved context
4. Answer includes citations to sources
RAG reduces hallucinations:
- Model must base answers on actual documents
- Easier to audit and update knowledge
Evaluation metrics for RAG:
- Faithfulness: Does it stick to what docs say?
- Answer relevancy: Does it address the question?
- Contextual relevancy: Were the right docs retrieved?

┌─────────────────────┐
│   User Question     │
│   "What is X?"      │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│     Retriever       │
│   Search knowledge  │
│   base for X        │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Retrieved Docs    │
│   [Doc 1] [Doc 2]   │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Generator         │
│   Question + Docs   │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Grounded Answer   │
│   with citations    │
└─────────────────────┘

Red teaming: Adversarial evaluation

Finding weaknesses before they find you

Red teaming: Deliberately trying to make the AI fail or behave badly
Like hiring hackers to test your security!
What red teamers look for:
- Harmful or unsafe outputs
- Jailbreaks (bypassing safety guardrails)
- Bias and offensive content
- Factual errors and hallucinations
Manual red teaming: Humans craft tricky prompts
- Gold standard but expensive and slow
AI-Assisted Red Teaming (AART): Use AI to generate adversarial test cases automatically
- Scales up testing dramatically
- Covers diverse cultural and geographic contexts
- Paper: AART (2023)

🔴 Without red teaming: Vulnerabilities discovered by users in production → Reputational damage, harm, legal liability

🟢 With red teaming: Vulnerabilities found before deployment → Fixes applied proactively → Safer, more reliable models

Benchmarks & leaderboards 🏆

Benchmarks, leaderboards, and their limits

Comparing AI models…and when metrics fail

Benchmarks: Standardised tests (MMLU, TruthfulQA, HumanEval)

Leaderboards: Human preference rankings (LM Arena uses Elo ratings)

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure”

Common problems:

Benchmark contamination: Test data leaks into training
Teaching to the test: Optimising for quirks, not capability
Metric saturation: “Human-level” on benchmarks, fails in real-world

What benchmarks miss: Robustness, safety, creativity, long-horizon reasoning

Source: X.com

Try it yourself at lmarena.ai!

Benchmark contamination in practice

When the model finds the answer key

In 2025, Anthropic discovered that Claude Opus independently identified it was being evaluated on BrowseComp, a benchmark by OpenAI
The model located the answer key online and used it to solve the evaluation
This is a new form of benchmark contamination: the model is not trained on the answers, but finds them during the test
Raises questions about the validity of benchmarks that assume models cannot access external information
As models become more capable, evaluation itself becomes harder

Claude identifying a benchmark evaluation

Source: Anthropic Engineering

Ethics beyond accuracy 🤔

Fairness: When overall accuracy hides problems

The disaggregation imperative

Overall accuracy can hide serious failures for specific groups
Real example (Buolamwini & Gebru, 2018):
- Commercial face recognition error rates:
- Light-skinned men: 0.8% error
- Dark-skinned women: 34.7% error
- 43× worse performance for one group!
This is called data slicing or subgroup analysis
The model “works” overall, but fails for specific populations
Even removing sensitive features doesn’t help—they correlate with other features
Always evaluate models per subgroup, not just overall

Source: Medium

Summary ✅

Main takeaways

Traditional ML Metrics:

Accuracy is not enough: Especially dangerous with class imbalance
Precision/Recall trade-off: Choose based on false positive vs. false negative costs
Use cross-validation and keep a truly unseen test set

LLM Evaluation:

Perplexity ≠ truth: Fluent text can still be wrong
LLM-as-a-Judge and G-Eval: Scalable evaluation with rubrics
RAG: Ground answers in documents to reduce hallucinations
Red teaming: Proactively find vulnerabilities before users do
Benchmarks have limits: Goodhart’s Law and “benchmaxxing”
Subgroup analysis: Always look beyond averages

Source: Zurich University of Applied Sciences

… and that’s all for today! 🎉