DATASCI 185: Introduction to AI Applications

Lecture 05: Metrics, Validation and Overfitting

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

  • We explored three major paradigms of machine learning
  • Supervised learning: Learn from labelled examples (classification, regression)
  • Unsupervised learning: Find patterns without labels (clustering, dimensionality reduction)
  • Reinforcement learning: Learn from rewards (agents and environments)
  • We discussed how different problems require different approaches
  • Today: How do we know if our models are actually any good? 🤔

Source: Codefinity

Lecture overview

Today’s agenda

Part 1: Traditional ML Metrics

  • Why accuracy isn’t enough
  • Confusion matrix, precision, recall
  • ROC curves and regression metrics
  • Overfitting and cross-validation

Part 2: LLM Evaluation

  • Why LLMs are harder to evaluate
  • LLM-as-a-Judge and G-Eval
  • Hallucinations, RAG, and red teaming
  • Benchmarks vs real-world performance

Model evaluation is both an art and a science

Source: Miquido - AI Glossary

Upcoming event

RSVP here

Tweet of the day

Emory’s stance on AI use

Why metrics matter 📊

Think about this 🤔

Scenario: You’re building a model to predict whether patients have a rare disease that affects 1 in 1,000 people.

Your colleague shows you a model and proudly announces: “It achieves 99.9% accuracy!”

Question for you: Is this model any good?

Take 30 seconds to think about it…

The uncomfortable truth: That “99.9% accurate” model catches zero actual cases. Every sick patient is missed!

The accuracy paradox

  • This is called the accuracy paradox
  • Accuracy measures overall correctness, but doesn’t tell us:
    • How many sick patients we actually found
    • How many healthy patients we falsely alarmed
    • Whether errors are concentrated in one group
  • The problem is especially severe with class imbalance
  • Real examples of imbalanced data:
    • Fraud: 0.2% of transactions
    • Security threats: < 0.01% of events
    • Manufacturing defects: Often < 1%

The “99.9% Accurate” Model Matrix

Predicted: Sick Predicted: Healthy
Actual: Sick 0
(TP)
1
(FN)
Actual: Healthy 0
(FP)
999
(TN)

Accuracy: \(\frac{0 + 999}{1,000} = 99.9\%\)

Result: Catches 0% of actual cases! 🚨

What are we really measuring?

Key questions before choosing metrics

  • Every metric captures one aspect of model performance
  • There’s no single “best” metric, it all depends on your specific context
  • Questions to ask yourself:
    • What’s the cost of a false positive? (Saying “yes” when it’s “no”)
    • What’s the cost of a false negative? (Saying “no” when it’s “yes”)
    • Are the classes balanced or imbalanced?
    • What action will be taken based on the prediction?
  • The choice of metric is a design decision with real-world consequences

Different metrics for different goals

Source: Medium

Classification metrics 🎯

The confusion matrix

The foundation of classification evaluation

  • A table showing all possible prediction outcomes
  • Four key quantities:
    • True Positives (TP): Correctly predicted positive
    • True Negatives (TN): Correctly predicted negative
    • False Positives (FP): Predicted positive, but actually negative (Type I error)
    • False Negatives (FN): Predicted negative, but actually positive (Type II error)
  • All classification metrics come from these four numbers!

Confusion matrix visualisation

Precision, recall, and the trade-off

The two sides of classification performance

Precision = TP / (TP + FP)

  • When the model says “yes”, how often is it right?
  • High precision = few false alarms
  • Important when false positives are costly (spam, fraud)

Recall = TP / (TP + FN)

  • Of all actual positives, how many did we find?
  • High recall = we don’t miss many positives
  • Important when false negatives are costly (disease, security)

Most classifiers output a probability score. Moving the threshold trades off precision and recall…you can’t maximise both!

The precision-recall trade-off

Source: Analytics Vidhya

Regression metrics

When the target is continuous

  • For regression (predicting numbers), we need different metrics:
Metric Formula Intuition
MAE Mean of |actual - predicted| Average error size (in original units)
MSE Mean of (actual - predicted)² Average squared error (penalises big errors)
RMSE √MSE Same as MSE, but in original units
1 - (SS_res / SS_tot) Proportion of variance explained

Easy examples:

  • MAE (Mean Absolute Error): Predict 10 pizzas, 12 arrive → off by 2. Predict 10, get 8 → off by 2. MAE = 2 pizzas
  • RMSE (Root Mean Squared Error): Off by 2 one day, off by 10 another. RMSE punishes that 10 way more than the 2
  • : R² = 0.8 means your model explains 80% of why scores differ; 20% remains unexplained

Validation & overfitting ⚠️

What is overfitting?

The enemy of generalisation

  • Overfitting: Model performs well on training data but poorly on new data
  • The model has memorised the training set, including its noise and quirks
  • It hasn’t learned the underlying patterns
  • Signs of overfitting:
    • Excellent training performance
    • Poor test performance
    • Gap grows with model complexity
  • Think of it like memorising exam answers vs understanding the material
  • The student who memorises fails when questions are rephrased

Overfitting visualised 😂

Source: X.com

Underfitting vs overfitting

Finding the sweet spot

Underfitting 📉

  • Model too simple
  • High bias, low variance
  • Poor on training AND test
  • Hasn’t captured the pattern
  • Solution: More complex model

Good fit

  • Right complexity
  • Balanced bias and variance
  • Good on both sets
  • Captures the true pattern
  • This is what we want!

Overfitting 📈

  • Model too complex
  • Low bias, high variance
  • Great on training, poor on test
  • Memorised noise
  • Solution: Regularisation

Data splits and cross-validation

How to evaluate fairly

The three-way split:

  • Training set (~60-70%): Fit the model
  • Validation set (~15-20%): Tune hyperparameters
  • Test set (~15-20%): Final evaluation only!
  • Never peek at test data during development—that’s cheating!

K-fold cross-validation:

  • Split data into K parts (folds)
  • Train on K-1 folds, validate on the remaining fold
  • Repeat K times, average the results
  • Every data point gets tested once
  • More robust than a single split, especially for small datasets

Train/validation/test split

K-fold cross-validation

Evaluating LLMs 🤖

Why is LLM evaluation so hard?

A fundamentally different problem

  • Traditional ML: One correct answer per input
    • 😺: Cat or dog? → “Cat” ✅
  • Language generation: Many valid outputs!
    • “The cat sat on the mat” ≈ “A feline rested upon the rug”
    • Both are correct! How do we score them?
  • We need to evaluate multiple dimensions at once:
    • Fluency: Is it grammatical and natural?
    • Relevance: Does it address the question?
    • Factuality: Is it actually true?
    • Helpfulness: Is it useful to the user?
  • Old text metrics (BLEU, ROUGE) just count word overlaps
  • They can’t tell if text is meaningful or even correct!

Using LLMs to evaluate LLMs? 😂

Source: TinyML SubStack

Perplexity explained

Measuring how “surprised” the model is

  • Perplexity measures how well an LLM predicts a sequence of text
  • Imagine the model guessing the next word
    • If it’s very confident about the right answer → Low perplexity
    • If it’s confused among many options → High perplexity
  • Think of it as: How many equally likely words could come next?
    • Perplexity of 10 → Model choosing from ~10 words
    • Perplexity of 100 → Model choosing from ~100 words
  • Lower is better: A model that understands language well makes confident, correct predictions
  • Critical limitation: Perplexity measures fluency, not truth
  • 🎬 Watch: What is Perplexity for LLMs? (5 min)

Examples:

“The capital of France is ___”

  • “Paris” → Low perplexity ✅
  • “elephant” → High perplexity ❌


The big catch:

“The earth is ___”

  • “flat” → Could have low perplexity!

Perplexity measures how well language flows, not whether statements are true.

LLM-as-a-Judge

Using AI to evaluate AI

  • Use a powerful LLM to grade responses from other models
  • Give the judge a rubric (evaluation criteria) and examples
  • The judge scores responses on dimensions like:
    • Helpfulness, accuracy, relevance, safety
  • Why it works:
    • Evaluating is often easier than generating
    • Scales infinitely (no human bottleneck)
    • Can evaluate subjective qualities
  • The catch:
    • The judge is only as good as its own capabilities
    • Potential biases: May favour its own style
    • Need a strong judge (often GPT-5 or Claude)
┌─────────────────────┐
│   User Question     │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Model Response    │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│  Reference Answer   │
│  (if available)     │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│      LLM Judge      │
│   + Rubric/Criteria │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Score (1-5)       │
│   + Explanation     │
└─────────────────────┘

G-Eval: A practical framework

Chain-of-thought evaluation

  • G-Eval is a popular method for custom LLM evaluation
  • How it works:
    1. Define your evaluation criteria (e.g., coherence, relevance)
    2. The LLM generates evaluation steps using chain-of-thought
    3. Apply these steps to score the output (typically 1-5)
  • Example criteria: “Coherence: The collective quality of all sentences in the actual output”
  • Why it’s useful:
    • Creates task-specific metrics on the fly
    • The LLM “thinks through” how to evaluate
    • Better alignment with human judgement than simpler metrics

G-Eval in action:

Step 1: Define criterion

“Rate coherence from 1-5”

Step 2: LLM generates steps

“1. Check logical flow between sentences 2. Verify topic consistency 3. Look for contradictions…”

Step 3: Apply and score

Score: 4/5 “Good flow but minor transition issue in paragraph 2”

Hallucinations 🎭

What are hallucinations?

When AI confidently makes things up

  • Hallucination: AI output that is fluent but factually incorrect or fabricated
  • The model generates plausible-sounding content that isn’t grounded in reality
  • Common forms:
    • Fabricated citations: Inventing papers that don’t exist
    • Made-up statistics: “73% of scientists agree…”
    • False biographical details: Wrong dates, events, achievements
    • Confident nonsense: Eloquent explanations of things that are simply wrong
  • This is one of the biggest challenges in deploying language models
  • Why? Because they’re trained to predict likely text, not true text

Please remember that:

Models optimise for:

P(next word | context)

Not for:

P(statement is true)

AI hallucination

Source: Nielsen Norman Group

Real-world consequences

Why hallucinations matter

Notable incidents (all real!):

Discussion question:

How would you verify if an AI’s answer is correct?

  • Check primary sources?
  • Ask another AI?
  • Trust your intuition?
  • Rely on reputation of the AI company?


Critical thinking is more important than trust!


🎬 Watch: IBM Explains AI Hallucinations (5 min)

RAG: Retrieval-Augmented Generation

Grounding answers in real documents

  • RAG: Instead of relying on “memory”, the model looks things up
  • How it works:
    1. User asks a question
    2. System retrieves relevant documents from a knowledge base
    3. LLM generates answer using retrieved context
    4. Answer includes citations to sources
  • RAG reduces hallucinations:
    • Model must base answers on actual documents
    • Easier to audit and update knowledge
  • Evaluation metrics for RAG:
    • Faithfulness: Does it stick to what docs say?
    • Answer relevancy: Does it address the question?
    • Contextual relevancy: Were the right docs retrieved?
┌─────────────────────┐
│   User Question     │
│   "What is X?"      │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│     Retriever       │
│   Search knowledge  │
│   base for X        │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Retrieved Docs    │
│   [Doc 1] [Doc 2]   │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Generator         │
│   Question + Docs   │
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│   Grounded Answer   │
│   with citations    │
└─────────────────────┘

Red teaming: Adversarial evaluation

Finding weaknesses before they find you

  • Red teaming: Deliberately trying to make the AI fail or behave badly
  • Like hiring hackers to test your security!
  • What red teamers look for:
    • Harmful or unsafe outputs
    • Jailbreaks (bypassing safety guardrails)
    • Bias and offensive content
    • Factual errors and hallucinations
  • Manual red teaming: Humans craft tricky prompts
    • Gold standard but expensive and slow
  • AI-Assisted Red Teaming (AART): Use AI to generate adversarial test cases automatically
    • Scales up testing dramatically
    • Covers diverse cultural and geographic contexts
    • Paper: AART (2023)

🔴 Without red teaming: Vulnerabilities discovered by users in production → Reputational damage, harm, legal liability

🟢 With red teaming: Vulnerabilities found before deployment → Fixes applied proactively → Safer, more robust models

Examples of red teaming

Benchmarks & leaderboards 🏆

Benchmarks, leaderboards, and their limits

Comparing AI models…and when metrics fail

Benchmarks: Standardised tests (MMLU, TruthfulQA, HumanEval)

Leaderboards: Human preference rankings (LM Arena uses Elo ratings)

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure”

Common problems:

  • Benchmark contamination: Test data leaks into training
  • Teaching to the test: Optimising for quirks, not capability
  • Metric saturation: “Human-level” on benchmarks, fails in real-world

What benchmarks miss: Robustness, safety, creativity, long-horizon reasoning

Goodhart’s Law in action

Source: X.com

Try it yourself at lmarena.ai!

Ethics beyond accuracy 🤔

Fairness: When overall accuracy hides problems

The disaggregation imperative

  • Overall accuracy can hide serious failures for specific groups
  • Real example (Buolamwini & Gebru, 2018):
    • Commercial face recognition error rates:
    • Light-skinned men: 0.8% error
    • Dark-skinned women: 34.7% error
    • 43× worse performance for one group!
  • This is called data slicing or subgroup analysis
  • The model “works” overall, but fails for specific populations
  • Even removing sensitive features doesn’t help—they correlate with other features
  • Always evaluate models per subgroup, not just overall

Subgroup performance disparities

Source: Medium

Summary ✅

Main takeaways

Traditional ML Metrics:

  • Accuracy is not enough: Especially dangerous with class imbalance
  • Precision/Recall trade-off: Choose based on false positive vs. false negative costs
  • Use cross-validation and keep a truly unseen test set

LLM Evaluation:

  • Perplexity ≠ truth: Fluent text can still be wrong
  • LLM-as-a-Judge and G-Eval: Scalable evaluation with rubrics
  • RAG: Ground answers in documents to reduce hallucinations
  • Red teaming: Proactively find vulnerabilities before users do
  • Benchmarks have limits: Goodhart’s Law and “benchmaxxing”
  • Subgroup analysis: Always look beyond averages

… and that’s all for today! 🎉