DATASCI 185: Introduction to AI Applications

Lecture 04: Supervised, Unsupervised, and Reinforcement Learning

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

  • We discussed the foundation of AI systems: (good) data!
  • Garbage in, garbage out!
  • Different AI/ML tasks require different types of labels (classification, regression, etc.)
  • Selection bias is dangerous and hard to fix after the fact
  • Labelling is harder than it looks, and clear guidelines and multiple annotators improve quality
  • Inter-annotator agreement (Cohen’s Kappa) measures labelling reliability
  • Today: How do machines actually learn from data? 🤔

You can’t out-train bad data

Source: Programmer Humor

Lecture overview

Today’s agenda

  • The three paradigms of machine learning
  • Supervised learning: Learning from labelled examples
    • Linear regression, decision trees, neural networks
  • The bias-variance trade-off
  • Unsupervised learning: Finding hidden structure
    • Clustering and dimensionality reduction
  • Reinforcement learning: Learning from interaction
    • Agents, rewards, and policies

The three paradigms of ML

Source: Data Science Dojo

Tweet of the day 😄

The three learning paradigms 🎓

How do machines learn?

Three fundamentally different approaches

  • AI algorithms learn patterns from data
  • But what kind of data and what kind of feedback?
  • Three main paradigms:
    1. Supervised learning: Learn from labelled examples (what we saw in previous lecture)
    2. Unsupervised learning: Find structure in unlabelled data
    3. Reinforcement learning: Learn from rewards and punishments
  • The paradigm you choose depends on what data you have and what you want to achieve

Learning paradigms overview and examples

Source: Medium

Supervised learning 📚

What is supervised learning?

Learning from examples with answers

  • Supervised learning: The model learns from labelled examples
  • Training data: Input-output pairs \((x_i, y_i)\)
  • Goal: Learn a function \(f\) such that \(f(x) \approx y\)
  • Like learning with a teacher who provides correct answers
  • The “supervision” comes from known labels
  • Most common paradigm in real-world applications
  • Examples:
    • Email → Spam/Not spam
    • Image → Cat/Dog/Bird
    • Patient data → Disease risk

A hard classification problem 😂

Source: Memedroid (!)

Classification vs regression

The two main supervised tasks

You may remember them from last class:

Classification 🏷️

  • Predict a discrete category
  • Binary: Yes/No, Spam/Ham
  • Multi-class: Cat/Dog/Bird/Fish
  • Output: Class label (or probabilities)
  • Examples:
    • Fraud detection
    • Disease diagnosis
    • Sentiment analysis

Regression 📈

  • Predict a continuous value
  • Output: A number on a continuous scale
  • Examples:
    • House price prediction
    • Stock price forecasting
    • Temperature prediction
    • Age estimation from photo

The same algorithm family often handles both tasks with minor modifications!

Linear models

The simplest supervised learners

  • Linear regression: Predict y as a weighted sum of features

\[\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n\]

  • Learn weights \(w\) that minimise prediction error
  • Logistic regression: Classification via sigmoid function

\[P(y=1|x) = \frac{1}{1 + e^{-(w_0 + w_1 x_1 + \ldots)}}\]

  • Simple, interpretable, fast to train
  • Works well when relationships are approximately linear
  • Often a strong baseline before trying complex models

Linear regression fits a line

Source: Medium

A logistic function converts a line into a curve

Source: Wikipedia

Decision trees

Learning rules from data

  • Decision trees: Learn a hierarchy of yes/no questions
  • Each node splits data based on a feature
  • Leaves contain predictions
  • Easy to interpret: “If income > £50k AND age > 30, then approve loan”
  • Can capture non-linear relationships
  • Prone to overfitting (memorising training data)
  • Solution: Random forests, which combine many trees
    • Each tree sees different data/features
    • Average their predictions for robustness

Decision tree structure

Source: Medium

Neural networks for supervised learning

Learning complex patterns

  • Neural networks: Layers of interconnected nodes
  • Each layer transforms the input
  • Can learn arbitrarily complex functions
  • Deep networks (many layers) = deep learning
  • Require more data than simpler models
  • Less interpretable (“black box”)
  • State-of-the-art for:
    • Image classification (CNNs)
    • Speech recognition
    • Natural language processing
  • Remember from Lecture 02: backpropagation enables training!

Neural network architecture

Source: 3Blue1Brown

The bias-variance trade-off

The fundamental tension in ML

  • Every model makes errors! Two sources:
  • Bias: Error from oversimplified assumptions
    • Underfitting: Model too simple
    • Misses real patterns in the data
  • Variance: Error from sensitivity to training data
    • Overfitting: Model too complex
    • Memorises noise, fails on new data

\[\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}\]

  • Trade-off: Reducing one often increases the other
  • Goal: Find the sweet spot where total error is minimised

Bias-variance trade-off

Source: Vizuara

Cross-validation

Robust model evaluation

  • Problem: Single train/test split can be misleading
    • Why? It doesn’t account for enough random variation
  • Cross-validation: Multiple train/test splits
  • K-fold CV: Split data into K parts
    • Train on K-1 folds, test on 1 fold
    • Repeat K times, rotate the test fold
    • Average the results
  • Benefits:
    • Every data point is tested once
    • More reliable performance estimate
    • Helps detect overfitting
  • Common choice: K = 5 or K = 10

5-fold cross-validation

Source: Vizuara

Unsupervised learning 🔍

What is unsupervised learning?

Finding structure without labels

  • Unsupervised learning: Learn from data without labels
  • No “correct answers” provided
  • Goal: Discover hidden patterns or structure
  • Like learning without a teacher, explore and discover
  • Why use it?
    • Labels are expensive (experts needed)
    • Sometimes labels don’t exist (“How many customer types?”)
    • Exploration: Understand data before modelling
    • Pre-training: Learn representations, then fine-tune
  • Some questions to ask before using these models:
    • Are there natural groups in the data?
    • Can we represent data more compactly?

Unsupervised learning finds structure

Source: University of Cambridge

Clustering

Grouping similar data points

  • Clustering: Partition data into groups (clusters)
  • Points in same cluster are similar, different clusters are dissimilar
  • K-means: Most popular algorithm
    1. Choose K cluster centres randomly
    2. Assign each point to nearest centre
    3. Move centres to mean of assigned points
    4. Repeat until convergence
  • Hierarchical clustering: Builds dendrograms without pre-specifying K

Real-world uses:

Domain Application
Marketing Customer segmentation
Biology Cell types, disease subtypes
Finance Fraud detection
Healthcare Patient risk groups

K-means clustering

Source: Machine Learning CoBan

Dimensionality reduction

Compressing information

  • Curse of dimensionality: High-dimensional data is hard to work with
    • Distances become meaningless
    • Need exponentially more data
    • Visualisation impossible
  • Dimensionality reduction: Find lower-dimensional representation
  • Preserve important structure, discard noise
  • Two main approaches:
    • Linear: PCA (Principal Component Analysis)
    • Non-linear: t-SNE, UMAP
  • Applications: Visualisation, preprocessing, compression
  • Super cool example (we will see it in lecture 06!): https://projector.tensorflow.org/

Reducing dimensions while preserving structure

Source: Medium

Dimensionality reduction techniques

PCA, t-SNE, and UMAP

Linear: PCA (Principal Component Analysis)

  • Find directions of maximum variance
  • Keep top K components to reduce dimensions
  • Fast, well-understood, preserves global structure
  • Limitation: Only captures linear relationships

Non-linear: t-SNE and UMAP

  • Preserve local structure: Nearby points stay nearby
  • Excellent for visualisation
  • Reveal clusters that PCA misses
  • Caution: Cluster sizes and distances can be misleading

Principal Component Analysis

Source: Vizuara

Reinforcement learning 🎮

What is reinforcement learning?

Learning from interaction

  • Reinforcement learning (RL): Learn by trial and error
  • An agent interacts with an environment
  • Takes actions, receives rewards (or punishments)
  • Goal: Learn a policy that maximises cumulative reward
  • No labelled examples, machines learn from experience
  • Like training a dog (or kids!): Reward good behaviour!
  • Different from supervised: No “correct” action given
  • Different from unsupervised: There IS a goal (maximise reward)

The RL loop

Source: Wikipedia

Key concepts in RL

The vocabulary you need

Concept Definition Example (Chess)
Agent The learner/decision-maker The chess-playing AI
Environment What the agent interacts with The chess board and opponent
State Current situation Board position
Action What the agent can do Move a piece
Reward Feedback signal +1 win, -1 lose, 0 otherwise
Policy Strategy: state → action “In this position, move queen”
Value Expected future reward from a state How good is this position?

Fun fact: Magnus Carlsen once said he “can’t beat his phone in chess”. RL really works! 😅

The exploration-exploitation trade-off

A fundamental dilemma

  • Exploration: Try new actions to discover better strategies
  • Exploitation: Use known good actions to maximise reward
  • The dilemma:
    • Too much exploration: Waste time on bad actions
    • Too much exploitation: Miss better strategies
  • Example: Restaurant choice
    • Exploit: Go to your favourite restaurant
    • Explore: Try a new restaurant (might be better!)
  • RL algorithms must balance both
  • Common approach: ε-greedy (explore with probability ε, also called multi-armed bandit)

Exploration vs exploitation

Source: Lilian Weng

Q-learning

Learning action values

  • Q-learning: Learn which actions are best in each situation
  • Think of it as keeping a score card:
    • \(Q(s, a)\) = “How good is action \(a\) in state \(s\)?”
  • The agent learns by doing:
    1. Try an action, see what reward you get → “Go right… hit a wall, ouch!”
    2. Update your score → “Going right here is bad”
    3. Repeat thousands of times → “Eventually: always go left here!”

The update rule: \[\text{New Score} = \text{Old Score} + \text{Small Correction}\]

Source: Medium

RLHF: RL for language models

From AlphaGo to ChatGPT

  • RLHF: Reinforcement Learning from Human Feedback
  • Key to ChatGPT’s success!
  • Process:
    1. Train base LLM on text (supervised)
    2. Collect human preferences on model outputs
    3. Train a reward model on preferences
    4. Use RL to optimise LLM for reward
  • Why it works:
    • Hard to specify “good response” in a formula
    • Humans can compare responses more easily
    • RL optimises for what humans actually want
  • More here: https://openai.com/index/learning-from-human-preferences/

RLHF pipeline

Source: Simform

RLHF

RLHF from Anthropic

Source: Anthropic’s hh-rlhf dataset (check it out, it’s really cool!)

Jailbreaking RLHF

If you can make it, you can jailbreak it!

AI safety

It’s not just about making it do what we want

  • Alignment problem: Ensuring models do what we mean, not just what we say
  • Specification gaming: Models finding perverse ways to maximise rewards (e.g., reward hacking)
  • Scalable oversight: How to supervise models that are more capable than their human evaluators
  • Red teaming: Proactively finding vulnerabilities and jailbreaks
  • Safety is not an afterthought: It must be built into the learning process from the start

AI alignment

Source: NanoBanana

Comparing the three paradigms 📊

Side-by-side comparison

Choosing the right approach

Supervised Unsupervised Reinforcement
Data Labelled (X, y) Unlabelled (X) States, actions, rewards
Goal Predict y from X Find structure Maximise reward
Feedback Correct answers None Reward signal
Analogy Learning with a teacher Exploring alone Learning by trial and error
Evaluation Compare to known labels Internal metrics Cumulative reward
Examples Spam detection, diagnosis Customer segments Game playing, robotics
Difficulty Medium (if labels exist) Hard to evaluate Hard to train

Summary 📚

Main takeaways

  • Three paradigms: Supervised (labels), unsupervised (no labels), reinforcement (rewards)

  • Supervised learning is most common: Learn from labelled examples to predict

  • The bias-variance trade-off is something to always keep in mind: Simple models underfit, complex models overfit

  • Unsupervised learning discovers hidden structure: Clustering, dimensionality reduction

  • Reinforcement learning learns from interaction: Exploration vs exploitation

  • RLHF powers modern LLMs like ChatGPT: RL to align with human preferences

… and that’s all for today! 🎉