DATASCI 185: Introduction to AI Applications

Lecture 04: Supervised, Unsupervised, and Reinforcement Learning

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

We discussed the foundation of AI systems: (good) data!
Garbage in, garbage out!
Different AI/ML tasks require different types of labels (classification, regression, etc.)
Selection bias is dangerous and hard to fix after the fact
Labelling is harder than it looks, and clear guidelines and multiple annotators improve quality
Inter-annotator agreement (Cohen’s Kappa) measures labelling reliability
Today: How do machines actually learn from data? 🤔

Source: Programmer Humor

Lecture overview

Today’s agenda

The three paradigms of machine learning
Supervised learning: Learning from labelled examples
- Linear regression, decision trees, neural networks
The bias-variance trade-off
Unsupervised learning: Finding hidden structure
- Clustering and dimensionality reduction
Reinforcement learning: Learning from interaction
- Agents, rewards, and policies

Source: Data Science Dojo

Tweet of the day 😄

The three learning paradigms 🎓

How do machines learn?

Three different approaches

AI algorithms learn patterns from data
But what kind of data and what kind of feedback?
Three main paradigms:
1. Supervised learning: Learn from labelled examples (what we saw in previous lecture)
2. Unsupervised learning: Find structure in unlabelled data
3. Reinforcement learning: Learn from rewards and punishments
The paradigm you choose depends on what data you have and what you want to achieve

Learning paradigms overview and examples

Source: Medium

Supervised learning 📚

What is supervised learning?

Learning from examples with answers

Supervised learning: The model learns from labelled examples
Training data: Input-output pairs \((x_i, y_i)\)
Goal: Learn a function \(f\) such that \(f(x) \approx y\)
Like learning with a teacher who provides correct answers
The “supervision” comes from known labels
Most common paradigm in real-world applications
Examples:
- Email → Spam/Not spam
- Image → Cat/Dog/Bird
- Patient data → Disease risk

Source: Memedroid (!)

Classification vs regression

The two main supervised tasks

You may remember them from last class:

Classification 🏷️

Predict a discrete category
Binary: Yes/No, Spam/Ham
Multi-class: Cat/Dog/Bird/Fish
Output: Class label (or probabilities)
Examples:
- Fraud detection
- Disease diagnosis
- Sentiment analysis

Regression 📈

Predict a continuous value
Output: A number on a continuous scale
Examples:
- House price prediction
- Stock price forecasting
- Temperature prediction
- Age estimation from photo

The same algorithm family often handles both tasks with minor modifications!

Linear models

The simplest supervised learners

Linear regression: Predict y as a weighted sum of features

\[\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n\]

Learn weights \(w\) that minimise prediction error
Logistic regression: Classification via sigmoid function

\[P(y=1|x) = \frac{1}{1 + e^{-(w_0 + w_1 x_1 + \ldots)}}\]

Simple, interpretable, fast to train
Works well when relationships are approximately linear
Often a strong baseline before trying complex models

Source: Medium

A logistic function converts a line into a curve

Source: Wikipedia

Decision trees

Learning rules from data

Decision trees: Learn a hierarchy of yes/no questions
Each node splits data based on a feature
Leaves contain predictions
Easy to interpret: “If income > £50k AND age > 30, then approve loan”
Can capture non-linear relationships
Prone to overfitting (memorising training data)
Solution: Random forests, which combine many trees
- Each tree sees different data/features
- Average their predictions for better accuracy

Source: Medium

Neural networks for supervised learning

Learning complex patterns

Neural networks: Layers of interconnected nodes
Each layer transforms the input
Can learn arbitrarily complex functions
Deep networks (many layers) = deep learning
Require more data than simpler models
Less interpretable (“black box”)
State-of-the-art for:
- Image classification (CNNs)
- Speech recognition
- Natural language processing
Remember from Lecture 02: backpropagation enables training!

Source: 3Blue1Brown

The bias-variance trade-off

Too simple or too complex?

Every model makes errors! Two sources:
Bias: Error from oversimplified assumptions
- Underfitting: Model too simple
- Misses real patterns in the data
Variance: Error from sensitivity to training data
- Overfitting: Model too complex
- Memorises noise, fails on new data

\[\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}\]

Trade-off: Reducing one often increases the other
Goal: Find the sweet spot where total error is minimised

Source: Vizuara

Cross-validation

Testing your model properly

Problem: Single train/test split can be misleading
- Why? It doesn’t account for enough random variation
Cross-validation: Multiple train/test splits
K-fold CV: Split data into K parts
- Train on K-1 folds, test on 1 fold
- Repeat K times, rotate the test fold
- Average the results
Benefits:
- Every data point is tested once
- More reliable performance estimate
- Helps detect overfitting
Common choice: K = 5 or K = 10

Source: Vizuara

Unsupervised learning 🔍

What is unsupervised learning?

Finding structure without labels

Unsupervised learning: Learn from data without labels
No “correct answers” provided
Goal: Discover hidden patterns or structure
Like learning without a teacher, explore and discover
Why use it?
- Labels are expensive (experts needed)
- Sometimes labels don’t exist (“How many customer types?”)
- Exploration: Understand data before modelling
- Pre-training: Learn representations, then fine-tune
Some questions to ask before using these models:
- Are there natural groups in the data?
- Can we represent data more compactly?

Source: University of Cambridge

Clustering

Grouping similar data points

Clustering: Partition data into groups (clusters)
Points in same cluster are similar, different clusters are dissimilar
K-means: Most popular algorithm
1. Choose K cluster centres randomly
2. Assign each point to nearest centre
3. Move centres to mean of assigned points
4. Repeat until convergence
Hierarchical clustering: Builds dendrograms without pre-specifying K

Real-world uses:

Domain	Application
Marketing	Customer segmentation
Biology	Cell types, disease subtypes
Finance	Fraud detection
Healthcare	Patient risk groups

Source: Machine Learning CoBan

Dimensionality reduction

Compressing information

Curse of dimensionality: High-dimensional data is hard to work with
- Distances become meaningless
- Need exponentially more data
- Visualisation impossible
Dimensionality reduction: Find lower-dimensional representation
Preserve important structure, discard noise
Two main approaches:
- Linear: PCA (Principal Component Analysis)
- Non-linear: t-SNE, UMAP
Applications: Visualisation, preprocessing, compression
Super cool example (we will see it in lecture 06!): https://projector.tensorflow.org/

Reducing dimensions while preserving structure

Source: Medium

Dimensionality reduction techniques

PCA, t-SNE, and UMAP

Linear: PCA (Principal Component Analysis)

Find directions of maximum variance
Keep top K components to reduce dimensions
Fast, well-understood, preserves global structure
Limitation: Only captures linear relationships

Non-linear: t-SNE and UMAP

Preserve local structure: Nearby points stay nearby
Excellent for visualisation
Reveal clusters that PCA misses
Caution: Cluster sizes and distances can be misleading

Source: Vizuara

Reinforcement learning 🎮

What is reinforcement learning?

Learning from interaction

Reinforcement learning (RL): Learn by trial and error
An agent interacts with an environment
Takes actions, receives rewards (or punishments)
Goal: Learn a policy that maximises cumulative reward
No labelled examples, machines learn from experience
Like training a dog (or kids!): Reward good behaviour!
Different from supervised: No “correct” action given
Different from unsupervised: There IS a goal (maximise reward)

Source: Wikipedia

Key concepts in RL

The vocabulary you need

Concept	Definition	Example (Chess)
Agent	The learner/decision-maker	The chess-playing AI
Environment	What the agent interacts with	The chess board and opponent
State	Current situation	Board position
Action	What the agent can do	Move a piece
Reward	Feedback signal	+1 win, -1 lose, 0 otherwise
Policy	Strategy: state → action	“In this position, move queen”
Value	Expected future reward from a state	How good is this position?

Fun fact: Magnus Carlsen once said he “can’t beat his phone in chess”. RL really works! 😅

The exploration-exploitation trade-off

Should you try something new or stick with what works?

Exploration: Try new actions to discover better strategies
Exploitation: Use known good actions to maximise reward
The dilemma:
- Too much exploration: Waste time on bad actions
- Too much exploitation: Miss better strategies
Example: Restaurant choice
- Exploit: Go to your favourite restaurant
- Explore: Try a new restaurant (might be better!)
RL algorithms must balance both
Common approach: ε-greedy (explore with probability ε, also called multi-armed bandit)

Source: Lilian Weng

Q-learning

Learning action values

Q-learning: Learn which actions are best in each situation
Think of it as keeping a score card:
- \(Q(s, a)\) = “How good is action \(a\) in state \(s\)?”
The agent learns by doing:
1. Try an action, see what reward you get → “Go right… hit a wall, ouch!”
2. Update your score → “Going right here is bad”
3. Repeat thousands of times → “Eventually: always go left here!”

The update rule: \[\text{New Score} = \text{Old Score} + \text{Small Correction}\]

Policy: Always pick the action with the highest score
Cool example: (Poor) Albert learns to walk: https://youtu.be/L_4BPjLBF4E?&t=137
Super cool example: DeepMind’s AlphaGo: https://youtu.be/WXuK6gekU1Y

Source: Medium

RLHF: RL for language models

From AlphaGo to ChatGPT

RLHF: Reinforcement Learning from Human Feedback
Key to ChatGPT’s success!
Process:
1. Train base LLM on text (supervised)
2. Collect human preferences on model outputs
3. Train a reward model on preferences
4. Use RL to optimise LLM for reward
Why it works:
- Hard to specify “good response” in a formula
- Humans can compare responses more easily
- RL optimises for what humans actually want
More here: https://openai.com/index/learning-from-human-preferences/

Source: Simform

RLHF

Source: Anthropic’s hh-rlhf dataset (check it out, it’s really cool!)

Jailbreaking RLHF

If you can make it, you can jailbreak it!

When it comes to LLMs, you can make it do (almost) anything with the right prompt
This is more art than science (as most things in AI are), but it can be fun!
Please note that I’m not encouraging you to jailbreak anything, as you may get permanently banned from using LLMs or worse
One of the best people in this field is/was Pliny the Liberator, who jailbroke pretty much all LLMs he could get his hands on 😂
If you’re interested in jailbreaking, check out this repository: https://github.com/elder-plinius/L1B3RT4S
Here’s Claude Sonnet teaching how to produce drugs: https://x.com/elder_plinius/status/1972885831955484830

Source: Pliny the Liberator

AI safety

Making sure AI does what we actually mean

Alignment problem: Ensuring models do what we mean, not what we literally say
Specification gaming: Models finding perverse ways to maximise rewards (e.g., reward hacking)
Scalable oversight: How to supervise models that are more capable than their human evaluators
Red teaming: Proactively finding vulnerabilities and jailbreaks
Safety needs to be part of training, not something you bolt on later

Source: NanoBanana

Comparing the three paradigms 📊

Side-by-side comparison

Choosing the right approach

	Supervised	Unsupervised	Reinforcement
Data	Labelled (X, y)	Unlabelled (X)	States, actions, rewards
Goal	Predict y from X	Find structure	Maximise reward
Feedback	Correct answers	None	Reward signal
Analogy	Learning with a teacher	Exploring alone	Learning by trial and error
Evaluation	Compare to known labels	Internal metrics	Cumulative reward
Examples	Spam detection, diagnosis	Customer segments	Game playing, robotics
Difficulty	Medium (if labels exist)	Hard to evaluate	Hard to train

Summary 📚

Main takeaways

Three paradigms: Supervised (labels), unsupervised (no labels), reinforcement (rewards)
Supervised learning is most common: Learn from labelled examples to predict
The bias-variance trade-off: Simple models underfit, complex models overfit
Unsupervised learning discovers hidden structure: Clustering, dimensionality reduction
Reinforcement learning learns from interaction: Exploration vs exploitation
RLHF powers modern LLMs like ChatGPT: RL to align with human preferences

… and that’s all for today! 🎉