Ethics ✕ Data Science

class: center, middle, inverse, title-slide

.title[
# Ethics ✕ Data Science
]
.subtitle[
## Machine learning 101
]
.author[
### Simon Munzert and Johannes Himmelreich
]

---

# Table of contents

1. [Machine learning, deep learning, AI](#definitions)

2. [Basic concepts in machine learning](#mlconcepts)

3. [Overview of ML landscape](#landscape)

3. [Performance metrics](#metrics)

4. [AI for public policy](#aipp)

---
class: inverse, center, middle
name: definitions

# Machine learning, deep learning, AI
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
# What is AI?

.pull-left[
## Artificial intelligence

> "Artificial intelligence (AI) is intelligence - perceiving, synthesizing, and inferring information - demonstrated by machines, as opposed to intelligence displayed by non-human animals and humans. Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs."

<div align="right">Wikipedia, <i>Artificial intelligence</i></div>

> "The effort to automate intellectual tasks normally performed humans."

<div align="right">Chollet and Allaire, 2018, <i>Deep Learning with R</i></div>
]

.pull-right-center[
<div align="center">
<img src="../pics/ai-picture.svg" height=270>
</div>

`Source` [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)

<div align="center">
<img src="../pics/deep-blue.jpeg" height=200>
<img src="../pics/kit-knight-rider.jpeg" height=200>
</div>
]

---
# What is AI?

.pull-left[
## Machine learning

> "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn' (...) It is seen as a part of artificial intelligence."

<div align="right">Wikipedia, <i>Machine learning</i></div>

> "Machine learning is a specific subfield of AI that aims at automatically developing programs (called models) purely from exposure to training data. This process of turning models data into a program is called learning."

<div align="right">Chollet and Allaire, 2018, <i>Deep Learning with R</i></div>
]

.pull-right-center[
<div align="center">
<img src="../pics/ai-picture.svg" height=270>
</div>

`Source` [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)

<div align="center">
<img src="../pics/facebook-feed.png" height=200>
</div>
]

---
# What is AI?

.pull-left[
## Data mining

> "Application of machine learning methods to large databases is called data mining. The analogy is that a large volume of earth and raw material is extracted from a mine, which when processed leads to a small amount of very precious material; similarly, in data mining, a large volume of data is processed to construct a simple model with valuable use, for example, having high predictive accuracy."

<div align="right">Alpaydin, 2014, <i>Introduction to Machine Learning</i></div>
]

.pull-right-center[
<div align="center">
<img src="../pics/ai-picture.svg" height=270>
</div>

`Source` [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)

<div align="center">
<img src="../pics/netflix-recommendations.png" height=200>
</div>
]

---
# What is AI?

.pull-left[
## Deep learning

> "Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network."

<div align="right">Wikipedia, <i>Deep learning</i></div>
]

.pull-right-center[
<div align="center">
<img src="../pics/ai-picture.svg" height=270>
</div>

`Source` [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)

<div align="center">
<img src="../pics/chatgpt.png" height=200>
</div>
]

---
# Corporate investment in AI

---
class: inverse, center, middle
name: mlconcepts

# Basic concepts in machine learning
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
# Regression vs. classification

.pull-left[
## Regression
- Predicts a continuous outcome
- Example: Predicting house prices, GDP growth, temperature

## Classification
- Predicts a categorical outcome
- Example: Predicting whether a person will default on a loan, whether an email is spam, whether a patient has a disease

]

.pull-right[
## Classifcation problems in the wild

Classification problems occur often, perhaps even more so than regression problems, e.g.:

1. A woman arrives at the emergency room with a set of symptoms. Which condition does she have?
2. An online banking service must be able to determine whether or not a transaction is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.

Decision-making problems often are classification problems!
]

---
# Supervised and unsupervised learning

.pull-left-wide2[
## Supervised learning
- The algorithm learns from labeled data, i.e., data with known outcomes
- The algorithm is trained on a training dataset and evaluated on a test dataset
- The goal is to the predict unobserved outcomes

## Unsupervised learning
- The algorithm learns from unlabeled data
- There are inputs but no supervising output; we can still learn about relationships and structure from such data

## Analogies
- Supervised: Child in school learns math (with teacher’s input)
- Unsupervised: Child at home plays with toys (without teacher’s input)
]

.pull-right-small2[
<div align="center"><br><br>
<img src="../pics/supervised-unsupervised-tasks.png" height=300>
</div>
]

---
# Training, validation and test dataset

---
# Overfitting

---
# Overfitting

---
# Overfitting

---
# Overfitting

---
# Overfitting

---
# Overfitting

---
# Overfitting

---
# Overfitting in classification

---
# Overfitting in classification

---
# Overfitting in classification

**Explained:** The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line.

---
# Overfitting, ultimately explained

---
class: inverse, center, middle
name: landscape

# Overview of ML landscape
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
# The ML landscape ([Microsoft.com](https://learn.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet?view=azureml-api-1))

---
# ML decision tree

`Source` [Sundararajan et al. 2021](https://www.um.edu.mt/library/oar/bitstream/123456789/107610/1/A%20contemporary%20review%20on%20drought%20modeling%20using%20machine%20learning%20approaches%202021.pdf)

---
# Affiliation of AI researchers

---
class: inverse, center, middle
name: metrics

# Performance metrics
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
# AI performance in knowledge tests

---
# AI capabilities vs. human performance

---
# ML performance benchmarking in the wild

`Source` Oueslati, 2024, Watching the Watchers: A Comparative Audit of Cloud‑Based Commercial Content Moderation Services.

---
# ML performance benchmarking in the wild

`Source` Wiik, 2024, GPT-4o vs. GPT-4 vs. Gemini 1.5 — Performance Analysis (Accuracy).