DATASCI 185: Introduction to AI Applications

Lecture 03: Dataset Design, Labels, and Tasks

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

We traced AI from Ancient Greece to modern transformers
Symbolic AI (1950s-1980s): Hand-coded rules and logic
AI Winters: When hype exceeded reality (twice!)
Neural networks: From perceptrons to deep learning
Transformers (2017): Attention is all you need!
Data became the new oil… but where do the data come from? 🤔

Source: Clay.Global

Lecture overview

Today’s agenda

Today we’ll explore the foundation of all AI systems: the data they’re trained on
Why data quality matters
Types of AI tasks
Sourcing and collecting data
Selection bias and its dangers
Best practices for labelling data
Working with multiple annotators
Quality control in datasets

Source: Medium

Tweet of the day

What matters more?

Take a moment to think…

“If you had to choose, which would give you better AI predictions?”

Option A: 🧠

A state-of-the-art algorithm trained on low-quality data

Option B: 📊

A simple algorithm trained on high-quality data

Discuss with your neighbour for 1 minute!

Raise your hand if you think Option A is better

And now raise your hand if you think Option B is better

A cow, a camel or a polar bear?

Source: Medium (slightly exaggerated!)

Why data quality matters 📊

Garbage in, garbage out

Why your model can’t outrun bad data

AI models are only as good as the data they’re trained on
No amount of algorithmic sophistication can fix bad data
Common data problems:
- Missing important information (e.g., age, income, etc.)
- Conflicting labels or formats (e.g., “yes” vs “1”)
- Biases, not representative of reality (e.g., gender, race, etc.)
Data-centric AI: Andrew Ng’s insight (2021): focus on improving the data, not just the model
- Check: https://datacentricai.org/

Source: xkcd

Real-world impact of data quality

When bad data causes harm

Case	What went wrong	Impact
Credit scoring	Data reflected historical gender-based income gaps	Lower credit limits for women with similar profiles
Facial recognition	Training data mostly light-skinned faces	Error rates 34x higher for dark-skinned women
Criminal risk assessment	Historical arrest data reflected biased policing	Perpetuated racial disparities in sentencing
Hiring AI	Trained on historical (male-dominated) hiring decisions	Penalised female candidates

These weren’t algorithm failures; they were data failures.

Types of machine learning tasks 🎯

Classification and regression tasks

The two main types of AI tasks

Classification: Predicting categories

Goal: Assign inputs to discrete categories
Examples: Email → Spam/Not Spam; Image → Cat/Dog/Bird; Text → Positive/Negative/Neutral
Binary classification: Two classes
Multi-class: Many classes (one per input)
Multi-label: Multiple labels per input
How do you define the classes?

Regression: Predicting continuous values

Goal: Predict a continuous numerical value
Examples: Features → House price; Patient data → Hospital stay length; Weather data → Temperature
The choice of target variable is a design decision
Different targets can lead to very different models

Other common AI tasks

Beyond classification and regression

Task	Description	Example
Object Detection	Find and locate objects in images	Self-driving cars detecting pedestrians
Segmentation	Label every pixel in an image	Medical imaging (tumour boundaries)
Sequence Labelling	Label each element in a sequence	Named entity recognition in text
Generation	Create new content	ChatGPT, DALL-E, music generation
Ranking	Order items by relevance	Search results, recommendations

Each task requires different types of labels in your dataset!

Defining your classes

Classification pitfalls

Classes should be:
- Mutually exclusive: No overlap between categories
- Collectively exhaustive: Cover all possible cases
- Consistently definable: Clear boundaries
Common problems:
- Ambiguous categories: What counts as “spam”?
- Subjective judgments: Is this review “positive”?
- Cultural differences: What’s “offensive” varies
- Evolving definitions: Categories change over time
Solution: Clear annotation guidelines

Source: The Independent

The proxy problem

When labels don’t measure what you care about

Often we can’t directly measure what we want to predict
We use proxy variables instead…but proxies can be misleading!
Example 01: Predicting “health needs”
- What we want: How sick is this patient?
- Available data: Healthcare spending
- Spending ≠ health needs: Black patients had less spent on them due to systemic barriers
- Algorithm learned: “Black patients need less care” → Wrong!
Example 02: Predicting expertise on social media
- What we want: Who is an expert in a field?
- Available data: Number of followers
- Followers ≠ Expertise: Popularity can be gamed or unrelated to domain knowledge
Always ask: Does this label actually measure what I care about?

Source: Andrej Karpathy

Sourcing and collecting data 📥

Where do the data come from?

Common data sources

Source	Examples	Considerations
Existing databases	Hospital records, transaction logs	May have missing fields, inconsistent formats
Public datasets	ImageNet, Wikipedia, Common Crawl	May not match your specific domain
Web scraping	Social media, news articles	Legal and ethical concerns
Surveys/Forms	User feedback, questionnaires	Response bias, sampling issues
Sensors/IoT	Cameras, wearables, satellites	Privacy concerns, data volume
Synthetic data	Generated by other models	May not capture real-world complexity

How much data do you need?

The million-dollar question

There’s no universal answer… it depends on:
- Task complexity: harder tasks need more data
- Number of classes/features: more categories = more examples needed
- Model architecture: deep learning is data-hungry
- Required accuracy: higher stakes = more data
Rule of thumb: Start small, then scale up
Scaling laws: Performance improves predictably with more data
- You can estimate how much data you’ll need by training on subsets and extrapolating
For most real-world problems, more high-quality data beats more sophisticated algorithms

Source: Our World in Data

Selection bias ⚠️

What is selection bias?

When your data doesn’t represent reality

Selection bias is the systematic deviation between your training data and the real-world distribution
Also called: confounding, distribution shift, sampling bias
The model learns patterns that won’t generalise
Extremely hard to fix after the fact
Prevention is better than cure!
Always ask: “In what ways might my data not be representative?”

Source: Wikipedia

Common causes of selection bias

Where does bias creep in?

Type	Description	Example
Time bias	Data from the past applied to the future	Training on 2019 data for 2024 predictions
Location bias	Data from one place applied elsewhere	US healthcare model used in India
Demographics bias	Certain groups under/over-represented	Clinical trial with mostly young, white, male participants
Availability bias	Using convenient rather than representative data	Surveying only friends
Response bias	Who responds differs from who doesn’t	Online surveys miss offline populations
Long-tail bias	Rare events underrepresented	Self-driving cars: unusual scenarios

Try it out: Generate an image of “a CEO” and “a nurse” using two different LLMs, such as Qwen and Gemini. What do you notice?

Prompt: Create an image of a CEO and a nurse

Source: Qwen via nanoGPT

Source: Gemini’s NanoBanana

Dealing with selection bias

Mitigation strategies

Before collection:

Enumerate potential biases before collecting data
Design collection to cover important subgroups
Over-sample rare but important cases
Use data augmentation to increase dataset size
- Try it out at home! Use an image generator to create variations of the same object (different angles, lighting, etc)

After collection:

Use representative validation sets
If deploying in new locations: validate on location-specific data
Document known limitations of your dataset

Source: WisdomPlexus

Labelling data 🏷️

Why labelling is hard

Humans are slow, expensive, and inconsistent

Most AI models requires labelled data
Labels come from humans → humans make mistakes
Common issues:
- Inconsistency: Different people label differently
- Ambiguity: Some examples are genuinely unclear
- Mislabelling: Human errors happen
- Fatigue: Quality drops over time
- Expertise: Some tasks need domain experts
Labelling is expensive and time-consuming
- See Alex Wang’s Scale AI
Quality control is super important!

Source: Galliot

Example

Source: LeewayHertz

MNIST dataset

How would you label these images?

Source: https://labelerrors.com/

Writing good annotation guidelines

What annotators actually need from you

Good guidelines should include:

Clear definitions: What does each label mean?
Examples: Canonical cases for each class
Counter-examples: What does NOT belong
Edge cases: How to handle ambiguity
Visual aids: Screenshots, annotated examples

Best practices:

Test guidelines with a small group first
Iterate based on confusion and disagreements
Keep guidelines as living documents and update them as needed

Source: Google’s Healthcare Text Annotation Guidelines

Multiple annotators and agreement

Reducing individual errors

Why use multiple annotators:

Individual annotators make mistakes
Multiple annotators → more reliable labels
Benefits: Catch errors, identify ambiguous cases, assess label quality, reduce biases
Trade-off: More expensive

Aggregation strategies:

Majority vote: Most common label wins (limitations: ties, equal weighting)
Weighted voting: Trust reliable annotators more
Adjudication: Expert makes final call
Keep all labels: Train on disagreement (for uncertainty research)

Measuring agreement (IAA):

Inter-annotator agreement: Do annotators agree?
High agreement → Labels are reliable
Low agreement → Problem with guidelines or task
Common metric: Cohen’s Kappa

Cohen’s Kappa: \[\kappa = \frac{\text{obs. agreement} - \text{exp. agreement}}{1 - \text{exp. agreement}}\]

Kappa	Interpretation
< 0.20	Slight
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
0.81–1.00	Almost perfect

Quality control

Catching mistakes before they reach your model

Quality control (QC) examples: Examples where you already know the correct answer
Slip QC examples into the labelling queue without telling annotators
Use them to:
- Monitor annotator accuracy over time
- Identify annotators who need retraining
- Catch low-quality work early
Important: Don’t let annotators know which examples are QC
Flag annotators who fail too many QC checks
Combine with regular audits of random samples

Source: SuperAnnotate

Building an annotation pipeline

From guidelines to final dataset

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '17px'}}}%%
flowchart TB
    A[📊 Data Selection] --> B[📝 Guideline Design]
    B --> C[👥 Annotator Training]
    C --> D[🧪 Pilot Labelling]
    D --> E{Quality OK?}
    E -->|No| B
    E -->|Yes| F[🏭 Full Labelling]
    F --> G[✅ QC Checks]
    G --> H[📈 Aggregation]
    H --> I[🎯 Final Dataset]
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style I fill:#90EE90,stroke:#333,stroke-width:2px

Key stages:

Select data stratified by important subgroups
Design guidelines with clear definitions and examples
Train annotators and calibrate their judgments
Pilot on a small sample, iterate on guidelines
Scale up with ongoing quality control
Aggregate labels using appropriate strategies
Document everything for reproducibility

Summary ✅

Main takeaways

Data quality > Algorithm sophistication for most real-world problems
Problem framing determines what data you need and how to label it
Selection bias is a major source of error. Enumerate biases before collecting
Labelling is hard, so clear guidelines and quality control are very important
Multiple annotators improve reliability, but aggregate intelligently
Document everything, so your future self will thank you

… and that’s all for today! 🎉