DATASCI 185: Introduction to AI Applications

Lecture 03: Dataset Design, Labels, and Tasks

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

  • We traced AI from Ancient Greece to modern transformers
  • Symbolic AI (1950s-1980s): Hand-coded rules and logic
  • AI Winters: When hype exceeded reality (twice!)
  • Neural networks: From perceptrons to deep learning
  • Transformers (2017): Attention is all you need!
  • Data became the new oil… but where do the data come from? 🤔

A short history of AI

Source: Clay.Global

Lecture overview

Today’s agenda

  • Today we’ll explore the foundation of all AI systems: the data they’re trained on
  • Why data quality matters
  • Types of AI tasks
  • Sourcing and collecting data
  • Selection bias and its dangers
  • Best practices for labelling data
  • Working with multiple annotators
  • Quality control in datasets

The Modern Data Pipeline

Source: Medium

Tweet of the day

What matters more?

Take a moment to think…

“If you had to choose, which would give you better AI predictions?”

Option A: 🧠

A state-of-the-art algorithm trained on low-quality data

Option B: 📊

A simple algorithm trained on high-quality data

Discuss with your neighbour for 1 minute!

Raise your hand if you think Option A is better

And now raise your hand if you think Option B is better

A cow, a camel or a polar bear?

Source: Medium (slightly exaggerated!)

Why data quality matters 📊

Garbage in, garbage out

The fundamental principle

  • AI models are only as good as the data they’re trained on
  • No amount of algorithmic sophistication can fix bad data
  • Common data problems:
    • Missing important information (e.g., age, income, etc.)
    • Conflicting labels or formats (e.g., “yes” vs “1”)
    • Biases, not representative of reality (e.g., gender, race, etc.)
  • The quality of your models is directly tied to the quality of your data
  • Data-centric AI: Andrew Ng’s insight (2021): focus on improving the data, not just the model

Garbage In, Garbage Out

Source: xkcd

Real-world impact of data quality

When bad data causes harm

Case What went wrong Impact
Credit scoring Data reflected historical gender-based income gaps Lower credit limits for women with similar profiles
Facial recognition Training data mostly light-skinned faces Error rates 34x higher for dark-skinned women
Criminal risk assessment Historical arrest data reflected biased policing Perpetuated racial disparities in sentencing
Hiring AI Trained on historical (male-dominated) hiring decisions Penalised female candidates

These weren’t algorithm failures—they were data failures.

Types of machine learning tasks 🎯

Classification and regression tasks

The two main types of AI tasks

Classification: Predicting categories

  • Goal: Assign inputs to discrete categories
  • Examples: Email → Spam/Not Spam; Image → Cat/Dog/Bird; Text → Positive/Negative/Neutral
  • Binary classification: Two classes
  • Multi-class: Many classes (one per input)
  • Multi-label: Multiple labels per input
  • Key challenge: How do you define the classes?

Classification

Regression: Predicting continuous values

  • Goal: Predict a continuous numerical value
  • Examples: Features → House price; Patient data → Hospital stay length; Weather data → Temperature
  • The choice of target variable is a design decision
  • Different targets can lead to very different models

Regression

Other common AI tasks

Beyond classification and regression

Task Description Example
Object Detection Find and locate objects in images Self-driving cars detecting pedestrians
Segmentation Label every pixel in an image Medical imaging (tumour boundaries)
Sequence Labelling Label each element in a sequence Named entity recognition in text
Generation Create new content ChatGPT, DALL-E, music generation
Ranking Order items by relevance Search results, recommendations

Each task requires different types of labels in your dataset!

Defining your classes

Classification pitfalls

  • Classes should be:
    • Mutually exclusive: No overlap between categories
    • Collectively exhaustive: Cover all possible cases
    • Consistently definable: Clear boundaries
  • Common problems:
    • Ambiguous categories: What counts as “spam”?
    • Subjective judgments: Is this review “positive”?
    • Cultural differences: What’s “offensive” varies
    • Evolving definitions: Categories change over time
  • Solution: Clear annotation guidelines

A duck or a rabbit?

Source: The Independent

The proxy problem

When labels don’t measure what you care about

  • Often we can’t directly measure what we want to predict
  • We use proxy variables instead…but proxies can be misleading!
  • Example 01: Predicting “health needs”
    • What we want: How sick is this patient?
    • Available data: Healthcare spending
    • Spending ≠ health needs: Black patients had less spent on them due to systemic barriers
    • Algorithm learned: “Black patients need less care” → Wrong!
  • Example 02: Predicting expertise on social media
    • What we want: Who is an expert in a field?
    • Available data: Number of followers
    • Followers ≠ Expertise: Popularity can be gamed or unrelated to domain knowledge
  • Always ask: Does this label actually measure what I care about?

Model-centric vs. Data-centric AI

Source: Andrej Karpathy

Sourcing and collecting data 📥

Where do the data come from?

Common data sources

Source Examples Considerations
Existing databases Hospital records, transaction logs May have missing fields, inconsistent formats
Public datasets ImageNet, Wikipedia, Common Crawl May not match your specific domain
Web scraping Social media, news articles Legal and ethical concerns
Surveys/Forms User feedback, questionnaires Response bias, sampling issues
Sensors/IoT Cameras, wearables, satellites Privacy concerns, data volume
Synthetic data Generated by other models May not capture real-world complexity

How much data do you need?

The million-dollar question

  • There’s no universal answer… it depends on:
    • Task complexity: harder tasks need more data
    • Number of classes/features: more categories = more examples needed
    • Model architecture: deep learning is data-hungry
    • Required accuracy: higher stakes = more data
  • Rule of thumb: Start small, then scale up
  • Scaling laws: Performance improves predictably with more data
    • You can estimate how much data you’ll need by training on subsets and extrapolating
  • For most real-world problems, more high-quality data beats more sophisticated algorithms

Performance vs. dataset size

Source: Our World in Data

Selection bias ⚠️

What is selection bias?

When your data doesn’t represent reality

  • Selection bias is the systematic deviation between your training data and the real-world distribution
  • Also called: confounding, distribution shift, sampling bias
  • The model learns patterns that won’t generalise
  • Extremely hard to fix after the fact
  • Prevention is better than cure!
  • Always ask: “In what ways might my data not be representative?”

Selection bias

Source: Wikipedia

Common causes of selection bias

Where does bias creep in?

Type Description Example
Time bias Data from the past applied to the future Training on 2019 data for 2024 predictions
Location bias Data from one place applied elsewhere US healthcare model used in India
Demographics bias Certain groups under/over-represented Clinical trial with mostly young, white, male participants
Availability bias Using convenient rather than representative data Surveying only friends
Response bias Who responds differs from who doesn’t Online surveys miss offline populations
Long-tail bias Rare events underrepresented Self-driving cars: unusual scenarios
  • Try it out: Generate an image of “a CEO” and “a nurse” using two different LLMs, such as Qwen and Gemini. What do you notice?

Prompt: Create an image of a CEO and a nurse

Qwen

Source: Qwen via nanoGPT

Gemini

Source: Gemini’s NanoBanana

Dealing with selection bias

Mitigation strategies

Before collection:

  • Enumerate potential biases before collecting data
  • Design collection to cover important subgroups
  • Over-sample rare but important cases
  • Use data augmentation to increase dataset size
    • Try it out at home! Use an image generator to create variations of the same object (different angles, lighting, etc)

After collection:

  • Use representative validation sets
  • If deploying in new locations: validate on location-specific data
  • Document known limitations of your dataset

Strategies to reduce bias

Source: WisdomPlexus

Labelling data 🏷️

Why labelling is hard

The human bottleneck

  • Most AI models requires labelled data
  • Labels come from humans → humans make mistakes
  • Common issues:
    • Inconsistency: Different people label differently
    • Ambiguity: Some examples are genuinely unclear
    • Mislabelling: Human errors happen
    • Fatigue: Quality drops over time
    • Expertise: Some tasks need domain experts
  • Labelling is expensive and time-consuming
  • Quality control is super important!

Labelling is harder than it looks!

Source: Galliot

Example

Source: LeewayHertz

Be the annotator! 🏷️

Try labelling these examples yourself

Instructions: For each example, decide on the label. Discuss with your neighbours!

Example 1: Email Spam Detection

“Congratulations! You’ve been selected to receive a $50 Amazon gift card for completing our customer satisfaction survey. Click here: surveyrewards-amazon.com

Is this SPAM or NOT SPAM?

Example 2: Sentiment Analysis

“The hotel room was clean and the location was perfect. However, the staff were incredibly rude and the WiFi didn’t work at all. I probably won’t stay here again.”

Is this POSITIVE, NEGATIVE, or NEUTRAL?

Example 3: Content Moderation

“This politician is an absolute moron who doesn’t understand basic economics.”

Should this be REMOVED or ALLOWED?

MNIST dataset

  • How would you label these images?

Writing good annotation guidelines

The foundation of quality labels

Good guidelines should include:

  1. Clear definitions: What does each label mean?
  2. Examples: Canonical cases for each class
  3. Counter-examples: What does NOT belong
  4. Edge cases: How to handle ambiguity
  5. Visual aids: Screenshots, annotated examples

Best practices:

  • Test guidelines with a small group first
  • Iterate based on confusion and disagreements
  • Keep guidelines living documents—update as needed

Multiple annotators and agreement

Reducing individual errors

Why use multiple annotators:

  • Individual annotators make mistakes
  • Multiple annotators → more reliable labels
  • Benefits: Catch errors, identify ambiguous cases, assess label quality, reduce biases
  • Trade-off: More expensive

Aggregation strategies:

  • Majority vote: Most common label wins (limitations: ties, equal weighting)
  • Weighted voting: Trust reliable annotators more
  • Adjudication: Expert makes final call
  • Keep all labels: Train on disagreement (for uncertainty research)

Measuring agreement (IAA):

  • Inter-annotator agreement: Do annotators agree?
  • High agreement → Labels are reliable
  • Low agreement → Problem with guidelines or task
  • Common metric: Cohen’s Kappa

Cohen’s Kappa: \[\kappa = \frac{\text{obs. agreement} - \text{exp. agreement}}{1 - \text{exp. agreement}}\]

Kappa Interpretation
< 0.20 Slight
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
0.81–1.00 Almost perfect

Quality control

The gold standard approach

  • Quality control (QC) examples: Examples where you already know the correct answer
  • Slip QC examples into the labelling queue without telling annotators
  • Use them to:
    • Monitor annotator accuracy over time
    • Identify annotators who need retraining
    • Catch low-quality work early
  • Important: Don’t let annotators know which examples are QC
  • Flag annotators who fail too many QC checks
  • Combine with regular audits of random samples

QC examples catch errors

Source: SuperAnnotate

Building a robust annotation pipeline

End-to-end workflow

%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '17px'}}}%%
flowchart TB
    A[📊 Data Selection] --> B[📝 Guideline Design]
    B --> C[👥 Annotator Training]
    C --> D[🧪 Pilot Labelling]
    D --> E{Quality OK?}
    E -->|No| B
    E -->|Yes| F[🏭 Full Labelling]
    F --> G[✅ QC Checks]
    G --> H[📈 Aggregation]
    H --> I[🎯 Final Dataset]
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style I fill:#90EE90,stroke:#333,stroke-width:2px

Key stages:

  1. Select data stratified by important subgroups
  2. Design guidelines with clear definitions and examples
  3. Train annotators and calibrate their judgments
  4. Pilot on a small sample, iterate on guidelines
  5. Scale up with ongoing quality control
  6. Aggregate labels using appropriate strategies
  7. Document everything for reproducibility

Summary ✅

Main takeaways

  • Data quality > Algorithm sophistication for most real-world problems

  • Problem framing determines what data you need and how to label it

  • Selection bias is a major source of error. Enumerate biases before collecting

  • Labelling is hard, so clear guidelines and quality control are very important

  • Multiple annotators improve reliability, but aggregate intelligently

  • Document everything, so your future self will thank you

… and that’s all for today! 🎉