DATASCI 185: Introduction to AI Applications

Lecture 14: Documentation and Dataset Governance

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 📋

Recap of last class

  • Last time we explored pipelines and monitoring
  • AI systems are complex chains that can fail at any step
  • Monitoring helps catch problems before users do
  • Testing for properties, not exact outputs
  • Today: We enter Module 4—Data Ethics and Bias
  • Documentation as the first line of defence
  • If you don’t know where your data came from, you can’t fix bias
  • Meta AI research scientist Moustapha Cissé: “You are what you eat, and right now we feed our models junk food.”

Source: Labellerr

Lecture overview

Today’s agenda

Part 1: Why Documentation Matters

  • The “nutrition label” for AI
  • What happens without it (spoiler: chaos)
  • Documentation as accountability

Part 2: Datasheets for Datasets

  • What, why, how, who, when
  • Reading and writing datasheets
  • Hands-on: Create your own!

Part 3: Model Cards

  • What does the model do (and not do)?
  • Intended uses and limitations
  • Ethical considerations

Part 4: LLM Training Data Governance

  • How modern AI datasets are documented
  • LAION, Common Crawl, and data provenance
  • The FAIR principles for research data

Fun fact of the day! 😁

Source: Reddit

Revolting fact of the day! 😠

Very nice, Zuck!

Source: Mashable

Why Documentation Matters 📝

The horror stories: What happens without documentation

ImageNet’s hidden problems (2019):

  • Most influential computer vision dataset ever
  • Powered 10+ years of AI research
  • Nobody knew where the images came from!
  • Later discovered:
    • Scraped from Flickr without consent
    • Racist and sexist labels (MTurk workers)
    • Non-consensual photos of minors
    • Content from revenge porn sites

The root cause?

They built it quickly and never documented what they were doing!

Thousands of AI systems were built on this undocumented foundation!

ImageNet analysis

Source: Excavating AI (great resource, check it out!)

Documentation as accountability

Without documentation, we can’t:

  • ❌ Know if data were collected ethically
  • ❌ Identify sources of bias
  • ❌ Determine if a model is appropriate for a use case
  • ❌ Assign responsibility when things go wrong
  • ❌ Reproduce or verify results
  • ❌ Update or fix problems later

With documentation, we can:

  • ✅ Make informed choices about using a dataset
  • ✅ Trace bias back to its source
  • ✅ Match models to appropriate applications
  • ✅ Hold creators accountable
  • ✅ Build trust with users and regulators

The legal angle:

Regulations are now requiring documentation:

  • EU AI Act: Mandates documentation for high-risk AI
  • NYC Law 144: Requires bias audits (needs documentation!)
  • California CCPA: Data transparency requirements

We’ll cover US vs EU regulations in detail in Lecture 17.

If you can’t document it, you can’t deploy it (legally).

The three pillars: Datasheets, Model Cards, and System Cards

Source: Laurel Papworth

They originated from landmark papers: Gebru et al. (2018) for datasheets and Mitchell et al. (2019) for model cards.

These are now industry standards adopted by Google, Microsoft, Hugging Face, and others!

Datasheets for Datasets 📊

What is a datasheet?

Inspired by electronics industry:

  • Every electronic component has a “datasheet”
  • Spec sheet with all relevant information
  • Engineers can make informed decisions

Timnit Gebru et al. (2018) proposed the same for datasets:

“A datasheet documents the motivation, composition, collection process, recommended uses, and other information about a dataset.”

Core sections:

Motivation, Composition, Collection, Preprocessing, Uses, Distribution, Maintenance

What a datasheet answers

Motivation:

  • Why was the dataset created?
  • Who created it and who funded it?
  • What task was it designed for?

Composition:

  • What types of data are included?
  • How many instances are there?
  • Is there any sensitive information?
  • Are there known errors or noise?

Collection:

  • How was data collected (scraped, survey, sensors)?
  • Was consent obtained?
  • Who were the data subjects?
  • What timeframe was covered?

Preprocessing:

  • Was data cleaned or filtered?
  • Were any instances removed?
  • How was data labelled?

Uses:

  • What tasks has it been used for?
  • What should it NOT be used for?
  • Are there known impact areas?

Distribution:

  • How is the dataset shared?
  • Is it versioned?

Maintenance:

  • Who is responsible for updates?
  • Is there a way to report issues?

Example: A good datasheet

The CHoRUS Dataset Datasheet (excerpt):

Motivation:

Develop diverse, ethically sourced, AI-ready dataset to improve recovery from acute illness, with attention to diversity, equity, and Social Determinants of Health.

Composition:

23,400 hospital admissions from 14 hospitals. Multi-modal: EHR, waveforms, imaging (DICOM), clinical notes (OHNLP), EEG data.

Ethical considerations:

  • Multi-site IRB approvals
  • Patient-focused ethical frameworks
  • BRIDGE Center ethics expertise on AI/ML biases

Recommended uses:

AI/ML for clinical care, particularly acute illness recovery prediction.

NOT recommended for:

Re-identification of patients (prohibited), applications perpetuating bias or discrimination.

Collaborative Hospital Repository Uniting Standards for Equitable AI

Source: BRIDGE Center

Activity: Let’s read a real datasheet! 📖

Read Anthropic’s HH-RLHF Dataset Card:

  1. Visit Anthropic/hh-rlhf
  2. Read the Dataset Card carefully
  3. This dataset is used to train AI assistants to be helpful and harmless!

Questions to answer:

  • What are the two types of data in this dataset?
  • Why does Anthropic warn against using this for supervised training of dialogue agents?
  • What content warning does the dataset include?
  • How can you contact the authors with issues?

Evaluate the documentation:

  • ✅ Clear purpose: Train preference/reward models for RLHF
  • ✅ Explicit warnings about misuse (don’t train chatbots directly!)
  • ✅ Content disclaimer about harmful material
  • ✅ Links to papers for methodology details
  • ✅ Contact email provided

Discussion:

  • Would you be comfortable using this dataset?
  • What ethical responsibilities come with accessing data that contains harmful content for research purposes?

⏱️ 3 minutes to explore!

Model Cards 🤖

What is a model card?

The “user manual” for AI models:

Just like datasheets document data, model cards document models.

Margaret Mitchell et al. (2019):

“Model cards are short documents accompanying trained ML models that provide benchmarked evaluation in a variety of conditions.”

Core sections:

  1. Model details: What is this model?
  2. Intended use: What should it be used for?
  3. Factors: What affects performance?
  4. Metrics: How was it evaluated?
  5. Training data: What was it trained on?
  6. Ethical considerations: What could go wrong?
  7. Caveats and recommendations: Warnings!

Model cards paper

Intended use: The most important section

Why “intended use” matters:

A hammer is a great tool. But:

  • ✅ Intended use: Driving nails
  • ❌ NOT intended for: Brain surgery

Models need the same clarity:

Model Intended Use NOT For
Gemini 3 Conversation, assistance Medical diagnosis
Face detection Placing cameras Law enforcement ID
Sentiment analysis Product feedback Hiring decisions

Without this guidance:

People will use models inappropriately and harm others.

Real example: Gemini 3 Model Card (edited and summarised)

Intended use:

“Assistance with a variety of text-based tasks”

Out-of-scope:

“Assistance with illegal activities”

“Chemical synthesis”

“Mental health crisis intervention”

Clear boundaries protect users and developers alike

Disaggregated evaluation: Breaking it down

Why overall accuracy isn’t enough:

  • A model can be 90% accurate overall
  • But 95% for Group A and 70% for Group B!

Disaggregated evaluation = report performance separately for different groups.

Real example: OpenAI CLIP Model Card (2021)

Gender classification accuracy using FairFace dataset:

Race Category Gender Accuracy
Middle Eastern 98.4% (highest)
White 96.5% (lowest)
All races >96%

Racial classification: ~93% | Age classification: ~63%

CLIP’s bias findings:

OpenAI found significant disparities when classifying people into crime-related categories:

  • Performance varied by race and gender
  • Disparities shifted based on how classes were constructed
  • Risk of denigration harms identified

“We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories.”

Industry trend: Leading companies now report disaggregated metrics in model cards.

System cards vs model cards

A newer concept: System Cards

As AI systems become more complex, a single “model card” isn’t enough. System cards document the entire AI system, not just one model.

What’s included in a system card:

  • Multiple models working together
  • How components interact
  • Human oversight mechanisms
  • Deployment context and safeguards
  • Red teaming and safety evaluations

Who’s using system cards:

  • OpenAI: Released o1 System Card in December 2024
  • Anthropic: Publishes system cards for each Claude release
  • Meta: Released 22 system cards in 2023 for their AI products

Model Card vs System Card:

Aspect Model Card System Card
Scope Single model Entire system
Focus Performance Safety + deployment
Audience Developers Developers + public
Includes Training data, metrics Red teaming, safeguards

LLM Training Data Governance 🗂️

How large AI datasets are built

The data behind modern AI:

Large language models and image generators are trained on massive datasets scraped from the web. Understanding how these datasets are created is important to understanding their biases.

Common Crawl: The foundation

  • Non-profit that crawls the web monthly
  • Billions of web pages archived
  • Raw HTML, metadata, and extracted text
  • Free and open for anyone to use
  • No consent from content creators

The pipeline:

  1. Common Crawl provides raw web data
  2. Organisations filter and clean it
  3. Datasets like LAION, The Pile, C4 are created
  4. These train models like GPT, DALL-E, Stable Diffusion

Common Crawl archives the web monthly

Source: Common Crawl

Data provenance: Where did this come from?

Data provenance = Tracking data from source to use

Think of it like a family tree for your data.

Why it matters:

  1. Debugging: Where did this error originate?
  2. Compliance: Can we legally use this data?
  3. Quality: Has this data been properly validated?
  4. Reproducibility: Can we recreate this analysis?

Tools for tracking provenance:

  • Data Provenance Explorer (MIT): Generates summaries of dataset sources, licenses, and allowable uses
  • DVC (Data Version Control): Git for data
  • MLflow: Tracks experiments and data versions

Data lineage diagram

Source: Geeks for Geeks

Analogy: Like a food supply chain. If there’s contamination, you need to trace it back to the source!

Simplified example: Tracking data with Python

What this code does:

This is a simplified example of how you might track where your data comes from. In practice, tools like DVC do this automatically

Key concepts:

  • source: Where did the data originate?
  • collected_at: When was it gathered?
  • transformations: What processing was applied?
  • version: Which version of the data is this?

Why this matters:

If your model starts behaving badly, you can trace back and ask: “Did the data change? Was a filter removed?”

# Simple data provenance tracking
# This is metadata, not actual data

provenance = {
    "dataset_name": "restaurant_reviews",
    "version": "2.1",
    "source": "yelp_api",
    "collected_at": "2024-01-15",
    "transformations": [
        "removed_duplicates",
        "filtered_english_only",
        "anonymised_usernames"
    ],
    "row_count": 50000,
    "known_issues": [
        "US restaurants only",
        "2023 reviews missing"
    ]
}

This creates a “paper trail” for your dataset.

The FAIR Principles 🔬

FAIR: The gold standard for research data

FAIR Principles (2016):

A framework for making research data more useful, developed by the scientific community.

F - Findable:

  • Data has a unique, persistent identifier (like a DOI)
  • Rich metadata describes the data
  • Registered in a searchable resource

A - Accessible:

  • Retrievable by identifier using standard protocols
  • Metadata remains accessible even if data isn’t
  • Clear authentication if access is restricted

I - Interoperable:

  • Uses standardised formats and vocabularies
  • Can be combined with other datasets
  • Machine-readable

R - Reusable:

  • Clear usage license
  • Detailed provenance information
  • Meets domain-relevant standards

FAIR principles

Source: GO-FAIR

Why FAIR matters for AI:

The NIH, EU, and major funders now require FAIR data practices. AI datasets that follow FAIR principles are more trustworthy.

Data Governance in Practice 🏛️

Data rights movements: The future

New movements emerging:

1. Data dignity:

2. Opt-out rights:

  • GDPR: Right to be forgotten
  • CCPA: Right to know and delete
  • AI training opt-out (some systems support this now)

3. Data trusts:

  • Independent organisations manage data on your behalf
  • Negotiate terms and protect privacy

Data rights

Growing recognition that data ownership matters

Discussion: Should YOU be paid when your data trains an AI that makes billions?

Discussion: Who owns the data? 💭

Scenario:

You wrote a poem and posted it online in 2018. An AI company scraped it, and now their AI can write poems “inspired by” your style.

Questions to debate:

  1. Did the company do anything wrong?
  2. Should you be compensated?
  3. Should you be able to opt out retroactively?

Different perspectives:

Tech companies: “It’s fair use, like learning from reading books”

Artists: “You’re profiting from my creative work without consent”

Lawyers: “Current law wasn’t designed for this”

Users: “I just want cool AI. I don’t care about the source”

Your turn: Where do you stand?

Putting It Into Practice 🛠️

Tools for documentation

Good news: Tools exist!

For Datasheets:

For Model Cards:

For Data Lineage: (advanced)

Mintlify

Trend: Major platforms now require documentation to publish datasets/models!

Summary 📚

Main takeaways

  • Datasheets: Document datasets with motivation, composition, and limitations

  • Model cards: Document models with intended use, performance by group, and ethics

  • System cards: Document entire AI systems including safety evaluations

  • Data lineage: Track where data came from and how it changed

  • FAIR principles: Findable, Accessible, Interoperable, Reusable data

  • Consent matters: Much AI training data lacks proper consent

  • Career opportunity: Documentation skills are increasingly valuable

…and that’s all for today! 🎉