DATASCI 185: Introduction to AI Applications

Lecture 14: Documentation and Dataset Governance

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 📋

Recap of last class

Last time we explored pipelines and monitoring
AI systems are complex chains that can fail at any step
Monitoring helps catch problems before users do
Testing for properties, not exact outputs
Today: We enter Module 4, Data Ethics and Bias
Documentation as the first line of defence
If you don’t know where your data came from, you can’t fix bias
Meta AI research scientist Moustapha Cissé: “You are what you eat, and right now we feed our models junk food.”

Source: Labellerr

Lecture overview

Today’s agenda

Part 1: Why Documentation Matters

The “nutrition label” for AI
What happens without it (spoiler: chaos)
Documentation as accountability

Part 2: Datasheets for Datasets

What, why, how, who, when
Reading and writing datasheets
Hands-on: Create your own!

Part 3: Model Cards

What does the model do (and not do)?
Intended uses and limitations
Ethical considerations

Part 4: LLM Training Data Governance

How modern AI datasets are documented
LAION, Common Crawl, and data provenance
The FAIR principles for research data

Fun fact of the day! 😁

Source: Reddit

Revolting fact of the day! 😠

Source: Mashable

Why Documentation Matters 📝

The horror stories: What happens without documentation

ImageNet’s hidden problems (2019):

Most influential computer vision dataset ever
Powered 10+ years of AI research
Nobody knew where the images came from!
Later discovered:
- Scraped from Flickr without consent
- Racist and sexist labels (MTurk workers)
- Non-consensual photos of minors
- Content from revenge porn sites

The root cause?

They built it quickly and never documented what they were doing!

Thousands of AI systems were built on this undocumented foundation!

Source: Excavating AI (great resource, check it out!)

Documentation as accountability

Without documentation, we can’t:

❌ Know if data were collected ethically
❌ Identify sources of bias
❌ Determine if a model is appropriate for a use case
❌ Assign responsibility when things go wrong
❌ Reproduce or verify results
❌ Update or fix problems later

With documentation, we can:

✅ Make informed choices about using a dataset
✅ Trace bias back to its source
✅ Match models to appropriate applications
✅ Hold creators accountable
✅ Build trust with users and regulators

The legal angle:

Regulations are now requiring documentation:

EU AI Act: Mandates documentation for high-risk AI
NYC Law 144: Requires bias audits (needs documentation!)
California CCPA: Data transparency requirements

We’ll cover US vs EU regulations in detail in Lecture 17.

If you can’t document it, you can’t deploy it (legally).

The three pillars: Datasheets, Model Cards, and System Cards

Source: Laurel Papworth

They originated from landmark papers: Gebru et al. (2018) for datasheets and Mitchell et al. (2019) for model cards.

These are now industry standards adopted by Google, Microsoft, Hugging Face, and others!

Datasheets for Datasets 📊

What is a datasheet?

Inspired by electronics industry:

Every electronic component has a “datasheet”
Spec sheet with all relevant information
Engineers can make informed decisions

Timnit Gebru et al. (2018) proposed the same for datasets:

“A datasheet documents the motivation, composition, collection process, recommended uses, and other information about a dataset.”

Core sections:

Motivation, Composition, Collection, Preprocessing, Uses, Distribution, Maintenance

Source: Jonathan Rystrøm (2024) - Dataset Card: Multicultural WVS Alignment

What a datasheet answers

Motivation:

Why was the dataset created?
Who created it and who funded it?
What task was it designed for?

Composition:

What types of data are included?
How many instances are there?
Is there any sensitive information?
Are there known errors or noise?

Collection:

How was data collected (scraped, survey, sensors)?
Was consent obtained?
Who were the data subjects?
What timeframe was covered?

Preprocessing:

Was data cleaned or filtered?
Were any instances removed?
How was data labelled?

Uses:

What tasks has it been used for?
What should it NOT be used for?
Are there known impact areas?

Distribution:

How is the dataset shared?
Is it versioned?

Maintenance:

Who is responsible for updates?
Is there a way to report issues?

Example: A good datasheet

The CHoRUS Dataset Datasheet (excerpt):

Motivation:

Develop diverse, ethically sourced, AI-ready dataset to improve recovery from acute illness, with attention to diversity, equity, and Social Determinants of Health.

Composition:

23,400 hospital admissions from 14 hospitals. Multi-modal: EHR, waveforms, imaging (DICOM), clinical notes (OHNLP), EEG data.

Ethical considerations:

Multi-site IRB approvals
Patient-focused ethical frameworks
BRIDGE Center ethics expertise on AI/ML biases

Recommended uses:

AI/ML for clinical care, particularly acute illness recovery prediction.

NOT recommended for:

Re-identification of patients (prohibited), applications perpetuating bias or discrimination.

Collaborative Hospital Repository Uniting Standards for Equitable AI

Source: BRIDGE Center

Activity: Let’s read a real datasheet! 📖

Read Anthropic’s HH-RLHF Dataset Card:

Visit Anthropic/hh-rlhf
Read the Dataset Card carefully
This dataset is used to train AI assistants to be helpful and harmless!

Questions to answer:

What are the two types of data in this dataset?
Why does Anthropic warn against using this for supervised training of dialogue agents?
What content warning does the dataset include?
How can you contact the authors with issues?

Evaluate the documentation:

✅ Clear purpose: Train preference/reward models for RLHF
✅ Explicit warnings about misuse (don’t train chatbots directly!)
✅ Content disclaimer about harmful material
✅ Links to papers for methodology details
✅ Contact email provided

Discussion:

Would you be comfortable using this dataset?
What ethical responsibilities come with accessing data that contains harmful content for research purposes?

⏱️ 3 minutes to explore!

Model Cards 🤖

What is a model card?

The “user manual” for AI models:

Just like datasheets document data, model cards document models.

Margaret Mitchell et al. (2019):

“Model cards are short documents accompanying trained ML models that provide benchmarked evaluation in a variety of conditions.”

Core sections:

Model details: What is this model?
Intended use: What should it be used for?
Factors: What affects performance?
Metrics: How was it evaluated?
Training data: What was it trained on?
Ethical considerations: What could go wrong?
Caveats and recommendations: Warnings!

Source: Mitchell et al. (2019)

Intended use: The most important section

Why “intended use” matters:

A hammer is a great tool. But:

✅ Intended use: Driving nails
❌ NOT intended for: Brain surgery

Models need the same clarity:

Model	Intended Use	NOT For
Gemini 3	Conversation, assistance	Medical diagnosis
Face detection	Placing cameras	Law enforcement ID
Sentiment analysis	Product feedback	Hiring decisions

Without this guidance:

People will use models inappropriately and harm others.

Real example: Gemini 3 Model Card (edited and summarised)

Intended use:

“Assistance with a variety of text-based tasks”

Out-of-scope:

“Assistance with illegal activities”

“Chemical synthesis”

“Mental health crisis intervention”

Disaggregated evaluation: Breaking it down

Why overall accuracy isn’t enough:

A model can be 90% accurate overall
But 95% for Group A and 70% for Group B!

Disaggregated evaluation = report performance separately for different groups.

Real example: OpenAI CLIP Model Card (2021)

Gender classification accuracy using FairFace dataset:

Race Category	Gender Accuracy
Middle Eastern	98.4% (highest)
White	96.5% (lowest)
All races	>96%

Racial classification: ~93% | Age classification: ~63%

CLIP’s bias findings:

OpenAI found significant disparities when classifying people into crime-related categories:

Performance varied by race and gender
Disparities shifted based on how classes were constructed
Risk of denigration harms identified

“We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories.”

System cards vs model cards

A newer concept: System Cards

Modern AI products often chain several models together. A model card only covers one of them, so system cards document the entire system.

What’s included in a system card:

Multiple models working together
How components interact
Human oversight mechanisms
Deployment context and safeguards
Red teaming and safety evaluations

Who’s using system cards:

OpenAI: Released o1 System Card in December 2024
Anthropic: Publishes system cards for each Claude release
Meta: Released 22 system cards in 2023 for their AI products

Model Card vs System Card:

Aspect	Model Card	System Card
Scope	Single model	Entire system
Focus	Performance	Safety + deployment
Audience	Developers	Developers + public
Includes	Training data, metrics	Red teaming, safeguards

OpenAI o1 System Card (Dec 2024)

LLM Training Data Governance 🗂️

How large AI datasets are built

The data behind modern AI:

Large language models and image generators are trained on massive datasets scraped from the web. If you want to understand where bias comes from, start with how the data was collected.

Common Crawl: The foundation

Non-profit that crawls the web monthly
Billions of web pages archived
Raw HTML, metadata, and extracted text
Free and open for anyone to use
No consent from content creators

The pipeline:

Common Crawl provides raw web data
Organisations filter and clean it
Datasets like LAION, The Pile, C4 are created
These train models like GPT, DALL-E, Stable Diffusion

Source: Common Crawl

Data provenance: Where did this come from?

Data provenance = Tracking data from source to use

Think of it like a family tree for your data.

Why it matters:

Debugging: Where did this error originate?
Compliance: Can we legally use this data?
Quality: Has this data been properly validated?
Reproducibility: Can we recreate this analysis?

Tools for tracking provenance:

Data Provenance Explorer (MIT): Generates summaries of dataset sources, licenses, and allowable uses
DVC (Data Version Control): Git for data
MLflow: Tracks experiments and data versions

Source: Geeks for Geeks

Analogy: Like a food supply chain. If there’s contamination, you need to trace it back to the source!

Simplified example: Tracking data with Python

What this code does:

This is a simplified example of how you might track where your data comes from. In practice, tools like DVC do this automatically

Key concepts:

source: Where did the data originate?
collected_at: When was it gathered?
transformations: What processing was applied?
version: Which version of the data is this?

Why this matters:

If your model starts behaving badly, you can trace back and ask: “Did the data change? Was a filter removed?”

# Simple data provenance tracking
# This is metadata, not actual data

provenance = {
    "dataset_name": "restaurant_reviews",
    "version": "2.1",
    "source": "yelp_api",
    "collected_at": "2024-01-15",
    "transformations": [
        "removed_duplicates",
        "filtered_english_only",
        "anonymised_usernames"
    ],
    "row_count": 50000,
    "known_issues": [
        "US restaurants only",
        "2023 reviews missing"
    ]
}

This creates a “paper trail” for your dataset.

The FAIR Principles 🔬

FAIR: The gold standard for research data

FAIR Principles (2016):

A framework for making research data more useful, developed by the scientific community.

F - Findable:

Data has a unique, persistent identifier (like a DOI)
Rich metadata describes the data
Registered in a searchable resource

A - Accessible:

Retrievable by identifier using standard protocols
Metadata remains accessible even if data isn’t
Clear authentication if access is restricted

I - Interoperable:

Uses standardised formats and vocabularies
Can be combined with other datasets
Machine-readable

R - Reusable:

Clear usage license
Detailed provenance information
Meets domain-relevant standards

Source: GO-FAIR

The NIH, EU, and most major funders now require FAIR data practices for grant-funded research.

Data Governance in Practice 🏛️

Data rights movements: The future

New movements emerging:

1. Data dignity:

“Your data = your labour”
Companies should pay for data (like paying workers)
Jaron Lanier, Radical Markets movement

2. Opt-out rights:

GDPR: Right to be forgotten
CCPA: Right to know and delete
AI training opt-out (some systems support this now)

3. Data trusts:

Independent organisations manage data on your behalf
Negotiate terms and protect privacy

Jaron Lanier, author of Who Owns the Future?

Discussion: Should YOU be paid when your data trains an AI that makes billions?

Discussion: Who owns the data? 💭

Scenario:

You wrote a poem and posted it online in 2018. An AI company scraped it, and now their AI can write poems “inspired by” your style.

Questions to debate:

Did the company do anything wrong?
Should you be compensated?
Should you be able to opt out retroactively?

Different perspectives:

Tech companies: “It’s fair use, like learning from reading books”

Artists: “You’re profiting from my creative work without consent”

Lawyers: “Current law wasn’t designed for this”

Users: “I just want cool AI. I don’t care about the source”

Your turn: Where do you stand?

Putting It Into Practice 🛠️

Tools for documentation

Good news: Tools exist!

For Datasheets:

Hugging Face Dataset Cards — Standard format, easy to publish
Data Nutrition Project — Visual “nutrition labels”
Google Dataset Search — Requires documentation to be indexed!

For Model Cards:

Hugging Face Model Cards — Industry standard
TensorFlow Model Card Toolkit — Automated card generation
Google Model Cards — Example cards

For Data Lineage: (advanced)

DVC (Data Version Control) - https://dvc.org/
MLflow - https://mlflow.org/

Hugging Face will not list a model or dataset without a card. Google Dataset Search ignores undocumented datasets entirely.

Summary 📚

Main takeaways

Datasheets: Document datasets with motivation, composition, and limitations
Model cards: Document models with intended use, performance by group, and ethics
System cards: Document entire AI systems including safety evaluations
Data lineage: Track where data came from and how it changed
FAIR principles: Findable, Accessible, Interoperable, Reusable data
Consent matters: Much AI training data lacks proper consent
Career opportunity: Employers are hiring for documentation and data governance roles

…and that’s all for today! 🎉