DATASCI 185: Introduction to AI Applications

Lecture 19: Privacy and Data Protection

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🔒

Recap of last class

We covered AI regulation around the world
EU AI Act: World’s first comprehensive AI law with risk-based approach (banned, high-risk, limited, minimal risk tiers)
US approach: Sectoral regulation with state-level initiatives
China: State control over content with innovation goals; requires algorithm registration
No global consensus yet, but convergence on key principles is likely
Today: How does AI intersect with privacy?

Note: A filming crew will be in the room next lecture to record a video about this class! If you prefer not to be filmed, please let me know 😉

Lecture overview

What we will cover today

Part 1: Privacy in the AI age

Why AI makes privacy harder
Data collection at unprecedented scale
Inference and prediction of sensitive attributes

Part 2: Legal frameworks

GDPR and data protection principles
Rights of individuals
US privacy landscape

Part 3: Technical approaches

Differential privacy
Federated learning
Privacy-preserving machine learning (synthetic data)

Part 4: Challenges and tensions

Case study: Clearview AI
Discussion: where’s your line on privacy?
What can you do?

Meme of the day 😄

Source: Caniphis

Good news of the day 📰

Source: Financial Times

Privacy in the AI age 🔍

Why AI makes privacy harder

AI makes privacy harder in ways that older laws did not anticipate:

Scale of collection:

Traditional surveillance was limited by staffing; AI-enabled collection runs around the clock at almost no cost
Every interaction becomes a data point, creating “data exhaust” (data generated as a byproduct of normal use) from ordinary activities

Inference capabilities:

AI can predict sensitive information from seemingly innocuous data, e.g. shopping patterns reveal health conditions, typing speed reveals mood, social networks reveal politics
Things you never disclosed can be inferred

Persistence:

Data doesn’t decay; models trained today affect you forever and decisions follow you across contexts
“Right to be forgotten” is technically hard

Source: AI Multiple

“If you’re not paying for the product, you are the product.” This phrase understates it! Even when you pay, your data is often the real product!

What can be inferred from your data?

Research has shown AI can predict:

Data source	What can be inferred
Facebook likes	Political views, sexuality, personality
Smartphone sensors	Depression, anxiety, Parkinson’s
Typing patterns	Age, gender, emotional state
Purchase history	Pregnancy, health conditions
Location data	Home address, workplace, religion
Voice recordings	Emotional state, health, age

Famous example: Target and pregnancy

Target’s marketing algorithm identified a pregnant teenager
Sent baby product coupons to her home
Her mother complained to Target, saying her daughter was “only in high school”
Days later, the father called Target, apologising and confirming the pregnancy
The algorithm knew before her family did (NYT)

Source: Time

The inference problem: You can control what you share. You cannot control what can be inferred from what you share.

Training data and privacy

LLMs have a training data problem:

Trained on internet-scale data
That data includes personal information
Models can memorise and regurgitate training data
Your name, address, phone number might be in there!

Demonstrated attacks:

Researchers extracted verbatim training data from GPT-2 (Carlini et al., 2021)
Including names, phone numbers, email addresses
Extraction attacks continue to improve

The consent problem:

Most people don’t know what’s in training sets
“Publicly available” ≠ “consented to AI training”

Source: The Hacker News

When you ask an LLM about yourself, it might actually know things from training data you never shared with it directly.

Legal frameworks 📜

Individual rights under GDPR

You have the right to:

Right	What it means
Access	Get a copy of your data
Rectification	Correct inaccurate data
Erasure	“Right to be forgotten”
Portability	Move data to another service
Restriction	Limit processing
Object	Stop certain processing
Not be profiled	Reject automated decisions

Article 22: automated decisions

You have the right not to be subject to purely automated decisions
There must be human involvement in consequential decisions
Exceptions exist for contracts and explicit consent

Source: EU GDPR Portal

In practice: These rights exist on paper, but exercising them is often difficult. Companies make it hard to find the right forms, respond slowly, or claim exemptions.

US privacy landscape

No comprehensive federal privacy law, as we saw in previous lectures

Instead, a patchwork of sector-specific rules:

Law	Scope	Protects
HIPAA	Healthcare data	Patient medical records
FERPA	Education records	Student academic data
COPPA	Children’s data	Under-13 online activity
GLBA	Financial data	Bank and loan records
FCRA	Credit reporting	Credit scores and history
ECPA	Electronic comms	Emails, calls, stored data

State laws filling the gap:

California (CCPA/CPRA): Most comprehensive, GDPR-like rights
Virginia, Colorado, Connecticut: Similar frameworks
Illinois BIPA: Biometric data, with private right of action
Many more states following, as we saw before

Source: IAPP

GDPR enforcement

Major fines (selected):

Company	Fine	What they did wrong
Meta (2023)	€1.2B	Transferred EU user data to US servers without adequate safeguards
Amazon (2021)	€746M	Processed personal data for targeted ads without valid consent
Meta (2022)	€405M	Exposed children’s accounts and contact info on Instagram
Google (2022)	€150M	Made it harder to reject cookies than to accept them
TikTok (2023)	€345M	Failed to protect children’s privacy settings and data

Patterns:

Big tech companies are primary targets
Data transfers, consent, children are focus areas
Fines are getting larger
Enforcement varies by country (Ireland is lax)

Source: Secureframe

€1.2 billion sounds big, but Meta made ~$135 billion in 2024. Is that a fine or a cost of doing business?

Technical approaches to privacy 🛡️

Differential privacy

Differential privacy is a mathematical framework for privacy-preserving data analysis (Dwork & Roth, 2014).

Add carefully calibrated random noise to query results before releasing them
For example: “How many people in this dataset have diabetes?” The true answer is 137, but the system returns 137 ± some noise (say, 134 or 141)
Individual records stay hidden, but aggregate patterns remain visible

Formal guarantee:

The output is nearly the same whether or not any single individual is in the dataset
Privacy loss is quantified by epsilon (ε): small ε = strong privacy but noisier results; large ε = weaker privacy but more accurate results
This is a provable, mathematical guarantee, not just a promise

Source: Flower AI

Real-world use: Apple uses differential privacy for emoji suggestions, Google for Chrome usage stats, US Census for 2020 data.

Federated learning

Federated learning trains AI models without centralising data (McMahan et al., 2017).

How it works:

A central server sends the same model to many devices
Each device trains the model on its own local data
Each device sends back only what it learned (updated model weights), not the data itself
The server averages all the updates into an improved model
The improved model is sent back to devices

Why it matters for privacy:

Your raw data never leaves your device
The server never sees your messages, photos, or browsing history
But it is not perfect: model updates can still leak information about the training data, and coordinating thousands of devices is complex

Source: Wikipedia

Example: Google’s Gboard uses federated learning to improve predictions. Your typing stays on your phone; only model improvements are shared.

Synthetic data for privacy

What if you could train a model on data that looks real but isn’t? That is the idea behind synthetic data (Jordon et al., 2022).

How it works:

Train a generative model on real data
Model learns statistical patterns and correlations
Generate new, fake data points that look real
Train your AI on synthetic data instead

Privacy benefits and limitations:

No real personal data in training pipeline
Individuals can’t be identified from synthetic records
Can share data freely for research and collaboration
However, synthetic data may not capture rare cases well
Risk of “overfitting” to original data (memorisation)

Source: GOV.UK

Real-world use: Healthcare organisations use synthetic patient data for research. Financial institutions test fraud detection on synthetic transactions.

Do these techniques actually help?

What they can do:

Allow useful research on sensitive datasets (medical, financial) while protecting privacy
Give engineers concrete tools, not just good intentions
Provide mathematical guarantees (differential privacy) or architectural separation (federated learning)

What they cannot do:

Stop companies from collecting data in the first place
Fix the power imbalance between users and platforms
Prevent governments from demanding access to data
Make users understand how their data is used

However…

The best privacy protection is not collecting data at all. But AI companies have the opposite incentive: more data = better models = more profit.

Technical fixes work within the system. They don’t change the system itself.

Watch out for “privacy washing”: companies announce differential privacy with large epsilon values (weak privacy) or federated learning that still collects metadata. The tools are real, but the way they are deployed can be misleading.

Challenges and tensions ⚖️

Case study: Clearview AI 🔍

In January 2020, the New York Times revealed that Clearview AI had secretly scraped over 30 billion photos from Facebook, Instagram, LinkedIn, and other platforms
Built a facial recognition database without anyone’s consent!
Sold access to over 2,400 US law enforcement agencies, plus corporations and wealthy individuals
US police used the database nearly a million times
People never knew their faces were in the database

The legal fallout:

American Civil Liberties Union (ACLU) sued under Illinois BIPA (2020). Settlement (2022): Clearview permanently banned from selling its database to private companies nationwide
France (CNIL): €20M fine for illegal data collection
Italy: €20M fine
Netherlands: €30.5M fine (2024)
UK, Australia, Canada: ordered to delete data

Source: Library of Congress

Clearview claimed their service was only for law enforcement. Investigation revealed they gave accounts to anyone willing to pay, including investors’ friends. The ACLU described it as putting everyone into a “perpetual police line-up”.

Quick quiz: What’s wrong and how would you fix it? 🎯

A social media company uses your messages to train an AI without telling you
A shopping app collects your exact location every 30 seconds, even when not in use
A company keeps customer data “just in case” with no deletion policy
An AI hiring tool rejects candidates without any human review

Where’s your line? 💬

Scenario:

A new app offers free health monitoring. It tracks:

Your heart rate and sleep
Your location and activity
What you eat and drink
Your social interactions

In exchange, it provides:

Personalised health advice
Early warning of health issues
Discounts from health insurers
Connection with others like you

Let’s discuss together:

Would you use this app?
What would make you change your mind?
What data would be “too much”?
Does it matter who runs it (tech company, hospital, government)?
Would you feel differently if it were mandatory?

There’s no right answer. The point is to identify your own values and understand what trade-offs you’re willing to make

What can you do?

Individual actions:

Review privacy settings and use privacy-focused tools (Signal, DuckDuckGo, Firefox)
Limit location sharing and use different email addresses for different services
Exercise your GDPR/CCPA rights (access, deletion requests) and be skeptical of “personalisation”

Limitations of individual action:

Opting out often means losing service. Privacy is collective
Your data can be inferred from others’ data, and power imbalance is structural

Collective action:

Support privacy legislation and demand transparency from companies
Support organisations fighting for privacy (EFF, EPIC, noyb)
Choose services that respect privacy and vote for privacy-protecting candidates

Using Signal instead of WhatsApp helps, but it won’t change how insurers use your health data. That takes legislation.

Summary 📝

Main takeaways

AI and privacy

AI enables collection at unprecedented scale
Inference makes “non-sensitive” data sensitive
Models can memorise and leak personal information

Legal frameworks

GDPR: Comprehensive rights-based approach
US: Patchwork of sectoral laws
Tension between AI development and data protection

Technical approaches

Differential privacy: Add noise, preserve patterns
Federated learning: Train without centralising
Synthetic data: Train on artificial data, protect originals
All have trade-offs; none is a silver bullet

Tensions

Same data enables benefits and surveillance
Trade-off between utility and privacy exists but is overstated

Key insights

Collection is the core problem
Privacy is collective, not just individual
Technical fixes don’t address power imbalances
Rules and oversight determine whether data helps or harms

The Target pregnancy case and the Clearview AI scraping both came down to the same thing: someone used your data in a way you never agreed to.

… and that’s all for today! 🎉