DATASCI 185: Introduction to AI Applications

Lecture 19: Privacy and Data Protection

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🔒

Recap of last class

  • We covered AI regulation around the world
  • EU AI Act: World’s first comprehensive AI law with risk-based approach (banned, high-risk, limited, minimal risk tiers)
  • US approach: Sectoral regulation with state-level initiatives
  • China: State control over content with innovation goals; requires algorithm registration
  • No global consensus yet, but convergence on key principles is likely
  • Today: How does AI intersect with privacy?

Oh, well 😅

Lecture overview

What we will cover today

Part 1: Privacy in the AI age

  • Why AI makes privacy harder
  • Data collection at unprecedented scale
  • Inference and prediction of sensitive attributes

Part 2: Legal frameworks

  • GDPR and data protection principles
  • Rights of individuals
  • US privacy landscape

Part 3: Technical approaches

  • Differential privacy
  • Federated learning
  • Privacy-preserving machine learning

Part 4: Challenges and tensions

  • AI as a force for good vs surveillance
  • What can you do?

Meme of the day 😄

Source: Caniphis

Privacy in the AI age 🔍

Why AI makes privacy harder

AI changes the privacy equation in several ways:

Scale of collection:

  • Traditional surveillance was limited; AI-enabled collection is cheap, pervasive, and continuous
  • Every interaction becomes a data point, creating “data exhaust” from ordinary activities

Inference capabilities:

  • AI can predict sensitive information from seemingly innocuous data, e.g. shopping patterns reveal health conditions, typing speed reveals mood, social networks reveal politics
  • Things you never disclosed can be inferred

Persistence:

  • Data doesn’t decay; models trained today affect you forever and decisions follow you across contexts
  • “Right to be forgotten” is technically hard

Source: AI Multiple

“If you’re not paying for the product, you are the product.” This phrase understates it. Even when you pay, your data is often the real product.

What can be inferred from your data?

Research has shown AI can predict:

Data source What can be inferred
Facebook likes Political views, sexuality, personality
Smartphone sensors Depression, anxiety, Parkinson’s
Typing patterns Age, gender, emotional state
Purchase history Pregnancy, health conditions
Location data Home address, workplace, religion
Voice recordings Emotional state, health, age

Famous example: Target and pregnancy

  • Target’s marketing algorithm identified a pregnant teenager
  • Sent baby product coupons to her home
  • Her father found out before she told him
  • The algorithm knew before her family did

Source: Time

The inference problem: You can control what you share. You cannot control what can be inferred from what you share.

Training data and privacy

LLMs have a training data problem:

  • Trained on internet-scale data
  • That data includes personal information
  • Models can memorise and regurgitate training data
  • Your name, address, phone number might be in there

Demonstrated attacks:

  • Researchers extracted verbatim training data from GPT2
  • Including names, phone numbers, email addresses
  • Extraction attacks continue to improve

The consent problem:

  • Most people don’t know what’s in training sets
  • “Publicly available” ≠ “consented to AI training”

Source: The Hacker News

When you ask an LLM about yourself, it might actually know things from training data you never shared with it directly.

Discussion: would you share? 🤔

Quick poll (raise your hand):

Would you share your data if…

  1. A health app predicts disease risk but sells data to insurers?
  2. A smart home device improves comfort but records all conversations?
  3. A job search site personalises results but shares with employers?
  4. A social app connects you with friends but builds a profile for advertisers?
  5. An AI tutor helps you learn but reports to your school?

The usual pattern:

  • People say they care about privacy
  • But they don’t act like they care
  • This is called the privacy paradox

Why?

  • Costs are immediate, harms are distant
  • Default settings favour sharing
  • Terms of service are unreadable
  • “Everyone does it” normalisation
  • Benefits are tangible, harms are abstract
  • We’re not good at probabilistic thinking

GDPR fundamentals

The General Data Protection Regulation (2018) is the EU’s comprehensive privacy law.

Key principles:

  1. Lawfulness, fairness, transparency: Process data legally and openly
  2. Data minimisation: Collect only what’s necessary
  3. Accuracy: Keep data accurate and up to date
  4. Storage limitation: Don’t keep longer than needed
  5. Integrity and confidentiality: Protect data properly
  6. Accountability: Demonstrate compliance

Lawful bases for processing:

  • Consent (freely given, specific, informed)
  • Legal obligation
  • Vital interests
  • Public interest
  • Legitimate interests (most contested)

GDPR applies to any organisation processing EU residents’ data, wherever they’re located. It has global reach, like the EU AI Act.

Individual rights under GDPR

You have the right to:

Right What it means
Access Get a copy of your data
Rectification Correct inaccurate data
Erasure “Right to be forgotten”
Portability Move data to another service
Restriction Limit processing
Object Stop certain processing
Not be profiled Reject automated decisions

Article 22: automated decisions

  • You have the right not to be subject to purely automated decisions
  • There must be human involvement in consequential decisions
  • Exceptions exist for contracts and explicit consent

Source: EU GDPR Portal

In practice: These rights exist on paper, but exercising them is often difficult. Companies make it hard to find the right forms, respond slowly, or claim exemptions.

GDPR and AI tensions

Purpose limitation vs model training:

  • You gave data for one purpose
  • Can it train a model for another?
  • AI companies claim legitimate interest

Data minimisation vs big data:

  • AI works better with more data
  • GDPR says collect only what’s necessary
  • What’s “necessary” for a foundation model?

Right to explanation vs black boxes:

  • If an AI makes a decision about you, you can ask why
  • But many AI systems can’t explain themselves

Right to erasure vs model training:

  • Can you demand your data be removed from a trained model?
  • “Unlearning” is technically very difficult

Source: CertPro

GDPR was written before the LLM era. Applying 2018 law to 2026 technology creates interpretation challenges.

US privacy landscape

No comprehensive federal privacy law

Instead, a patchwork of sector-specific rules:

Law Scope
HIPAA Healthcare data
FERPA Education records
COPPA Children’s data
GLBA Financial data
FCRA Credit reporting
ECPA Electronic communications

State laws filling the gap:

  • California (CCPA/CPRA): Most comprehensive, GDPR-like rights
  • Virginia, Colorado, Connecticut: Similar frameworks
  • Illinois BIPA: Biometric data, with private right of action
  • Many more states following, as we saw in previous lectures

Source: IAPP

The US approach: regulate specific harms in specific sectors rather than establishing general data rights. This leaves many gaps.

GDPR enforcement

Major fines (selected):

Company Fine Reason
Meta (2023) €1.2B EU-US data transfers
Amazon (2021) €746M Targeted advertising
Meta (2022) €405M Instagram children’s data
Google (2022) €150M Cookie consent
TikTok (2023) €345M Children’s privacy

Patterns:

  • Big tech companies are primary targets
  • Data transfers, consent, children are focus areas
  • Fines are getting larger
  • Enforcement varies by country (Ireland is lax)

Source: Secureframe

€1.2 billion sounds big, but Meta made ~$135 billion in 2024. Is that a fine or a cost of doing business?

Technical approaches to privacy 🛡️

Differential privacy

Differential privacy is a mathematical framework for privacy-preserving data analysis.

The core idea:

  • Add carefully calibrated noise to data or results
  • The noise makes it impossible to tell whether any individual was in the dataset
  • But aggregate patterns remain visible

Formal guarantee:

  • The output of an analysis is nearly the same whether or not any single individual is included
  • An attacker learns almost nothing about any specific person
  • Privacy loss is quantified by epsilon (ε)
  • Trade-off: More noise = more privacy = less accuracy

Source: Flower AI

Real-world use: Apple uses differential privacy for emoji suggestions, Google for Chrome usage stats, US Census for 2020 data.

Federated learning

Federated learning trains AI models without centralising data.

How it works:

  1. Central server sends model to devices
  2. Each device trains on its local data
  3. Devices send model updates (not data) back
  4. Server aggregates updates into improved model
  5. Repeat

Privacy benefits and limitations:

  • Raw data never leaves your device
  • Server only sees aggregated gradients
  • Much harder to extract individual information
  • But coordination is complex and models can still leak information

Source: Wikipedia

Example: Google’s Gboard uses federated learning to improve predictions. Your typing stays on your phone; only model improvements are shared.

Synthetic data for privacy

Synthetic data is artificially generated data that preserves the statistical properties of real data without containing actual personal information.

How it works:

  • Train a generative model on real data
  • Model learns statistical patterns and correlations
  • Generate new, fake data points that look real
  • Train your AI on synthetic data instead

Privacy benefits and limitations:

  • No real personal data in training pipeline
  • Individuals can’t be identified from synthetic records
  • Can share data freely for research and collaboration
  • However, synthetic data may not capture rare cases well
  • Risk of “overfitting” to original data (memorisation)

Source: GOV.UK

Real-world use: Healthcare organisations use synthetic patient data for research. Financial institutions test fraud detection on synthetic transactions.

Do these techniques actually help?

Honest assessment:

  • They help at the margins
  • They don’t solve fundamental problems
  • Collection is still the main issue

The data minimisation solution:

  • Best privacy protection: don’t collect data
  • But AI companies have opposite incentive
  • Technical fixes work around the problem
  • They don’t address the power imbalance
  • Companies announce privacy features with large epsilon (weak privacy)
  • Hard for users to evaluate claims

Challenges and tensions ⚖️

AI as a force for good vs surveillance

The promise:

  • Disease prediction saves lives
  • Personalised education helps students
  • Recommendation systems save time
  • Fraud detection protects consumers

The same system can be both good and bad:

  • Health app: Helps you exercise OR enables insurance discrimination
  • Location data: Optimises traffic OR tracks movements
  • Facial recognition: Finds missing children OR enables authoritarian control
  • Context and governance determine outcomes

Source: ACLU

“The technology is neutral” is a dodge. Technology embeds values in its design and deployment. The question is whose values.

What can you do?

Individual actions:

  • Review privacy settings and use privacy-focused tools (Signal, DuckDuckGo, Firefox)
  • Limit location sharing and use different email addresses for different services
  • Exercise your rights (access, deletion requests) and be skeptical of “personalisation”

Limitations of individual action:

  • Opting out often means losing service; privacy is collective, not just individual
  • Your data can be inferred from others’ data, and power imbalance is structural

Collective action:

  • Support privacy legislation and demand transparency from companies
  • Support organisations fighting for privacy (EFF, EPIC, etc.)
  • Choose services that respect privacy and vote for privacy-protecting candidates

Individual action is necessary but not sufficient. Systemic problems require systemic solutions. That means policy, not just personal choices.

Case study: Clearview AI 🔍

What happened:

  • Clearview AI scraped billions of photos from social media
  • Built a facial recognition database without consent
  • Sold access to law enforcement, corporations, wealthy individuals
  • People never knew their faces were in the database

Privacy violations:

  • No consent for data collection or use
  • Violated terms of service of platforms
  • GDPR fines: €20M in multiple EU countries
  • Illinois BIPA: Multiple class action settlements
  • UK, Australia, Canada: Ordered to delete data

Source: Library of Congress

Clearview claimed their service was only for law enforcement. Investigation revealed they gave accounts to anyone willing to pay, including investors’ friends.

Quick quiz: What’s wrong and how would you fix it? 🎯

For each scenario, discuss with your neighbour: What’s the problem? How would you fix it?

  1. A social media company uses your messages to train an AI without telling you

  2. A shopping app collects your exact location every 30 seconds, even when not in use

  3. A company keeps customer data “just in case” with no deletion policy

  4. An AI hiring tool rejects candidates without any human review

Problem + Fix:

  1. Using data for undisclosed purpose → Get explicit consent, clearly disclose AI training in privacy policy
  2. Collecting more data than necessary → Only track location when the app is actively being used
  3. Keeping data indefinitely → Set clear retention limits, delete old data automatically
  4. Fully automated decisions with major impact → Always have a human review hiring decisions

Class discussion: where’s your line? 💬

Scenario:

A new app offers free health monitoring. It tracks:

  • Your heart rate and sleep
  • Your location and activity
  • What you eat and drink
  • Your social interactions

In exchange, it provides:

  • Personalised health advice
  • Early warning of health issues
  • Discounts from health insurers
  • Connection with others like you

Let’s discuss together:

  1. Would you use this app?
  2. What would make you change your mind?
  3. What data would be “too much”?
  4. Does it matter who runs it (tech company, hospital, government)?
  5. Would you feel differently if it were mandatory?

There’s no right answer. The point is to identify your own values and understand what trade-offs you’re willing to make.

Summary and takeaways 📝

Main takeaways

AI and privacy

  • AI enables collection at unprecedented scale
  • Inference makes “non-sensitive” data sensitive
  • Models can memorise and leak personal information

Legal frameworks

  • GDPR: Comprehensive rights-based approach
  • US: Patchwork of sectoral laws
  • Tension between AI development and data protection

Technical approaches

  • Differential privacy: Add noise, preserve patterns
  • Federated learning: Train without centralising
  • Synthetic data: Train on artificial data, protect originals
  • All have trade-offs; none is a silver bullet

Tensions

  • Same data enables benefits and surveillance
  • Trade-off between utility and privacy exists but is overstated

Key insights

  • Collection is the core problem
  • Privacy is collective, not just individual
  • Technical fixes don’t address power imbalances
  • Context and governance determine outcomes

“Privacy is not about having something to hide. It’s about having the power to choose what to reveal.”

… and that’s all for today! 🎉