DATASCI 185: Introduction to AI Applications

Lecture 19: Privacy and Data Protection

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🔒

Recap of last class

  • We covered AI regulation around the world
  • EU AI Act: World’s first comprehensive AI law with risk-based approach (banned, high-risk, limited, minimal risk tiers)
  • US approach: Sectoral regulation with state-level initiatives
  • China: State control over content with innovation goals; requires algorithm registration
  • No global consensus yet, but convergence on key principles is likely
  • Today: How does AI intersect with privacy?

Oh, well 😅
  • Note: A filming crew will be in the room next lecture to record a video about this class! If you prefer not to be filmed, please let me know 😉

Lecture overview

What we will cover today

Part 1: Privacy in the AI age

  • Why AI makes privacy harder
  • Data collection at unprecedented scale
  • Inference and prediction of sensitive attributes

Part 2: Legal frameworks

  • GDPR and data protection principles
  • Rights of individuals
  • US privacy landscape

Part 3: Technical approaches

  • Differential privacy
  • Federated learning
  • Privacy-preserving machine learning (synthetic data)

Part 4: Challenges and tensions

  • Case study: Clearview AI
  • Discussion: where’s your line on privacy?
  • What can you do?

Meme of the day 😄

Source: Caniphis

Good news of the day 📰

Source: Financial Times

Privacy in the AI age 🔍

Why AI makes privacy harder

AI makes privacy harder in ways that older laws did not anticipate:

Scale of collection:

  • Traditional surveillance was limited by staffing; AI-enabled collection runs around the clock at almost no cost
  • Every interaction becomes a data point, creating “data exhaust” (data generated as a byproduct of normal use) from ordinary activities

Inference capabilities:

  • AI can predict sensitive information from seemingly innocuous data, e.g. shopping patterns reveal health conditions, typing speed reveals mood, social networks reveal politics
  • Things you never disclosed can be inferred

Persistence:

  • Data doesn’t decay; models trained today affect you forever and decisions follow you across contexts
  • “Right to be forgotten” is technically hard

Source: AI Multiple

“If you’re not paying for the product, you are the product.” This phrase understates it! Even when you pay, your data is often the real product!

What can be inferred from your data?

Research has shown AI can predict:

Data source What can be inferred
Facebook likes Political views, sexuality, personality
Smartphone sensors Depression, anxiety, Parkinson’s
Typing patterns Age, gender, emotional state
Purchase history Pregnancy, health conditions
Location data Home address, workplace, religion
Voice recordings Emotional state, health, age

Famous example: Target and pregnancy

  • Target’s marketing algorithm identified a pregnant teenager
  • Sent baby product coupons to her home
  • Her mother complained to Target, saying her daughter was “only in high school”
  • Days later, the father called Target, apologising and confirming the pregnancy
  • The algorithm knew before her family did (NYT)

Source: Time

The inference problem: You can control what you share. You cannot control what can be inferred from what you share.

Training data and privacy

LLMs have a training data problem:

  • Trained on internet-scale data
  • That data includes personal information
  • Models can memorise and regurgitate training data
  • Your name, address, phone number might be in there!

Demonstrated attacks:

  • Researchers extracted verbatim training data from GPT-2 (Carlini et al., 2021)
  • Including names, phone numbers, email addresses
  • Extraction attacks continue to improve

The consent problem:

  • Most people don’t know what’s in training sets
  • “Publicly available” ≠ “consented to AI training”

Source: The Hacker News

When you ask an LLM about yourself, it might actually know things from training data you never shared with it directly.

Discussion: would you share? 🤔

Quick poll (raise your hand):

Would you share your data if…

  1. A health app predicts disease risk but sells data to insurers?
  2. A smart home device improves comfort but records all conversations?
  3. A job search site personalises results but shares with employers?
  4. A social app connects you with friends but builds a profile for advertisers?
  5. An AI tutor helps you learn but reports to your school?

The usual pattern:

  • People say they care about privacy
  • But they don’t act like they care
  • This is called the privacy paradox

Why?

  • Costs are immediate, harms are distant
  • Default settings favour sharing
  • Terms of service are unreadable
  • “Everyone does it” normalisation
  • Benefits are tangible, harms are abstract
  • We’re not good at probabilistic thinking

GDPR fundamentals

The General Data Protection Regulation (2016/679, effective 2018) is the EU’s comprehensive privacy law.

Main principles:

  1. Lawfulness, fairness, transparency: Process data legally and openly
  2. Data minimisation: Collect only what’s necessary
  3. Accuracy: Keep data accurate and up to date
  4. Storage limitation: Don’t keep longer than needed
  5. Integrity and confidentiality: Protect data properly
  6. Accountability: Demonstrate compliance

Lawful bases for processing:

  • Consent (freely given, specific, informed)
  • Legal obligation
  • Public interest
  • Legitimate interests (most contested: companies use this to justify processing without consent, e.g., fraud prevention or training AI models)

GDPR applies to any organisation processing EU residents’ data, wherever they’re located. It has global reach, like the EU AI Act. This means a US startup with European users must comply with GDPR, even if it has no office in the EU.

Individual rights under GDPR

You have the right to:

Right What it means
Access Get a copy of your data
Rectification Correct inaccurate data
Erasure “Right to be forgotten”
Portability Move data to another service
Restriction Limit processing
Object Stop certain processing
Not be profiled Reject automated decisions

Article 22: automated decisions

  • You have the right not to be subject to purely automated decisions
  • There must be human involvement in consequential decisions
  • Exceptions exist for contracts and explicit consent

Source: EU GDPR Portal

In practice: These rights exist on paper, but exercising them is often difficult. Companies make it hard to find the right forms, respond slowly, or claim exemptions.

GDPR and AI tensions

Purpose limitation vs model training:

  • You gave data for one purpose
  • Can it train a model for another?
  • AI companies claim legitimate interest

Data minimisation vs big data:

  • AI works better with more data
  • GDPR says collect only what’s necessary
  • What’s “necessary” for a foundation model? Everything?

Right to explanation vs black boxes:

  • If an AI makes a decision about you, you can ask why
  • But many AI systems can’t explain themselves

Right to erasure vs model training:

  • Can you demand your data be removed from a trained model?
  • “Unlearning” is technically very difficult

Source: CertPro

GDPR was written before the LLM era. Applying 2018 law to 2026 technology creates interpretation challenges.

US privacy landscape

No comprehensive federal privacy law, as we saw in previous lectures

Instead, a patchwork of sector-specific rules:

Law Scope Protects
HIPAA Healthcare data Patient medical records
FERPA Education records Student academic data
COPPA Children’s data Under-13 online activity
GLBA Financial data Bank and loan records
FCRA Credit reporting Credit scores and history
ECPA Electronic comms Emails, calls, stored data

State laws filling the gap:

  • California (CCPA/CPRA): Most comprehensive, GDPR-like rights
  • Virginia, Colorado, Connecticut: Similar frameworks
  • Illinois BIPA: Biometric data, with private right of action
  • Many more states following, as we saw before

Source: IAPP

GDPR enforcement

Major fines (selected):

Company Fine What they did wrong
Meta (2023) €1.2B Transferred EU user data to US servers without adequate safeguards
Amazon (2021) €746M Processed personal data for targeted ads without valid consent
Meta (2022) €405M Exposed children’s accounts and contact info on Instagram
Google (2022) €150M Made it harder to reject cookies than to accept them
TikTok (2023) €345M Failed to protect children’s privacy settings and data

Patterns:

  • Big tech companies are primary targets
  • Data transfers, consent, children are focus areas
  • Fines are getting larger
  • Enforcement varies by country (Ireland is lax)

Source: Secureframe

€1.2 billion sounds big, but Meta made ~$135 billion in 2024. Is that a fine or a cost of doing business?

Technical approaches to privacy 🛡️

Differential privacy

Differential privacy is a mathematical framework for privacy-preserving data analysis (Dwork & Roth, 2014).

  • Add carefully calibrated random noise to query results before releasing them
  • For example: “How many people in this dataset have diabetes?” The true answer is 137, but the system returns 137 ± some noise (say, 134 or 141)
  • Individual records stay hidden, but aggregate patterns remain visible

Formal guarantee:

  • The output is nearly the same whether or not any single individual is in the dataset
  • Privacy loss is quantified by epsilon (ε): small ε = strong privacy but noisier results; large ε = weaker privacy but more accurate results
  • This is a provable, mathematical guarantee, not just a promise

Source: Flower AI

Real-world use: Apple uses differential privacy for emoji suggestions, Google for Chrome usage stats, US Census for 2020 data.

Federated learning

Federated learning trains AI models without centralising data (McMahan et al., 2017).

How it works:

  1. A central server sends the same model to many devices
  2. Each device trains the model on its own local data
  3. Each device sends back only what it learned (updated model weights), not the data itself
  4. The server averages all the updates into an improved model
  5. The improved model is sent back to devices

Why it matters for privacy:

  • Your raw data never leaves your device
  • The server never sees your messages, photos, or browsing history
  • But it is not perfect: model updates can still leak information about the training data, and coordinating thousands of devices is complex

Source: Wikipedia

Example: Google’s Gboard uses federated learning to improve predictions. Your typing stays on your phone; only model improvements are shared.

Synthetic data for privacy

What if you could train a model on data that looks real but isn’t? That is the idea behind synthetic data (Jordon et al., 2022).

How it works:

  • Train a generative model on real data
  • Model learns statistical patterns and correlations
  • Generate new, fake data points that look real
  • Train your AI on synthetic data instead

Privacy benefits and limitations:

  • No real personal data in training pipeline
  • Individuals can’t be identified from synthetic records
  • Can share data freely for research and collaboration
  • However, synthetic data may not capture rare cases well
  • Risk of “overfitting” to original data (memorisation)

Source: GOV.UK

Real-world use: Healthcare organisations use synthetic patient data for research. Financial institutions test fraud detection on synthetic transactions.

Do these techniques actually help?

What they can do:

  • Allow useful research on sensitive datasets (medical, financial) while protecting privacy
  • Give engineers concrete tools, not just good intentions
  • Provide mathematical guarantees (differential privacy) or architectural separation (federated learning)

What they cannot do:

  • Stop companies from collecting data in the first place
  • Fix the power imbalance between users and platforms
  • Prevent governments from demanding access to data
  • Make users understand how their data is used

However…

The best privacy protection is not collecting data at all. But AI companies have the opposite incentive: more data = better models = more profit.

Technical fixes work within the system. They don’t change the system itself.

Watch out for “privacy washing”: companies announce differential privacy with large epsilon values (weak privacy) or federated learning that still collects metadata. The tools are real, but the way they are deployed can be misleading.

Challenges and tensions ⚖️

Case study: Clearview AI 🔍

  • In January 2020, the New York Times revealed that Clearview AI had secretly scraped over 30 billion photos from Facebook, Instagram, LinkedIn, and other platforms
  • Built a facial recognition database without anyone’s consent!
  • Sold access to over 2,400 US law enforcement agencies, plus corporations and wealthy individuals
  • US police used the database nearly a million times
  • People never knew their faces were in the database

The legal fallout:

Source: Library of Congress

Clearview claimed their service was only for law enforcement. Investigation revealed they gave accounts to anyone willing to pay, including investors’ friends. The ACLU described it as putting everyone into a “perpetual police line-up”.

Quick quiz: What’s wrong and how would you fix it? 🎯

  1. A social media company uses your messages to train an AI without telling you

  2. A shopping app collects your exact location every 30 seconds, even when not in use

  3. A company keeps customer data “just in case” with no deletion policy

  4. An AI hiring tool rejects candidates without any human review

Where’s your line? 💬

Scenario:

A new app offers free health monitoring. It tracks:

  • Your heart rate and sleep
  • Your location and activity
  • What you eat and drink
  • Your social interactions

In exchange, it provides:

  • Personalised health advice
  • Early warning of health issues
  • Discounts from health insurers
  • Connection with others like you

Let’s discuss together:

  1. Would you use this app?
  2. What would make you change your mind?
  3. What data would be “too much”?
  4. Does it matter who runs it (tech company, hospital, government)?
  5. Would you feel differently if it were mandatory?

There’s no right answer. The point is to identify your own values and understand what trade-offs you’re willing to make

What can you do?

Individual actions:

  • Review privacy settings and use privacy-focused tools (Signal, DuckDuckGo, Firefox)
  • Limit location sharing and use different email addresses for different services
  • Exercise your GDPR/CCPA rights (access, deletion requests) and be skeptical of “personalisation”

Limitations of individual action:

  • Opting out often means losing service. Privacy is collective
  • Your data can be inferred from others’ data, and power imbalance is structural

Collective action:

  • Support privacy legislation and demand transparency from companies
  • Support organisations fighting for privacy (EFF, EPIC, noyb)
  • Choose services that respect privacy and vote for privacy-protecting candidates

Using Signal instead of WhatsApp helps, but it won’t change how insurers use your health data. That takes legislation.

Summary 📝

Main takeaways

AI and privacy

  • AI enables collection at unprecedented scale
  • Inference makes “non-sensitive” data sensitive
  • Models can memorise and leak personal information

Legal frameworks

  • GDPR: Comprehensive rights-based approach
  • US: Patchwork of sectoral laws
  • Tension between AI development and data protection

Technical approaches

  • Differential privacy: Add noise, preserve patterns
  • Federated learning: Train without centralising
  • Synthetic data: Train on artificial data, protect originals
  • All have trade-offs; none is a silver bullet

Tensions

  • Same data enables benefits and surveillance
  • Trade-off between utility and privacy exists but is overstated

Key insights

  • Collection is the core problem
  • Privacy is collective, not just individual
  • Technical fixes don’t address power imbalances
  • Rules and oversight determine whether data helps or harms

The Target pregnancy case and the Clearview AI scraping both came down to the same thing: someone used your data in a way you never agreed to.

… and that’s all for today! 🎉