DATASCI 185: Introduction to AI Applications

Lecture 24: Long-term Safety, Alignment, and Future of AI

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 😊

Recap of last class

  • Last class: misinformation, deepfakes, and trust online
  • Scale, speed, accessibility of synthetic content
  • Political manipulation, non-consensual intimate imagery, fraud
  • System 1 vs System 2, confirmation bias, bandwagon effect
  • Detection (AI vs AI), provenance, platform policies, media literacy
  • Today: Looking further ahead at long-term safety and alignment

Source: Chris Ume

Lecture overview

What we will cover today

Part 1: AI safety

  • Near-term vs long-term concerns
  • Why safety matters now
  • Concrete problems in AI safety

Part 2: The alignment problem

  • What is alignment?
  • Why it’s hard
  • Current approaches

Part 3: Future trajectories

  • Where is AI heading?
  • Expert disagreement
  • Scenarios to consider

Part 4: What can we do?

  • Research directions
  • Governance approaches
  • Your role

Meme of the day!

Source: Cheezburger

AI safety 🛡️

Near-term vs long-term concerns

Near-term safety concerns:

  • Bias and discrimination (already happening)
  • Privacy violations
  • Job displacement
  • Misinformation (covered last lecture)
  • Security vulnerabilities

Long-term safety concerns:

  • Alignment: AI pursuing wrong goals
  • Loss of human control
  • Concentration of power
  • Existential risk (controversial)
  • Some argue long-term concerns distract from near-term harms
  • Others argue near-term focus misses bigger picture
  • “Black Swan” events
  • They’re connected: Safe systems now → safer systems later

Source: Saetra & Donaher (2023)

Why think about long-term safety now?

  • AI capabilities advancing faster than expected
  • Safety research takes time
  • Hard to retrofit safety later
  • Better to be prepared

Historical analogies:

Technology Safety lag
Nuclear Developed first, safety after
Internet Security an afterthought
Social media Harms discovered in deployment
Biotech Ongoing debate
  • We’re still early enough to shape development
  • Safety research is growing
  • Industry increasingly engaged, governments are starting to catch up

Concrete problems in AI safety

Amodei et al. (2016) identified five big challenges:

Problem Description
Safe exploration How to learn without dangerous actions
Avoiding negative side effects Don’t break things achieving goals
Avoiding reward hacking Don’t game the objective
Scalable oversight How to supervise complex systems
Robustness to distributional shift Handle novel situations safely

Why these matter:

  • Not speculative, but already happening in deployed systems
  • Scale with capability
  • Unsolved even for current AI
  • Let’s analyse these issues in more detail… 🤓

Source: Brian Christian

Safe exploration

  • Learning requires trying new things
  • Some actions are irreversible
  • “Explore safely” is hard to specify

Examples:

  • Robot learning to walk: Don’t break yourself
  • Self-driving: Don’t explore by crashing
  • Financial AI: Don’t bankrupt the company
  • Medical AI: Don’t kill patients while learning

Current approaches:

  • Simulation: Learn in low-risk environments first
  • Conservative policies: Keep actions within a safe region
  • Human oversight: Ask before novel actions
  • Reward shaping: Penalise dangerous states

Source: The Wall Street Journal

How do you specify “safe” without already knowing everything about the domain?

Negative side effects

  • AI optimises for specified objective
  • It ignores everything else
  • Unintended consequences not in objective
  • “You didn’t say not to…”

Classic example (thought experiment):

  • Robot tasked with fetching coffee
  • Knocks over obstacles in path
  • Harms humans in the way
  • Technically: Coffee fetched ✓

Real-world version:

  • Content algorithm maximises engagement
  • Side effect: Polarisation, addiction
  • Not in objective function
  • Nobody specified “don’t harm society”

Political polarisation is real, but sometimes not designed

The world is too complex to specify everything we care about. AI must somehow learn to preserve what matters.

The alignment problem 🎯

What is alignment?

  • Alignment: AI systems that do what we want
  • Pursue goals we actually intend
  • Not just stated objectives
  • Behave safely even when we can’t supervise

Why the word “alignment”:

  • AI goals aligned with human values
  • Not orthogonal or opposed
  • Not pursuing random objectives
  • Not technically satisfying letter while violating spirit

Why it’s hard:

  • Hard to specify what we want
  • Humans disagree about values
  • Our stated preferences aren’t always true preferences
  • Context matters enormously

Claude can fake alignment even without being asked! 😧

Source: Anthropic

Why alignment is hard

Specification problem:

  • Can’t write down everything we care about
  • Edge cases are infinite
  • Values are context-dependent
  • Humans can’t articulate their own values perfectly

Goodhart’s Law: (remember this!)

  • “When a measure becomes a target…”
  • “…it ceases to be a good measure”
  • Optimise metric ≠ achieve goal

Examples:

  • Click rates → clickbait
  • Engagement → addiction
  • GDP → environmental destruction

The King Midas problem: (Russell, 2014)

  • Gets exactly what he asked for
  • Not what he wanted
  • Literal interpretation of wishes
  • Common AI failure mode

Current alignment approaches

RLHF (Reinforcement Learning from Human Feedback): (OpenAI, 2019)

  • Humans rank AI outputs
  • Train reward model on rankings
  • Optimise AI to satisfy reward model
  • Current industry standard

Constitutional AI: (Anthropic, 2022)

  • AI critiques its own outputs
  • Uses set of principles (“constitution”)
  • Iteratively improves
  • Reduces need for human labelling

Debate and recursive reward modelling: (Leike et al., 2024)

  • AI systems argue with each other
  • Humans judge who’s more persuasive
  • Scales human oversight: easier to evaluate than produce knowledge

Limitations of current approaches:

  • RLHF: Can learn to game evaluators
  • Human feedback: Expensive, biased
  • Principles: Still have to specify them right
  • None are complete solutions

These methods work well enough for current systems. Whether they scale to more capable AI is unknown.

The alignment survey

Ngo et al. overview (2022):

  • Survey of alignment problem and techniques
  • Published in foundational AI research

However…

  • No solved problem: Active research area
  • Multiple approaches needed
  • Uncertainty about scaling
  • Theoretical foundations lacking

Open questions:

  1. Will RLHF scale?
  2. Can we detect deceptive alignment?
  3. How do we handle value disagreement?
  4. What’s the role of interpretability?
  5. When is good enough “good enough”?

Research directions:

Area Focus
Interpretability Understanding what AI “thinks”
Robustness Maintaining alignment under pressure
Scalable oversight Supervising superhuman AI
Value learning Inferring human values
Corrigibility Making AI correctable

Discussion: whose values? 🤔

The values problem:

If we align AI to human values…

  • Whose human values?
  • Developers? Users? Affected parties?
  • Majority? Consensus? Universal?
  • Present generation? Future?

Discuss (2 minutes):

  1. Should AI reflect your values or “universal” values?
  2. What happens when values conflict?
  3. Who should decide?
  4. Is this a technical or political question?

Some perspectives:

  • Libertarian: Each user controls their AI
  • Democratic: Majority decides
  • Rights-based: Some things off-limits regardless
  • Technocratic: Experts decide
  • Pluralist: Multiple systems for different contexts

Alignment is not just technical. It’s deeply political. Technical solutions can’t avoid value choices.

Future trajectories 🚀

Where is AI heading?

Current trends:

  • Models getting larger and more capable
  • More general-purpose systems
  • Multimodal capabilities

Uncertainties:

  • Will scaling continue to work?
  • When do we hit diminishing returns?
  • What capabilities emerge unexpectedly?
  • How fast is too fast?

Expert disagreement:

  • Wide variation in predictions
  • Timeline estimates vary by decades
  • Some expect AGI soon, others never
  • Confidence often exceeds evidence

Source: Our World in Data

Honest answer: Nobody knows. Uncertainty is high. Be sceptical of confident predictions.

Scenarios to consider

Scenario 1: Gradual improvement

  • AI gets better slowly
  • Humans adapt
  • Society adjusts incrementally
  • Most likely?

Scenario 2: Capability jumps

  • Sudden breakthroughs
  • Unexpected capabilities
  • Rapid deployment
  • Less time to adapt

Scenario 3: Plateau

  • Current approaches hit limits
  • Progress slows dramatically
  • Different paradigms needed
  • Also possible

Scenario 4: Transformative AI

  • Systems vastly more capable than humans
  • Fundamental changes to economy, society
  • Either very good or very bad
  • Uncertain timeline

Stuart Russell’s perspective

Russell’s argument (TED talk):

  • We’re building systems whose objectives we don’t fully control
  • Standard AI paradigm: optimise given objective
  • Problem: We can’t specify objectives correctly
  • Need fundamentally different approach

His proposal:

  1. AI should be uncertain about objectives
  2. Should defer to humans
  3. Should allow itself to be switched off
  4. Objectives learned, not programmed

“You cannot fetch the coffee if you are dead”

  • AI should value human control
  • Not because told to
  • Because it’s instrumentally useful for actual goals

Source: TED

Russell is a leading AI researcher who takes long-term safety very seriously. Worth hearing him out.

Existential risk: the debate

Those who worry:

  • AI could become uncontrollable
  • Misaligned powerful AI = catastrophe
  • Even small probability × huge harm = important
  • “We might not get a second chance”
  • Hinton, Bengio, Russell, many others

Those who are sceptical:

  • Speculative, distracts from real present harms
  • “Sci-fi thinking” not grounded
  • We control the off switch
  • Capabilities overstated

The actual state of debate:

  • Serious researchers on both sides
  • Uncertainty and disagreement are genuine

What can we do? 🔧

Research directions

Technical safety research:

Area Goal
Interpretability Understand AI internals
Robustness Resist adversarial inputs
Alignment Ensure AI pursues intended goals
Oversight Scale human supervision
Honesty AI that doesn’t deceive

Growing field:

  • Anthropic, OpenAI, DeepMind safety teams, academic labs
  • Still small relative to capabilities

Career opportunity:

  • High-impact work
  • Talent shortage
  • Many paths in (technical, governance, policy)

Source: The Wall Street Journal (2026)

If this interests you: Look into 80,000 Hours, AI safety bootcamps, and safety-focused labs.

Your role

We’ve already discussed what governments are (not) doing about AI. What can you do?

  • Understand the technology and debates
  • Evaluate claims critically
  • Support good governance
  • Vote based on AI policy positions

As professionals:

  • If you work with AI: Consider impacts
  • Ask safety questions in your organisation
  • Advocate for responsible development
  • Question work you find unethical

As people:

  • Engage with policy discussions
  • Don’t be passive!

Finding balance

Neither panic nor complacency:

  • Don’t dismiss all concerns as sci-fi
  • Don’t accept all concerns uncritically
  • Evidence over vibes!

What reasonable concern looks like:

  • Take problems seriously
  • But don’t catastrophise
  • Stay grounded in current reality
  • Plan for multiple futures

What reasonable optimism looks like:

  • Acknowledge real progress
  • Believe problems are solvable
  • Trust in human adaptability

Summary 📝

Main takeaways

Safety landscape

  • Near-term and long-term concerns both matter
  • Concrete problems: safe exploration, side effects, reward hacking
  • Acting now is important given uncertainty

Alignment

  • Getting AI to do what we actually want
  • Hard because: specification, Goodhart’s law, value disagreement
  • Current approaches: RLHF, Constitutional AI
  • Major open problems remain

Future trajectories

  • Genuine uncertainty about where AI is heading
  • Multiple scenarios worth considering
  • Expert disagreement is real
  • Plan for multiple futures

What we can do

  • Technical safety research
  • Governance and policy
  • Individual choices
  • Neither panic nor complacency

The future of AI is not determined. Our choices matter. Understanding the landscape is step one.

… and that’s all for today! 🎉