DATASCI 185: Introduction to AI Applications

Lecture 24: Long-term Safety, Alignment, and Future of AI

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 😊

Recap of last class

  • Last class: misinformation, deepfakes, and trust
  • Three types of false content: mis-, dis-, and malinformation, each with different intent
  • AI lowers the cost and skill needed to produce convincing fakes (text, image, audio, video)
  • Real-world harms: political manipulation, non-consensual imagery, financial fraud
  • Cognitive biases (System 1 vs 2, confirmation bias, bandwagon effect) explain why we fall for it
  • Responses: technical detection, content provenance (C2PA), platform policies, media literacy
  • Today: long-term safety and the alignment problem

Source: Chris Ume

Lecture overview

What we will cover today

Part 1: AI safety

  • Near-term vs long-term concerns
  • Why safety matters now
  • Concrete problems in AI safety

Part 2: The alignment problem

  • What is alignment?
  • Why it’s hard
  • Current approaches

Part 3: Future trajectories

  • Where is AI heading?
  • Expert disagreement
  • Scenarios to consider

Part 4: What can we do?

  • Research directions
  • Governance approaches
  • Your role

Meme of the day!

Source: Cheezburger

Funny news of the day! 🫧

Source: CNBC

AI safety 🛡️

Near-term vs long-term concerns

Near-term safety concerns:

  • Bias and discrimination (already happening)
  • Privacy violations
  • Job displacement
  • Misinformation (covered last lecture)
  • Security vulnerabilities

Long-term safety concerns:

  • Alignment: AI pursuing wrong goals
  • Loss of human control
  • Concentration of power
  • Existential risk (controversial)
  • Some argue long-term concerns distract from near-term harms
  • Others argue near-term focus misses bigger picture
  • “Black Swan” events
  • They’re connected: Safe systems now → safer systems later

Source: Saetra & Donaher (2023)

Why think about long-term safety now?

  • AI capabilities advancing faster than expected
  • Safety research takes time
  • Hard to retrofit safety later
  • Better to be prepared

Historical analogies:

Technology Safety lag
Nuclear Developed first, safety after
Internet Security an afterthought
Social media Harms discovered in deployment
Biotech Ongoing debate
  • We’re still early enough to shape development
  • Safety research is growing, but still a fraction of what goes into capabilities
  • Major labs (Anthropic, DeepMind, OpenAI) now have dedicated safety teams

Concrete problems in AI safety

Amodei et al. (2016) identified five big challenges:

Problem Description
Safe exploration How to learn without dangerous actions
Avoiding negative side effects Don’t break things achieving goals
Avoiding reward hacking Don’t game the objective
Scalable oversight How to supervise complex systems
Robustness to distributional shift Handle novel situations safely

Why these matter:

  • Not speculative, but already happening in deployed systems
  • Scale with capability
  • Unsolved even for current AI
  • Let’s analyse these issues in more detail… 🤓

Source: Brian Christian

Safe exploration

  • Learning requires trying new things
  • Some actions are irreversible
  • “Explore safely” is hard to specify

Examples:

  • Robot learning to walk: Don’t break yourself
  • Self-driving: Don’t explore by crashing
  • Financial AI: Don’t bankrupt the company
  • Medical AI: Don’t kill patients while learning

Current approaches:

  • Simulation: Learn in low-risk environments first
  • Conservative policies: Keep actions within a safe region
  • Human oversight: Ask before novel actions
  • Reward shaping: Penalise dangerous states

Source: The Wall Street Journal

How do you specify “safe” without already knowing everything about the domain?

Negative side effects

  • AI optimises for specified objective
  • It ignores everything else
  • Unintended consequences not in objective
  • “You didn’t say not to…”

Classic example (thought experiment):

  • Robot tasked with fetching coffee
  • Knocks over obstacles in path
  • Harms humans in the way
  • Technically: Coffee fetched ✓

Real-world version:

  • Content algorithm maximises engagement
  • Side effect: Polarisation, addiction
  • Not in objective function
  • Nobody specified “don’t harm society”

Political polarisation is real, but sometimes not designed

Nobody told the engagement algorithm “don’t polarise society.” If you don’t put it in the objective, the system won’t care about it.

The alignment problem 🎯

What is alignment?

  • Alignment: AI systems that do what we want
  • Pursue goals we actually intend
  • Not just stated objectives
  • Behave safely even when we can’t supervise

Why the word “alignment”:

  • AI goals aligned with human values
  • Not orthogonal or opposed
  • Not pursuing random objectives
  • Not technically satisfying letter while violating spirit

Why it’s hard:

  • Hard to specify what we want
  • Humans disagree about values
  • Our stated preferences aren’t always true preferences
  • Context matters enormously

Claude can fake alignment even without being asked! 😧

Source: Anthropic

Alignment is hard!

Specification problem:

  • Can’t write down everything we care about
  • Edge cases are infinite
  • Values are context-dependent
  • Humans can’t articulate their own values perfectly

Goodhart’s Law: (remember this!)

  • “When a measure becomes a target…”
  • “…it ceases to be a good measure”
  • Optimise metric ≠ achieve goal

Examples:

  • Click rates → clickbait
  • Engagement → addiction
  • GDP → environmental destruction

The King Midas problem: (Russell, 2014)

  • Gets exactly what he asked for
  • Not what he wanted
  • Literal interpretation of wishes
  • Common AI failure mode

Current alignment approaches

RLHF (Reinforcement Learning from Human Feedback): (OpenAI, 2019)

  • Humans rank AI outputs
  • Train reward model on rankings
  • Optimise AI to satisfy reward model
  • Current industry standard

Constitutional AI: (Anthropic, 2022)

  • AI critiques its own outputs
  • Uses set of principles (“constitution”)
  • Iteratively improves
  • Reduces need for human labelling
  • Amanda Askell’s work on AI ethics and philosophy

Debate and recursive reward modelling: (Leike et al., 2024)

  • AI systems argue with each other
  • Humans judge who’s more persuasive
  • Scales human oversight: easier to evaluate than produce knowledge

Limitations of current approaches:

  • RLHF: Can learn to game evaluators
  • Human feedback: Expensive, biased
  • Principles: Still have to specify them right
  • None are complete solutions

These methods work well enough for current systems. Whether they scale to more capable AI is unknown.

The alignment survey

Ngo et al. overview (2022):

  • Survey of alignment problem and techniques
  • Published in foundational AI research

However…

  • No solved problem: Active research area
  • Multiple approaches needed
  • Uncertainty about scaling
  • Theoretical foundations lacking

Open questions:

  1. Will RLHF scale?
  2. Can we detect deceptive alignment?
  3. How do we handle value disagreement?
  4. What’s the role of interpretability?
  5. When is good enough “good enough”?

Discussion: whose values? 🤔

The values problem:

If we align AI to human values…

  • Whose human values?
  • Developers? Users? Affected parties?
  • Majority? Consensus? Universal?
  • Present generation? Future?

Let’s discuss:

  1. Should AI reflect your values or “universal” values?
  2. What happens when values conflict?
  3. Who should decide?
  4. Is this a technical or political question?

Some perspectives:

  • Libertarian: Each user controls their AI
  • Democratic: Majority decides
  • Rights-based: Some things off-limits regardless
  • Technocratic: Experts decide
  • Pluralist: Multiple systems for different contexts

“Whose values?” is a political question. No amount of engineering can avoid it. That is what makes alignment so difficult.

Future trajectories 🚀

Where is AI heading?

Current trends:

  • Models getting larger and more capable
  • More general-purpose systems
  • Multimodal capabilities

Uncertainties:

  • Will scaling continue to work?
  • When do we hit diminishing returns?
  • What capabilities emerge unexpectedly?
  • How fast is too fast?

Expert disagreement:

  • Wide variation in predictions
  • Timeline estimates vary by decades
  • Some expect AGI soon, others never
  • Confidence often exceeds evidence

Source: Our World in Data

Honest answer: Nobody knows. Uncertainty is high. Be sceptical of confident predictions.

Scenarios to consider

Scenario 1: Gradual improvement

  • AI gets better slowly
  • Humans adapt
  • Society adjusts incrementally
  • Most likely?

Scenario 2: Capability jumps

  • Sudden breakthroughs
  • Unexpected capabilities
  • Rapid deployment
  • Less time to adapt

Scenario 3: Plateau

  • Current approaches hit limits
  • Progress slows dramatically
  • Different paradigms needed
  • Also possible

Scenario 4: Transformative AI

  • Systems vastly more capable than humans
  • Fundamental changes to economy, society
  • Either very good or very bad
  • Uncertain timeline

Stuart Russell’s perspective

Russell’s argument (TED talk):

  • We’re building systems whose objectives we don’t fully control
  • Standard AI paradigm: optimise given objective
  • Problem: We can’t specify objectives correctly
  • Need fundamentally different approach

His proposal:

  1. AI should be uncertain about objectives
  2. Should defer to humans
  3. Should allow itself to be switched off
  4. Objectives learned, not programmed

“You cannot fetch the coffee if you are dead”

  • AI should value human control
  • Not because told to
  • Because it’s instrumentally useful for actual goals

Source: TED

Russell literally wrote the most-used AI textbook (AI: A Modern Approach). When he says we have a problem, it is worth paying attention.

Existential risk: the debate

Those who worry:

  • AI could become uncontrollable
  • Misaligned powerful AI = catastrophe
  • Even small probability × huge harm = important
  • “We might not get a second chance”
  • Hinton, Bengio, Russell, many others

Those who are sceptical:

  • Speculative, distracts from real present harms
  • “Sci-fi thinking” not grounded
  • We control the off switch
  • Capabilities overstated

The actual state of debate:

  • Serious researchers on both sides
  • Uncertainty and disagreement are genuine

What can we do? 🔧

Research directions

Technical safety research:

Area Goal
Interpretability Understand AI internals
Robustness Resist adversarial inputs
Alignment Ensure AI pursues intended goals
Oversight Scale human supervision
Honesty AI that doesn’t deceive

Growing field:

  • Anthropic, OpenAI, DeepMind safety teams, academic labs
  • Still small relative to capabilities

Career opportunity:

  • High-impact work
  • Talent shortage
  • Many paths in technical and non-technical roles

Source: The Wall Street Journal (2026)

If this interests you: Look into 80,000 Hours, AI safety bootcamps, and safety-focused labs.

Where you fit in

People who understand the technology are often absent from policy debates. That is a problem!

As citizens:

  • Scrutinise AI regulation proposals: what problem does this solve, and does it actually address it?
  • Vote, comment on public consultations, write to representatives
  • Most AI coverage confuses hype with capability; you can do better 😉

If you work with AI or data:

  • Ask what happens when your system fails or is misused
  • Know what data you train on and who it affects
  • Talk to people outside your field about how they experience AI

Thinking about AI risk clearly

Common mistakes:

  • Dismissing concerns because current AI seems harmless (ignores rapid capability gains)
  • Catastrophising based on scenarios with no evidence
  • Conflating “possible” with “probable”

A more useful framework:

  • Separate near-term harms from speculative long-term risks
  • Ask: what evidence would change my mind?
  • Experts genuinely disagree, and that is okay! Certainty is the red flag

What history suggests:

  • Technology risks are real but rarely follow worst-case predictions
  • The biggest harms tend to come from problems we did not anticipate, not the ones we worried about most

Summary 📝

Main takeaways

Safety landscape

  • Near-term and long-term concerns both matter
  • Concrete problems: safe exploration, side effects, reward hacking
  • Acting now is important given uncertainty

Alignment

  • Getting AI to do what we actually want
  • Hard because: specification, Goodhart’s law, value disagreement
  • Current approaches: RLHF, Constitutional AI
  • Major open problems remain

Future trajectories

  • Genuine uncertainty about where AI is heading
  • Multiple scenarios worth considering
  • Expert disagreement is real
  • Plan for multiple futures

What we can do

  • Technical safety research
  • Governance and policy
  • Individual choices
  • Neither panic nor complacency

The future of AI is not determined. The choices we make now, about alignment, oversight, and governance, will shape what that future looks like.

… and that’s all for today! 🎉