DATASCI 185: Introduction to AI Applications

Lecture 24: Long-term Safety, Alignment, and Future of AI

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 😊

Recap of last class

Last class: misinformation, deepfakes, and trust
Three types of false content: mis-, dis-, and malinformation, each with different intent
AI lowers the cost and skill needed to produce convincing fakes (text, image, audio, video)
Real-world harms: political manipulation, non-consensual imagery, financial fraud
Cognitive biases (System 1 vs 2, confirmation bias, bandwagon effect) explain why we fall for it
Responses: technical detection, content provenance (C2PA), platform policies, media literacy
Today: long-term safety and the alignment problem

Source: Chris Ume

Lecture overview

What we will cover today

Part 1: AI safety

Near-term vs long-term concerns
Why safety matters now
Concrete problems in AI safety

Part 2: The alignment problem

What is alignment?
Why it’s hard
Current approaches

Part 3: Future trajectories

Where is AI heading?
Expert disagreement
Scenarios to consider

Part 4: What can we do?

Research directions
Governance approaches
Your role

Meme of the day!

Source: Cheezburger

Funny news of the day! 🫧

Source: CNBC

AI safety 🛡️

Near-term vs long-term concerns

Near-term safety concerns:

Bias and discrimination (already happening)
Privacy violations
Job displacement
Misinformation (covered last lecture)
Security vulnerabilities

Long-term safety concerns:

Alignment: AI pursuing wrong goals
Loss of human control
Concentration of power
Existential risk (controversial)

Some argue long-term concerns distract from near-term harms
Others argue near-term focus misses bigger picture
“Black Swan” events
They’re connected: Safe systems now → safer systems later

Source: Saetra & Donaher (2023)

Why think about long-term safety now?

AI capabilities advancing faster than expected
Safety research takes time
Hard to retrofit safety later
Better to be prepared

Historical analogies:

Technology	Safety lag
Nuclear	Developed first, safety after
Internet	Security an afterthought
Social media	Harms discovered in deployment
Biotech	Ongoing debate

We’re still early enough to shape development
Safety research is growing, but still a fraction of what goes into capabilities
Major labs (Anthropic, DeepMind, OpenAI) now have dedicated safety teams

Source: Center for AI Safety

Concrete problems in AI safety

Amodei et al. (2016) identified five big challenges:

Problem	Description
Safe exploration	How to learn without dangerous actions
Avoiding negative side effects	Don’t break things achieving goals
Avoiding reward hacking	Don’t game the objective
Scalable oversight	How to supervise complex systems
Robustness to distributional shift	Handle novel situations safely

Why these matter:

Not speculative, but already happening in deployed systems
Scale with capability
Unsolved even for current AI
Let’s analyse these issues in more detail… 🤓

Source: Brian Christian

Safe exploration

Learning requires trying new things
Some actions are irreversible
“Explore safely” is hard to specify

Examples:

Robot learning to walk: Don’t break yourself
Self-driving: Don’t explore by crashing
Financial AI: Don’t bankrupt the company
Medical AI: Don’t kill patients while learning

Current approaches:

Simulation: Learn in low-risk environments first
Conservative policies: Keep actions within a safe region
Human oversight: Ask before novel actions
Reward shaping: Penalise dangerous states

Source: The Wall Street Journal

How do you specify “safe” without already knowing everything about the domain?

Negative side effects

AI optimises for specified objective
It ignores everything else
Unintended consequences not in objective
“You didn’t say not to…”

Classic example (thought experiment):

Robot tasked with fetching coffee
Knocks over obstacles in path
Harms humans in the way
Technically: Coffee fetched ✓

Real-world version:

Content algorithm maximises engagement
Side effect: Polarisation, addiction
Not in objective function
Nobody specified “don’t harm society”

Political polarisation is real, but sometimes not designed

Nobody told the engagement algorithm “don’t polarise society.” If you don’t put it in the objective, the system won’t care about it.

The alignment problem 🎯

What is alignment?

Alignment: AI systems that do what we want
Pursue goals we actually intend
Not just stated objectives
Behave safely even when we can’t supervise

Why the word “alignment”:

AI goals aligned with human values
Not orthogonal or opposed
Not pursuing random objectives
Not technically satisfying letter while violating spirit

Why it’s hard:

Hard to specify what we want
Humans disagree about values
Our stated preferences aren’t always true preferences
Context matters enormously

Claude can fake alignment even without being asked! 😧

Source: Anthropic

Alignment is hard!

Specification problem:

Can’t write down everything we care about
Edge cases are infinite
Values are context-dependent
Humans can’t articulate their own values perfectly

Goodhart’s Law: (remember this!)

“When a measure becomes a target…”
“…it ceases to be a good measure”
Optimise metric ≠ achieve goal

Examples:

Click rates → clickbait
Engagement → addiction
GDP → environmental destruction

The King Midas problem: (Russell, 2014)

Gets exactly what he asked for
Not what he wanted
Literal interpretation of wishes
Common AI failure mode

Current alignment approaches

RLHF (Reinforcement Learning from Human Feedback): (OpenAI, 2019)

Humans rank AI outputs
Train reward model on rankings
Optimise AI to satisfy reward model
Current industry standard

Constitutional AI: (Anthropic, 2022)

AI critiques its own outputs
Uses set of principles (“constitution”)
Iteratively improves
Reduces need for human labelling
Amanda Askell’s work on AI ethics and philosophy

Debate and recursive reward modelling: (Leike et al., 2024)

AI systems argue with each other
Humans judge who’s more persuasive
Scales human oversight: easier to evaluate than produce knowledge

Limitations of current approaches:

RLHF: Can learn to game evaluators
Human feedback: Expensive, biased
Principles: Still have to specify them right
None are complete solutions

These methods work well enough for current systems. Whether they scale to more capable AI is unknown.

The alignment survey

Ngo et al. overview (2022):

Survey of alignment problem and techniques
Published in foundational AI research

However…

No solved problem: Active research area
Multiple approaches needed
Uncertainty about scaling
Theoretical foundations lacking

Open questions:

Will RLHF scale?
Can we detect deceptive alignment?
How do we handle value disagreement?
What’s the role of interpretability?
When is good enough “good enough”?

Source: Ngo et al. (2022)

Discussion: whose values? 🤔

The values problem:

If we align AI to human values…

Whose human values?
Developers? Users? Affected parties?
Majority? Consensus? Universal?
Present generation? Future?

Let’s discuss:

Should AI reflect your values or “universal” values?
What happens when values conflict?
Who should decide?
Is this a technical or political question?

Some perspectives:

Libertarian: Each user controls their AI
Democratic: Majority decides
Rights-based: Some things off-limits regardless
Technocratic: Experts decide
Pluralist: Multiple systems for different contexts

“Whose values?” is a political question. No amount of engineering can avoid it. That is what makes alignment so difficult.

Future trajectories 🚀

Where is AI heading?

Current trends:

Models getting larger and more capable
More general-purpose systems
Multimodal capabilities

Uncertainties:

Will scaling continue to work?
When do we hit diminishing returns?
What capabilities emerge unexpectedly?
How fast is too fast?

Expert disagreement:

Wide variation in predictions
Timeline estimates vary by decades
Some expect AGI soon, others never
Confidence often exceeds evidence

Source: Our World in Data

Honest answer: Nobody knows. Uncertainty is high. Be sceptical of confident predictions.

Scenarios to consider

Scenario 1: Gradual improvement

AI gets better slowly
Humans adapt
Society adjusts incrementally
Most likely?

Scenario 2: Capability jumps

Sudden breakthroughs
Unexpected capabilities
Rapid deployment
Less time to adapt

Scenario 3: Plateau

Current approaches hit limits
Progress slows dramatically
Different paradigms needed
Also possible

Scenario 4: Transformative AI

Systems vastly more capable than humans
Fundamental changes to economy, society
Either very good or very bad
Uncertain timeline

Source: American Enterprise Institute

Stuart Russell’s perspective

Russell’s argument (TED talk):

We’re building systems whose objectives we don’t fully control
Standard AI paradigm: optimise given objective
Problem: We can’t specify objectives correctly
Need fundamentally different approach

His proposal:

AI should be uncertain about objectives
Should defer to humans
Should allow itself to be switched off
Objectives learned, not programmed

“You cannot fetch the coffee if you are dead”

AI should value human control
Not because told to
Because it’s instrumentally useful for actual goals

Source: TED

Russell literally wrote the most-used AI textbook (AI: A Modern Approach). When he says we have a problem, it is worth paying attention.

Existential risk: the debate

Those who worry:

AI could become uncontrollable
Misaligned powerful AI = catastrophe
Even small probability × huge harm = important
“We might not get a second chance”
Hinton, Bengio, Russell, many others

Those who are sceptical:

Speculative, distracts from real present harms
“Sci-fi thinking” not grounded
We control the off switch
Capabilities overstated

The actual state of debate:

Serious researchers on both sides
Uncertainty and disagreement are genuine

Source: MIT Technology Review’s YouTube channel

What can we do? 🔧

Research directions

Technical safety research:

Area	Goal
Interpretability	Understand AI internals
Robustness	Resist adversarial inputs
Alignment	Ensure AI pursues intended goals
Oversight	Scale human supervision
Honesty	AI that doesn’t deceive

Growing field:

Anthropic, OpenAI, DeepMind safety teams, academic labs
Still small relative to capabilities

Career opportunity:

High-impact work
Talent shortage
Many paths in technical and non-technical roles

Source: The Wall Street Journal (2026)

If this interests you: Look into 80,000 Hours, AI safety bootcamps, and safety-focused labs.

Where you fit in

People who understand the technology are often absent from policy debates. That is a problem!

As citizens:

Scrutinise AI regulation proposals: what problem does this solve, and does it actually address it?
Vote, comment on public consultations, write to representatives
Most AI coverage confuses hype with capability; you can do better 😉

If you work with AI or data:

Ask what happens when your system fails or is misused
Know what data you train on and who it affects
Talk to people outside your field about how they experience AI

Source: MIT Sloan Review (2024)

Thinking about AI risk clearly

Common mistakes:

Dismissing concerns because current AI seems harmless (ignores rapid capability gains)
Catastrophising based on scenarios with no evidence
Conflating “possible” with “probable”

A more useful framework:

Separate near-term harms from speculative long-term risks
Ask: what evidence would change my mind?
Experts genuinely disagree, and that is okay! Certainty is the red flag

What history suggests:

Technology risks are real but rarely follow worst-case predictions
The biggest harms tend to come from problems we did not anticipate, not the ones we worried about most

Source: r/OptimistsUnite

Summary 📝

Main takeaways

Safety landscape

Near-term and long-term concerns both matter
Concrete problems: safe exploration, side effects, reward hacking
Acting now is important given uncertainty

Alignment

Getting AI to do what we actually want
Hard because: specification, Goodhart’s law, value disagreement
Current approaches: RLHF, Constitutional AI
Major open problems remain

Future trajectories

Genuine uncertainty about where AI is heading
Multiple scenarios worth considering
Expert disagreement is real
Plan for multiple futures

What we can do

Technical safety research
Governance and policy
Individual choices
Neither panic nor complacency

The future of AI is not determined. The choices we make now, about alignment, oversight, and governance, will shape what that future looks like.

… and that’s all for today! 🎉