DATASCI 185: Introduction to AI Applications

Lecture 02: A Brief History of AI and the Recent Shift

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Short recap of last class

Course website: https://danilofreire.github.io/datasci185
Course GitHub repo: https://github.com/danilofreire/datasci185
Syllabus, schedule, and assignments will be posted on both sites
We will use Canvas for announcements, submissions, and grades
Instructor: Danilo Freire (danilo.freire@emory.edu)
In-class TAs: Tom Suo (tom.suo@emory.edu) and Sissi Li (sissi.li@emory.edu)
Grading TAs: Philip Wang (xipu.wang@emory.edu) and Anita Osuri (anita.osuri@emory.edu)
Grading:
- 10 problem sets (50%)
- 5 in-class quizzes (30%)
- Final project (20%)

Lecture overview

Today’s agenda

The pre-history of AI
Birth of AI: Dartmouth 1956
Symbolic AI and expert systems
AI winters and why they happened
The neural network comeback
The data revolution
Transformers explained simply
From GPT to ChatGPT

Click to enlarge (all images can be enlarged by clicking on them)

Source: University of Queensland’s Brain Institute

The pre-history of AI 🏛️

The pre-history of AI

Ancient myths of artificial life

Humans have long dreamed of creating artificial life
Ancient myths featured mechanical beings:
- Talos (Greek): Bronze giant protecting Crete
- Golem (Jewish folklore): Clay figure brought to life
- Frankenstein (1818): Mary Shelley’s novel about creating life
We’ve always been fascinated with replicating intelligence
Can machines truly think, or only simulate thinking?

Source: Wikipedia - Talos

Early mechanical calculators

From Pascal to Babbage

Can intelligence be reduced to rules and symbols, or does it emerge from interconnected units?
1642: Blaise Pascal builds the Pascaline, a mechanical calculator for addition and subtraction
1673: Gottfried Leibniz creates a machine that can multiply and divide
1837: Charles Babbage designs the Analytical Engine, a programmable, general-purpose computer (never completed)
Ada Lovelace writes what many consider the first computer program for Babbage’s machine
Calculation could be separated from human minds

Source: Wikipedia - Analytical Engine

Human computers

When “computer” meant a person

For most of history, the word “computer” referred to a person, not a machine!
Large teams of human computers performed complex calculations:
- Astronomical tables
- Ballistics trajectories
- Census data
Often women, who were paid less than male mathematicians
This division of mental labour (Adam Smith!) showed that complex calculations could be broken into simple, mechanical steps
Historian Lorraine Daston: Calculation shifted from “genius” to “merely mechanical”

Source: NASA

Alan Turing and the foundations of computing

The theoretical breakthrough

1936: Alan Turing publishes “On Computable Numbers, with an Application to the Entscheidungsproblem” (decision problem)
Introduces the concept of a universal machine that can compute anything computable
1950: Publishes “Computing Machinery and Intelligence” in Mind (philosophy journal)
- Proposes the famous Turing Test: Can a machine fool a human into thinking it’s human?
- Asks: “Can machines think?” and reframes it as a practical question (“Can machines do what we (as thinking entities) can do?”)
Turing’s ideas laid the theoretical foundation for both computing and AI
He suggested machines could learn, not just follow fixed rules

Source: Wikipedia - Turing Test

Live Turing test!

Guess if the following sentences were written by a human or a machine:

“The universe (which others call the Library) is composed of an indefinite, perhaps an infinite number of hexagonal galleries, with enormous ventilation shafts in the middle, encircled by very low railings.”
- Human! Jorge Luis Borges (Argentine writer)
“This is a multifaceted issue that requires us to examine several different perspectives.”
- Machine!
“The function of science fiction is not always to predict the future but sometimes to prevent it.”
- Human! Attributed to Frank Herbert (author of Dune)
“The epistemological ramifications of this ontological framework necessitate a paradigmatic shift in our heuristic methodologies.”
- Machine!
And yes, ChatGPT did pass the Turing test

The birth of AI (1950s–1970s) 🎂

The Dartmouth Conference (1956)

Where it all began

Summer 1956: A workshop at Dartmouth College marks the birth of AI as a field
Organised by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon
First use of the term “Artificial Intelligence”
The founding conjecture:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

Attendees included Herbert Simon and Allen Newell
Optimism was high: Many believed human-level AI was just decades away

Source: IEEE Spectrum

Symbolic AI and expert systems

Two approaches to intelligence

Symbolic AI, 1956–1980s

Intelligence = manipulating symbols according to rules
Program the formal rules behind intelligent behaviour into machines
Early successes: ELIZA (therapist, video here), SHRDLU (block world, video here)
- Try ELIZA here: https://anthay.github.io/eliza.html (by the ELIZA Archaeology Project)
Bold but overconfident predictions: Minsky and Simon expected human-level AI within a generation

Expert systems, 1970s–1980s

Shift to narrow domains with expert knowledge
Encode human expertise as if-then rules
Examples: MYCIN (bacterial infections, video here), DENDRAL (chemical analysis, video here)
Edward Feigenbaum: “The problem-solving power… is primarily a consequence of the specialist’s knowledge employed”
Both approaches relied on hand-crafted knowledge, not learning from data
Neither could learn or improve automatically

The AI winters

The first winter (1974–1980)

Combinatorial explosion, limited computing power, brittleness
Funding collapsed, researchers left the field
Lesson: AI is harder than pioneers thought

The second winter (1987–1993)

Expensive to maintain, knowledge bottleneck, couldn’t learn
Desktop computers made specialised AI hardware obsolete
Market collapsed by 1987
“AI” became a dirty word

The pattern

Bold claims attract funding
Initial successes create excitement
Limitations become apparent at scale
Funding dries up, researchers move on
Quiet research continues, laying groundwork…

Anything familiar about this pattern? 😅

Maybe this time is different? 🤷🏻‍♂️

Neural networks: rise, fall, and rise again 🧠

Early neural networks

Inspired by the brain

1943: McCulloch and Pitts propose artificial neurons
- Simple binary units that mimic biological neurons
1958: Frank Rosenblatt invents the Perceptron
- A simple neural network that could learn to classify patterns
- Trained by adjusting weights based on errors
The New York Times (1958) predicted it would “walk, talk, see, write, reproduce itself”
Instead of programming rules, let the machine learn from examples
However, single-layer perceptrons couldn’t solve certain simple problems, so interest waned
Multi-layer networks could solve these problems, but training them was hard
It took nearly 20 years for neural networks to make their comeback
But how does an artificial neuron work?

Each input (\(x_i\)) (e.g., pixel brightness) is multiplied by a weight (\(w_i\)) to increase or decrease the importance of that signal, then summed with a bias (\(b\)) (a threshold that determines when the neuron “fires”): \(z = \sum x_i \cdot w_i + b\)
The result passes through an activation function (non-linear transformation) to produce the output: \(y = f(z)\)
Without the activation function, the network would be just a linear model!

Source: AI Mind

The backpropagation breakthrough (1986)

Learning to learn

1986: Rumelhart, Hinton, and Williams popularise backpropagation
You can train multi-layer networks by propagating errors backwards through the layers
It finds functions that are highly non-linear with very precise adjustments to many weights
Finally, neural networks could learn complex patterns!
Renewed interest in connectionism
But challenges remained:
- Limited computing power
- Not enough training data
Neural networks showed promise but couldn’t yet compete with other methods

Source: Medium

The data revolution 📊

The unreasonable effectiveness of data

A paradigm shift

2009: Google researchers (Halevy, Norvig, Pereira) publish a landmark paper called “The Unreasonable Effectiveness of Data”
Simple models with lots of data beat complex models with less data
Traditional AI: Design clever algorithms
New AI: Feed massive amounts of data to simple learning algorithms

“Now go out and gather some data, and see what it can do.”

Data, computing power, and better algorithms came together at the same time
Any one alone wasn’t enough, but all three together changed the field

Source: Google Research

The ImageNet moment (2012)

Deep learning arrives

ImageNet: A dataset of 14 million labeled images
Annual competition: Classify images into 1,000 categories
2012: AlexNet (Krizhevsky, Sutskever, Hinton) wins by a huge margin
- Error rate: 15.3% (next best: 26.2%)
- Used deep neural networks
- Trained on Graphics Processing Units (GPUs), which are better than Central Processing Units for parallel processing
This was a great moment for deep learning!
Within years, deep learning dominated computer vision
Proved that neural networks + big data + GPUs = breakthrough

Ilya Sutskever, Alex Krizhevsky, and Geoffrey Hinton

Source: Medium

Transformers: the architecture that changed everything ⚡

Attention is all you need (2017)

The transformer revolution

2017: Google researchers publish “Attention Is All You Need” (Vaswani et al.)
Self-attention mechanism
Instead of processing sequentially:
- Each word can “look at” every other word directly
- Computes relevance scores between all pairs
- Parallel processing: Much faster to train
This architecture powers GPT (Generative Pre-trained Transformer), BERT, and pretty much all modern AI
Beyond text: Now used in images, audio, video, proteins, games…

How transformers work (simplified!)

The big picture

Source: Hivenet

A transformer has three main parts:

Embedding: Turn words into numbers
Transformer Blocks: Mix and refine information (the magic happens here!)
Output: Predict the next word

Step 1: Tokenisation

Breaking text into pieces

Text must be converted to numbers
Tokenisation: Split text into tokens
- Usually words, letters, or word pieces
- “unhappiness” → “un”, “happiness”
- GPT-2 has a vocabulary of 50,257 tokens
Each token gets a unique ID number
Think of it as creating a dictionary where each word/piece has a code

Source: Medium

Step 2: Embeddings

Words as points in space

Each token becomes a vector (a list of numbers)
GPT-2: Each token → 768 numbers
These vectors capture meaning:
- Similar words are close together
- “king” - “man” + “woman” ≈ “queen”
Also add position information:
- Word order matters in language!
- “Dog bites man” ≠ “Man bites dog”
These embeddings are learned during training

Words plotted in 3-dimensional space. Embeddings can have hundreds or thousands of dimensions–too many for humans to visualise

Source: Google Cloud

Step 3: Self-attention

The key innovation

The core mechanism that makes transformers powerful
For each word, ask: “Which other words are relevant to understanding me?”
Creates three vectors for each token:
- Query (Q): “What am I looking for?” (does “it” refer to “animal” or “street”?)
- Key (K): “What do I contain?” (check the characteristics of “it” as a token)
- Value (V): “What information can I provide?” (provide the information: “ah, it’s animal!”)
Attention score = how well Query matches Key
Like a search engine where each word searches for relevant context

Self-attention: Words attend to each other

Source: Jay Alammar

Self-attention example

“The cat sat on the mat”

Source: Thomas Wiecki

When processing “sat”, the model might attend strongly to:
- “cat” (who is sitting?)
- “mat” (where are they sitting?)
Multi-head attention: Multiple attention patterns in parallel (division of labour again!)
- One head might focus on subject-verb relationships
- Another on adjective-noun relationships
- GPT-2 uses 12 attention heads

Step 4: Feed-forward and stacking

Building deeper understanding

After attention, each token passes through a feed-forward network (FFN, Multi-Layer Perceptron, MLP)
Attention is about “looking around” to find connections between words
FFN is about “thinking” about what those connections actually mean for each individual word
This refines and transforms the representation
Models stack many blocks:
- GPT-2 (small): 12 blocks
- GPT-3: 96 blocks
- GPT-4: Rumoured to be much larger
Each block builds more abstract understanding
- Early layers: Grammar, syntax
- Later layers: Meaning, context, reasoning

Source: Mohamed Traore

Step 5: Predicting the next token

The final output

After all blocks, the model predicts: “What comes next?”
Outputs a probability for every token in vocabulary
Example: “The cat sat on the…”
- “floor”: 27.4%
- “bed”: 22.5%
- “couch”: 17.8%
- …
Temperature controls randomness:
- Low (0.2): More predictable
- High (1.0+): More creative/random
The chosen token is added, and the process repeats

Step 5: Predicting the next token

The final output

Probability distribution over vocabulary

Source: Transformer Explainer

From GPT to ChatGPT 🚀

The rise of large language models

Scaling up

Model	Year	Parameters	Notable Achievement
GPT-1	2018	117M	Showed pre-training works
BERT	2018	340M	Revolutionised NLP benchmarks
GPT-2	2019	1.5B	“Too dangerous to release”
GPT-3	2020	175B	Few-shot learning emergence
GPT-4	2023	~1.7T?	Multimodal, near-human reasoning
GPT-5	2025	635B?	More efficient, even better reasoning

Bigger models + more data = emergent abilities
Capabilities appear at scale that weren’t explicitly trained
This is the “scaling hypothesis” (although some people, like Ilya Sutskever, argue that scaling has reached its limits)

What makes ChatGPT different?

Beyond just scaling

ChatGPT (Nov 2022) wasn’t just a bigger model
Three key innovations:
1. Instruction tuning: Trained to follow instructions
2. RLHF (Reinforcement Learning from Human Feedback): Humans rate responses; model learns what’s helpful
3. Safety training: Reduce harmful outputs
Result: A model that’s actually useful for conversation
100 million users in 2 months—fastest-growing consumer app ever

Source: Voronoi

Multimodal AI

Beyond text

Modern AI isn’t just about text anymore
Multimodal models can process:
- Text ↔︎ Images (DALL-E, Midjourney)
- Text ↔︎ Audio (Whisper, ElevenLabs)
- Text ↔︎ Video (Sora, Runway)
- Text ↔︎ Code (Codex, Copilot)
Same transformer architecture, different inputs/outputs
Vision Transformers (ViT): Treat images as sequences of patches
The boundaries between modalities are blurring

Multimodal AI: Understanding multiple types of data

Source: Tarun Sharma

Takeaways 📚

What changed? What stayed the same?

Continuity and transformation

What changed:

We stopped writing rules by hand and started letting machines learn from data
AI went from doing one narrow thing to doing many things with a single architecture
The bottleneck shifted from clever algorithms to data and computing power

What stayed the same:

We still ask the same question Turing asked in 1950: can machines think?
AI systems are still brittle, biased, and lacking common sense
We still go through hype cycles. Are we in one now? 🤷🏻‍♂️

Interactive resource

Explore transformers yourself!

Georgia Tech’s Transformer Explainer
Interactive visualisation of how transformers work
See attention patterns in real time
Experiment with temperature and sampling
Runs GPT-2 directly in your browser!
Great for building intuition

https://poloclub.github.io/transformer-explainer

… and that’s all for today! 🎉

DATASCI 185: Introduction to AI Applications

Welcome back! 🤓

Short recap of last class

Lecture overview

Today’s agenda

The pre-history of AI 🏛️

The pre-history of AI

Ancient myths of artificial life

Early mechanical calculators

From Pascal to Babbage

Human computers

When “computer” meant a person

Alan Turing and the foundations of computing

The theoretical breakthrough

Live Turing test!

The birth of AI (1950s–1970s) 🎂

The Dartmouth Conference (1956)

Where it all began

Symbolic AI and expert systems

Two approaches to intelligence

The AI winters

Neural networks: rise, fall, and rise again 🧠

Early neural networks

Inspired by the brain

The backpropagation breakthrough (1986)

Learning to learn

The data revolution 📊

The unreasonable effectiveness of data

A paradigm shift

The ImageNet moment (2012)

Deep learning arrives

Transformers: the architecture that changed everything ⚡

Attention is all you need (2017)

The transformer revolution

How transformers work (simplified!)

The big picture

Step 1: Tokenisation

Breaking text into pieces

Step 2: Embeddings

Words as points in space

Step 3: Self-attention

The key innovation

Self-attention example

“The cat sat on the mat”

Step 4: Feed-forward and stacking

Building deeper understanding

Step 5: Predicting the next token

The final output

Step 5: Predicting the next token

The final output

From GPT to ChatGPT 🚀

The rise of large language models

Scaling up

What makes ChatGPT different?

Beyond just scaling

Multimodal AI

Beyond text

Takeaways 📚

What changed? What stayed the same?

Continuity and transformation

Interactive resource

Explore transformers yourself!

… and that’s all for today! 🎉

See you all soon! 😊