DATASCI 185: Introduction to AI Applications

Lecture 02: A Brief History of AI and the Recent Shift

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Short recap of last class

Lecture overview

Today’s agenda

  • The pre-history of AI
  • Birth of AI: Dartmouth 1956
  • Symbolic AI and expert systems
  • AI winters and why they happened
  • The neural network comeback
  • The data revolution
  • Transformers explained simply
  • From GPT to ChatGPT

Click to enlarge (all images can be enlarged by clicking on them)

Source: University of Queensland’s Brain Institute

The pre-history of AI 🏛️

The pre-history of AI

Ancient myths of artificial life

  • Humans have long dreamed of creating artificial life
  • Ancient myths featured mechanical beings:
    • Talos (Greek): Bronze giant protecting Crete
    • Golem (Jewish folklore): Clay figure brought to life
    • Frankenstein (1818): Mary Shelley’s novel about creating life
  • We’ve always been fascinated with replicating intelligence
  • Can machines truly think, or only simulate thinking?

Talos

Source: Wikipedia - Talos

Early mechanical calculators

From Pascal to Babbage

Babbage’s Analytical Engine (replica)

Source: Wikipedia - Analytical Engine

Human computers

When “computer” meant a person

  • For most of history, the word “computer” referred to a person, not a machine!
  • Large teams of human computers performed complex calculations:
    • Astronomical tables
    • Ballistics trajectories
    • Census data
  • Often women, who were paid less than male mathematicians
  • This division of mental labour (Adam Smith!) showed that complex calculations could be broken into simple, mechanical steps
  • Historian Lorraine Daston: Calculation shifted from “genius” to “merely mechanical”

Human computers at NASA (1950s)

Source: NASA

Alan Turing and the foundations of computing

The theoretical breakthrough

The standard version of the Turing test

Source: Wikipedia - Turing Test

Live Turing test!

  • Guess if the following sentences were written by a human or a machine:


  • “The universe (which others call the Library) is composed of an indefinite, perhaps an infinite number of hexagonal galleries, with enormous ventilation shafts in the middle, encircled by very low railings.”
  • “This is a multifaceted issue that requires us to examine several different perspectives.”
    • Machine!
  • “The function of science fiction is not always to predict the future but sometimes to prevent it.”
  • “The epistemological ramifications of this ontological framework necessitate a paradigmatic shift in our heuristic methodologies.”
    • Machine!
  • And yes, ChatGPT did pass the Turing test

The birth of AI (1950s–1970s) 🎂

The Dartmouth Conference (1956)

Where it all began

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

The Dartmouth workshop participants

Source: IEEE Spectrum

Symbolic AI and expert systems

Two approaches to intelligence

Symbolic AI, 1956–1980s

Expert systems, 1970s–1980s

  • Shift to narrow domains with expert knowledge

  • Encode human expertise as if-then rules

  • Examples: MYCIN (bacterial infections, video here), DENDRAL (chemical analysis, video here)

  • Edward Feigenbaum: “The problem-solving power… is primarily a consequence of the specialist’s knowledge employed

  • Both approaches relied on hand-crafted knowledge, not learning from data

  • Neither could learn or improve automatically

The AI winters

The first winter (1974–1980)

  • Combinatorial explosion, limited computing power, brittleness
  • Funding collapsed, researchers left the field
  • Lesson: AI is harder than pioneers thought

The second winter (1987–1993)

  • Expensive to maintain, knowledge bottleneck, couldn’t learn
  • Desktop computers made specialised AI hardware obsolete
  • Market collapsed by 1987
  • AI” became a dirty word

The pattern

  1. Bold claims attract funding
  2. Initial successes create excitement
  3. Limitations become apparent at scale
  4. Funding dries up, researchers move on
  5. Quiet research continues, laying groundwork…


Anything familiar about this pattern? 😅

Maybe this time is different? 🤷🏻‍♂️

Neural networks: rise, fall, and rise again 🧠

Early neural networks

Inspired by the brain

  • 1943: McCulloch and Pitts propose artificial neurons
    • Simple binary units that mimic biological neurons
  • 1958: Frank Rosenblatt invents the Perceptron
    • A simple neural network that could learn to classify patterns
    • Trained by adjusting weights based on errors
  • The New York Times (1958) predicted it would “walk, talk, see, write, reproduce itself”
  • Instead of programming rules, let the machine learn from examples
  • However, single-layer perceptrons couldn’t solve certain simple problems, so interest waned
  • Multi-layer networks could solve these problems, but training them was hard
  • It took nearly 20 years for neural networks to make their comeback
  • How does an artificial neuron work?
  • Each input (\(x_i\)) (e.g., pixel brightness) is multiplied by a weight (\(w_i\)) to increase or decrease the importance of that signal, then summed with a bias (\(b\)) (a threshold that determines when the neuron “fires”): \(z = \sum x_i \cdot w_i + b\)
  • The result passes through an activation function (non-linear transformation) to produce the output: \(y = f(z)\)
  • Without the activation function, the network would be just a linear model!

Perceptron diagram

Source: AI Mind

The backpropagation breakthrough (1986)

Learning to learn

  • 1986: Rumelhart, Hinton, and Williams popularise backpropagation
  • You can train multi-layer networks by propagating errors backwards through the layers
  • It finds functions that are highly non-linear with very precise adjustments to many weights
  • Finally, neural networks could learn complex patterns!
  • Renewed interest in connectionism
  • But challenges remained:
    • Limited computing power
    • Not enough training data
  • Neural networks showed promise but couldn’t yet compete with other methods

Backpropagation diagram

Source: Medium

The data revolution 📊

The unreasonable effectiveness of data

A paradigm shift

  • 2009: Google researchers (Halevy, Norvig, Pereira) publish a landmark paper called “The Unreasonable Effectiveness of Data
  • Simple models with lots of data beat complex models with less data
  • Traditional AI: Design clever algorithms
  • New AI: Feed massive amounts of data to simple learning algorithms

“Now go out and gather some data, and see what it can do.”

  • This insight transformed the field
  • Data became the new oil, together with computing power and better algorithms
  • Any one of them alone wasn’t enough!

More data beats better algorithms

Source: Google Research

The ImageNet moment (2012)

Deep learning arrives

  • ImageNet: A dataset of 14 million labeled images
  • Annual competition: Classify images into 1,000 categories
  • 2012: AlexNet (Krizhevsky, Sutskever, Hinton) wins by a huge margin
    • Error rate: 15.3% (next best: 26.2%)
    • Used deep neural networks
    • Trained on Graphics Processing Units (GPUs), which are better than Central Processing Units for parallel processing
  • This was a great moment for deep learning!
  • Within years, deep learning dominated computer vision
  • Proved that neural networks + big data + GPUs = breakthrough

Ilya Sutskever, Alex Krizhevsky, and Geoffrey Hinton

Source: Medium

Transformers: the architecture that changed everything ⚡

Attention is all you need (2017)

The transformer revolution

  • 2017: Google researchers publish “Attention Is All You Need” (Vaswani et al.)
  • Self-attention mechanism
  • Instead of processing sequentially:
    • Each word can “look at” every other word directly
    • Computes relevance scores between all pairs
    • Parallel processing: Much faster to train
  • This architecture powers GPT (Generative Pre-trained Transformer), BERT, and pretty much all modern AI
  • Beyond text: Now used in images, audio, video, proteins, games…

The paper that started it all

How transformers work (simplified!)

The big picture

Transformer architecture overview

Source: Hivenet

A transformer has three main parts:

  1. Embedding: Turn words into numbers
  2. Transformer Blocks: Mix and refine information (the magic happens here!)
  3. Output: Predict the next word

Step 1: Tokenisation

Breaking text into pieces

  • Text must be converted to numbers
  • Tokenisation: Split text into tokens
    • Usually words, letters, or word pieces
    • “unhappiness” → “un”, “happiness”
    • GPT-2 has a vocabulary of 50,257 tokens
  • Each token gets a unique ID number
  • Think of it as creating a dictionary where each word/piece has a code

Tokenisation diagram

Source: Medium

Step 2: Embeddings

Words as points in space

  • Each token becomes a vector (a list of numbers)
  • GPT-2: Each token → 768 numbers
  • These vectors capture meaning:
    • Similar words are close together
    • “king” - “man” + “woman” ≈ “queen”
  • Also add position information:
    • Word order matters in language!
    • “Dog bites man” ≠ “Man bites dog”
  • These embeddings are learned during training

Words plotted in 3-dimensional space. Embeddings can have hundreds or thousands of dimensions–too many for humans to visualise

Source: Google Cloud

Step 3: Self-attention

The key innovation

  • The core mechanism that makes transformers powerful
  • For each word, ask: “Which other words are relevant to understanding me?
  • Creates three vectors for each token:
    • Query (Q): “What am I looking for?” (does “it” refer to “animal” or “street”?)
    • Key (K): “What do I contain?” (check the characteristics of “it” as a token)
    • Value (V): “What information can I provide?” (provide the information: “ah, it’s animal!”)
  • Attention score = how well Query matches Key
  • Like a search engine where each word searches for relevant context

Self-attention: Words attend to each other

Source: Jay Alammar

Self-attention example

“The cat sat on the mat”

Source: Thomas Wiecki

  • When processing “sat”, the model might attend strongly to:
    • “cat” (who is sitting?)
    • “mat” (where are they sitting?)
  • Multi-head attention: Multiple attention patterns in parallel (division of labour again!)
    • One head might focus on subject-verb relationships
    • Another on adjective-noun relationships
    • GPT-2 uses 12 attention heads

Step 4: Feed-forward and stacking

Building deeper understanding

  • After attention, each token passes through a feed-forward network (FFN, Multi-Layer Perceptron, MLP)
  • Attention is about “looking around” to find connections between words
  • FFN is about “thinking” about what those connections actually mean for each individual word
  • This refines and transforms the representation
  • Models stack many blocks:
    • GPT-2 (small): 12 blocks
    • GPT-3: 96 blocks
    • GPT-4: Rumoured to be much larger
  • Each block builds more abstract understanding
    • Early layers: Grammar, syntax
    • Later layers: Meaning, context, reasoning

One transformer block

Source: Mohamed Traore

Step 5: Predicting the next token

The final output

  • After all blocks, the model predicts: “What comes next?
  • Outputs a probability for every token in vocabulary
  • Example: “The cat sat on the…”
    • “floor”: 27.4%
    • “bed”: 22.5%
    • “couch”: 17.8%
  • Temperature controls randomness:
    • Low (0.2): More predictable
    • High (1.0+): More creative/random
  • The chosen token is added, and the process repeats

Step 5: Predicting the next token

The final output

Probability distribution over vocabulary

Source: Transformer Explainer

From GPT to ChatGPT 🚀

The rise of large language models

Scaling up

Model Year Parameters Notable Achievement
GPT-1 2018 117M Showed pre-training works
BERT 2018 340M Revolutionised NLP benchmarks
GPT-2 2019 1.5B “Too dangerous to release”
GPT-3 2020 175B Few-shot learning emergence
GPT-4 2023 ~1.7T? Multimodal, near-human reasoning
GPT-5 2025 635B? More efficient, even better reasoning


  • Bigger models + more data = emergent abilities
  • Capabilities appear at scale that weren’t explicitly trained
  • This is the “scaling hypothesis” (although some people, like Ilya Sutskever, argue that scaling has reached its limits)

What makes ChatGPT different?

Beyond just scaling

ChatGPT’s explosive growth

Source: Voronoi

Multimodal AI

Beyond text

  • Modern AI isn’t just about text anymore
  • Multimodal models can process:
    • Text ↔︎ Images (DALL-E, Midjourney)
    • Text ↔︎ Audio (Whisper, ElevenLabs)
    • Text ↔︎ Video (Sora, Runway)
    • Text ↔︎ Code (Codex, Copilot)
  • Same transformer architecture, different inputs/outputs
  • Vision Transformers (ViT): Treat images as sequences of patches
  • The boundaries between modalities are blurring

Multimodal AI: Understanding multiple types of data

Source: Tarun Sharma

Takeaways 📚

What changed? What stayed the same?

Continuity and transformation

What changed:

  • From human-designed rules to learned patterns
  • From narrow tasks to general capabilities
  • From mimicking human reasoning to optimising for prediction
  • From scarce data to abundant data

What stayed the same:

  • The dream of intelligent machines
  • The question: Can machines truly understand the world?
  • The challenges: Brittleness, bias, lack of common sense
  • The hype cycles: Overpromise, underdeliver (are we in one now?)

Interactive resource

Explore transformers yourself!

  • Georgia Tech’s Transformer Explainer
  • Interactive visualisation of how transformers work
  • See attention patterns in real time
  • Experiment with temperature and sampling
  • Runs GPT-2 directly in your browser!
  • Great for building intuition

https://poloclub.github.io/transformer-explainer

… and that’s all for today! 🎉

See you all soon! 😊