DATASCI 185: Introduction to AI Applications

Lecture 07: How Machines See and Hear

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

Last time, we saw how LLMs process text
Words become numbers! (embeddings)
These numbers capture meaning, and similar words are close together
Remember: king − man + woman ≈ queen?
The model learns which words tend to appear near each other
What about images and sounds? 🖼️🎵
Spoiler: They become numbers too!

Source: DeepSet AI

Lecture overview

Today’s agenda

Part 1: How Machines “See”

What computers actually see (again: just numbers!)
How AI learns to recognise objects
Fun activity: Train your own image classifier!

Part 2: How Machines “Hear”

Turning invisible sound waves into pictures
How Siri and Alexa understand speech

Part 3: The Big Picture

The main insight: Everything becomes the same kind of numbers!

Part 4: What This Means for Society

Can you trust what you see and hear?
Deepfakes, bias, and privacy
Class discussion: Where do we draw the line?

Tweet of the day 😄

Power to the people! ✊🏻

How machines “see” images 🔍

Discussion: Describe this to a computer 🤔

Imagine you need to describe a photo to someone who can ONLY understand numbers.

How would you do it?

Take 1 minute to discuss with your neighbour! ⏱️

What information would you include?
How would you represent colours? Shapes?
What makes this hard?

Just like text needs to become numbers, images need to become numbers too!

And it turns out there’s a nice way to do this…

This picture is unrelated to the class. It’s here just because the cat is cute! 😄

What computers actually see

It’s just a grid of numbers!

A digital photo is just a grid of tiny coloured squares (pixels)
Each pixel has three numbers:
- Red: 0 (none) to 255 (brightest red)
- Green: 0 to 255
- Blue: 0 to 255
Each number tells the computer how much light is at that pixel
Mix them together → millions of colours!
A small 3x3 grayscale image (which has only one channel) could look like this:

[[255, 128, 0],
 [64, 200, 150],
 [0, 50, 255]]

A typical photo might have millions of pixels
That’s millions of numbers for the AI to process
But here’s the problem: Raw numbers don’t tell you “this is a cute cat with a hat” (and that’s inefficient!) 😺

Each pixel = three numbers (Red, Green, Blue)

Your screen is showing millions of these right now!

Source: Reddit

How AI learns to see

Like learning to read!

Think about how you learned to read:

First, you learned to recognise letters
Then you combined letters into words
Then words into sentences
Then sentences into meaning

AI vision works the same way!

First, detect simple edges and colours
Combine edges into shapes and textures
Combine shapes into parts (eyes, wheels, petals)
Combine parts into whole objects (cat, car, flower)

Each layer builds on the previous one 🧱

Feature hierarchy: edges → textures → parts → objects

Source: Towards AI

The convolution operation

A “sliding magnifying glass”

Convolution: A small filter “slides” across the image
At each position, it performs element-wise multiplication and sums the result
Different filters detect different features:
- Vertical edges, horizontal edges
- Corners, textures, gradients
The filter learns what to look for during training!
Output: A feature map showing where that feature appears

Analogy: Like using a stencil to find specific patterns 🔍

Source: vdumoulin/conv_arithmetic

Feature hierarchies

From edges to objects

CNNs stack multiple convolutional layers
Each layer builds on the previous one:
- Layer 1: Simple edges and colours
- Layer 2: Textures and corners
- Layer 3: Parts (eyes, wheels, leaves)
- Layer 4+: Whole objects and scenes
This is called a feature hierarchy
The network learns to compose simple features into complex concepts
Similar to how our visual cortex works!

Feature extraction performed over the image of a lion

Source: Towards Data Science

The full CNN architecture

Putting it all together

What AI learns to see in different layers

Input: Raw image (e.g., 224×224×3)
Convolutional layers: Extract features (edges → textures → objects)
Fully connected layers: Combine features for final decision
Output: Class probabilities (e.g., 95% woman, 5% man)

Vision Transformers (ViT)

“An Image is Worth 16×16 Words”

In 2020, researchers asked: Can we use Transformers for images?
Key idea: Treat image patches like tokens!
The process:
1. Divide image into fixed-size patches (e.g., 16×16 pixels)
2. Flatten each patch into a vector
3. Add positional embeddings (so model knows patch locations)
4. Process through Transformer (same attention mechanism as LLMs, so patches can “look at” other patches to understand context)
Same architecture for text AND images
Foundation for multimodal models like GPT-4V and Gemini

Source: Dosovitskiy et al. (2020)

CLIP: Connecting images and text

The bridge to multimodal AI

CLIP (Contrastive Language-Image Pre-training) by OpenAI
Trained on 400 million image-text pairs from the internet
Key innovation: Shared embedding space for images AND text
How it works:
- Image encoder → image embedding
- Text encoder → text embedding
- Train so matching pairs are close together
Result: Can match images to text descriptions without task-specific training
Foundation for DALL-E, Stable Diffusion, and multimodal LLMs!

CLIP learns to match images with their text descriptions

Source: OpenAI CLIP

What can AI do with images?

Three main tasks

Classification vs Detection vs Segmentation

Task	Question	Real-world Example
Classification	“What is this?”	Instagram knowing your photo is a selfie
Detection	“What and where?”	Your phone camera finding faces
Segmentation	“Which pixels are what?”	iPhone’s Portrait Mode blurring backgrounds

Activity time! 🎮

Train your own AI!

Teachable Machine demo

Let’s train an image classifier: no coding required!

Go to teachablemachine.withgoogle.com
Click “Get Started” → “Image Project” → “Standard”
Create 2-3 classes (e.g., “thumbs up”, “thumbs down”, “peace sign”)
Record ~12 examples of each using your webcam (the website records 3 at a time)
Click “Train Model” (takes about 10 seconds)
Test it live!

Try this:

What happens if you show it something it wasn’t trained on?
Can you “fool” your model?

teachablemachine.withgoogle.com

How machines “hear” audio 🎵

Sound is invisible…so how do we process it?

Sound is just vibrations in the air. We can’t see it!

But here’s a clever trick:

Record the vibrations as a waveform (line going up and down)
Transform the waveform into a picture called a spectrogram
Now we can use the same AI that understands images!

It’s like creating a “photograph” of sound 📸🎵

This is why modern AI is so powerful! We can turn anything into pictures or numbers and use similar techniques!

Waveform vs spectrogram vs mel-spectrogram

Top: Waveform (raw sound)

Middle: Spectrogram (sound as an image!)

Bottom: Mel-spectrogram (adjusted for human hearing, with more resolution for lower frequencies and less for higher frequencies)

Source: Bäckström et al (2026)

What the AI “sees” in your voice

In a spectrogram:

Horizontal axis: Time (left to right)
Vertical axis: Pitch (low notes at bottom, high at top)
Colour/brightness: How loud that frequency is

What patterns can AI find?

Your unique voice “fingerprint”
The difference between “cat” and “bat”
Emotion (are you happy? angry? tired?)
Whether you’re speaking or singing
What language you’re using

Fun fact: Dogs, cats, and humans all have distinctive spectrogram patterns! 🐕🐱👤

Mel spectrogram of speech. The first row is by an individual with high-pitched voice, the second row is by an individual with low-pitched voice

Source: Schnupp et al (2012)

Activity: See your own voice! 🎤

Spectrograms in real-time

Try this later:

Go to musiclab.chromeexperiments.com/Spectrogram
Allow microphone access
Watch what happens when you:
- Hum a low note vs a high note
- Say “aaaah” vs “eeeeh” vs “ooooh”
- Whistle
- Snap your fingers or clap

What to notice:

Low sounds appear at the bottom, high sounds at the top
Vowels create stable horizontal bands
Percussive sounds (claps) create vertical spikes
Your voice has a unique pattern, like a fingerprint!

musiclab.chromeexperiments.com/Spectrogram

Try making different sounds and watch the patterns!

Whisper: How Siri and Alexa understand you

Whisper is OpenAI’s speech recognition system:

Trained on 680,000 hours of audio from the internet
Understands 99 languages!
Can handle accents, background noise, different speaking styles
Completely open source and free

How it works:

Slice: Break audio into short windows (typically 25ms)
FFT: Convert each window from waveform to frequencies (how much of each frequency is present)
Mel mapping: Apply mel scale to match human perception
Stack: Create 2D spectrogram from all windows
Process: Use Transformers, same as text and images!
Context: Use context (attention) to predict next tokens

Architecture: Encoder-decoder Transformer
- Encoder: Processes mel spectrogram
- Decoder: Generates text tokens
Open source and free to use!

This powers much of what Siri, Alexa, and Google Assistant do!

Source: Sam Galope (2024)

Activity: Test speech recognition! 🎤

Let’s try the Whisper demo:

huggingface.co/spaces/openai/whisper

Experiments to try at home:

Record yourself speaking normally
Try speaking with an accent
Record with background noise (music, talking)
Try a different language if you speak one!
Speak very fast or very slow

Observe:

What does it get right? Wrong?
Does it understand your accent?
What about punctuation? Who decides where sentences end?

huggingface.co/spaces/openai/whisper

AI can compose music now 🎵

Suno.ai and AI music

Suno.ai generates complete songs from text prompts:

Give it a description: “upbeat pop song about studying for exams”
It creates: melody, harmony, rhythm, AND vocals!
A full song in about 30 seconds

How does it work?

Trained on millions of hours of music
Learns patterns: chord progressions, song structures, vocal styles
Uses the same “sound → numbers → AI” pipeline we discussed

The questions this raises:

Anyone can now create professional-sounding music
But: Who owns AI-generated music?
Some AI songs sound eerily similar to real artists
Musicians worry about their livelihoods

suno.ai. Try generating your own song!

Discussion: If AI creates a song that sounds like Taylor Swift, is that copying? Should it be legal?

The Big Picture 🌐

The main idea: Everything becomes numbers

So… whether it’s text, images, or audio…
Everything gets converted to embeddings!
Same mathematical representation:
- Text token → 4096-dimensional vector
- Image patch → 4096-dimensional vector
- Audio segment → 4096-dimensional vector
Once in embedding space, the LLM doesn’t know the original modality!
This is why multimodal models can work with text, images, and audio at the same time

Text, images, and audio all become the same kind of numbers, so AI can understand them together!

The three-part architecture

How multimodal LLMs work

Multimodal LLM architecture: Encoder → Projector → LLM

Modality Encoder: Vision Transformer (images) or Whisper (audio). These are pre-trained specialists
Projection Layer: Aligns encoder outputs to the LLM’s embedding space. This part is often surprisingly simple!
LLM Backbone: The “brain” (GPT, Gemini, Claude), which processes everything as tokens

Training multimodal models

Two-stage learning

Stage 1: Feature Alignment

Freeze vision encoder and LLM
Only train the neural network that projects the encoder outputs to the LLM’s embedding space
Goal: Learn that “dog image” and “dog text” should be close
Dataset: Millions of image-caption pairs

Stage 2: Instruction Tuning

Unfreeze the LLM
Train on question-answer datasets
Goal: Learn to follow complex multimodal instructions
Dataset: “Describe this image”, “What’s wrong with this chart?”

Source: Sebastian Raschka

What this means for society ⚠️

Class discussion: Where do we draw the line? 🤔

In small groups, discuss this question:

Voice/image generation

Should AI-generated content require…

Watermarks that can’t be removed?
Disclosure that it’s AI-made?
Consent from people being depicted?
None of the above (free speech)?

How would you enforce it?

Take 2 minutes, then we’ll share perspectives!

Summary 📚

Main takeaways

Images = grids of numbers: AI learns to spot patterns, from edges to objects
Sound = pictures of vibrations: We turn audio into spectrograms and use image AI!
The big insight: Text, images, and audio all become embeddings, the same kind of numbers
Multimodal AI: ChatGPT, Claude, and Gemini can see images because everything speaks the same mathematical “language”
Hands-on: You trained your own AI with Teachable Machine!
Critical thinking: Deepfakes, bias, and privacy are real concerns we must address

… and that’s all for today! 🎉