DATASCI 185: Introduction to AI Applications

Lecture 07: How Machines See and Hear

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

  • Last time, we saw how LLMs process text
  • Words become numbers! (embeddings)
  • These numbers capture meaning, and similar words are close together
  • Remember: king − man + woman ≈ queen?
  • The model learns which words tend to appear near each other
  • What about images and sounds? 🖼️🎵
  • Spoiler: They become numbers too!

Source: DeepSet AI

Lecture overview

Today’s agenda

Part 1: How Machines “See”

  • What computers actually see (again: just numbers!)
  • How AI learns to recognise objects
  • Fun activity: Train your own image classifier!

Part 2: How Machines “Hear”

  • Turning invisible sound waves into pictures
  • How Siri and Alexa understand speech

Part 3: The Big Picture

  • The main insight: Everything becomes the same kind of numbers!

Part 4: What This Means for Society

  • Can you trust what you see and hear?
  • Deepfakes, bias, and privacy
  • Class discussion: Where do we draw the line?

Tweet of the day 😄

Power to the people! ✊🏻

How machines “see” images 🔍

Discussion: Describe this to a computer 🤔

Imagine you need to describe a photo to someone who can ONLY understand numbers.

How would you do it?

Take 1 minute to discuss with your neighbour! ⏱️

  • What information would you include?
  • How would you represent colours? Shapes?
  • What makes this hard?

Just like text needs to become numbers, images need to become numbers too!

And it turns out there’s a nice way to do this…

This picture is unrelated to the class. It’s here just because the cat is cute! 😄

What computers actually see

It’s just a grid of numbers!

  • A digital photo is just a grid of tiny coloured squares (pixels)
  • Each pixel has three numbers:
    • Red: 0 (none) to 255 (brightest red)
    • Green: 0 to 255
    • Blue: 0 to 255
  • Each number tells the computer how much light is at that pixel
  • Mix them together → millions of colours!
  • A small 3x3 grayscale image (which has only one channel) could look like this:
[[255, 128, 0],
 [64, 200, 150],
 [0, 50, 255]]
  • A typical photo might have millions of pixels
  • That’s millions of numbers for the AI to process
  • But here’s the problem: Raw numbers don’t tell you “this is a cute cat with a hat” (and that’s inefficient!) 😺

Apple screen under a microscope

Each pixel = three numbers (Red, Green, Blue)

Your screen is showing millions of these right now!

Source: Reddit

How AI learns to see

Like learning to read!

  • Think about how you learned to read:
  1. First, you learned to recognise letters
  2. Then you combined letters into words
  3. Then words into sentences
  4. Then sentences into meaning
  • AI vision works the same way!
  1. First, detect simple edges and colours
  2. Combine edges into shapes and textures
  3. Combine shapes into parts (eyes, wheels, petals)
  4. Combine parts into whole objects (cat, car, flower)
  • Each layer builds on the previous one 🧱

Feature hierarchy: edges → textures → parts → objects

Source: Towards AI

The convolution operation

A “sliding magnifying glass”

  • Convolution: A small filter “slides” across the image
  • At each position, it performs element-wise multiplication and sums the result
  • Different filters detect different features:
    • Vertical edges, horizontal edges
    • Corners, textures, gradients
  • The filter learns what to look for during training!
  • Output: A feature map showing where that feature appears

Analogy: Like using a stencil to find specific patterns 🔍

Convolution: filter sliding over image

Source: vdumoulin/conv_arithmetic

Feature hierarchies

From edges to objects

  • CNNs stack multiple convolutional layers
  • Each layer builds on the previous one:
    • Layer 1: Simple edges and colours
    • Layer 2: Textures and corners
    • Layer 3: Parts (eyes, wheels, leaves)
    • Layer 4+: Whole objects and scenes
  • This is called a feature hierarchy
  • The network learns to compose simple features into complex concepts
  • Similar to how our visual cortex works!

Feature extraction performed over the image of a lion

Source: Towards Data Science

The full CNN architecture

Putting it all together

What AI learns to see in different layers
  1. Input: Raw image (e.g., 224×224×3)
  2. Convolutional layers: Extract features (edges → textures → objects)
  3. Fully connected layers: Combine features for final decision
  4. Output: Class probabilities (e.g., 95% woman, 5% man)

Vision Transformers (ViT)

“An Image is Worth 16×16 Words”

  • In 2020, researchers asked: Can we use Transformers for images?
  • Key idea: Treat image patches like tokens!
  • The process:
    1. Divide image into fixed-size patches (e.g., 16×16 pixels)
    2. Flatten each patch into a vector
    3. Add positional embeddings (so model knows patch locations)
    4. Process through Transformer (same attention mechanism as LLMs, so patches can “look at” other patches to understand context)
  • Same architecture for text AND images
  • Foundation for multimodal models like GPT-4V and Gemini

Vision Transformer architecture

Attention

Source: Dosovitskiy et al. (2020)

CLIP: Connecting images and text

The bridge to multimodal AI

  • CLIP (Contrastive Language-Image Pre-training) by OpenAI
  • Trained on 400 million image-text pairs from the internet
  • Key innovation: Shared embedding space for images AND text
  • How it works:
    • Image encoder → image embedding
    • Text encoder → text embedding
    • Train so matching pairs are close together
  • Result: Can match images to text descriptions without task-specific training
  • Foundation for DALL-E, Stable Diffusion, and multimodal LLMs!

CLIP learns to match images with their text descriptions

Source: OpenAI CLIP

What can AI do with images?

Three main tasks

Classification vs Detection vs Segmentation
Task Question Real-world Example
Classification “What is this?” Instagram knowing your photo is a selfie
Detection “What and where?” Your phone camera finding faces
Segmentation “Which pixels are what?” iPhone’s Portrait Mode blurring backgrounds

Activity time! 🎮

Train your own AI!

Teachable Machine demo

Let’s train an image classifier: no coding required!

  1. Go to teachablemachine.withgoogle.com
  2. Click “Get Started” → “Image Project” → “Standard”
  3. Create 2-3 classes (e.g., “thumbs up”, “thumbs down”, “peace sign”)
  4. Record ~12 examples of each using your webcam (the website records 3 at a time)
  5. Click “Train Model” (takes about 10 seconds)
  6. Test it live!

Try this:

  • What happens if you show it something it wasn’t trained on?
  • Can you “fool” your model?

Teachable Machine interface

How machines “hear” audio 🎵

Sound is invisible…so how do we process it?

Sound is just vibrations in the air. We can’t see it!

But here’s a clever trick:

  1. Record the vibrations as a waveform (line going up and down)
  2. Transform the waveform into a picture called a spectrogram
  3. Now we can use the same AI that understands images!

It’s like creating a “photograph” of sound 📸🎵

This is why modern AI is so powerful! We can turn anything into pictures or numbers and use similar techniques!

Waveform vs spectrogram vs mel-spectrogram

Top: Waveform (raw sound)

Middle: Spectrogram (sound as an image!)

Bottom: Mel-spectrogram (adjusted for human hearing, with more resolution for lower frequencies and less for higher frequencies)

Source: Bäckström et al (2026)

What the AI “sees” in your voice

In a spectrogram:

  • Horizontal axis: Time (left to right)
  • Vertical axis: Pitch (low notes at bottom, high at top)
  • Colour/brightness: How loud that frequency is

What patterns can AI find?

  • Your unique voice “fingerprint”
  • The difference between “cat” and “bat”
  • Emotion (are you happy? angry? tired?)
  • Whether you’re speaking or singing
  • What language you’re using

Fun fact: Dogs, cats, and humans all have distinctive spectrogram patterns! 🐕🐱👤

Mel spectrogram of speech. The first row is by an individual with high-pitched voice, the second row is by an individual with low-pitched voice

Source: Schnupp et al (2012)

Activity: See your own voice! 🎤

Spectrograms in real-time

Try this later:

  1. Go to musiclab.chromeexperiments.com/Spectrogram
  2. Allow microphone access
  3. Watch what happens when you:
    • Hum a low note vs a high note
    • Say “aaaah” vs “eeeeh” vs “ooooh”
    • Whistle
    • Snap your fingers or clap

What to notice:

  • Low sounds appear at the bottom, high sounds at the top
  • Vowels create stable horizontal bands
  • Percussive sounds (claps) create vertical spikes
  • Your voice has a unique pattern, like a fingerprint!

Spectrogram of a drum machine

musiclab.chromeexperiments.com/Spectrogram

Try making different sounds and watch the patterns!

Whisper: How Siri and Alexa understand you

Whisper is OpenAI’s speech recognition system:

  • Trained on 680,000 hours of audio from the internet
  • Understands 99 languages!
  • Can handle accents, background noise, different speaking styles
  • Completely open source and free

How it works:

  1. Slice: Break audio into short windows (typically 25ms)
  2. FFT: Convert each window from waveform to frequencies (how much of each frequency is present)
  3. Mel mapping: Apply mel scale to match human perception
  4. Stack: Create 2D spectrogram from all windows
  5. Process: Use Transformers, same as text and images!
  6. Context: Use context (attention) to predict next tokens

This powers much of what Siri, Alexa, and Google Assistant do!

Activity: Test speech recognition! 🎤

Let’s try the Whisper demo:

huggingface.co/spaces/openai/whisper

Experiments to try at home:

  1. Record yourself speaking normally
  2. Try speaking with an accent
  3. Record with background noise (music, talking)
  4. Try a different language if you speak one!
  5. Speak very fast or very slow

Observe:

  • What does it get right? Wrong?
  • Does it understand your accent?
  • What about punctuation—who decides where sentences end?

Hugging Face Whisper demo

AI can compose music now 🎵

Suno.ai and AI music

Suno.ai generates complete songs from text prompts:

  • Give it a description: “upbeat pop song about studying for exams”
  • It creates: melody, harmony, rhythm, AND vocals!
  • A full song in about 30 seconds

How does it work?

  • Trained on millions of hours of music
  • Learns patterns: chord progressions, song structures, vocal styles
  • Uses the same “sound → numbers → AI” pipeline we discussed

The questions this raises:

  • Anyone can now create professional-sounding music
  • But: Who owns AI-generated music?
  • Some AI songs sound eerily similar to real artists
  • Musicians worry about their livelihoods

Suno AI music generation

suno.ai — Try generating your own song!

Discussion: If AI creates a song that sounds like Taylor Swift, is that copying? Should it be legal?

The Big Picture 🌐

The main idea: Everything becomes numbers

  • So… whether it’s text, images, or audio…
  • Everything gets converted to embeddings!
  • Same mathematical representation:
    • Text token → 4096-dimensional vector
    • Image patch → 4096-dimensional vector
    • Audio segment → 4096-dimensional vector
  • Once in embedding space, the LLM doesn’t know the original modality!
  • This enables true multimodal understanding

All modalities converge to embeddings

Text, images, and audio all become the same kind of numbers—then AI can understand them together!

The three-part architecture

How multimodal LLMs work

Multimodal LLM architecture: Encoder → Projector → LLM
  1. Modality Encoder: Vision Transformer (images) or Whisper (audio) — pre-trained specialists
  2. Projection Layer: Aligns encoder outputs to LLM’s embedding space — often surprisingly simple!
  3. LLM Backbone: The “brain” (GPT, Gemini, Claude) — processes everything as tokens

Training multimodal models

Two-stage learning

Stage 1: Feature Alignment

  • Freeze vision encoder and LLM
  • Only train the neural network that projects the encoder outputs to the LLM’s embedding space
  • Goal: Learn that “dog image” and “dog text” should be close
  • Dataset: Millions of image-caption pairs

Stage 2: Instruction Tuning

  • Unfreeze the LLM
  • Train on question-answer datasets
  • Goal: Learn to follow complex multimodal instructions
  • Dataset: “Describe this image”, “What’s wrong with this chart?”

Two-stage training process

Source: Sebastian Raschka

What this means for society ⚠️

Class discussion: Where do we draw the line? 🤔

In small groups, discuss this question:

Voice/image generation

Should AI-generated content require…

  • Watermarks that can’t be removed?
  • Disclosure that it’s AI-made?
  • Consent from people being depicted?
  • None of the above—free speech?

How would you enforce it?

Take 2 minutes, then we’ll share perspectives!

The bright side: Amazing applications

Application How it helps Impact
🏥 Medical imaging Detects cancer earlier than human doctors Saves lives
Accessibility Describes images for blind users, transcribes for deaf users Independence
🌍 Language Whisper transcribes 99 languages Breaks barriers
🎬 Entertainment Dubbing films into any language with original actor’s voice Global storytelling
📚 Education AI tutors that can show and explain Personalised learning
🔬 Science Analysing microscope images, satellite data Accelerates discovery

The technology isn’t good or bad—it’s how we use it

Summary 📚

Main takeaways

  • Images = grids of numbers: AI learns to spot patterns, from edges to objects

  • Sound = pictures of vibrations: We turn audio into spectrograms and use image AI!

  • The big insight: Text, images, and audio all become embeddings—the same kind of numbers

  • Multimodal AI: ChatGPT, Claude, and Gemini can see images because everything speaks the same mathematical “language”

  • Hands-on: You trained your own AI with Teachable Machine!

  • Critical thinking: Deepfakes, bias, and privacy are real concerns we must address

… and that’s all for today! 🎉