DATASCI 185: Introduction to AI Applications

Lecture 06: Language, Tokenisation, and Embeddings

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

  • We explored model evaluation across two domains
  • Traditional ML metrics: Accuracy, precision, recall
  • The accuracy paradox: 99.9% accuracy can be useless!
  • Overfitting: When models memorise instead of learn
  • Cross-validation: More robust evaluation
  • LLM evaluation: Perplexity, LLM-as-a-Judge, G-Eval
  • Hallucinations and RAG for grounding answers
  • Today: How do LLMs actually understand language? 🤔

Lecture overview

Today’s agenda

Part 1: How LLMs “See” Text

  • The translation problem: Text to numbers
  • The processing pipeline

Part 2: Tokens and Tokenisation

  • What tokens are (hint: not always words!)
  • Tokenisation methods and quirks
  • Token limits and API costs

Part 3: Embeddings

  • Words as vectors in space
  • Semantic similarity and the king-queen example in detail

Part 4: Parameters

  • What are the billions of parameters?
  • Embeddings, weights, and biases
  • Temperature and creativity dials

Meme of the day

Source: Reddit r/Singularity

New study by Anthropic on how AI impacts coding skills

How LLMs “See” Text 👁️

The translation problem

LLMs don’t read English!

  • Here’s something you probably know by now: Computers only understand numbers
  • When you type “Hello, how are you?”, the LLM doesn’t see letters!
  • It sees something like: [15496, 11, 703, 527, 499, 30]
  • This creates a translation problem:
    • How do we convert text into numbers?
    • How do we preserve meaning in those numbers?
    • How do we capture relationships between words?
  • The solution involves two key concepts:
    • Tokenisation: Breaking text into pieces
    • Embeddings: Converting pieces into meaningful vectors

From text to output

The LLM processing pipeline

How prompting really works:

  1. Input text: “The cat sat on the mat”
  2. Tokenisation: Break into tokens → ["The", "cat", "sat", "on", "the", "mat"]
  3. Token IDs: Map to numbers → [464, 3797, 3332, 319, 262, 2603]
  4. Embeddings: Convert to vectors → Each token becomes a list of ~4,096 numbers!
  5. Processing: Pass through neural network layers
  6. Output: Generate next token probabilities
  7. Decoding: Convert back to text

Every word you type goes through this transformation!

Source: NanoBanana

Why this matters for you

Practical implications

Understanding the pipeline helps you:

  • Write better prompts: Know what the model “sees”
  • Understand costs: API pricing is per token, not per word
  • Avoid surprises: Some words use more tokens than others
  • Debug issues: Why did the model misunderstand?
  • Use context wisely: Token limits are real constraints

Fun fact: The word “everything” is 1 token, but “ChatGPT” is 3 tokens in OpenAI’s (old) tokeniser! 🤯

Tokenisation isn’t always intuitive!

Source: OpenAI

Tokens and Tokenisation 🧩

What is a token?

The basic unit of LLM processing

  • A token is the basic unit that an LLM reads
  • Tokens are NOT always words!
  • A token can be:
    • A full word: “hello” → 1 token
    • Part of a word: “unhappiness” → “un” + “happiness” (2 tokens)
    • A single character: “!” → 1 token
    • A space: ” ” → often included with the next word
  • Rule of thumb: 1 token ≈ 4 characters in English
  • Or roughly: 100 tokens ≈ 75 words
  • Other languages often use more tokens per word
  • Let’s try it out with “Hello, it’s Danilo here!”
  • We’ll use OpenAI’s tokenizer for this example

Why use tokens instead of words?

The clever engineering choice

Three approaches to breaking up text:

Method Example: “Evergreen” Pros Cons
Word-based 1 token Intuitive Huge vocabulary
Character-based 9 tokens Tiny vocabulary Loses meaning
Subword (BPE) 2 tokens Best of both! Less intuitive


  • Modern LLMs use Byte Pair Encoding (BPE)
  • Common subwords become single tokens
  • Rare words are split into known pieces
  • This is why “ChatGPT” → [“Chat”, “G”, “PT”]
  • How does this save space?
    • Example: “unbelievable”, “unhappy”, “undo”, and “unknown” all share the same “un”
    • The model only needs to store “un” once!

Why subwords win:

  • ✅ Handles unknown words gracefully
  • ✅ Keeps vocabulary manageable (~50,000 tokens)
  • ✅ Common words stay whole
  • ✅ Rare words get broken up sensibly
  • ✅ Works across languages

Source: Hugging Face

Tokenisation in action

Try it yourself!

OpenAI’s Tokeniser (try it!): platform.openai.com/tokenizer

The number of tokens can change from one version to another!

Some surprising examples:

Text Tokens Count
“Hello” [“Hello”] 1
“hello” [“hello”] 1
” hello” (with space) [” hello”] 1
“Hello!” [“Hello”, “!”] 2
“everything” [“everything”] 1
“ChatGPT” [“Chat”, “G”, “PT”] 3
“São Paulo” [“S”, “ão”, ” Paulo”] 3
“🎉” Multiple bytes 2+

Notice: Capitalisation, spacing, and punctuation all affect tokenisation!

Source: OpenAI Platform

Token limits and context windows

Why your prompt has a maximum length

  • Every LLM has a context window: Maximum tokens it can process
  • This limit includes both your prompt AND the response!
  • Output tokens typically cost 2-4x more than input tokens!
Model Context Window
GPT-3.5 4,096 or 16,384 tokens
GPT-4 8,192 or 128,000 tokens
Gemini Pro 1,000,000+ tokens
  • If your prompt uses 3,500 tokens and the limit is 4,096…
  • You only have 596 tokens left for the response!
  • Exceeding limits → Error or truncated output
  • Curiosity: Non-English languages use more tokens
    • Chinese, Japanese, Korean: ~2-3x more tokens per concept
    • This means higher costs and shorter effective context

Source: Anthropic

Same meaning, different tokens:

Language Text Tokens
English “Hello” 1
Chinese “你好” 4
Arabic “مرحبا” 6
Japanese “こんにちは” 6

Embeddings 🌌

What are embeddings?

Words as points in space

  • Once text is tokenised, each token gets converted to an embedding
  • An embedding is a vector: A list of numbers
  • Typically 768 to 4,096 numbers per word!
  • Example: “cat” → [0.23, -0.45, 0.12, -0.89, ... (4,096 numbers)]
  • These numbers capture the meaning of the word
  • Similar words have similar vectors
    • “cat” and “dog” will be close together
    • “cat” and “aeroplane” will be far apart
  • The model learns these representations during training
  • This is where the “understanding” happens!
  • Remember \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\)

Source: TensorFlow Projector

Similar words cluster together in the embedding space

Word2Vec: A brief history

The breakthrough that started it all

  • Word2Vec (2013, Google’s Mikolov et al.) revolutionised NLP
  • Two training approaches:
    • CBOW (Continuous Bag of Words): Predict word from context
    • Skip-gram: Predict context from word
  • Key insight: Words appearing in similar contexts have similar meanings
    • “I love my _____” → cat, dog, hamster are all likely
    • So cat, dog, hamster should be near each other!
  • Trained on billions of words from the web
  • Created the embedding approach modern LLMs use
  • 🎬 Watch: Word2Vec Explained (10 min)

Source: Chris McCormick

“You shall know a word by the company it keeps” — J.R. Firth (1957)

Word2vec illustrated

How embeddings capture meaning

  • Imagine we have the word embeddings for a few words
  • The colours represent numbers from -2 to +2, obtained from our model
  • See how “woman” and “girl” are similar to each other, but “queen” is further away?
  • This tell us something!
  • Although we don’t know what each dimension codes for, we can see some patterns
  • There are clear places where “king” and “queen” are similar to each other and distinct from all the others
  • The model has learned something about gender and royalty!
  • Note that “water” has few connections to other words

Source: Jay Alammar

Embeddings in practice

Real-world applications

How embeddings power modern AI:

  • Semantic search 🔍
    • Search by meaning, not just keywords
    • “cheap flights” finds “budget airfare”
  • Recommendations 📚
    • Find similar products, articles, or content
    • “Users who liked X also liked Y”
  • Clustering 📊
    • Group similar documents automatically
    • Topic modelling without labels
  • RAG (Retrieval-Augmented Generation)
    • Find relevant documents for LLM context
    • Power behind ChatGPT + browsing/search

Similarity search example:

Query: “How to fix a broken phone screen?”

Most similar (by embedding): 1. “Repairing cracked smartphone displays” 2. “DIY phone screen replacement guide” 3. “Mobile repair services near me”

Meaning match, not keyword match!

Source: Pinecone

Parameters: The LLM’s Brain 🧠

What is a parameter?

The numbers that make AI tick

  • Think back to algebra: \(y = 2x + 3\)
  • The numbers 2 and 3 are parameters
  • They determine the behaviour of the function
  • In an LLM, parameters are numbers that:
    • Get set during training
    • Control how the model processes text
    • Determine what the model “knows”
  • GPT-3: 175 billion parameters
  • GPT-4: estimated 1+ trillion parameters
  • Each parameter is adjusted thousands of times during training
  • Training = finding the right numbers to make predictions accurate

Source: Medium

Each arrow represents parameters that are learned during training

How many parameters?

The scale of modern LLMs

Model Parameters Training Data
GPT-2 (2019) 1.5 billion 40 GB text
GPT-3 (2020) 175 billion 570 GB text
GPT-4 (2023) ~1+ trillion (est.) Undisclosed
LLaMA 2 (2023) 7B - 70B 2 trillion tokens
Gemini Ultra (2024) Undisclosed Massive

To put this in perspective:

  • 175 billion parameters = 175,000,000,000 individual numbers
  • If you counted 1 per second, it would take 5,500 years!
  • Training GPT-3 required ~3.6 million GPU hours
  • Each of those parameters was updated tens of thousands of times

The three types of parameters

Embeddings, weights, and biases

1. Embedding parameters

  • Store the learned vector for each token
  • Vocabulary (~50,000) × dimensions (~4,096)
  • = ~200 million parameters just for embeddings!

2. Weight parameters

  • Control connections between neurons
  • Determine how strongly different parts influence each other
  • Like volume knobs: amplify or diminish signals

3. Bias parameters

  • Adjust thresholds for neuron activation
  • Help detect subtle patterns that might otherwise be missed
  • Like a baseline adjustment

TL;DR: Weights and biases together determine how the model processes information

Source: Medium

Simple analogy:

  • Embeddings = vocabulary
  • Weights = grammar rules
  • Biases = intuition adjustments

Training: Finding the right numbers

How parameters get their values

The training process:

  1. Start with random parameter values
  2. Show the model text: “The cat sat on the ___”
  3. Model predicts: “table” (wrong!)
  4. Calculate the error
  5. Backpropagation: Adjust parameters to reduce error
  6. Repeat billions of times

Key insight:

  • The model learns by predicting the next word
  • Every error teaches it something
  • Patterns emerge from massive repetition
  • Training GPT-3 cost ~$3.5 million in compute!

Source: Daniel McKee and Maghav Kumar

Parameters are adjusted step by step to minimise errors

Model size vs. capability

Bigger isn’t always better

The scaling hypothesis:

  • More parameters = more capability (generally)
  • But diminishing returns set in
  • A 10x bigger model isn’t 10x smarter

Small models fighting back:

  • Efficient training: Better data, longer training
  • Distillation: Teach small models from big ones
  • Quantisation: Reduce precision (32-bit → 4-bit)
  • Mixture of Experts: Only use relevant parts

Example:

  • LLaMA 2 (7B) can outperform GPT-3 (175B) on some tasks!
  • Why? Better training data and techniques
  • Quality over quantity matters for training data

Source: OpenAI Scaling Laws Paper

Hyperparameters: The creativity dials

Temperature, top-p, and top-k

Not all parameters are learned!

Hyperparameters are settings YOU control:

  • Temperature 🌡️
    • Controls randomness/creativity
    • Low (0.1-0.3): Focused, deterministic
    • High (0.7-1.0): Creative, varied
    • 0.0: Always picks most likely word
  • Top-p (nucleus sampling)
    • Only consider words in top p% probability
    • 0.9: Consider top 90% of likely words
  • Top-k
    • Only consider top k most likely words
    • k=50: Choose from 50 most likely words

Temperature in action:

Source: Medium

Putting It All Together 🔗

The full picture

From your prompt to the response

The attention mechanism (recap)

How LLMs understand context

  • Attention is the key innovation of transformers
  • Every word “looks at” every other word
  • Determines which words are relevant to each other

Example:

“The animal didn’t cross the street because it was too tired”

  • What does “it” refer to?
  • Attention helps the model connect “it” to “animal”
  • Not all words matter equally for understanding

How it works:

  • Each word asks: “Who should I pay attention to?”
  • Creates weighted connections between all words
  • Allows understanding of long-range dependencies

Source: Jay Alammar

Darker = more attention between those words

Practical tips for prompting

Using your new knowledge

Now that you understand the internals:

  • Be concise: Fewer tokens = lower cost, more room for response
  • Use clear language: Common words tokenise efficiently
  • Provide context: Help the attention mechanism
  • Be specific: The model uses your exact words
  • Test different temperatures: Match creativity to task
  • Mind the context limit: Plan for both prompt AND response

When things go wrong:

  • Model confused? → Try rephrasing with different words
  • Model doesn’t understand? → Provide more context
  • Running out of context? → Summarise earlier content

Quick reference:

Task Temperature
Factual Q&A 0.0-0.3
General chat 0.5-0.7
Creative writing 0.7-1.0
Brainstorming 0.8-1.2

Source: Microsoft Education

Summary 📚

Main takeaways

  • LLMs don’t read text—they process tokens, numerical representations of text chunks

  • Tokenisation breaks text into pieces; ~100 tokens ≈ 75 words; affects cost and limits

  • Embeddings convert tokens to vectors, capturing meaning in ~4,096 dimensions

  • The famous king − man + woman ≈ queen shows embeddings encode semantic relationships

  • Parameters (billions of them!) are the numbers learned during training

  • Hyperparameters like temperature let you control creativity vs. determinism

  • Attention mechanisms help models understand context and word relationships

… and that’s all for today! 🎉