DATASCI 185: Introduction to AI Applications

Lecture 06: Language, Tokenisation, and Embeddings

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🤓

Recap of last class

We explored model evaluation across two domains
Traditional ML metrics: Accuracy, precision, recall
The accuracy paradox: 99.9% accuracy can be useless!
Overfitting: When models memorise instead of learn
Cross-validation: More reliable evaluation
LLM evaluation: Perplexity, LLM-as-a-Judge, G-Eval
Hallucinations and RAG for grounding answers
Today: How do LLMs actually understand language? 🤔

Source: r/ProgrammerHumor

Lecture overview

Today’s agenda

Part 1: How LLMs “See” Text

The translation problem: Text to numbers
The processing pipeline

Part 2: Tokens and Tokenisation

What tokens are (hint: not always words!)
Tokenisation methods and quirks
Token limits and API costs

Part 3: Embeddings

Words as vectors in space
Semantic similarity and the king-queen example in detail

Part 4: Parameters

What are the billions of parameters?
Embeddings, weights, and biases
Temperature and creativity dials

Meme of the day

Source: Reddit r/Singularity

New study by Anthropic on how AI impacts coding skills

Source: https://www.anthropic.com/research/AI-assistance-coding-skills

How LLMs “See” Text 👁️

The translation problem

LLMs don’t read English!

Here’s something you probably know by now: Computers only understand numbers
When you type “Hello, how are you?”, the LLM doesn’t see letters!
It sees something like: [15496, 11, 703, 527, 499, 30]
This creates a translation problem:
- How do we convert text into numbers?
- How do we preserve meaning in those numbers?
- How do we capture relationships between words?
The solution involves two key concepts:
- Tokenisation: Breaking text into pieces
- Embeddings: Converting pieces into meaningful vectors

From text to output

The LLM processing pipeline

How prompting really works:

Input text: “The cat sat on the mat”
Tokenisation: Break into tokens → ["The", "cat", "sat", "on", "the", "mat"]
Token IDs: Map to numbers → [464, 3797, 3332, 319, 262, 2603]
Embeddings: Convert to vectors → Each token becomes a list of ~4,096 numbers!
Processing: Pass through neural network layers
Output: Generate next token probabilities
Decoding: Convert back to text

Your prompt goes through all of these steps before the model produces a single word

Source: NanoBanana

Why this matters for you

Practical implications

Understanding the pipeline helps you:

Write better prompts: Know what the model “sees”
Understand costs: API pricing is per token, not per word
Avoid surprises: Some words use more tokens than others
Debug issues: Why did the model misunderstand?
Use context wisely: Token limits are real constraints

Fun fact: The word “everything” is 1 token, but “ChatGPT” is 3 tokens in OpenAI’s (old) tokeniser! 🤯

Source: OpenAI

Tokens and Tokenisation 🧩

What is a token?

The basic unit of LLM processing

A token is the basic unit that an LLM reads
Tokens are NOT always words!
A token can be:
- A full word: “hello” → 1 token
- Part of a word: “unhappiness” → “un” + “happiness” (2 tokens)
- A single character: “!” → 1 token
- A space: ” ” → often included with the next word
Rule of thumb: 1 token ≈ 4 characters in English
Or roughly: 100 tokens ≈ 75 words
Other languages often use more tokens per word
Let’s try it out with “Hello, it’s Danilo here!”
We’ll use OpenAI’s tokenizer for this example

Source: OpenAI Tokenizer

Why use tokens instead of words?

The clever engineering choice

Three approaches to breaking up text:

Method	Example: “Evergreen”	Pros	Cons
Word-based	1 token	Intuitive	Huge vocabulary
Character-based	9 tokens	Tiny vocabulary	Loses meaning
Subword (BPE)	2 tokens	Best of both!	Less intuitive

Modern LLMs use Byte Pair Encoding (BPE)
Common subwords become single tokens
Rare words are split into known pieces
This is why “ChatGPT” → [“Chat”, “G”, “PT”]
How does this save space?
- Example: “unbelievable”, “unhappy”, “undo”, and “unknown” all share the same “un”
- The model only needs to store “un” once!

Why subwords win:

✅ Handles unknown words gracefully
✅ Keeps vocabulary manageable (~50,000 tokens)
✅ Common words stay whole
✅ Rare words get broken up sensibly
✅ Works across languages

Source: Hugging Face

Tokenisation in action

Try it yourself!

OpenAI’s Tokeniser (try it!): platform.openai.com/tokenizer

The number of tokens can change from one version to another!

Some surprising examples:

Text	Tokens	Count
“Hello”	[“Hello”]	1
“hello”	[“hello”]	1
” hello” (with space)	[” hello”]	1
“Hello!”	[“Hello”, “!”]	2
“everything”	[“everything”]	1
“ChatGPT”	[“Chat”, “G”, “PT”]	3
“São Paulo”	[“S”, “ão”, ” Paulo”]	3
“🎉”	Multiple bytes	2+

Notice: Capitalisation, spacing, and punctuation all affect tokenisation!

Source: OpenAI Platform

Token limits and context windows

Why your prompt has a maximum length

Every LLM has a context window: Maximum tokens it can process
This limit includes both your prompt AND the response!
Output tokens typically cost 2-4x more than input tokens!

Model	Context Window
GPT-3.5	4,096 or 16,384 tokens
GPT-4	8,192 or 128,000 tokens
Gemini Pro	1,000,000+ tokens

If your prompt uses 3,500 tokens and the limit is 4,096…
You only have 596 tokens left for the response!
Exceeding limits → Error or truncated output
Curiosity: Non-English languages use more tokens
- Chinese, Japanese, Korean: ~2-3x more tokens per concept
- This means higher costs and shorter effective context

Source: Anthropic

Same meaning, different tokens:

Language	Text	Tokens
English	“Hello”	1
Chinese	“你好”	4
Arabic	“مرحبا”	6
Japanese	“こんにちは”	6

Embeddings 🌌

What are embeddings?

Words as points in space

Once text is tokenised, each token gets converted to an embedding
An embedding is a vector: A list of numbers
Typically 768 to 4,096 numbers per word!
Example: “cat” → [0.23, -0.45, 0.12, -0.89, ... (4,096 numbers)]
These numbers capture the meaning of the word
Similar words have similar vectors
- “cat” and “dog” will be close together
- “cat” and “aeroplane” will be far apart
The model learns these representations during training
This is where the “understanding” happens!
Remember $\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$

Source: TensorFlow Projector

Similar words cluster together in the embedding space

Word2Vec: A brief history

Where modern embeddings began

Word2Vec (2013, Google’s Mikolov et al.) revolutionised NLP
Two training approaches:
- CBOW (Continuous Bag of Words): Predict word from context
- Skip-gram: Predict context from word
Key insight: Words appearing in similar contexts have similar meanings
- “I love my _____” → cat, dog, hamster are all likely
- So cat, dog, hamster should be near each other!
Trained on billions of words from the web
Created the embedding approach modern LLMs use
🎬 Watch: Word2Vec Explained (10 min)

Source: Chris McCormick

“You shall know a word by the company it keeps” — J.R. Firth (1957)

Word2vec illustrated

How embeddings capture meaning

Imagine we have the word embeddings for a few words
The colours represent numbers from -2 to +2, obtained from our model
See how “woman” and “girl” are similar to each other, but “queen” is further away?
This tell us something!
Although we don’t know what each dimension codes for, we can see some patterns
There are clear places where “king” and “queen” are similar to each other and distinct from all the others
The model has learned something about gender and royalty!
Note that “water” has few connections to other words

Source: Jay Alammar

Embeddings in practice

Real-world applications

How embeddings power modern AI:

Semantic search 🔍
- Search by meaning, not just keywords
- “cheap flights” finds “budget airfare”
Recommendations 📚
- Find similar products, articles, or content
- “Users who liked X also liked Y”
Clustering 📊
- Group similar documents automatically
- Topic modelling without labels
RAG (Retrieval-Augmented Generation)
- Find relevant documents for LLM context
- Power behind ChatGPT + browsing/search

Similarity search example:

Query: “How to fix a broken phone screen?”

Most similar (by embedding): 1. “Repairing cracked smartphone displays” 2. “DIY phone screen replacement guide” 3. “Mobile repair services near me”

Meaning match, not keyword match!

Source: Pinecone

Parameters: The LLM’s Brain 🧠

What is a parameter?

What are these numbers, exactly?

Think back to algebra: $y = 2x + 3$
The numbers 2 and 3 are parameters
They determine the behaviour of the function
In an LLM, parameters are numbers that:
- Get set during training
- Control how the model processes text
- Determine what the model “knows”
GPT-3: 175 billion parameters
GPT-4: estimated 1+ trillion parameters
Each parameter is adjusted thousands of times during training
Training = finding the right numbers to make predictions accurate

Source: Medium

Each arrow represents parameters that are learned during training

How many parameters?

The scale of modern LLMs

Model	Parameters	Training Data
GPT-2 (2019)	1.5 billion	40 GB text
GPT-3 (2020)	175 billion	570 GB text
GPT-4 (2023)	~1+ trillion (est.)	Undisclosed
LLaMA 2 (2023)	7B - 70B	2 trillion tokens
Gemini Ultra (2024)	Undisclosed	Massive

To put this in perspective:

175 billion parameters = 175,000,000,000 individual numbers
If you counted 1 per second, it would take 5,500 years!
Training GPT-3 required ~3.6 million GPU hours
Each of those parameters was updated tens of thousands of times

The three types of parameters

Embeddings, weights, and biases

1. Embedding parameters

Store the learned vector for each token
Vocabulary (~50,000) × dimensions (~4,096)
= ~200 million parameters just for embeddings!

2. Weight parameters

Control connections between neurons
Determine how strongly different parts influence each other
Like volume knobs: amplify or diminish signals

3. Bias parameters

Adjust thresholds for neuron activation
Help detect subtle patterns that might otherwise be missed
Like a baseline adjustment

In short: embeddings store word meanings, weights connect the dots, and biases fine-tune the thresholds

Source: Medium

Simple analogy:

Embeddings = vocabulary
Weights = grammar rules
Biases = intuition adjustments

Training: Finding the right numbers

How parameters get their values

The training process:

Start with random parameter values
Show the model text: “The cat sat on the ___”
Model predicts: “table” (wrong!)
Calculate the error
Backpropagation: Adjust parameters to reduce error
Repeat billions of times

Key insight:

The model learns by predicting the next word
Every error teaches it something
Patterns emerge from massive repetition
Training GPT-3 cost ~$3.5 million in compute!

Source: Daniel McKee and Maghav Kumar

Parameters are adjusted step by step to minimise errors

Model size vs. capability

Bigger isn’t always better

The scaling hypothesis:

More parameters = more capability (generally)
But diminishing returns set in
A 10x bigger model isn’t 10x smarter

Small models fighting back:

Efficient training: Better data, longer training
Distillation: Teach small models from big ones
Quantisation: Reduce precision (32-bit → 4-bit)
Mixture of Experts: Only use relevant parts

Example:

LLaMA 2 (7B) can outperform GPT-3 (175B) on some tasks!
Why? Better training data and techniques
Quality over quantity matters for training data

Source: OpenAI Scaling Laws Paper

Hyperparameters: The creativity dials

Temperature, top-p, and top-k

Not all parameters are learned!

Hyperparameters are settings YOU control:

Temperature 🌡️
- Controls randomness/creativity
- Low (0.1-0.3): Focused, deterministic
- High (0.7-1.0): Creative, varied
- 0.0: Always picks most likely word
Top-p (nucleus sampling)
- Only consider words in top p% probability
- 0.9: Consider top 90% of likely words
Top-k
- Only consider top k most likely words
- k=50: Choose from 50 most likely words

Temperature in action:

Source: Medium

Putting It All Together 🔗

The full picture

From your prompt to the response

The attention mechanism (recap)

How LLMs understand context

Attention is the key innovation of transformers
Every word “looks at” every other word
Determines which words are relevant to each other

Example:

“The animal didn’t cross the street because it was too tired”

What does “it” refer to?
Attention helps the model connect “it” to “animal”
Not all words matter equally for understanding

How it works:

Each word asks: “Who should I pay attention to?”
Creates weighted connections between all words
Allows understanding of long-range dependencies

Source: Jay Alammar

Darker = more attention between those words

Practical tips for prompting

Using your new knowledge

Now that you understand the internals:

✅ Be concise: Fewer tokens = lower cost, more room for response
✅ Use clear language: Common words tokenise efficiently
✅ Provide context: Help the attention mechanism
✅ Be specific: The model uses your exact words
✅ Test different temperatures: Match creativity to task
✅ Mind the context limit: Plan for both prompt AND response

When things go wrong:

Model confused? → Try rephrasing with different words
Model doesn’t understand? → Provide more context
Running out of context? → Summarise earlier content

Quick reference:

Task	Temperature
Factual Q&A	0.0-0.3
General chat	0.5-0.7
Creative writing	0.7-1.0
Brainstorming	0.8-1.2

Source: Microsoft Education

Summary 📚

Main takeaways

LLMs don’t read text; they process tokens, numerical representations of text chunks
Tokenisation breaks text into pieces; ~100 tokens ≈ 75 words; affects cost and limits
Embeddings convert tokens to vectors, capturing meaning in ~4,096 dimensions
The famous king − man + woman ≈ queen shows embeddings encode semantic relationships
Parameters (billions of them!) are the numbers learned during training
Hyperparameters like temperature let you control creativity vs. determinism
Attention mechanisms help models understand context and word relationships

… and that’s all for today! 🎉