DATASCI 185: Introduction to AI Applications

Lecture 11: RAG, Semantic Search, and Grounding AI

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 📚

Recap of last class

Last time, we explored hallucinations
Why they happen: no fact-checking mechanism, sycophancy
Types: factual errors, fake citations, logical contradictions
Higher risk for: recent events, legal/medical, obscure topics
Prompting helps, but doesn’t fully solve the problem
Today: What if AI could access your own documents? 📄

In a RAG system, an LLM uses retrieved context to provide a response to the input question

Source: Google Research

Lecture overview

Today’s agenda

Part 1: The Problem

Quick recap: hallucinations (from Lecture 10)
The knowledge cutoff issue
How RAG solves both problems

Part 2: Semantic Search

Finding by meaning, not just keywords
Quick embeddings recap
Why this matters for RAG

Part 3: The RAG Pipeline

How documents become searchable
Chunking, embedding, retrieving
From question to grounded answer

Part 4: No-Code Tools

NotebookLM, Perplexity
Build your own RAG system today! 🛠️

Meme of the day 😄

Source: Bhavishya Pandit

Quick recap: Hallucinations 🤥

LLMs are just pattern-matching machines. They:
- Generate text that fits statistical expectations
- Have no internal verification system for truth
They’d rather make something up than admit ignorance
Information from credible sources gets mixed with unreliable ones during training
Fabricated answers sound just as convincing as accurate ones

The main question for today:

What if we could give the LLM access to real, verified information when answering?

What if the LLM could look things up before answering?

Source: Iguazio

The solution: Give AI access to real information!

RAG = Retrieval-Augmented Generation

The main idea:

When you ask a question…
First search your documents for relevant information
Then give that information to the LLM
The LLM generates an answer using your data

Why this works:

LLM gets real, up-to-date information
Answers are grounded in your documents
Can cite sources, so you can verify them!
No need to retrain the model

It’s like giving the AI an open-book exam! 📖

Source: AWS

Semantic Search 🔍

Traditional search vs. semantic search

Feature	Keyword Search	Semantic Search
Matching	Exact words only	Meaning-based
“cheap flights”	✅ “cheap flights”	✅ “budget airfare” too!
Synonyms	❌ Misses them	✅ Understands them
Typos	❌ Breaks results	✅ Often still works
Context	❌ Ignores it	✅ Considers it
Technology	String matching	Embeddings

Example query: “How do I fix a broken pipe?”

Keyword search: Only finds documents with “broken pipe”
Semantic search: Also finds “plumbing repair”, “leaky faucet solutions”

Why does this matter for RAG?

Users ask questions in natural language
Your documents may use different terminology
Semantic search bridges this gap!

How semantic search works: Embeddings

Remember from Lecture 06?

Text gets converted to vectors (lists of numbers):

Text	Vector (simplified)
“king”	[0.82, 0.15, -0.43, …]
“queen”	[0.79, 0.18, -0.41, …]
“apple”	[-0.12, 0.67, 0.23, …]

Key properties:

Similar meanings → similar vectors
“king” and “queen” are close in vector space
“king” and “apple” are far apart
Modern embedding models produce 768 to 3,072 dimensions

Model	Dimensions	Use Case
OpenAI ada-002	1,536	General purpose
Google Gecko	768	Lightweight
Cohere Embed v3	1,024	Multilingual

Similar sentences cluster together in embedding space

Measuring similarity: Cosine similarity

How do we measure if two vectors are similar?

Cosine similarity measures the angle between vectors
The numerator is the dot product (multiply their corresponding components and add them all up; how much they point in the same direction)
The denominator normalises by their lengths (magnitude)
- If vector $A = [0.8, 0.6]$, then $\|A\| = \sqrt{0.8^2 + 0.6^2} = 1$
If vectors are identical, cosine similarity = 1; if orthogonal, = 0; if opposite, = -1
The values range from 0 to 1 because word counts cannot be negative

\[\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}\]

Interpretation:

Score	Meaning	Example
0.9–1.0	Very similar	“car” vs “automobile”
0.7–0.9	Related	“car” vs “truck”
0.4–0.7	Loosely related	“car” vs “road”
0.0–0.4	Unrelated	“car” vs “banana”

In RAG systems:

Query embedding compared to all chunk embeddings
Return chunks with highest similarity scores
Typical threshold: retrieve if score > 0.7

Mini-example (2D for simplicity):

"I love dogs"  → [0.8, 0.6]
"I adore puppies" → [0.75, 0.65]
"The weather is nice" → [-0.3, 0.9]

Cosine similarity:

“dogs” vs “puppies”: 0.98 ✅
“dogs” vs “weather”: 0.12 ❌

The search finds “puppies” when you ask about “dogs”!

The RAG Pipeline 🔄

How RAG works

Left side (once): Prepare your documents for searching

Right side (every query): Find relevant info, then generate answer

The preparation work pays off because you only do it once; after that, every query is fast 📚

Step 1: Ingest your documents

What can you ingest?

Format	Examples	Challenges
PDF	Papers, reports	Tables, columns, headers
Word/Docs	Reports, notes	Formatting, styles
Web pages	Articles, docs	Navigation, ads
Code	.py, .js files	Comments vs. code
Transcripts	Meeting notes	Speaker identification

The challenge:

Documents are unstructured
Need to extract clean text
Preserve meaningful structure

Good news: Tools like NotebookLM and ChatGPT handle extraction automatically!

Many document types → unified text

Step 2: Chunk the text

Why chunk?

LLMs have context window limits (4K–128K tokens)
Can’t fit a whole book in one prompt!
Need to find the most relevant parts
There are many strategies for chunking, each with its own trade-offs. More here

Chunking parameters:

Parameter	Typical Values	Trade-off
Chunk size	256–1,024 tokens	Small = precise, Large = more context
Overlap	10–20%	Prevents losing info at boundaries
Strategy	Sentence, paragraph, semantic	Depends on document structure

Rule of thumb: Start with 512 tokens, 20% overlap. Adjust based on your documents and retrieval quality.

Chunk size trade-offs:

Size	Pros	Cons
Small (256)	Precise retrieval	Loses context
Medium (512)	Balanced	Good default
Large (1024)	Rich context	May dilute relevance

Source: Mastering LLM

Step 3: Embed, store & retrieve

Convert chunks to vectors & store:

Each chunk → embedding model → vector
Same embedding model used for queries later!

Database	Type	Speed (1M vectors)
Pinecone	Cloud	~50ms queries
Chroma	Local/Cloud	~100ms queries
FAISS	Local	~10ms queries

Why vector databases?

Traditional databases: exact match
Vector databases: similarity search

Don’t worry: no-code tools handle this for you!

Retrieval (when you ask a question):

Question → embedding → query vector
Vector DB finds similar chunks
Returns top-k most similar (k = 3–10)

Example: “What is our refund policy?”

Rank	Chunk	Score
1	“Returns and refunds: Customers may return…”	0.92
2	“Our guarantee covers full refunds…”	0.87
3	“Payment methods accepted…”	0.54 ❌

Parameters: top-k (3–10), threshold (0.7–1.0)

Step 4: Generate with context

The final step:

The LLM receives a prompt like this:

System: Answer the user’s question using ONLY the context provided. If the answer isn’t in the context, say “I don’t have that information.”

Context: [Chunk 1]: “Returns and refunds: Customers may return items within 30 days…”. [Chunk 2]: “Our guarantee covers full refunds for defective products…”

User question: “What is your refund policy?”

The LLM now:

Has specific, relevant information to work with
Is instructed to cite sources
Is told to say “I don’t know” if context doesn’t contain the answer
Generates a grounded response

Retrieved chunks become the LLM’s “reference material”

This is why RAG reduces hallucinations: the LLM answers from your documents instead of guessing from its training data

RAG vs. Alternatives 🔄

RAG vs. fine-tuning vs. prompting

Three ways to customise LLM behaviour:

Aspect	Prompt Engineering	RAG	Fine-tuning
What it does	Careful instructions	Add external knowledge	Retrain model weights
Cost	Free or $	$	$$$
Setup time	Minutes	Hours	Days–Weeks
Data freshness	Training cutoff	Real-time	Training cutoff
Accuracy (domain)	Low–Medium	High	High
Hallucination risk	High	Low	Medium
Cites sources	❌	✅	❌
Best for	Simple tasks	Knowledge-intensive QA	Style/behaviour change

When to use each:

Prompting: Quick experiments, general tasks, no private data
RAG: Customer support, research, legal/medical QA, any task needing current or private information
Fine-tuning: Specific writing style, consistent persona, specialised domain language

Research findings: Does RAG actually help?

Yes! The evidence is strong:

Study	Finding
Lewis et al. (2020)	Original RAG paper: outperformed fine-tuned models on knowledge-intensive tasks
Shuster et al. (2021)	RAG reduced hallucinations by ~30–50% in dialogue systems
Gao et al. (2024)	Comprehensive survey: RAG approach dominant in production systems
Liu et al. (2023)	“Lost in the middle”: LLMs use beginning and end of context better than middle

Hallucination rates comparison:

Setting	Hallucination Rate
Base LLM (no RAG)	15–40%
LLM + RAG	5–15%
LLM + RAG + verification	2–8%

Rates vary by domain and implementation quality

The “Lost in the Middle” problem:

Liu et al. (2023) found that LLMs pay most attention to:

Beginning of context (primacy)
End of context (recency)
Middle is often ignored!

Implication for RAG:

Put the most relevant chunks first or last, not in the middle!

Source: Liu et al. (2023)

RAG Failure Modes ⚠️

When RAG goes wrong

RAG isn’t perfect. Common failure modes:

Failure Type	What Happens	Frequency
Retrieval failure	Wrong chunks retrieved	15–25% of queries
Lost in the middle	Relevant info in middle ignored	Common with many chunks
Context overflow	Too much text, truncated	Depends on doc size
Outdated docs	Stale information retrieved	Depends on maintenance
Extraction errors	PDF tables/images parsed incorrectly	10–30% of complex docs

Retrieval failures happen when:

Query uses different terms than documents
Question is ambiguous
Multiple topics compete for relevance
Embedding model doesn’t capture domain-specific meaning

Example: Retrieval failure

Your document says: > “The quarterly earnings call is scheduled for March 15th”

You ask: > “When is the investor meeting?”

Problem: “investor meeting” ≠ “earnings call” in embedding space

Result: Wrong chunks retrieved, wrong answer!

Mitigation strategies:

Use terms from your documents
Add synonyms to your documents
Use hybrid search (keyword + semantic)
Increase top-k (retrieve more chunks)

Best practices for reliable RAG

For document preparation:

Clean your documents
- Remove headers, footers, page numbers
- Fix OCR errors in scanned PDFs
Use descriptive headings
- Helps chunking and retrieval
- “Q4 2024 Revenue” > “Section 3.2”
Keep documents updated
- Stale docs = stale answers
- Version control your knowledge base
Test with real queries
- What questions will users actually ask?
- Do retrieved chunks contain the answer?

For querying:

Be specific
- ✅ “What was Q4 2024 revenue?”
- ❌ “Tell me about the company”
Use document terminology
- If doc says “associates”, ask about “associates” not “employees”
Check the citations!
- Does the answer match what the source says?
- This is your verification step
Ask follow-up questions
- “What source did you use for that?”
- “Can you quote the relevant passage?”

Golden rule: Trust, but verify 🔍

No-Code RAG Tools 🛠️

Tools you can use today!

Tool	Free?	Best For	Key Feature
NotebookLM	✅ Yes	Research, study notes	Multi-source synthesis
ChatGPT + Files	✅ Free tier	General documents	Easy upload & chat
Claude + Files	✅ Free tier	Long documents	200K token context
Google AI Studio	✅ Free tier	Experimentation	Gemini models

All of these implement RAG internally!

You upload documents → (Retrieval source)
System finds relevant parts → (Augmented)
LLM generates answers → (Generation)

No coding required, just upload and ask!

NotebookLM: Your AI research assistant

What is NotebookLM?

Free tool from Google
Upload up to 50 sources (PDFs, docs, websites, YouTube)
Creates a personal knowledge base
AI answers questions from YOUR sources only

Features:

Feature	What It Does
Source grounding	Only answers from your docs
Citations	Points to exact source passages
Audio Overview	Generates podcast-style summary
Study guides	Creates questions & summaries
Cross-referencing	Finds connections between sources

Best for: Research projects, exam prep, literature reviews, understanding complex reports

notebooklm.google.com

Activity: Z.ai file upload RAG 📎

Compare: With vs. without your documents. Let’s do it together (or at home if Emory’s connection doesn’t allow us to! 😂)

Step 1: Without documents

Open chat.z.ai
Start a new chat
Ask: “What are the assignment deadlines for DATASCI 185 at Emory University?”
Note: Z.ai doesn’t know! (It’s not in training data)

Step 2: With your document

Start another new chat
Click the 📎 (attach) button
Upload the course syllabus PDF
Ask the same question: “What are the assignment deadlines?”
Compare the answers

What to observe:

Without Doc	With Doc
“I don’t have access to…”	Specific dates from syllabus
May hallucinate generic answer	Grounded in your document
No citations	Can quote the source

This is RAG in action!

The LLM retrieves relevant parts of your uploaded file and uses them to generate an accurate answer.

You just built a RAG system! 🎉

Real-World Applications 🌍

RAG is everywhere!

Industry	Application	How RAG Helps	Example Company
🏢 Customer Support	AI chatbots	Answer questions from product docs	Intercom, Zendesk
⚖️ Legal	Research assistants	Search case law by meaning	Harvey AI, Casetext
🏥 Healthcare	Clinical support	Find relevant patient records, guidelines	Epic, Nuance
📚 Education	Personal tutors	Answer questions from course materials	Khan Academy, Duolingo
💼 Finance	Analyst tools	Search earnings reports, SEC filings	Bloomberg, Kensho
🔬 Research	Literature review	Find related papers, summarise findings	Elicit, Semantic Scholar
💻 Developer Tools	Documentation QA	Answer questions from codebases	GitHub Copilot, Cursor

In each case, the user needs answers grounded in specific documents, and the system needs to cite where the answer came from

Market size: Enterprise RAG solutions expected to reach $40B+ by 2028 (estimates vary)

Advanced RAG: LangChain is a popular framework for building RAG applications. Feel free to explore it if you’re familiar with Python/JavaScript and want to build your own RAG system!

Summary 📚

Key takeaways

The problem: LLMs hallucinate and lack access to private/current information
Semantic search: Find by meaning using embeddings and cosine similarity
The RAG pipeline: Chunk → Embed → Store → Retrieve → Generate
Research shows: RAG reduces hallucinations by 30–50%
Watch out for: Retrieval failures, lost-in-the-middle, outdated docs
No-code tools: NotebookLM, ChatGPT with files, etc
Always verify: RAG reduces errors but doesn’t eliminate them!

Quick reference:

Concept	Key Numbers
Cosine similarity	0.9+ = very similar
Chunk size	256–1024 tokens
Overlap	10–20%
Top-k retrieval	3–10 chunks
Hallucination reduction	30–50%

Remember: Upload your docs, ask specific questions, and always check citations!

…and that’s all for today! 🎉