DATASCI 185: Introduction to AI Applications

Lecture 11: RAG, Semantic Search, and Grounding AI

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 📚

Recap of last class

  • Last time, we explored hallucinations
  • Why they happen: no fact-checking mechanism, sycophancy
  • Types: factual errors, fake citations, logical contradictions
  • Higher risk for: recent events, legal/medical, obscure topics
  • Prompting helps, but doesn’t fully solve the problem
  • Today: What if AI could access your own documents? 📄

In a RAG system, an LLM uses retrieved context to provide a response to the input question

Source: Google Research

Lecture overview

Today’s agenda

Part 1: The Problem

  • Quick recap: hallucinations (from Lecture 10)
  • The knowledge cutoff issue
  • How RAG solves both problems

Part 2: Semantic Search

  • Finding by meaning, not just keywords
  • Quick embeddings recap
  • Why this is a game-changer

Part 3: The RAG Pipeline

  • How documents become searchable
  • Chunking, embedding, retrieving
  • From question to grounded answer

Part 4: No-Code Tools

  • NotebookLM, Perplexity
  • Build your own RAG system today! 🛠️

Meme of the day 😄

Source: Bhavishya Pandit

Quick recap: Hallucinations 🤥

  • LLMs are just pattern-matching machines. They:
    • Generate text that fits statistical expectations
    • Have no internal verification system for truth
  • They’d rather make something up than admit ignorance
  • Information from credible sources gets mixed with unreliable ones during training
  • Fabricated answers sound just as convincing as accurate ones

The main question for today:

What if we could give the LLM access to real, verified information when answering?

RAG is designed to address exactly this problem!

Source: Iguazio

The solution: Give AI access to real information!

RAG = Retrieval-Augmented Generation

The main idea:

  1. When you ask a question…
  2. First search your documents for relevant information
  3. Then give that information to the LLM
  4. The LLM generates an answer using your data

Why this works:

  • LLM gets real, up-to-date information
  • Answers are grounded in your documents
  • Can cite sources, so you can verify them!
  • No need to retrain the model

It’s like giving the AI an open-book exam! 📖

RAG concept diagram

Source: AWS

How semantic search works: Embeddings

Remember from Lecture 06?

Text gets converted to vectors (lists of numbers):

Text Vector (simplified)
“king” [0.82, 0.15, -0.43, …]
“queen” [0.79, 0.18, -0.41, …]
“apple” [-0.12, 0.67, 0.23, …]

Key properties:

  • Similar meanings → similar vectors
  • “king” and “queen” are close in vector space
  • “king” and “apple” are far apart
  • Modern embedding models produce 768 to 3,072 dimensions
Model Dimensions Use Case
OpenAI ada-002 1,536 General purpose
Google Gecko 768 Lightweight
Cohere Embed v3 1,024 Multilingual

Sentence embeddings visualised

Similar sentences cluster together in embedding space

Measuring similarity: Cosine similarity

How do we measure if two vectors are similar?

  • Cosine similarity measures the angle between vectors
  • The numerator is the dot product (multiply their corresponding components and add them all up; how much they point in the same direction)
  • The denominator normalises by their lengths (magnitude)
    • If vector \(A = [0.8, 0.6]\), then \(\|A\| = \sqrt{0.8^2 + 0.6^2} = 1\)
  • If vectors are identical, cosine similarity = 1; if orthogonal, = 0; if opposite, = -1
  • The values range from 0 to 1 because word counts cannot be negative

\[\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}\]

Interpretation:

Score Meaning Example
0.9–1.0 Very similar “car” vs “automobile”
0.7–0.9 Related “car” vs “truck”
0.4–0.7 Loosely related “car” vs “road”
0.0–0.4 Unrelated “car” vs “banana”

In RAG systems:

  • Query embedding compared to all chunk embeddings
  • Return chunks with highest similarity scores
  • Typical threshold: retrieve if score > 0.7

Mini-example (2D for simplicity):

"I love dogs"  → [0.8, 0.6]
"I adore puppies" → [0.75, 0.65]
"The weather is nice" → [-0.3, 0.9]

Cosine similarity:

  • “dogs” vs “puppies”: 0.98
  • “dogs” vs “weather”: 0.12

The search finds “puppies” when you ask about “dogs”!

The RAG Pipeline 🔄

How RAG works

RAG pipeline

Left side (once): Prepare your documents for searching

Right side (every query): Find relevant info, then generate answer

Think of it like a librarian (retrieval) who finds the right books, paired with a scholar (generation) who reads them and answers your question! 📚

Step 1: Ingest your documents

What can you ingest?

Format Examples Challenges
PDF Papers, reports Tables, columns, headers
Word/Docs Reports, notes Formatting, styles
Web pages Articles, docs Navigation, ads
Code .py, .js files Comments vs. code
Transcripts Meeting notes Speaker identification

The challenge:

  • Documents are unstructured
  • Need to extract clean text
  • Preserve meaningful structure

Good news: Tools like NotebookLM and ChatGPT handle extraction automatically!

Document ingestion

Many document types → unified text

Step 2: Chunk the text

Why chunk?

  • LLMs have context window limits (4K–128K tokens)
  • Can’t fit a whole book in one prompt!
  • Need to find the most relevant parts
  • There are many strategies for chunking, each with its own trade-offs. More here

Chunking parameters:

Parameter Typical Values Trade-off
Chunk size 256–1,024 tokens Small = precise, Large = more context
Overlap 10–20% Prevents losing info at boundaries
Strategy Sentence, paragraph, semantic Depends on document structure

Rule of thumb: Start with 512 tokens, 20% overlap. Adjust based on your documents and retrieval quality.

Chunk size trade-offs:

Size Pros Cons
Small (256) Precise retrieval Loses context
Medium (512) Balanced Good default
Large (1024) Rich context May dilute relevance

Chunking visualisation

Source: Mastering LLM

Step 3: Embed, store & retrieve

Convert chunks to vectors & store:

  • Each chunk → embedding model → vector
  • Same embedding model used for queries later!
Database Type Speed (1M vectors)
Pinecone Cloud ~50ms queries
Chroma Local/Cloud ~100ms queries
FAISS Local ~10ms queries

Why vector databases?

  • Traditional databases: exact match
  • Vector databases: similarity search

Don’t worry: no-code tools handle this for you!

Retrieval (when you ask a question):

  1. Question → embedding → query vector
  2. Vector DB finds similar chunks
  3. Returns top-k most similar (k = 3–10)

Example: “What is our refund policy?”

Rank Chunk Score
1 “Returns and refunds: Customers may return…” 0.92
2 “Our guarantee covers full refunds…” 0.87
3 “Payment methods accepted…” 0.54 ❌

Parameters: top-k (3–10), threshold (0.7–1.0)

Step 4: Generate with context

The final step:

The LLM receives a prompt like this:

System: Answer the user’s question using ONLY the context provided. If the answer isn’t in the context, say “I don’t have that information.”

Context: [Chunk 1]: “Returns and refunds: Customers may return items within 30 days…”. [Chunk 2]: “Our guarantee covers full refunds for defective products…”

User question: “What is your refund policy?”

The LLM now:

  • Has specific, relevant information to work with
  • Is instructed to cite sources
  • Is told to say “I don’t know” if context doesn’t contain the answer
  • Generates a grounded response

Context injection

Retrieved chunks become the LLM’s “reference material”

This is why RAG reduces hallucinations! The LLM has verified information to work with, not just its training data!

RAG vs. Alternatives 🔄

RAG vs. fine-tuning vs. prompting

Three ways to customise LLM behaviour:

Aspect Prompt Engineering RAG Fine-tuning
What it does Careful instructions Add external knowledge Retrain model weights
Cost Free or $ $ $$$
Setup time Minutes Hours Days–Weeks
Data freshness Training cutoff Real-time Training cutoff
Accuracy (domain) Low–Medium High High
Hallucination risk High Low Medium
Cites sources
Best for Simple tasks Knowledge-intensive QA Style/behaviour change

When to use each:

  • Prompting: Quick experiments, general tasks, no private data
  • RAG: Customer support, research, legal/medical QA, any task needing current or private information
  • Fine-tuning: Specific writing style, consistent persona, specialised domain language

Research findings: Does RAG actually help?

Yes! The evidence is strong:

Study Finding
Lewis et al. (2020) Original RAG paper: outperformed fine-tuned models on knowledge-intensive tasks
Shuster et al. (2021) RAG reduced hallucinations by ~30–50% in dialogue systems
Gao et al. (2024) Comprehensive survey: RAG approach dominant in production systems
Liu et al. (2023) “Lost in the middle”: LLMs use beginning and end of context better than middle

Hallucination rates comparison:

Setting Hallucination Rate
Base LLM (no RAG) 15–40%
LLM + RAG 5–15%
LLM + RAG + verification 2–8%

Rates vary by domain and implementation quality

The “Lost in the Middle” problem:

Liu et al. (2023) found that LLMs pay most attention to:

  1. Beginning of context (primacy)
  2. End of context (recency)
  3. Middle is often ignored!

Implication for RAG:

Put the most relevant chunks first or last, not in the middle!

Lost in the middle effect

Source: Liu et al. (2023)

RAG Failure Modes ⚠️

When RAG goes wrong

RAG isn’t perfect. Common failure modes:

Failure Type What Happens Frequency
Retrieval failure Wrong chunks retrieved 15–25% of queries
Lost in the middle Relevant info in middle ignored Common with many chunks
Context overflow Too much text, truncated Depends on doc size
Outdated docs Stale information retrieved Depends on maintenance
Extraction errors PDF tables/images parsed incorrectly 10–30% of complex docs

Retrieval failures happen when:

  • Query uses different terms than documents
  • Question is ambiguous
  • Multiple topics compete for relevance
  • Embedding model doesn’t capture domain-specific meaning

Example: Retrieval failure

Your document says: > “The quarterly earnings call is scheduled for March 15th”

You ask: > “When is the investor meeting?”

Problem: “investor meeting” ≠ “earnings call” in embedding space

Result: Wrong chunks retrieved, wrong answer!

Mitigation strategies:

  • Use terms from your documents
  • Add synonyms to your documents
  • Use hybrid search (keyword + semantic)
  • Increase top-k (retrieve more chunks)

Best practices for reliable RAG

For document preparation:

  1. Clean your documents
    • Remove headers, footers, page numbers
    • Fix OCR errors in scanned PDFs
  2. Use descriptive headings
    • Helps chunking and retrieval
    • “Q4 2024 Revenue” > “Section 3.2”
  3. Keep documents updated
    • Stale docs = stale answers
    • Version control your knowledge base
  4. Test with real queries
    • What questions will users actually ask?
    • Do retrieved chunks contain the answer?

For querying:

  1. Be specific
    • ✅ “What was Q4 2024 revenue?”
    • ❌ “Tell me about the company”
  2. Use document terminology
    • If doc says “associates”, ask about “associates” not “employees”
  3. Check the citations!
    • Does the answer match what the source says?
    • This is your verification step
  4. Ask follow-up questions
    • “What source did you use for that?”
    • “Can you quote the relevant passage?”

Golden rule: Trust, but verify 🔍

No-Code RAG Tools 🛠️

Tools you can use today!

Tool Free? Best For Key Feature
NotebookLM ✅ Yes Research, study notes Multi-source synthesis
ChatGPT + Files ✅ Free tier General documents Easy upload & chat
Claude + Files ✅ Free tier Long documents 200K token context
Google AI Studio ✅ Free tier Experimentation Gemini models

All of these implement RAG internally!

  • You upload documents → (Retrieval source)
  • System finds relevant parts → (Augmented)
  • LLM generates answers → (Generation)

No coding required, just upload and ask!

NotebookLM: Your AI research assistant

What is NotebookLM?

  • Free tool from Google
  • Upload up to 50 sources (PDFs, docs, websites, YouTube)
  • Creates a personal knowledge base
  • AI answers questions from YOUR sources only

Features:

Feature What It Does
Source grounding Only answers from your docs
Citations Points to exact source passages
Audio Overview Generates podcast-style summary
Study guides Creates questions & summaries
Cross-referencing Finds connections between sources

Best for: Research projects, exam prep, literature reviews, understanding complex reports

NotebookLM interface

Activity: Z.ai file upload RAG 📎

Compare: With vs. without your documents. Let’s do it together (or at home if Emory’s connection doesn’t allow us to! 😂)

Step 1: Without documents

  1. Open chat.z.ai
  2. Start a new chat
  3. Ask: “What are the assignment deadlines for DATASCI 185 at Emory University?”
  4. Note: Z.ai doesn’t know! (It’s not in training data)

Step 2: With your document

  1. Start another new chat
  2. Click the 📎 (attach) button
  3. Upload the course syllabus PDF
  4. Ask the same question: “What are the assignment deadlines?”
  5. Compare the answers

What to observe:

Without Doc With Doc
“I don’t have access to…” Specific dates from syllabus
May hallucinate generic answer Grounded in your document
No citations Can quote the source

This is RAG in action!

The LLM retrieves relevant parts of your uploaded file and uses them to generate an accurate answer.

You just built a RAG system! 🎉

Real-World Applications 🌍

RAG is everywhere!

Industry Application How RAG Helps Example Company
🏢 Customer Support AI chatbots Answer questions from product docs Intercom, Zendesk
⚖️ Legal Research assistants Search case law by meaning Harvey AI, Casetext
🏥 Healthcare Clinical support Find relevant patient records, guidelines Epic, Nuance
📚 Education Personal tutors Answer questions from course materials Khan Academy, Duolingo
💼 Finance Analyst tools Search earnings reports, SEC filings Bloomberg, Kensho
🔬 Research Literature review Find related papers, summarise findings Elicit, Semantic Scholar
💻 Developer Tools Documentation QA Answer questions from codebases GitHub Copilot, Cursor

Common thread: All need accurate, source-backed answers from specific documents…exactly what RAG provides

Market size: Enterprise RAG solutions expected to reach $40B+ by 2028 (estimates vary)

Advanced RAG: LangChain is a popular framework for building RAG applications. Feel free to explore it if you’re familiar with Python/JavaScript and want to build your own RAG system!

Summary 📚

Key takeaways

  • The problem: LLMs hallucinate and lack access to private/current information

  • Semantic search: Find by meaning using embeddings and cosine similarity

  • The RAG pipeline: Chunk → Embed → Store → Retrieve → Generate

  • Research shows: RAG reduces hallucinations by 30–50%

  • Watch out for: Retrieval failures, lost-in-the-middle, outdated docs

  • No-code tools: NotebookLM, ChatGPT with files, etc

  • Always verify: RAG reduces errors but doesn’t eliminate them!

Quick reference:

Concept Key Numbers
Cosine similarity 0.9+ = very similar
Chunk size 256–1024 tokens
Overlap 10–20%
Top-k retrieval 3–10 chunks
Hallucination reduction 30–50%

Remember: Upload your docs, ask specific questions, and always check citations!

…and that’s all for today! 🎉