Lecture 11: RAG, Semantic Search, and Grounding AI
Source: Google Research
Part 1: The Problem
Part 2: Semantic Search
Part 3: The RAG Pipeline
Part 4: No-Code Tools
Source: Bhavishya Pandit
The main question for today:
What if we could give the LLM access to real, verified information when answering?
RAG = Retrieval-Augmented Generation
The main idea:
Why this works:
It’s like giving the AI an open-book exam! 📖
Source: AWS
| Feature | Keyword Search | Semantic Search |
|---|---|---|
| Matching | Exact words only | Meaning-based |
| “cheap flights” | ✅ “cheap flights” | ✅ “budget airfare” too! |
| Synonyms | ❌ Misses them | ✅ Understands them |
| Typos | ❌ Breaks results | ✅ Often still works |
| Context | ❌ Ignores it | ✅ Considers it |
| Technology | String matching | Embeddings |
Example query: “How do I fix a broken pipe?”
Remember from Lecture 06?
Text gets converted to vectors (lists of numbers):
| Text | Vector (simplified) |
|---|---|
| “king” | [0.82, 0.15, -0.43, …] |
| “queen” | [0.79, 0.18, -0.41, …] |
| “apple” | [-0.12, 0.67, 0.23, …] |
Key properties:
| Model | Dimensions | Use Case |
|---|---|---|
| OpenAI ada-002 | 1,536 | General purpose |
| Google Gecko | 768 | Lightweight |
| Cohere Embed v3 | 1,024 | Multilingual |
How do we measure if two vectors are similar?
\[\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}\]
Interpretation:
| Score | Meaning | Example |
|---|---|---|
| 0.9–1.0 | Very similar | “car” vs “automobile” |
| 0.7–0.9 | Related | “car” vs “truck” |
| 0.4–0.7 | Loosely related | “car” vs “road” |
| 0.0–0.4 | Unrelated | “car” vs “banana” |
In RAG systems:
What can you ingest?
| Format | Examples | Challenges |
|---|---|---|
| Papers, reports | Tables, columns, headers | |
| Word/Docs | Reports, notes | Formatting, styles |
| Web pages | Articles, docs | Navigation, ads |
| Code | .py, .js files | Comments vs. code |
| Transcripts | Meeting notes | Speaker identification |
The challenge:
Good news: Tools like NotebookLM and ChatGPT handle extraction automatically!
Why chunk?
Chunking parameters:
| Parameter | Typical Values | Trade-off |
|---|---|---|
| Chunk size | 256–1,024 tokens | Small = precise, Large = more context |
| Overlap | 10–20% | Prevents losing info at boundaries |
| Strategy | Sentence, paragraph, semantic | Depends on document structure |
Rule of thumb: Start with 512 tokens, 20% overlap. Adjust based on your documents and retrieval quality.
Chunk size trade-offs:
| Size | Pros | Cons |
|---|---|---|
| Small (256) | Precise retrieval | Loses context |
| Medium (512) | Balanced | Good default |
| Large (1024) | Rich context | May dilute relevance |
Source: Mastering LLM
Convert chunks to vectors & store:
| Database | Type | Speed (1M vectors) |
|---|---|---|
| Pinecone | Cloud | ~50ms queries |
| Chroma | Local/Cloud | ~100ms queries |
| FAISS | Local | ~10ms queries |
Why vector databases?
Don’t worry: no-code tools handle this for you!
Retrieval (when you ask a question):
Example: “What is our refund policy?”
| Rank | Chunk | Score |
|---|---|---|
| 1 | “Returns and refunds: Customers may return…” | 0.92 |
| 2 | “Our guarantee covers full refunds…” | 0.87 |
| 3 | “Payment methods accepted…” | 0.54 ❌ |
Parameters: top-k (3–10), threshold (0.7–1.0)
The final step:
The LLM receives a prompt like this:
System: Answer the user’s question using ONLY the context provided. If the answer isn’t in the context, say “I don’t have that information.”
Context: [Chunk 1]: “Returns and refunds: Customers may return items within 30 days…”. [Chunk 2]: “Our guarantee covers full refunds for defective products…”
User question: “What is your refund policy?”
The LLM now:
Three ways to customise LLM behaviour:
| Aspect | Prompt Engineering | RAG | Fine-tuning |
|---|---|---|---|
| What it does | Careful instructions | Add external knowledge | Retrain model weights |
| Cost | Free or $ | $ | $$$ |
| Setup time | Minutes | Hours | Days–Weeks |
| Data freshness | Training cutoff | Real-time | Training cutoff |
| Accuracy (domain) | Low–Medium | High | High |
| Hallucination risk | High | Low | Medium |
| Cites sources | ❌ | ✅ | ❌ |
| Best for | Simple tasks | Knowledge-intensive QA | Style/behaviour change |
When to use each:
Yes! The evidence is strong:
| Study | Finding |
|---|---|
| Lewis et al. (2020) | Original RAG paper: outperformed fine-tuned models on knowledge-intensive tasks |
| Shuster et al. (2021) | RAG reduced hallucinations by ~30–50% in dialogue systems |
| Gao et al. (2024) | Comprehensive survey: RAG approach dominant in production systems |
| Liu et al. (2023) | “Lost in the middle”: LLMs use beginning and end of context better than middle |
Hallucination rates comparison:
| Setting | Hallucination Rate |
|---|---|
| Base LLM (no RAG) | 15–40% |
| LLM + RAG | 5–15% |
| LLM + RAG + verification | 2–8% |
Rates vary by domain and implementation quality
The “Lost in the Middle” problem:
Liu et al. (2023) found that LLMs pay most attention to:
Implication for RAG:
Put the most relevant chunks first or last, not in the middle!
Source: Liu et al. (2023)
RAG isn’t perfect. Common failure modes:
| Failure Type | What Happens | Frequency |
|---|---|---|
| Retrieval failure | Wrong chunks retrieved | 15–25% of queries |
| Lost in the middle | Relevant info in middle ignored | Common with many chunks |
| Context overflow | Too much text, truncated | Depends on doc size |
| Outdated docs | Stale information retrieved | Depends on maintenance |
| Extraction errors | PDF tables/images parsed incorrectly | 10–30% of complex docs |
Retrieval failures happen when:
Example: Retrieval failure
Your document says: > “The quarterly earnings call is scheduled for March 15th”
You ask: > “When is the investor meeting?”
Problem: “investor meeting” ≠ “earnings call” in embedding space
Result: Wrong chunks retrieved, wrong answer!
Mitigation strategies:
For document preparation:
For querying:
Golden rule: Trust, but verify 🔍
| Tool | Free? | Best For | Key Feature |
|---|---|---|---|
| NotebookLM | ✅ Yes | Research, study notes | Multi-source synthesis |
| ChatGPT + Files | ✅ Free tier | General documents | Easy upload & chat |
| Claude + Files | ✅ Free tier | Long documents | 200K token context |
| Google AI Studio | ✅ Free tier | Experimentation | Gemini models |
All of these implement RAG internally!
No coding required, just upload and ask!
What is NotebookLM?
Features:
| Feature | What It Does |
|---|---|
| Source grounding | Only answers from your docs |
| Citations | Points to exact source passages |
| Audio Overview | Generates podcast-style summary |
| Study guides | Creates questions & summaries |
| Cross-referencing | Finds connections between sources |
Best for: Research projects, exam prep, literature reviews, understanding complex reports
Compare: With vs. without your documents. Let’s do it together (or at home if Emory’s connection doesn’t allow us to! 😂)
Step 1: Without documents
Step 2: With your document
What to observe:
| Without Doc | With Doc |
|---|---|
| “I don’t have access to…” | Specific dates from syllabus |
| May hallucinate generic answer | Grounded in your document |
| No citations | Can quote the source |
This is RAG in action!
The LLM retrieves relevant parts of your uploaded file and uses them to generate an accurate answer.
You just built a RAG system! 🎉
| Industry | Application | How RAG Helps | Example Company |
|---|---|---|---|
| 🏢 Customer Support | AI chatbots | Answer questions from product docs | Intercom, Zendesk |
| ⚖️ Legal | Research assistants | Search case law by meaning | Harvey AI, Casetext |
| 🏥 Healthcare | Clinical support | Find relevant patient records, guidelines | Epic, Nuance |
| 📚 Education | Personal tutors | Answer questions from course materials | Khan Academy, Duolingo |
| 💼 Finance | Analyst tools | Search earnings reports, SEC filings | Bloomberg, Kensho |
| 🔬 Research | Literature review | Find related papers, summarise findings | Elicit, Semantic Scholar |
| 💻 Developer Tools | Documentation QA | Answer questions from codebases | GitHub Copilot, Cursor |
Common thread: All need accurate, source-backed answers from specific documents…exactly what RAG provides
Market size: Enterprise RAG solutions expected to reach $40B+ by 2028 (estimates vary)
Advanced RAG: LangChain is a popular framework for building RAG applications. Feel free to explore it if you’re familiar with Python/JavaScript and want to build your own RAG system!
The problem: LLMs hallucinate and lack access to private/current information
Semantic search: Find by meaning using embeddings and cosine similarity
The RAG pipeline: Chunk → Embed → Store → Retrieve → Generate
Research shows: RAG reduces hallucinations by 30–50%
Watch out for: Retrieval failures, lost-in-the-middle, outdated docs
No-code tools: NotebookLM, ChatGPT with files, etc
Always verify: RAG reduces errors but doesn’t eliminate them!
Quick reference:
| Concept | Key Numbers |
|---|---|
| Cosine similarity | 0.9+ = very similar |
| Chunk size | 256–1024 tokens |
| Overlap | 10–20% |
| Top-k retrieval | 3–10 chunks |
| Hallucination reduction | 30–50% |
Remember: Upload your docs, ask specific questions, and always check citations!