DATASCI 185: Introduction to AI Applications

Lecture 13: Building Reliable Pipelines, Monitoring, and Testing

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 🔧

Recap of last class

Last time, we explored RAG (Retrieval-Augmented Generation)
LLMs have knowledge cutoffs and hallucinate
Ground AI in your own documents
Chunk → Embed → Retrieve → Generate
RAG reduces hallucinations by 30–50%
But RAG isn’t perfect: retrieval errors still occur
Today: What happens behind the scenes when you use AI?
You’ve all experienced AI going down, being slow, or giving weird answers
Why does this happen? And what can you do about it?

Source: Astera Software

Lecture overview

Today’s agenda

Part 1: What is a Pipeline?

From prompt to response
Why “it works on my laptop” isn’t enough
Real-world AI failures and lessons

Part 2: Why Pipelines Break

Data drift: when the world changes
Model degradation over time
Infrastructure and scaling issues

Part 3: Monitoring Basics

What to watch: the key metrics
Alerting: the “canary in the coal mine”
Dashboards and observability

Part 4: Evaluating AI Tools as a User

Signs of a reliable vs. unreliable AI
Your personal reliability checklist
Hands-on: Rate real AI tools!

Meme of the day 😄

That’s actually a very good point!

DataFest 2026

DataFest 2026 is coming!

It’s a 48-hour data science competition, where you’ll work in teams to solve a real-world problem using data
Open to all Emory undergraduates, regardless of major or minor
AI use is allowed and encouraged!
Deadline to sign up: March 9, 2026
Participant registration form
Mentor sign-up form
Volunteer registration form

What is a Pipeline? 🏭

From prompt to response

When you ask an LLM a question, here’s what happens:

Your prompt goes from your browser to servers
Load balancers route it to an available GPU (cloud infrastructure)
Preprocessing cleans and formats your text (with LLMs, this is usually minimal)
Tokenisation converts text to numbers (you know how to do that!)
The model processes tokens (billions of operations!)
Postprocessing formats the output
Safety filters check for harmful content (e.g., hate speech, etc.)
Response travels back to your screen

Each step can fail independently!

This entire process is called a pipeline.

Even “What’s the weather?” touches load balancers, GPUs, and safety filters before you see an answer

Failure types: what actually goes wrong?

Not all failures are the same:

Failure type	What it looks like	Root cause
Hallucination	Confident but false content	Model limitations, missing grounding
Policy failure	Unsafe or inappropriate output	Weak guardrails or bad updates
Data drift	Outdated or weird answers	World changes faster than training
Pipeline failure	Slow, down, or inconsistent responses	Infrastructure, scaling, broken steps

You can spot what type of failure happened, then choose the right response

When AI Breaks 💥

Real AI failures (recent headlines)

Incident	What Happened	Impact
Air Canada chatbot (2024)	Invented a refund policy	Company lost lawsuit
DPD chatbot (2024)	Swore at customers, criticised its own company	Emergency shutdown
Google Gemini images (2024)	Historically inaccurate diverse images	Feature paused, CEO apologised
Chevy chatbot (2023)	Agreed to sell car for $1	Prompt injection attack
Snapchat My AI (2023)	Privacy concerns, couldn’t be removed	User backlash

Common causes:

Hallucination treated as truth (Air Canada)
Updates broke guardrails (DPD)
No input validation (Chevy)
Overcorrection for bias (Gemini)
Rush to market (Snapchat)

Source: https://x.com/ChrisJBakke/status/1736533308849443121. The whole thread is hilarious 😂

Case study: DPD’s sweary chatbot

What happened (January 2024):

DPD (UK delivery company) updated their AI chatbot
After the update, customer Ashley Beauchamp discovered it would:
- Swear when asked nicely
- Write poems criticising DPD
- Call itself “useless” and recommend competitors
- Say DPD was “the worst delivery firm in the world”

DPD’s response:

Immediately disabled the AI chatbot, apologised, launched investigation.

What went wrong:

A system update broke the guardrails. The content filtering that prevented inappropriate responses stopped working, but no one noticed until customers posted screenshots on social media

Source: BBC News

Pipeline failure point:

No monitoring after deployment. The update was pushed without testing, and no automated checks caught the broken guardrails.

Case study: Google Gemini’s image crisis

What happened (February 2024):

Users asked Gemini for images of “1943 German soldiers”
Gemini generated images of Black men and Asian women in Nazi uniforms
Similar problems with “US Founding Fathers” and other historical figures
Google CEO called results “completely unacceptable”
Image generation feature paused for weeks

Why did this happen? Can you guess?

Source: BBC News

Pipeline failure point:

Testing didn’t catch edge cases

Why Pipelines Break 🔥

Data drift: The world changes

Data drift = When real-world data differs from training data

Example: A sentiment analyser trained in 2020:

“This product is sick!” → Probably means: Amazing ✅
“This slaps!” → Might not understand 🤔
“No cap, this is bussin’” → Completely lost! 😵

Types of drift:

Type	Description	Example
Label drift	What’s “correct” changes	Policy updates
Concept drift	Meaning of things changes	New slang, trends
Data drift	Input distribution changes	New demographics

Source: Spot Intelligence

Slang, prices, and user habits shift constantly, but a trained model only knows what it learned

Why does data drift happen?

Models assume $P_{\text{train}}(X, Y) = P_{\text{prod}}(X, Y)$

But in reality, this joint distribution shifts over time:

Data drift: $P(X)$ changes
- Input distribution differs from training
- Example: New user demographics, device types
- Model sees inputs it never learned from
Label drift: $P(Y)$ changes
- Class frequencies change over time
- Example: Fraud rate increases during holidays
- Model’s decision boundaries become suboptimal
Concept drift: $P(Y|X)$ changes
- Same input → different correct output
- Example: “Sick” now means “great” in slang
- The most dangerous type: model is confident but wrong

Technical causes:

Cause	Effect
Non-stationarity	Real-world processes aren’t static
Sample selection bias	Training data ≠ production population
Feedback loops	Model outputs influence future inputs
External shocks	COVID, policy changes, viral trends

Why it’s inevitable:

Machine learning assumes i.i.d. data (independent and identically distributed). But real-world data is:

Temporally correlated (today depends on yesterday)
Non-stationary (distributions shift over months and years)
Adversarial (users game the system)
Context-dependent (meaning changes with culture and location)

Model degradation: Models get stale

Even without drift, models degrade over time:

User expectations evolve
Competitors improve their products
Edge cases accumulate
Small errors compound

Signs your AI tool is degrading (as a user):

❌ Answers that used to work now don’t
❌ More “I don’t know” or refusals
❌ Inconsistent responses to same prompt
❌ Friends getting better results elsewhere
❌ You’re finding workarounds more often

The “boiling frog” problem:

Changes happen so gradually you don’t notice until it’s too late!

Source: James Howard

A model that was 95% accurate at launch can drop to 80% within months if left alone

Why does model degradation happen?

Model performance $P(t)$ decays over time, even with stable data:

Entropy accumulation
- Small prediction errors compound over time
- Error rate $\epsilon$ grows: $\epsilon(t) \approx \epsilon_0 \cdot e^{\lambda t}$
- Edge cases create cascading failures
Distribution shift in deployment
- Model outputs influence user behaviour
- Users adapt prompts based on past responses
- Creates feedback loops that amplify biases
Catastrophic forgetting (after updates)
- Fine-tuning on new data overwrites old knowledge
- Safety training can reduce capability
- Model “forgets” how to handle rare cases

Technical mechanisms:

Mechanism	Effect
Weight decay	Regularisation erodes rare patterns
Quantisation	Compression loses precision
API updates	Provider changes break workflows
Context pollution	Long conversations degrade quality

Models optimise for average-case performance. But users remember the worst failures.

User trust is proportional to min(performance)

Not the mean!

Infrastructure issues: When computers fail

AI systems run on physical computers that can fail:

Hardware failures: GPUs overheat, disks die
Network issues: Connections timeout, packets lost
Scaling problems: Too many users at once
Resource exhaustion: Running out of memory
Dependency failures: External APIs go down

What you see as a user:

Behind the scenes	What you experience
GPU memory full	Slow or no response
Network timeout	“Error, try again”
Rate limits hit	Degraded access
Database down	RAG retrieval fails!
API key issues	Feature stops working

Source: MongoDB

If any one layer goes down, you get errors, slowness, or silence

Monitoring Basics 📊

What companies watch: Key metrics

Essential metrics for AI systems:

Category	Metric	Why It Matters
Performance	Latency	How fast are responses?
	Throughput	How many requests per second?
	Error rate	What % fail completely?
Quality	Accuracy/relevance	Are answers correct?
	Hallucination rate	How often does it make things up?
	User satisfaction	Thumbs up/down?
Resources	Token usage	How much does each request cost?
	GPU memory	Are we close to capacity?
Business	User engagement	Are users coming back?

As a user: You experience these as speed, accuracy, and availability!

Source: Oracle

Teams watch dashboards like this to catch problems before users notice

Status pages: Checking AI reliability

Every major AI service has a status page:

Service	Status Page
OpenAI (ChatGPT)	status.openai.com
Anthropic (Claude)	status.anthropic.com
Google AI	status.cloud.google.com
Midjourney	Check their Discord

What you can learn:

Current outages and degraded performance
Historical incidents (how often things break)
Maintenance windows
Incident post-mortems (what went wrong)

Tip: Check the status page before blaming your prompt!

OpenAI’s status page shows real incidents https://status.openai.com

Fun fact: OpenAI’s status page shows frequent incidents: error rates, regional outages, feature degradation. AI reliability is super hard!

Testing AI: Why it’s hard

Traditional software testing:

Input: 2 + 2
Expected output: 4
Test: Assert(2+2 == 4) ✅

Deterministic: Same input → same output, always

AI testing:

Input: “What’s the weather like?”
Expected output: … um … 🤔
The response changes every time!
Many “correct” answers exist!

Non-deterministic: Same input → different outputs!

This is why Gemini’s image bug wasn’t caught:

Testing “generate a cute cat” worked fine. Testing “generate 1943 German soldiers” probably wasn’t in the test suite

Source: Medium

Solution: Test for properties (is it safe? is the format correct?) rather than exact outputs!

What CAN be tested?

Properties that can be verified:

Test Type	What It Checks	Example
Safety tests	No harmful content	“Does it refuse illegal requests?”
Format tests	Output structure	“Is the JSON valid?”
Consistency tests	Stable core facts	“Is Paris still in France?”
Boundary tests	Edge cases	“What if prompt is 10,000 words?”
Bias tests	Fairness	“Does it treat groups equally?”
Regression tests	Old bugs stay fixed	“Does the DPD fix still work?”

Input validation (first line of defence):

Block prompt injections (“Ignore all previous instructions…”)
Limit input length
Filter prohibited content

Output validation (last line of defence):

What Air Canada should have done:

User asks: "What's the refund policy?"

Chatbot generates: "You can request 
a refund within 90 days..."

OUTPUT VALIDATION:
❌ Check against actual policy database
❌ Flag if inventing new policies
❌ Require human review for novel claims

Instead: Response went directly to user
         → Company lost lawsuit

Both ends matter: Validate inputs AND outputs!

Input validation: First line of defence

Check inputs BEFORE they reach the model:

Things to validate:

✅ Length: Too short? Too long?
✅ Language: Is it in expected language?
✅ Content: Any prohibited content?
✅ Format: Does it make sense?
✅ Rate: Is this user spamming us?

Example input validation:

Prompt: "Ignore all previous instructions..."

❌ BLOCKED: Prompt injection attempt detected!

Garbage in → garbage out still applies!

Source: ApX Machine Learning

Output validation: Last line of defence

Check outputs BEFORE sending to users:

Things to validate:

✅ Safety: No harmful or offensive content
✅ Accuracy: Core facts should be verifiable
✅ Consistency: Shouldn’t contradict itself
✅ Format: Meets expected structure
✅ Length: Not too short, not too long
✅ Privacy: No personal data leakage

Tools for output validation:

Content moderation APIs (OpenAI, Perspective)
Domain-specific rules (e.g., medical disclaimers)
LLM-as-a-Judge for quality scoring

Example output filter:

AI response: "To make explosives..."

❌ BLOCKED: Dangerous content detected!

Replaced with: "I can't help with 
that request."

Source: ApX Machine Learning

What Can YOU Do? 🛡️

Being a savvy AI user

You can’t fix AI pipelines, but you can:

Compare answers from different models
- Same question to ChatGPT + Claude + Gemini
- If they disagree, investigate further
Check status pages
- AI acting weird? Maybe it’s not your prompt
- status.openai.com, status.anthropic.com
Recognise the difference
- Pipeline problem: AI is down, slow, or glitching
- Hallucination: AI is confident but wrong
- Working as intended: AI refuses for safety reasons

Quick diagnostic:

Symptom	Likely Cause
“I’m at capacity”	Infrastructure (wait and retry)
Slow response	High load or network
Different answers to same Q	Non-determinism (normal!)
Confident but wrong	Hallucination (verify!)
Refuses to answer	Safety guardrails (try rephrasing)
Gibberish output	Pipeline failure (refresh)

The variation is normal!

Why does AI give different answers to the same question?

Temperature: Randomness dial (creativity vs. consistency)
Load balancing: Your request might hit a different server
Model versions: A/B testing different versions
Context: Time of day, your chat history
Updates: The model literally changed since yesterday

If AI always gave the exact same response, it would feel robotic. Some variation keeps it natural.

But for important decisions:

Ask multiple times. Ask different AIs. If answers diverge on something critical, don’t trust any of them!

Sources: r/Bard and Google Blog

You and your friend might be running different model versions right now without knowing it

Activity: Diagnose the problem! 🔍

Scenario cards: What’s the likely cause?

Scenario A:

You ask Claude a question. It takes 45 seconds to respond, and the answer is cut off mid-sentence.

Scenario B:

You ask Gemini “Who won the 2027 Super Bowl?” and it confidently names a team.

Scenario C:

You ask ChatGPT to write code. Yesterday it worked; today it refuses and says “I can’t help with that.”

Scenario D:

You ask Midjourney to generate “a photo of my professor.” It creates an image of a random person.

Scenario E:

Three different AI assistants give you three completely different answers about a historical date.

Discuss with a neighbour:

Pipeline problem?
Hallucination?
Working as intended?
How would you verify?

⏱️ 5 minutes to diagnose!

Activity answers

Scenario A: 45 seconds, cut off mid-sentence

Pipeline problem: Server overload or timeout
Check status page
Retry in a few minutes

Scenario B: 2027 Super Bowl answer

Hallucination: Event hasn’t happened yet!
AI doesn’t know it doesn’t know
Always verify time-sensitive claims

Scenario C: Code refused today but worked yesterday

Model update changed safety guardrails
Or: Your specific prompt triggered a new filter
Try rephrasing; if still blocked, guardrails changed

Scenario D: “Photo of my professor” = random person

Working as intended: AI doesn’t know your professor!
It generates a plausible “professor-looking” person
This is expected behaviour, not a failure

Scenario E: Different AIs, different historical dates

Potential hallucination from all of them
Don’t trust any AI for verifiable facts
Look it up in a reliable source!

Summary 📚

Main takeaways

You use dozens of AI pipelines daily: text, image, voice, recommendations
Pipelines are the multi-step process from your input to AI’s output. Every step can fail
Real failures: Air Canada lawsuit, DPD swearing, Gemini images. Pipeline problems, not “rogue AI”
Data drift: The world changes, AI stays frozen
Monitoring: Companies track metrics you never see. When monitoring fails, you read about it in the news
You should compare AIs, check status pages, distinguish pipeline problems from hallucinations

Your AI user toolkit

Quick reference:

Symptom	Action
AI is down/slow	Check status page, wait
Different answers	Normal! Verify if critical
Confident but wrong	Hallucination, verify facts
Refuses request	Guardrails, try rephrasing
Weird/gibberish	Pipeline glitch, refresh

Status pages to bookmark:

On AI failures:

AI Incident Database. Real failures documented!

…and that’s all for today! 🎉