DATASCI 185: Introduction to AI Applications

Lecture 13: Building Reliable Pipelines, Monitoring, and Testing

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 🔧

Recap of last class

  • Last time, we explored RAG (Retrieval-Augmented Generation)
  • LLMs have knowledge cutoffs and hallucinate
  • Ground AI in your own documents
  • Chunk → Embed → Retrieve → Generate
  • RAG reduces hallucinations by 30–50%
  • But RAG isn’t perfect: retrieval errors still occur
  • Today: What happens behind the scenes when you use AI?
  • You’ve all experienced AI going down, being slow, or giving weird answers
  • Why does this happen? And what can you do about it?

Source: Astera Software

Lecture overview

Today’s agenda

Part 1: What is a Pipeline?

  • From prompt to response
  • Why “it works on my laptop” isn’t enough
  • Real-world AI failures and lessons

Part 2: Why Pipelines Break

  • Data drift: when the world changes
  • Model degradation over time
  • Infrastructure and scaling issues

Part 3: Monitoring Basics

  • What to watch: the key metrics
  • Alerting: the “canary in the coal mine”
  • Dashboards and observability

Part 4: Evaluating AI Tools as a User

  • Signs of a reliable vs. unreliable AI
  • Your personal reliability checklist
  • Hands-on: Rate real AI tools!

Meme of the day 😄

That’s actually a very good point!

DataFest 2026

DataFest 2026 is coming!

What is a Pipeline? 🏭

From prompt to response

When you ask an LLM a question, here’s what happens:

  1. Your prompt goes from your browser to servers
  2. Load balancers route it to an available GPU (cloud infrastructure)
  3. Preprocessing cleans and formats your text (with LLMs, this is usually minimal)
  4. Tokenisation converts text to numbers (you know how to do that!)
  5. The model processes tokens (billions of operations!)
  6. Postprocessing formats the output
  7. Safety filters check for harmful content (e.g., hate speech, etc.)
  8. Response travels back to your screen

Each step can fail independently!

This entire process is called a pipeline.

The AI pipeline

A simple question = many complex steps behind the scenes

Failure types: what actually goes wrong?

Not all failures are the same:

Failure type What it looks like Root cause
Hallucination Confident but false content Model limitations, missing grounding
Policy failure Unsafe or inappropriate output Weak guardrails or bad updates
Data drift Outdated or weird answers World changes faster than training
Pipeline failure Slow, down, or inconsistent responses Infrastructure, scaling, broken steps

You can spot what type of failure happened, then choose the right response

When AI Breaks 💥

Real AI failures (recent headlines)

Incident What Happened Impact
Air Canada chatbot (2024) Invented a refund policy Company lost lawsuit
DPD chatbot (2024) Swore at customers, criticised its own company Emergency shutdown
Google Gemini images (2024) Historically inaccurate diverse images Feature paused, CEO apologised
Chevy chatbot (2023) Agreed to sell car for $1 Prompt injection attack
Snapchat My AI (2023) Privacy concerns, couldn’t be removed User backlash

Common causes:

  1. Hallucination treated as truth (Air Canada)
  2. Updates broke guardrails (DPD)
  3. No input validation (Chevy)
  4. Overcorrection for bias (Gemini)
  5. Rush to market (Snapchat)

Source: https://x.com/ChrisJBakke/status/1736533308849443121. The whole thread is hilarious 😂

Case study: DPD’s sweary chatbot

What happened (January 2024):

  • DPD (UK delivery company) updated their AI chatbot
  • After the update, customer Ashley Beauchamp discovered it would:
    • Swear when asked nicely
    • Write poems criticising DPD
    • Call itself “useless” and recommend competitors
    • Say DPD was “the worst delivery firm in the world”

DPD’s response:

Immediately disabled the AI chatbot, apologised, launched investigation.

What went wrong:

A system update broke the guardrails. The content filtering that prevented inappropriate responses stopped working, but no one noticed until customers posted screenshots on social media

Source: BBC News

Pipeline failure point:

No monitoring after deployment. The update was pushed without testing, and no automated checks caught the broken guardrails.

Case study: Google Gemini’s image crisis

What happened (February 2024):

  • Users asked Gemini for images of “1943 German soldiers”
  • Gemini generated images of Black men and Asian women in Nazi uniforms
  • Similar problems with “US Founding Fathers” and other historical figures
  • Google CEO called results “completely unacceptable”
  • Image generation feature paused for weeks

Why did this happen? Can you guess?

Source: BBC News

Pipeline failure point:

Testing didn’t catch edge cases

Why Pipelines Break 🔥

Data drift: The world changes

Data drift = When real-world data differs from training data

Example: A sentiment analyser trained in 2020:

  • “This product is sick!” → Probably means: Amazing ✅
  • “This slaps!” → Might not understand 🤔
  • “No cap, this is bussin’” → Completely lost! 😵

Types of drift:

Type Description Example
Label drift What’s “correct” changes Policy updates
Concept drift Meaning of things changes New slang, trends
Data drift Input distribution changes New demographics

Data drift visualisation

Source: Spot Intelligence

The world keeps changing, your model stays frozen!

Why does data drift happen?

Models assume \(P_{\text{train}}(X, Y) = P_{\text{prod}}(X, Y)\)

But in reality, this joint distribution shifts over time:

  1. Data drift\(P(X)\) changes
    • Input distribution differs from training
    • Example: New user demographics, device types
    • Model sees inputs it never learned from
  2. Label drift\(P(Y)\) changes
    • Class frequencies change over time
    • Example: Fraud rate increases during holidays
    • Model’s decision boundaries become suboptimal
  3. Concept drift\(P(Y|X)\) changes
    • Same input → different correct output
    • Example: “Sick” now means “great” in slang
    • The most dangerous type: model is confident but wrong

Technical causes:

Cause Effect
Non-stationarity Real-world processes aren’t static
Sample selection bias Training data ≠ production population
Feedback loops Model outputs influence future inputs
External shocks COVID, policy changes, viral trends

Why it’s inevitable:

Machine learning assumes i.i.d. data (independent and identically distributed). But real-world data is:

  • Temporally correlated (today depends on yesterday)
  • Non-stationary (distributions shift)
  • Adversarial (users game the system)

Model degradation: Models get stale

Even without drift, models degrade over time:

  • User expectations evolve
  • Competitors improve their products
  • Edge cases accumulate
  • Small errors compound

Signs your AI tool is degrading (as a user):

  • ❌ Answers that used to work now don’t
  • ❌ More “I don’t know” or refusals
  • ❌ Inconsistent responses to same prompt
  • ❌ Friends getting better results elsewhere
  • ❌ You’re finding workarounds more often

The “boiling frog” problem:

Changes happen so gradually you don’t notice until it’s too late!

Model degradation

Source: James Howard

Performance declines gradually, even without drift!

Why does model degradation happen?

Model performance \(P(t)\) decays over time, even with stable data:

  1. Entropy accumulation
    • Small prediction errors compound over time
    • Error rate \(\epsilon\) grows: \(\epsilon(t) \approx \epsilon_0 \cdot e^{\lambda t}\)
    • Edge cases create cascading failures
  2. Distribution shift in deployment
    • Model outputs influence user behaviour
    • Users adapt prompts based on past responses
    • Creates feedback loops that amplify biases
  3. Catastrophic forgetting (after updates)
    • Fine-tuning on new data overwrites old knowledge
    • Safety training can reduce capability
    • Model “forgets” how to handle rare cases

Technical mechanisms:

Mechanism Effect
Weight decay Regularisation erodes rare patterns
Quantisation Compression loses precision
API updates Provider changes break workflows
Context pollution Long conversations degrade quality

Models optimise for average-case performance. But users remember the worst failures.

User trust is proportional to min(performance)

Not the mean!

Infrastructure issues: When computers fail

AI systems run on physical computers that can fail:

  • Hardware failures: GPUs overheat, disks die
  • Network issues: Connections timeout, packets lost
  • Scaling problems: Too many users at once
  • Resource exhaustion: Running out of memory
  • Dependency failures: External APIs go down

What you see as a user:

Behind the scenes What you experience
GPU memory full Slow or no response
Network timeout “Error, try again”
Rate limits hit Degraded access
Database down RAG retrieval fails!
API key issues Feature stops working

Infrastructure diagram

Source: MongoDB

Many components = many potential failure points

Monitoring Basics 📊

What companies watch: Key metrics

Essential metrics for AI systems:

Category Metric Why It Matters
Performance Latency How fast are responses?
Throughput How many requests per second?
Error rate What % fail completely?
Quality Accuracy/relevance Are answers correct?
Hallucination rate How often does it make things up?
User satisfaction Thumbs up/down?
Resources Token usage How much does each request cost?
GPU memory Are we close to capacity?
Business User engagement Are users coming back?

As a user: You experience these as speed, accuracy, and availability!

Monitoring dashboard

Source: Oracle

A dashboard shows real-time health of an AI system

Status pages: Checking AI reliability

Every major AI service has a status page:

Service Status Page
OpenAI (ChatGPT) status.openai.com
Anthropic (Claude) status.anthropic.com
Google AI status.cloud.google.com
Midjourney Check their Discord

What you can learn:

  • Current outages and degraded performance
  • Historical incidents (how often things break)
  • Maintenance windows
  • Incident post-mortems (what went wrong)

Tip: Check the status page before blaming your prompt!

OpenAI’s status page shows real incidents https://status.openai.com

Fun fact: OpenAI’s status page shows frequent incidents: error rates, regional outages, feature degradation. AI reliability is super hard!

Testing AI: Why it’s hard

Traditional software testing:

  • Input: 2 + 2
  • Expected output: 4
  • Test: Assert(2+2 == 4) ✅

Deterministic: Same input → same output, always

AI testing:

  • Input: “What’s the weather like?”
  • Expected output: … um … 🤔
  • The response changes every time!
  • Many “correct” answers exist!

Non-deterministic: Same input → different outputs!

This is why Gemini’s image bug wasn’t caught:

Testing “generate a cute cat” worked fine. Testing “generate 1943 German soldiers” probably wasn’t in the test suite

Testing comparison

Source: Medium

Solution: Test for properties (is it safe? is the format correct?) rather than exact outputs!

What CAN be tested?

Properties that can be verified:

Test Type What It Checks Example
Safety tests No harmful content “Does it refuse illegal requests?”
Format tests Output structure “Is the JSON valid?”
Consistency tests Stable core facts “Is Paris still in France?”
Boundary tests Edge cases “What if prompt is 10,000 words?”
Bias tests Fairness “Does it treat groups equally?”
Regression tests Old bugs stay fixed “Does the DPD fix still work?”

Input validation (first line of defence):

  • Block prompt injections (“Ignore all previous instructions…”)
  • Limit input length
  • Filter prohibited content

Output validation (last line of defence):

What Air Canada should have done:

User asks: "What's the refund policy?"

Chatbot generates: "You can request 
a refund within 90 days..."

OUTPUT VALIDATION:
❌ Check against actual policy database
❌ Flag if inventing new policies
❌ Require human review for novel claims

Instead: Response went directly to user
         → Company lost lawsuit

Both ends matter: Validate inputs AND outputs!

Input validation: First line of defence

Check inputs BEFORE they reach the model:

Things to validate:

  • Length: Too short? Too long?
  • Language: Is it in expected language?
  • Content: Any prohibited content?
  • Format: Does it make sense?
  • Rate: Is this user spamming us?

Example input validation:

Prompt: "Ignore all previous instructions..."

❌ BLOCKED: Prompt injection attempt detected!

Garbage in → garbage out still applies!

Input validation flowchart

Source: ApX Machine Learning

Output validation: Last line of defence

Check outputs BEFORE sending to users:

Things to validate:

  • Safety: No harmful or offensive content
  • Accuracy: Core facts should be verifiable
  • Consistency: Shouldn’t contradict itself
  • Format: Meets expected structure
  • Length: Not too short, not too long
  • Privacy: No personal data leakage

Tools for output validation:

  • Content moderation APIs (OpenAI, Perspective)
  • Domain-specific rules (e.g., medical disclaimers)
  • LLM-as-a-Judge for quality scoring

Example output filter:

AI response: "To make explosives..."

❌ BLOCKED: Dangerous content detected!

Replaced with: "I can't help with 
that request."

Output validation

Source: ApX Machine Learning

What Can YOU Do? 🛡️

Being a savvy AI user

You can’t fix AI pipelines, but you can:

  1. Compare answers from different models
    • Same question to ChatGPT + Claude + Gemini
    • If they disagree, investigate further
  2. Check status pages
  3. Recognise the difference
    • Pipeline problem: AI is down, slow, or glitching
    • Hallucination: AI is confident but wrong
    • Working as intended: AI refuses for safety reasons

Quick diagnostic:

Symptom Likely Cause
“I’m at capacity” Infrastructure (wait and retry)
Slow response High load or network
Different answers to same Q Non-determinism (normal!)
Confident but wrong Hallucination (verify!)
Refuses to answer Safety guardrails (try rephrasing)
Gibberish output Pipeline failure (refresh)

The variation is normal!

Why does AI give different answers to the same question?

  1. Temperature: Randomness dial (creativity vs. consistency)
  2. Load balancing: Your request might hit a different server
  3. Model versions: A/B testing different versions
  4. Context: Time of day, your chat history
  5. Updates: The model literally changed since yesterday

If AI always gave the exact same response, it would feel robotic. Some variation keeps it natural.

But for important decisions:

Ask multiple times. Ask different AIs. If answers diverge on something critical, don’t trust any of them!

Sources: r/Bard and Google Blog

A/B testing means different users get different model versions or parameters

Activity: Diagnose the problem! 🔍

Scenario cards: What’s the likely cause?

Scenario A:

You ask Claude a question. It takes 45 seconds to respond, and the answer is cut off mid-sentence.

Scenario B:

You ask Gemini “Who won the 2027 Super Bowl?” and it confidently names a team.

Scenario C:

You ask ChatGPT to write code. Yesterday it worked; today it refuses and says “I can’t help with that.”

Scenario D:

You ask Midjourney to generate “a photo of my professor.” It creates an image of a random person.

Scenario E:

Three different AI assistants give you three completely different answers about a historical date.

Discuss with a neighbour:

  • Pipeline problem?
  • Hallucination?
  • Working as intended?
  • How would you verify?

⏱️ 5 minutes to diagnose!

Activity answers

Scenario A: 45 seconds, cut off mid-sentence

  • Pipeline problem: Server overload or timeout
  • Check status page
  • Retry in a few minutes

Scenario B: 2027 Super Bowl answer

  • Hallucination: Event hasn’t happened yet!
  • AI doesn’t know it doesn’t know
  • Always verify time-sensitive claims

Scenario C: Code refused today but worked yesterday

  • Model update changed safety guardrails
  • Or: Your specific prompt triggered a new filter
  • Try rephrasing; if still blocked, guardrails changed

Scenario D: “Photo of my professor” = random person

  • Working as intended: AI doesn’t know your professor!
  • It generates a plausible “professor-looking” person
  • This is expected behaviour, not a failure

Scenario E: Different AIs, different historical dates

  • Potential hallucination from all of them
  • Don’t trust any AI for verifiable facts
  • Look it up in a reliable source!

Summary 📚

Main takeaways

  • You use dozens of AI pipelines daily: text, image, voice, recommendations

  • Pipelines are the multi-step process from your input to AI’s output. Every step can fail

  • Real failures: Air Canada lawsuit, DPD swearing, Gemini images. Pipeline problems, not “rogue AI”

  • Data drift: The world changes, AI stays frozen

  • Monitoring: Companies track metrics you never see. When monitoring fails, you read about it in the news

  • You should compare AIs, check status pages, distinguish pipeline problems from hallucinations

Your AI user toolkit

Quick reference:

Symptom Action
AI is down/slow Check status page, wait
Different answers Normal! Verify if critical
Confident but wrong Hallucination, verify facts
Refuses request Guardrails, try rephrasing
Weird/gibberish Pipeline glitch, refresh

Status pages to bookmark:

On AI failures:

…and that’s all for today! 🎉