DATASCI 185: Introduction to AI Applications

Lecture 16: AI in Finance and Healthcare - Opportunities and Biases

Danilo Freire

Department of Data and Decision Sciences
Emory University

Welcome back! 😉

Recap of last class

  • Last time we covered types of bias in AI systems
  • Bias can enter at every stage: data collection, labelling, model design, deployment
  • Historical bias: past discrimination baked into training data
  • Representation bias: some groups underrepresented
  • Measurement bias: proxies that correlate with protected attributes
  • Fairness has multiple definitions that can conflict
  • Today: how AI is used in healthcare and finance, and where bias shows up

Source: NIST

Lecture overview

What we will cover today

Part 1: AI in healthcare

  • Traditional ML vs LLMs in medicine
  • Real deployments and RAG in clinical settings

Part 2: AI in finance

  • ML for trading, credit scoring, and fraud
  • LLMs, RAG, and agentic workflows

Part 3: When bias creeps in

  • The Optum algorithm and credit scoring
  • RAG-specific risks in both fields
  • Patterns that repeat across domains

Part 4: What can we learn?

  • Questions to ask about any AI system
  • Group discussion

Meme of the day

Source: Medium

AI in healthcare 🏥

Traditional ML in healthcare

ML has been used in medicine for years, mostly for tasks with labelled data and a clear outcome:

  • Medical imaging: CNNs detect tumours in X-rays, mammograms, retinal scans. Some match or beat radiologists
  • Risk scoring: ML models predict readmission, sepsis, mortality. Hospitals use these to allocate resources
  • Drug discovery: ML screens millions of molecules against protein targets, saving years in early development
  • ECG analysis: Deep learning flags arrhythmias from wearables in real time

All well-defined tasks: structured input in, classification or number out.

Traditional ML in healthcare works best for classification or regression on structured data.

What LLMs add to healthcare

LLMs handle tasks that involve reading, writing, and reasoning about text:

  • Clinical note summarisation: Pull out key medications, diagnoses, and follow-up instructions in seconds
  • Patient-facing chatbots: Explain diagnoses in plain language, answer follow-up questions, triage symptoms
  • Literature synthesis: Summarise findings across dozens of papers on a condition
  • Clinical trial matching: Match unstructured patient records against trial eligibility criteria (previously done by hand)
  • My suggestion: DeepMind’s MedGemma, a free and open-source LLM for medical tasks. Optimised for medical text and image comprehension

All require understanding language, not crunching numbers.

LLMs reading clinical notes

Over 150 US health systems use AI-drafted messages via Epic’s MyChart (GPT-4):

  • Patient sends a question about medication side effects
  • System reads the patient’s record and drafts a response
  • Clinician reviews and sends. Saves ~30 seconds per message

Google’s Med-PaLM 2 scored 86.5% on US Medical Licensing Exam questions. Physicians preferred its open-ended answers over those from other doctors.

But passing exams ≠ treating patients. The model cannot examine the patient, read body language, or pick up on context outside the chart.

Source: Google Research

LLMs for patient communication

LLMs are already used to talk to patients:

  • Multilingual support: Translate discharge instructions into a patient’s language without a dedicated translator
  • After-visit summaries: Plain-language version of what happened and what to do next
  • Symptom triage: Chatbots recommend whether to see a doctor, go to A&E, or stay home

John Muir Health: clinicians using AI-assisted charting saved 34 min/day on documentation. Physician turnover dropped 44%.

Liability remains unresolved. If an AI chatbot gives bad advice and a patient is harmed, who is responsible?

The technology is deployed. The governance frameworks are still catching up.

Source: NewsMedical.net

Discussion: would you trust an AI doctor?

Think about this:

You visit your GP. Instead of a doctor, an AI chatbot takes your symptoms, checks your medical history, and recommends a diagnosis and treatment plan. A doctor reviews the AI’s recommendation for 30 seconds before signing off.

Discuss with your neighbour (2 minutes):

  1. Would you be comfortable with this?
  2. Does it matter if the AI is more accurate on average than the doctor alone?
  3. What if you could see the AI’s reasoning?

Things to consider:

  • Accuracy on average ≠ accuracy for your specific case
  • Trust depends on transparency and the ability to question
  • Some patients may prefer a human for emotional and cultural reasons

AI in finance 📈

Traditional ML in finance

Finance was one of the earliest adopters of ML. Tasks are well-defined and data is plentiful:

  • Price forecasting: Neural networks on time-series data. Results are mixed; markets are noisy
  • Credit scoring: Logistic regression estimate default probability. Backbone of consumer lending
  • Fraud detection: Deep learning flags suspicious transactions in real time. Outperforms rule-based systems (McKinsey, 2023)
  • Algorithmic trading: RL agents learn strategies by trial and error in simulated markets

All involve numerical data, clear targets, and plenty of training examples.

Jim Simons, an extremely successful quant trader. He started in 1988!

Financial ML relies on structured, numerical data: price histories, transaction records, credit files.

What LLMs add to finance

LLMs handle tasks that require reading and interpreting text. Much of the information that moves markets is in natural language:

  • Sentiment analysis: Classify financial news and earnings calls as positive, negative, or neutral
  • Document summarisation: Condense 10-K filings and earnings reports into short summaries
  • Numerical reasoning: Models like FinQA read financial tables and do multi-step calculations
  • Advisory chatbots: Answer portfolio questions, explain products, guide users through tax season

See the Chartered Financial Analyst Institute practical guide (2024) for more detail.

Source: CFA Institute

LLMs vs traditional ML: when to use which

Aspect Traditional ML/DL LLMs
Best data type Numerical, time-series, tabular Unstructured text, documents
Typical tasks Classification, regression, anomaly detection Summarisation, Q&A, sentiment, generation
Training data Needs labelled examples for each task Can work zero-shot or few-shot
Multi-task Separate model per task One model, different prompts
Numerical precision Strong Weaker — struggles with exact calculations
Cost Lower inference cost Higher — large models are expensive to run

Rule of thumb from Li et al. (2023): if your task has a clear definition (e.g. regression, classification), plenty of labelled data, and does not need common-sense reasoning, a traditional model is probably cheaper and just as good. LLMs pay off when you are dealing with unstructured text, ambiguity, or multi-step reasoning.

Finance-specific LLMs

Several LLMs have been fine-tuned for finance:

  • BloombergGPT: 363B tokens of financial data + 345B general. Beats general models on financial benchmarks
  • FinGPT: Open-source, fine-tunes LLaMA via LoRA. Under $300 per run
  • FinMA (PIXIU): 136K finance-specific instructions. Beats general LLMs on sentiment

Benchmarks (CFA Institute; Li et al., 2023):

  • Finance-tuned LLMs beat general models on sentiment and classification
  • GPT-4 still wins on numerical reasoning (more maths in pre-training)
  • For stock prediction, no model is reliable. ARIMA/LSTM still more practical

Domain-specific LLMs trade general ability for better financial language understanding. More about BloombergGPT here

Example: sentiment analysis for trading

The Financial PhraseBank dataset has annotated financial news sentences:

Prompt: “Analyse the sentiment of this financial news statement: negative, positive, or neutral.”

Text: “We have analysed Kaupthing Bank Sweden and found a business which fits well into Alandsbanken.”

Answer: Positive

A trading pipeline:

  1. Collect headlines and earnings call transcripts
  2. Score sentiment with a fine-tuned FinLLM
  3. Aggregate into a market signal
  4. Combine with quantitative data for a trading decision

Agentic LLMs in finance

LLMs can orchestrate multi-step workflows (Li et al., 2023):

User asks an LLM agent to optimise a portfolio of equity and bond ETFs.

The LLM breaks this into steps:

  1. Look up relevant ETFs in a database
  2. Pull historical prices via API
  3. Write Python code for Sharpe ratio optimisation
  4. Run the code and interpret results
  5. Present a recommendation

The LLM acts as planner and coordinator, calling databases, APIs, and code interpreters. JPMorgan already uses AI agents for investment advice.

Source: Deloitte

The LLM does not “know” finance the way a quant does. It orchestrates tools that do the computation.

RAG in healthcare and finance

In healthcare:

  • Clinicians query drug interaction databases and clinical guidelines through natural language
  • Systems retrieve relevant papers and generate evidence-backed answers
  • Medical knowledge bases can be updated without retraining the model

In finance:

  • Compliance officers ask questions about new regulations. The system retrieves the actual rule text and generates grounded answers
  • Analysts query internal research repositories
  • Knowledge bases stay current as regulations change

RAG is one of the most practical near-term LLM applications in both fields because it addresses the hallucination problem directly.

Source: StackAI

RAG grounds LLM answers in actual documents. The model cites sources, not memory.

RAG risks in healthcare and finance

RAG reduces hallucination but does not eliminate risk. Both domains face domain-specific problems:

Retrieval quality:

  • If the knowledge base is incomplete or outdated, the retrieved documents will be wrong. In medicine, outdated guidelines can be dangerous
  • In finance, retrieving superseded regulations may lead to compliance failures

Bias in the knowledge base:

  • Medical literature underrepresents certain populations. RAG retrieves what exists, so gaps in research translate into gaps in answers
  • Financial knowledge bases skew toward English-language and Western-market sources

False confidence:

  • RAG answers look authoritative because they cite documents. Users may trust them more than they should
  • A clinician might skip verification if the answer comes with citations. A compliance officer might not read the full regulation

RAG is better than pure LLM generation, but “grounded” does not mean “correct.” The knowledge base itself can be biased, incomplete, or stale.

Discussion: will AI replace financial analysts?

Quick poll:

Show of hands: will AI replace more than half of entry-level financial analyst jobs within 10 years?

Discuss (2 minutes):

  1. Which analyst tasks are most automatable?
  2. Which are hardest to automate?
  3. How would this change your career preparation?

Consider:

  • Summarising reports: very automatable
  • Client relationships: not automatable
  • Novel situations: partly automatable
  • Responsibility for wrong calls: someone has to

When bias creeps in 🤔

Bias can enter anywhere

Every application we discussed is vulnerable to the same biases from last lecture:

  • Historical bias: training data reflects past discrimination
  • Representation bias: some groups underrepresented
  • Measurement bias: proxies that work differently across groups
  • Feedback loops: predictions reinforce themselves over time

The next slides cover real cases in healthcare and finance.

Bias enters through the data and design choices, not through intent.

Healthcare bias: the Optum case

Obermeyer et al. (2019), published in Science, exposed a widely-used Optum algorithm.

What it did: Identified patients needing extra care. Getting flagged = getting help.

The problem: Used healthcare costs as a proxy for needs. But Black patients spend less even when equally sick (access barriers, insurance gaps, income inequality).

Result: At the same risk score, Black patients had 26.3% more chronic conditions. Unbiased referral would have raised the Black patient share from 17.7% to 46.5%.

Race was not an input. But costs carried racial information because healthcare access itself was unequal.

Healthcare bias: the fix

What happened next:

  • Researchers contacted Optum before publication
  • Switched from predicting costs to predicting health outcomes
  • Racial bias reduced by ~84%

Lessons:

  • The fix was simple. The hard part was recognising the problem
  • Nobody had checked if the proxy worked equally for all groups
  • Cost seemed reasonable, until someone disaggregated by race

What you optimise for matters. Costs ≠ needs when access to care is unequal.

Source: NBC News

Financial bias: credit scoring

Credit scoring decides who gets loans and at what rate. High-stakes ML.

Proxy problems:

  • Postcode correlates with race → indirect racial encoding
  • Purchase patterns correlate with gender
  • Career breaks penalise women disproportionately

Apple Card (2019): David Heinemeier Hansson’s credit limit was 20× his wife’s despite shared finances. NY regulators investigated Goldman Sachs. Found no intentional discrimination, but the inability to explain the outcomes damaged trust.

Source: Department of Financial Services, New York State

No illegality found, but the inability to explain the algorithm undermined public trust.

Financial bias: LLM sentiment and market access

LLMs in finance carry their own bias risks:

Sentiment analysis:

  • Training data skews toward English-language media (Bloomberg, Reuters, FT)
  • Less accurate for emerging markets, smaller firms, non-English sources
  • Biased sentiment → biased trading signals → biased capital flows

Advisory chatbots:

  • Trained on advice for wealthy clients → may not serve lower-income users
  • Better in English than other languages

Summarisation:

  • Stronger on large US corporations than international firms → unequal service

Bias in financial LLMs is about whose information is well-represented and whose is not.

Common patterns across domains

The same patterns appear in both healthcare and finance, even without using race and gender as inputs:

  1. Proxy problems
    • Costs → health needs; postcode → creditworthiness
  2. Historical data
    • Healthcare access and lending were (and are) unequal
  3. Feedback loops
    • Denied credit → fewer opportunities → lower scores
    • Less care → worse health → higher costs
  4. Invisible until examined
    • Overall accuracy looked fine. Disaggregated analysis revealed disparities

Source: UBIS Global

Group discussion

Scenario:

A health insurer uses an LLM to read clinical notes and estimate costs for premium pricing:

  • Reads discharge summaries and doctor notes
  • Estimates future costs → sets premiums
  • No human reviews individual estimates

In your group (5 minutes):

  1. What bias could enter?
  2. What proxy problems do you see?
  3. How might this create a feedback loop?
  4. What safeguards would you require?

Consider:

  • Clinical notes describe patients differently by race or gender
  • Cost estimates inherit Optum-style problems
  • “Expensive” patients pay more → reduced access
  • Would you be comfortable if this system made decisions about you or someone you care about?

This scenario combines LLM text comprehension, healthcare data, financial incentives, and bias.

Main takeaways

Healthcare AI

  • Traditional ML: imaging, risk scoring, drug discovery
  • LLMs: note summarisation, patient communication, literature synthesis
  • RAG: evidence-backed clinical Q&A

Finance AI

  • ML: trading, credit scoring, fraud detection
  • LLMs: sentiment, summarisation, agentic workflows
  • RAG: compliance and regulatory Q&A

Bias and RAG risks

  • Proxy variables carry hidden assumptions
  • Historical data encodes inequality
  • RAG reduces hallucination but inherits knowledge base biases
  • Disaggregated analysis is the only way to see the problem

AI in healthcare and finance is not good or bad by default. What matters is how it is built, who it serves, and who audits it.

Further reading

Required readings:

Recommended:

Watch:

  • Coded Bias (2020). Netflix documentary on AI discrimination (90 min)

And that is all for today!