DATASCI 185: Introduction to AI Applications

Lecture 16: AI in Finance and Healthcare

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Welcome back! 😉

Recap of last class

Last time we covered types of bias in AI systems
Bias can enter at every stage: data collection, labelling, model design, deployment
Historical bias: past discrimination baked into training data
Representation bias: some groups underrepresented
Measurement bias: proxies that correlate with protected attributes
Fairness has multiple definitions that can conflict
Today: how AI is used in healthcare and finance, and where bias shows up

Source: NIST

Lecture overview

What we will cover today

Part 1: AI in healthcare

Traditional ML vs LLMs in medicine
Real deployments and RAG in clinical settings

Part 2: AI in finance

ML for trading, credit scoring, and fraud
LLMs, RAG, and agentic workflows

Part 3: When bias creeps in

The Optum algorithm and credit scoring
RAG-specific risks in both fields
Patterns that repeat across domains

Part 4: What can we learn?

Questions to ask about any AI system
Group discussion

Meme of the day

Source: Medium

Group project 📋

Group project

You will work in groups of 4-5 students to explore how AI tools perform in a real-world domain

Choose one of seven domains:

Healthcare (diagnosis, treatment, patient care)
Education (tutoring, assessment, feedback)
Finance (investing, fraud, credit)
Law (legal research, contracts, compliance)
Creative industries (writing, art, music)
Scientific research (literature review, hypothesis generation)
Customer service (support bots, complaint handling)

No programming required: the focus is on evaluation and design

More information here: https://github.com/danilofreire/datasci185/blob/main/project/project-instructions.pdf

What you actually do

You test existing AI tools in your chosen domain (try them, break them, probe their limits!) and then design a new application that addresses a gap or problem you find

The goal is to think critically: what works, what fails, and what would you build differently?

Deliverables and timeline

Three deliverables:

Optional proposal (April 1): one page, not graded. Lets you get early feedback on your domain and plan
Final report (April 27): roughly 5 pages. Covers your domain overview, tool evaluation, and proposed application
Infographic (April 27): one-page PDF summarising your findings for a general audience

You may also include a bonus appendix with technical details, prompts, code, or extended analysis (not required but recommended)

Key dates:

March 23: group list due. If you don’t have a group, you’ll be assigned to one randomly by March 25
April 1: optional proposal due
April 27: report and infographic due

Form your group soon! 🤓

Groups should be confirmed by March 23. Talk to your classmates this week. If you need help finding a group, contact us! 😉

AI in healthcare 🏥

Traditional ML in healthcare

ML has been used in medicine for years, mostly for tasks with labelled data and a clear outcome (supervised learning):

Medical imaging: CNNs detect tumours in X-rays, mammograms, retinal scans. Some match or beat radiologists
Risk scoring: ML models predict readmission, sepsis, mortality. Hospitals use these to allocate resources
Drug discovery: ML screens millions of molecules against protein targets, saving years in early development
Electrocardiogram analysis: Deep learning flags arrhythmias from wearables in real time

Give the model structured input, get a classification or a number out

If you can put the input in a spreadsheet and the output is a number or a category, traditional ML is probably the right tool

What LLMs add to healthcare

LLMs handle tasks that involve reading, writing, and reasoning about text:

Clinical note summarisation: Pull out key medications, diagnoses, and follow-up instructions in seconds
Patient-facing chatbots: Explain diagnoses in plain language, answer follow-up questions, triage symptoms
Literature synthesis: Summarise findings across dozens of papers on a condition
Clinical trial matching: Match unstructured patient records against trial eligibility criteria (previously done by hand)
My suggestion: DeepMind’s MedGemma, a free and open-source LLM for medical tasks. Optimised for medical text and image comprehension

The key difference from traditional ML: these tasks need language comprehension, not number-crunching

Source: Google Health AI

LLMs reading clinical notes

Over 150 US health systems use AI-drafted messages via Epic’s MyChart (GPT-4):

Patient sends a question about medication side effects
System reads the patient’s record and drafts a response
Clinician reviews and sends. Saves ~30 seconds per message

Google’s Med-PaLM 2 scored 86.5% on US Medical Licensing Exam questions. Physicians preferred its open-ended answers over those from other doctors

But passing exams ≠ treating patients. The model cannot examine the patient, read body language, or pick up on context outside the charts

Source: Google Research

LLMs for patient communication

LLMs are already used to talk to patients:

Multilingual support: Translate discharge instructions into a patient’s language without a dedicated translator
After-visit summaries: Plain-language version of what happened and what to do next
Symptom triage: Chatbots recommend whether to see a doctor, go to A&E, or stay home

John Muir Health: clinicians using AI-assisted charting saved 34 min/day on documentation. Physician turnover dropped 44%. More here

Liability remains unresolved. If an AI chatbot gives bad advice and a patient is harmed, who is responsible?

Millions of patients already interact with AI chatbots, but liability rules have not caught up.

Source: NewsMedical.net

Discussion: would you trust an AI doctor?

Think about this:

You visit your GP. Instead of a doctor, an AI chatbot takes your symptoms, checks your medical history, and recommends a diagnosis and treatment plan. A doctor reviews the AI’s recommendation for 30 seconds before signing off.

Would you be comfortable with this?
Does it matter if the AI is more accurate on average than the doctor alone?
What if you could see the AI’s reasoning?

Things to consider:

Accuracy on average ≠ accuracy for your specific case
Trust depends on transparency and the ability to question
Some patients may prefer a human for emotional and cultural reasons

Source: The New York Times

AI in finance 📈

Traditional ML in finance

Finance adopted ML early, and for good reason: there is a lot of structured data and the payoffs are measurable.

Price forecasting: Neural networks on time-series data. Results are mixed; markets are noisy
Credit scoring: Logistic regression estimate default probability. Backbone of consumer lending
Fraud detection: Deep learning flags suspicious transactions in real time. Outperforms rule-based systems (McKinsey, 2023)
Algorithmic trading: RL agents learn strategies by trial and error in simulated markets

These tasks share a common trait: the inputs are numbers, the outputs are numbers, and you can tell quickly whether the model is working

Jim Simons, an extremely successful quant trader. He started in 1988!

Jim Simons started using ML for trading in 1988. The data was there; the question was always whether the models were good enough

What LLMs add to finance

LLMs handle tasks that require reading and interpreting text. Much of the information that moves markets is in natural language:

Sentiment analysis: Classify financial news and earnings calls as positive, negative, or neutral
Document summarisation: Condense 10-K filings and earnings reports into short summaries
Numerical reasoning: Models like FinQA read financial tables and do multi-step calculations
Advisory chatbots: Answer portfolio questions, explain products, guide users through tax season

See the Chartered Financial Analyst Institute practical guide (2024) for more detail.

Source: CFA Institute

LLMs vs traditional ML: when to use which

Aspect	Traditional ML/DL	LLMs
Best data type	Numerical, time-series, tabular	Unstructured text, documents
Typical tasks	Classification, regression, anomaly detection	Summarisation, Q&A, sentiment, generation
Training data	Needs labelled examples for each task	Can work zero-shot or few-shot
Multi-task	Separate model per task	One model, different prompts
Numerical precision	Strong	Weaker — struggles with exact calculations
Cost	Lower inference cost	Higher — large models are expensive to run

Rule of thumb (Li et al., 2023): if your task is well-defined (regression, classification) and you have plenty of labelled data, a traditional model is cheaper and just as good. LLMs pay off when the input is unstructured text or when the task requires reasoning across multiple steps

Finance-specific LLMs

Several LLMs have been fine-tuned for finance:

BloombergGPT: 363B tokens of financial data + 345B general. Beats general models on financial benchmarks
FinGPT: Open-source, fine-tunes LLaMA via LoRA. Under $300 per run
FinMA (PIXIU): 136K finance-specific instructions. Beats general LLMs on sentiment

Benchmarks (CFA Institute; Li et al., 2023):

Finance-tuned LLMs beat general models on sentiment and classification
GPT-4 still wins on numerical reasoning (more maths in pre-training)
For stock prediction, no model is reliable. ARIMA/LSTM still more practical

The trade-off: these models understand financial jargon better, but lose ground on general reasoning. BloombergGPT paper

Example: sentiment analysis for trading

The Financial PhraseBank dataset has annotated financial news sentences:

Prompt: “Analyse the sentiment of this financial news statement: negative, positive, or neutral.”

Text: “We have analysed Kaupthing Bank Sweden and found a business which fits well into Alandsbanken.”

Answer: Positive

A trading pipeline:

Collect headlines and earnings call transcripts
Score sentiment with a fine-tuned FinLLM
Aggregate into a market signal
Combine with quantitative data for a trading decision

Source: Chen et al (2025)

Agentic LLMs in finance

LLMs can orchestrate multi-step workflows (Li et al., 2023):

User asks an LLM agent to optimise a portfolio of equity and bond ETFs.

The LLM breaks this into steps:

Look up relevant ETFs in a database
Pull historical prices via API
Write Python code for Sharpe ratio optimisation
Run the code and interpret results
Present a recommendation

The LLM is the glue: it decides what to do next, calls the right tool, and passes results along. JPMorgan (and many others) already uses AI agents for investment advice

The LLM does not “know” finance the way a quant does. It orchestrates tools that do the computation

RAG in healthcare and finance

In healthcare:

Clinicians query drug interaction databases and clinical guidelines through natural language
Systems retrieve relevant papers and generate evidence-backed answers
Medical knowledge bases can be updated without retraining the model

In finance:

Compliance officers ask questions about new regulations. The system retrieves the actual rule text and generates grounded answers
Analysts query internal research repositories
Knowledge bases stay current as regulations change

RAG is popular in both fields for one reason: it lets the model cite actual documents instead of making things up.

Source: StackAI

RAG grounds LLM answers in actual documents. The model cites sources, not memory

RAG risks in healthcare and finance

RAG reduces hallucination but does not eliminate risk. Both domains face domain-specific problems:

Retrieval quality:

If the knowledge base is incomplete or outdated, the retrieved documents will be wrong. In medicine, outdated guidelines can be dangerous
In finance, retrieving superseded regulations may lead to compliance failures

Bias in the knowledge base:

Medical literature underrepresents certain populations. RAG retrieves what exists, so gaps in research translate into gaps in answers
Financial knowledge bases skew toward English-language and Western-market sources

False confidence:

RAG answers look authoritative because they cite documents. Users may trust them more than they should
A clinician might skip verification if the answer comes with citations. A compliance officer might not read the full regulation

RAG is better than pure LLM generation, but “grounded” does not mean “correct.” If the knowledge base has gaps, the answers will too

Discussion: will AI replace financial analysts?

Quick poll:

Show of hands: will AI replace more than half of entry-level financial analyst jobs within 10 years?

Discuss (2 minutes):

Which analyst tasks are most automatable?
Which are hardest to automate?
How would this change your career preparation?

Consider:

Summarising reports: very automatable
Client relationships: not automatable
Novel situations: partly automatable
Responsibility for wrong calls: someone has to

When bias creeps in 🤔

Healthcare bias: the Optum case

Obermeyer et al. (2019), published in Science, exposed a widely-used Optum algorithm

What it did: Identified patients needing extra care. Getting flagged = getting help

The problem: Used healthcare costs as a proxy for needs. But Black patients spend less even when equally sick (access barriers, insurance gaps, income inequality)

Result: At the same risk score, Black patients had 26.3% more chronic conditions. Unbiased referral would have raised the Black patient share from 17.7% to 46.5%

Race was not an input. But costs carried racial information because healthcare access itself was unequal

Source: Obermeyer et al. (2019)

Healthcare bias: the fix

What happened next:

Researchers contacted Optum before publication
Switched from predicting costs to predicting health outcomes
Racial bias reduced by ~84%

Lessons:

The fix was simple. The hard part was recognising the problem
Nobody had checked if the proxy worked equally for all groups
Cost seemed reasonable, until someone disaggregated by race

What you optimise for matters. Costs ≠ needs when access to care is unequal

Source: NBC News

Financial bias: credit scoring

Credit scoring decides who gets loans and at what rate. High-stakes ML

Proxy problems:

Postcode correlates with race → indirect racial encoding
Purchase patterns correlate with gender
Career breaks penalise women disproportionately

Apple Card (2019): David Heinemeier Hansson’s credit limit was 20× his wife’s despite shared finances. NY regulators investigated Goldman Sachs. Found no intentional discrimination, but the inability to explain the outcomes damaged trust

Source: Department of Financial Services, New York State

No illegality found, but the inability to explain the algorithm undermined public trust

Financial bias: LLM sentiment and market access

LLMs in finance carry their own bias risks:

Sentiment analysis:

Training data skews toward English-language media (Bloomberg, Reuters, FT)
Less accurate for emerging markets, smaller firms, non-English sources
Biased sentiment → biased trading signals → biased capital flows

Advisory chatbots:

Trained on advice for wealthy clients → may not serve lower-income users
Better in English than other languages

Summarisation:

Stronger on large US corporations than international firms → unequal service

Bias in financial LLMs is about whose information is well-represented and whose is not

Common patterns across domains

The same patterns appear in both healthcare and finance, even without using race and gender as inputs:

Proxy problems
- Costs → health needs; postcode → creditworthiness
Historical data
- Healthcare access and lending were (and are) unequal
Feedback loops
- Denied credit → fewer opportunities → lower scores
- Less care → worse health → higher costs
Invisible until examined
- Overall accuracy looked fine. Disaggregated analysis revealed disparities

Source: UBIS Global

Discussion

Scenario:

A health insurer uses an LLM to read clinical notes and estimate costs for premium pricing:

Reads discharge summaries and doctor notes
Estimates future costs → sets premiums
No human reviews individual estimates

Questions:

What bias could enter?
What proxy problems do you see?
How might this create a feedback loop?
What safeguards would you require?

Consider:

Clinical notes describe patients differently by race or gender
Cost estimates inherit Optum-style problems
“Expensive” patients pay more → reduced access
Would you be comfortable if this system made decisions about you or someone you care about?

Main takeaways

Healthcare AI

Traditional ML: imaging, risk scoring, drug discovery
LLMs: note summarisation, patient communication, literature synthesis
RAG: evidence-backed clinical Q&A

Finance AI

ML: trading, credit scoring, fraud detection
LLMs: sentiment, summarisation, agentic workflows
RAG: compliance and regulatory Q&A

Bias and RAG risks

Proxy variables carry hidden assumptions
Historical data encodes inequality
RAG reduces hallucination but inherits knowledge base biases
Disaggregated analysis is the only way to see the problem

The same algorithm can save lives or deny care, depending on what it optimises for. The Optum case showed that a single proxy choice (costs vs. health outcomes) changed everything.

And that is all for today!