Topic 8: Distinguishing Goals of Data Analysis

class: center, middle, inverse, title-slide

.title[
# Topic 8: <br> Distinguishing Goals of Data Analysis
]
.author[
### Nick Hagerty <br> ECNS 460/560 <br> Montana State University
]

---

name: toc

# Table of contents

1. [The Data Generating Process](#DGP)

1. [The Potential Outcomes Model](#rubin)

1. [Descriptive, Predictive, or Causal Analysis?](#which)

---

# Interpreting relationships in data

When we say there's a relationship between two variables... how do we interpret that?
- What precisely do we mean?
- What do we want to do with this information?

---

# Distinguishing goals of data analysis

When we analyze data, we always have one of three ultimate goals:

1. .hi-turquoise[**Description:**] Simply characterize observed patterns in the data.
2. .hi[**Causation:**] Learn about causal relationships: If we change X, how will Y change?
3. .hi-purple[**Prediction:**] Be able to guess the value of one variable from other information.

Here are a few quick examples of a research question in each category:

1. .hi-turquoise[**Descriptive:**] Is wealth inequality increasing faster in the U.S. than in Europe?
2. .hi[**Causal:**] Does Medicaid coverage reduce the risk of bankruptcy?
3. .hi-purple[**Predictive:**] Can nighttime satellite imagery be used as a real-time indicator of GDP?

Discerning which type of goal you have is critical for:
* **Choosing methods:** Distinct approaches are required to achieve different goals.
* **Interpreting results:** Mistaking one goal for another can lead your audience to make very bad decisions.

---
class: inverse, middle
name: DGP

# The Data Generating Process

---

# The Data Generating Process

What's shared across all goals of data analysis:

We start with an **outcome variable** called `$Y$`. We want to know some things:
* Where do the values of `$Y$` come from?
* How does `$Y$` relate to other variables?

The **data generating process** (DGP) in the real world determines the values of `$Y$`.
* The true DGP is unknowable, a black box:

---

# The Data Generating Process

We could write this picture as a function:

$$ Y = f(x_1, x_2, x_3, ... x_N) $$

But some big questions remain:
- What are all these inputs `$x_1, x_2, ...$`?
- What is the functional form? (How does the box work?)

---

# The Data Generating Process

For some questions, the set of inputs and their functional form are discoverable.

**Physics example:** How long does it take for an object to fall from a particular height?

$$ t = \sqrt{2h/g} \approx 0.452 \sqrt{h}  $$

Newton and others did experiments to learn the functional form.

---

# The Data Generating Process

Even in physics, this is just an approximation to the real world.

The true DGP is more complicated, but it is still limited to a certain number of factors.
* We can understand the vast majority of the "duration of free fall" DGP.

In physics, there is no need for any distinction between causation and prediction.
* If your model is *correct,* they're the same thing!

---

# The Data Generating Process

**Social science is harder.** People have free will!
* Inputs? Any outcome is determined by far more than 10 or even 100 inputs.
  - Many inputs are impossible to measure.
* Functional form? No universal laws guaranteeing they are simple.
  - Inputs often interact with each other.

Characterizing the full DGP for *anything* is typically far beyond the realm of possibility.

What can we do instead? **Lower our ambitions!**

---

# Types of goals

Most data analysis in economics falls into one of **3 types of goals:**

1. .hi-turquoise[**Descriptive:**] Document or quantify observed relationships between inputs and outputs.
   - Does not not necessarily tell us about the true DGP.
   - Helps us understand facts about the world.
   - Can often inspire questions for further research.

2. .hi[**Causal:**] Try to understand *one piece* of how the box works (the true DGP).
   - When you *change* one factor, how does it change the result?
   - Helps us make decisions about what to do (in policy, business, personal life).

3. .hi-purple[**Predictive:**] Create your own box to try to match the output.
   - Doesn't matter if it works the same, or if you have the correct inputs.
   - Only matters how closely your box produces the same result.
   - Helps us know what's likely to happen in a new situation.

For more clarity, we need to take a slightly technical detour.

---
class: inverse, middle
name: rubin

# The Potential Outcomes Model

---

# Causation requires counterfactuals

**Causal effect:** the change in one variable resulting from a change in another.
- **Treatment** variable `$T$`. (E.g., taking Advil.)
- **Outcome** variable `$Y$`. (E.g., severity of pain.)

**Potential outcomes:** What would happen with and without treatment?
- `$Y^1_i =$` outcome for person `$i$` if `$T_i = 1$` (take Advil).
- `$Y^0_i =$` outcome for person `$i$` if `$T_i = 0$` (don't take Advil).

The **treatment effect** for person `$i$` is:
`$$\text{Effect}_i = Y^1_i - Y^0_i.$$`

**The Fundamental Problem of Causal Inference:** Only one potential outcome is ever observable for each person.

`$$Y_{i} =
\begin{cases}
Y_{i}^{1} & \text{if } T_{i}=1\\
Y_{i}^{0} & \text{if } T_{i}=0
\end{cases}$$`
The other one is a **counterfactual** -- what *would* have happened had `$T_i$` been different.

---

# Average Treatment Effect

We can never learn the treatment effect by looking at just one person.

Instead, we'll look at many people and try to estimate the **average treatment effect.**

`$$\begin{aligned}
ATE 
&= \mathbb{E}[Y^1_i - Y^0_i] \\
&= \overline{Y^1} - \overline{Y^0}.
\end{aligned}$$`

* `$\overline{Y^1}$`: average outcome if everyone had `$T=1$`.
* `$\overline{Y^0}$`: average outcome if everyone had `$T=0$`.

But these are still just *potential* outcomes. 😢

---
class: clear, middle

---

# Interpreting mean comparisons

What we *do* have in our data:
- `$\overline{Y_T}$`: Average outcome among people who are treated `$(T_i=1)$`.
- `$\overline{Y_U}$`: Average outcome among people who are untreated `$(T_i=0)$`.

It's often natural to look at the difference in these group means:
`$$\text{Difference in Means} = \overline{Y_T} - \overline{Y_U}$$`

How is this different from the ATE? 
--
**Treated and untreated people aren't the same.**
* Means are taken over groups of people who have different potential outcomes.
- For example: People who take Advil have higher pain than those who don't.
  * Treatment effect is probably negative, but difference in means is positive.

---

# Interpreting mean comparisons

Let's be more precise about the connection between the ATE and the difference in means.

First, we can rewrite these as potential outcomes:
`$$\begin{aligned}
\text{Difference in Means}
&= \overline{Y_T} - \overline{Y_U} \\
&= \overline{Y_T^1} - \overline{Y_U^0}
\end{aligned}$$`

Next, I'll use an algebraic trick -- add and subtract the same term:
`$$\begin{aligned}
\text{Difference in Means}
&= \overline{Y_T^1} - \overline{Y_U^0} + 0\\
&= \overline{Y_T^1} - \overline{Y_U^0} + \Big( \overline{Y_T^0} - \overline{Y_T^0} \Big)
\end{aligned}$$`

Rearranging, we get:
`$$\text{Difference in Means} = \underbrace{
  \Big(\overline{Y_T^1} - \overline{Y_T^0}\Big)
  }_\text{treatment effect} + \underbrace{
  \Big(\overline{Y_T^0} - \overline{Y_U^0}\Big)
  }_\text{selection bias}$$`

---

# Correlation is causation plus selection

`$$\text{Difference in Means} = \underbrace{
  \Big(\overline{Y_T^1} - \overline{Y_T^0}\Big)
  }_\text{treatment effect} + \underbrace{
  \Big(\overline{Y_T^0} - \overline{Y_U^0}\Big)
  }_\text{selection bias}$$`

The difference in means is a result of 2 processes:
- **Treatment:** The true causal effect of the treatment variable. (Taking Advil reduces pain.)
- **Selection:** Differences in the types of people who are treated. (People who take Advil are in more pain.)

The central challenge of causal inference is to disentangle treatment from selection. How?
- Since this formula is about *potential* outcomes, we can never prove it.
- We always need to bring in (well-grounded) assumptions.

**The idea:** find a situation where we can argue that selection bias is negligible.
- This is called a **research design** or **identification strategy.**

---

# Selection bias

Note:

1. Don't confuse selection bias with sampling bias.
   * **Sampling bias:** We've collected a sample that is unrepresentative of a broader population.
   * **Selection bias:** Within our sample, the people with `$T=1$` have different potential outcomes from people with `$T=0$`.

--
2. Selection bias is an all-encompassing category for several sources of bias:
   - Omitted variables bias
   - Reverse causality
   - Selection on gains

--
3. What is a **research design** that can allow us to eliminate selection bias and estimate the treatment effect? Examples:
   - Experimentally vary `$T$`.
   - Eliminate "bad" variation in `$T$` (e.g., by controlling for enough other variables).
   - Isolate "good" variation in `$T$` (e.g., by finding and using an instrumental variable).

---
class: inverse, middle
name: which

# Descriptive, Predictive, or Causal?

---

# Types of goals

Now we can revisit our 3 goals of data analysis.

For sake of illustration, consider a simple linear model:

`$$\begin{aligned}
Y_i
&= f(x_{0i}, x_{1i}, ..., x_{Ni}) \\
&= \beta_0 x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i
\end{aligned}$$`

All 3 goals are trying to model the DGP, but with different emphases and interpretations.

<br>

**Three key differences** among the goals:
1. Focus: outcome vs. coefficients.
2. Whether selection bias matters.
3. Interpretation.

---

# Diff 1. Focus: outcome vs. coefficients.

`$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$`

.hi-turquoise[**Description:**] Want to estimate a few coefficients: `$\beta_0, \beta_1, \beta_2$`.
- Focus on showing relationships among a few variables: `$\color{#6A5ACD}{\hat{Y_{i}}}, x_{0i}, x_{1i}, x_{2i}.$`
- Give up goal of *correctly* modeling the true DGP. Just show patterns as they are.

.hi-purple[**Prediction:**] Want to estimate `$\color{#6A5ACD}{\hat{Y_{i}}}$`.
- Focus on predicting `$\color{#6A5ACD}{\hat{Y_{i}}}$` given observed data `$x_{0i}, x_{1i}, x_{2i}, ....$`
- Find a model that *works*, by any possible means.
- Give up goal of *correctly* modeling the true DGP.

.hi[**Causal inference:**] Want to estimate one coefficient `$\color{#e64173}{\hat{\beta_0}}$`, the effect of a treatment variable `$x_{0i}$`.
- Give up goal of understanding causal effects of any other factors `$x_{1i}, x_{2i}, ....$`
- Give up goal of explaining the overall level of `$\color{#6A5ACD}{Y_{i}}$`.
  - Only how much it changes in response to a change in `$x_{0i}$`.

---
# Diff 2. Whether selection bias matters.

`$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$`

.hi-turquoise[**Description:**] **No.** Only want to infer patterns from observed data.
- The reasons why patterns exist is an entirely separate question.

.hi-purple[**Prediction:**] **No.** Only want to infer patterns from observed data. Selection gives us valuable information to predict `$\color{#6A5ACD}{Y_i}$`.
- **Given** that a student attends summer school, what is their GPA?
- **Given** how many people are wearing shorts, will an ice cream truck show up?
- **Given** the size of its police department, how much crime does a city have?

.hi[**Causal inference:**] **Yes.** Want to infer the result of active intervention. Must eliminate selection bias to estimate the treatment effect.
- **If** a student attends summer school, how will their GPA change?
- **If** someone chooses to wear shorts, will it make an ice cream truck show up?
- **If** a city adds more police officers, will crime decrease?

---
# Diff 3. Interpretation.

`$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$`

.hi-turquoise[**Description:**] `$\beta_0$` represents an **association** between `$x_{0i}$` and `$\color{#6A5ACD}{Y_i}$`.
- It says how the average value of `$\color{#6A5ACD}{Y_i}$` changes as we look at different values of `$x_{0}$`.
- Only a statement about the data, not about the reasons behind the pattern.

.hi-purple[**Prediction:**] Model does **not** need to be interpretable.
- Coefficients `$\beta$` are informative only of predictive power, not causal effects.
- `$\color{#e64173}{\beta_0}$` might reflect the treatment effect of `$x_{0}$`, *or* the effects of omitted variables for which `$x_0$` happens to be a good proxy.
- Generally, the model can be treated as a black box.

.hi[**Causal inference:**] `$\color{#e64173}{\beta_0}$` is a **causal effect** of `$x_{0}$`.
- Under stated assumptions (of the identification strategy).
- Other coefficients generally lack interpretability.

---
# Prediction vs. causal inference

When would .hi[causal inference] be useful (to a policymaker, business executive, individual)? When would .hi-purple[prediction] be useful?

1. **Birth weight `$(Y)$` ~ air pollution `$(x)$`**

1. **Income `$(Y)$` ~ educational attainment `$(x)$`**

1. **Probability of default on a loan `$(Y)$`**

1. **Poverty `$(Y)$` ~ public infrastructure `$(x)$`**

---
# Prediction vs. causal inference

When would .hi[causal inference] be useful (to a policymaker, business executive, individual)? When would .hi-purple[prediction] be useful?

1. **Birth weight `$(Y)$` ~ air pollution `$(x)$`**

--
   - Causal inference: How strictly should air pollution be regulated?
   - Prediction: What areas might need more prenatal services?

--
1. **Income `$(Y)$` ~ educational attainment `$(x)$`**

--
   - Causal inference: Should I go to college? Should we increase investment in public education?
   - Prediction: What types of people should we advertise to? Choose for a tax audit?

--
1. **Probability of default on a loan `$(Y)$`**

--
   - Causal inference: How can we (borrowers) make better financial decisions?
   - Prediction: Who should we (bankers) lend money to?

--
1. **Poverty `$(Y)$` ~ public infrastructure `$(x)$`**

--
   - Causal inference: Can infrastructure investment help fight poverty?
   - Prediction: Which communities need the most aid right now?

---

# Summary

There are **3 main purposes** of data analysis:
1. .hi-turquoise[**Descriptive analysis:**] Characterize observed patterns among variables.
2. .hi[**Causal inference:**] Learn how Y changes as a result of actively changing X.
3. .hi-purple[**Prediction:**] Predict the value of one variable from other information.

Causal questions require thinking about **counterfactuals.**
* **The Fundamental Problem of Causal Inference:** We can only ever observe one of the **potential outcomes.**
* A difference in means gives the **treatment effect** plus **selection bias.**

Key differences among the 3 goals:
1. **Focus:** Prediction focuses on the outcome `$\color{#6A5ACD}{\hat{Y_{i}}}$`, causal inference on a coefficient `$\color{#e64173}{\hat{\beta_0}}$`.
2. **Selection bias** is a problem for causal inference, but not predictive or descriptive analysis.
3. **Interpretion:** Coefficients have meaning in causal inference and descriptive analysis, but not in prediction.

---

# What about methods?

What **methods** should we use for each goal?

1. .hi-turquoise[**Descriptive analysis:**] **Exploratory analysis and regression.**
   - Starting next week!
   
2. .hi[**Causal inference:**] **Econometrics.**
   - See your other classes.

3. .hi-purple[**Prediction:**] **Statistical learning / machine learning.**
   - Introduction later in the semester!