class: center, middle, inverse, title-slide .title[ # Topic 8:
Distinguishing Goals of Data Analysis ] .author[ ### Nick Hagerty
ECNS 460/560
Montana State University ] --- name: toc <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .hi { font-weight: 600; color: #e64173 !important; } .hi-pink { font-weight: 600; color: #e64173 !important; } .hi-slate { font-weight: 600; color: #314f4f !important; } .hi-turquoise { font-weight: 600; color: #44C1C4 !important; } .hi-purple { font-weight: 600; color: #6A5ACD !important; } .hi-orange { font-weight: 600; color: #FFA500 !important; } </style> # Table of contents 1. [The Data Generating Process](#DGP) 1. [The Potential Outcomes Model](#rubin) 1. [Descriptive, Predictive, or Causal Analysis?](#which) --- # Interpreting relationships in data When we say there's a relationship between two variables... how do we interpret that? - What precisely do we mean? - What do we want to do with this information? <img src="08-Distinguishing-Goals_files/figure-html/unnamed-chunk-2-1.png" width="80%" style="display: block; margin: auto;" /> --- # Distinguishing goals of data analysis When we analyze data, we always have one of three ultimate goals: 1. .hi-turquoise[**Description:**] Simply characterize observed patterns in the data. 2. .hi[**Causation:**] Learn about causal relationships: If we change X, how will Y change? 3. .hi-purple[**Prediction:**] Be able to guess the value of one variable from other information. -- Here are a few quick examples of a research question in each category: 1. .hi-turquoise[**Descriptive:**] Is wealth inequality increasing faster in the U.S. than in Europe? 2. .hi[**Causal:**] Does Medicaid coverage reduce the risk of bankruptcy? 3. .hi-purple[**Predictive:**] Can nighttime satellite imagery be used as a real-time indicator of GDP? -- Discerning which type of goal you have is critical for: * **Choosing methods:** Distinct approaches are required to achieve different goals. * **Interpreting results:** Mistaking one goal for another can lead your audience to make very bad decisions. --- class: inverse, middle name: DGP # The Data Generating Process --- # The Data Generating Process What's shared across all goals of data analysis: We start with an **outcome variable** called `\(Y\)`. We want to know some things: * Where do the values of `\(Y\)` come from? * How does `\(Y\)` relate to other variables? The **data generating process** (DGP) in the real world determines the values of `\(Y\)`. * The true DGP is unknowable, a black box: <img src="img/black-box-general.png" width="75%" style="display: block; margin: auto;" /> --- # The Data Generating Process <img src="img/black-box-general.png" width="80%" style="display: block; margin: auto;" /> We could write this picture as a function: $$ Y = f(x_1, x_2, x_3, ... x_N) $$ But some big questions remain: - What are all these inputs `\(x_1, x_2, ...\)`? - What is the functional form? (How does the box work?) --- # The Data Generating Process For some questions, the set of inputs and their functional form are discoverable. **Physics example:** How long does it take for an object to fall from a particular height? $$ t = \sqrt{2h/g} \approx 0.452 \sqrt{h} $$ Newton and others did experiments to learn the functional form. <img src="img/black-box-physics1.png" width="85%" style="display: block; margin: auto;" /> --- # The Data Generating Process Even in physics, this is just an approximation to the real world. <img src="img/black-box-physics2.png" width="85%" style="display: block; margin: auto;" /> The true DGP is more complicated, but it is still limited to a certain number of factors. * We can understand the vast majority of the "duration of free fall" DGP. -- In physics, there is no need for any distinction between causation and prediction. * If your model is *correct,* they're the same thing! --- # The Data Generating Process **Social science is harder.** People have free will! * Inputs? Any outcome is determined by far more than 10 or even 100 inputs. - Many inputs are impossible to measure. * Functional form? No universal laws guaranteeing they are simple. - Inputs often interact with each other. Characterizing the full DGP for *anything* is typically far beyond the realm of possibility. <img src="img/black-box-general.png" width="70%" style="display: block; margin: auto;" /> What can we do instead? **Lower our ambitions!** --- # Types of goals Most data analysis in economics falls into one of **3 types of goals:** 1. .hi-turquoise[**Descriptive:**] Document or quantify observed relationships between inputs and outputs. - Does not not necessarily tell us about the true DGP. - Helps us understand facts about the world. - Can often inspire questions for further research. 2. .hi[**Causal:**] Try to understand *one piece* of how the box works (the true DGP). - When you *change* one factor, how does it change the result? - Helps us make decisions about what to do (in policy, business, personal life). 3. .hi-purple[**Predictive:**] Create your own box to try to match the output. - Doesn't matter if it works the same, or if you have the correct inputs. - Only matters how closely your box produces the same result. - Helps us know what's likely to happen in a new situation. -- For more clarity, we need to take a slightly technical detour. --- class: inverse, middle name: rubin # The Potential Outcomes Model --- # Causation requires counterfactuals **Causal effect:** the change in one variable resulting from a change in another. - **Treatment** variable `\(T\)`. (E.g., taking Advil.) - **Outcome** variable `\(Y\)`. (E.g., severity of pain.) **Potential outcomes:** What would happen with and without treatment? - `\(Y^1_i =\)` outcome for person `\(i\)` if `\(T_i = 1\)` (take Advil). - `\(Y^0_i =\)` outcome for person `\(i\)` if `\(T_i = 0\)` (don't take Advil). The **treatment effect** for person `\(i\)` is: `$$\text{Effect}_i = Y^1_i - Y^0_i.$$` -- **The Fundamental Problem of Causal Inference:** Only one potential outcome is ever observable for each person. `$$Y_{i} = \begin{cases} Y_{i}^{1} & \text{if } T_{i}=1\\ Y_{i}^{0} & \text{if } T_{i}=0 \end{cases}$$` The other one is a **counterfactual** -- what *would* have happened had `\(T_i\)` been different. --- # Average Treatment Effect We can never learn the treatment effect by looking at just one person. Instead, we'll look at many people and try to estimate the **average treatment effect.** `$$\begin{aligned} ATE &= \mathbb{E}[Y^1_i - Y^0_i] \\ &= \overline{Y^1} - \overline{Y^0}. \end{aligned}$$` * `\(\overline{Y^1}\)`: average outcome if everyone had `\(T=1\)`. * `\(\overline{Y^0}\)`: average outcome if everyone had `\(T=0\)`. -- But these are still just *potential* outcomes. 😢 --- class: clear, middle <img src="img/potential_outcomes.PNG" width="55%" style="display: block; margin: auto;" /> --- # Interpreting mean comparisons What we *do* have in our data: - `\(\overline{Y_T}\)`: Average outcome among people who are treated `\((T_i=1)\)`. - `\(\overline{Y_U}\)`: Average outcome among people who are untreated `\((T_i=0)\)`. It's often natural to look at the difference in these group means: `$$\text{Difference in Means} = \overline{Y_T} - \overline{Y_U}$$` How is this different from the ATE? -- **Treated and untreated people aren't the same.** * Means are taken over groups of people who have different potential outcomes. - For example: People who take Advil have higher pain than those who don't. * Treatment effect is probably negative, but difference in means is positive. --- # Interpreting mean comparisons Let's be more precise about the connection between the ATE and the difference in means. First, we can rewrite these as potential outcomes: `$$\begin{aligned} \text{Difference in Means} &= \overline{Y_T} - \overline{Y_U} \\ &= \overline{Y_T^1} - \overline{Y_U^0} \end{aligned}$$` -- Next, I'll use an algebraic trick -- add and subtract the same term: `$$\begin{aligned} \text{Difference in Means} &= \overline{Y_T^1} - \overline{Y_U^0} + 0\\ &= \overline{Y_T^1} - \overline{Y_U^0} + \Big( \overline{Y_T^0} - \overline{Y_T^0} \Big) \end{aligned}$$` -- Rearranging, we get: `$$\text{Difference in Means} = \underbrace{ \Big(\overline{Y_T^1} - \overline{Y_T^0}\Big) }_\text{treatment effect} + \underbrace{ \Big(\overline{Y_T^0} - \overline{Y_U^0}\Big) }_\text{selection bias}$$` --- # Correlation is causation plus selection `$$\text{Difference in Means} = \underbrace{ \Big(\overline{Y_T^1} - \overline{Y_T^0}\Big) }_\text{treatment effect} + \underbrace{ \Big(\overline{Y_T^0} - \overline{Y_U^0}\Big) }_\text{selection bias}$$` The difference in means is a result of 2 processes: - **Treatment:** The true causal effect of the treatment variable. (Taking Advil reduces pain.) - **Selection:** Differences in the types of people who are treated. (People who take Advil are in more pain.) -- The central challenge of causal inference is to disentangle treatment from selection. How? - Since this formula is about *potential* outcomes, we can never prove it. - We always need to bring in (well-grounded) assumptions. -- **The idea:** find a situation where we can argue that selection bias is negligible. - This is called a **research design** or **identification strategy.** --- # Selection bias Note: 1. Don't confuse selection bias with sampling bias. * **Sampling bias:** We've collected a sample that is unrepresentative of a broader population. * **Selection bias:** Within our sample, the people with `\(T=1\)` have different potential outcomes from people with `\(T=0\)`. -- 2. Selection bias is an all-encompassing category for several sources of bias: - Omitted variables bias - Reverse causality - Selection on gains -- 3. What is a **research design** that can allow us to eliminate selection bias and estimate the treatment effect? Examples: - Experimentally vary `\(T\)`. - Eliminate "bad" variation in `\(T\)` (e.g., by controlling for enough other variables). - Isolate "good" variation in `\(T\)` (e.g., by finding and using an instrumental variable). --- class: inverse, middle name: which # Descriptive, Predictive, or Causal? --- # Types of goals Now we can revisit our 3 goals of data analysis. For sake of illustration, consider a simple linear model: `$$\begin{aligned} Y_i &= f(x_{0i}, x_{1i}, ..., x_{Ni}) \\ &= \beta_0 x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i \end{aligned}$$` All 3 goals are trying to model the DGP, but with different emphases and interpretations. <br> **Three key differences** among the goals: 1. Focus: outcome vs. coefficients. 2. Whether selection bias matters. 3. Interpretation. --- # Diff 1. Focus: outcome vs. coefficients. `$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$` .hi-turquoise[**Description:**] Want to estimate a few coefficients: `\(\beta_0, \beta_1, \beta_2\)`. - Focus on showing relationships among a few variables: `\(\color{#6A5ACD}{\hat{Y_{i}}}, x_{0i}, x_{1i}, x_{2i}.\)` - Give up goal of *correctly* modeling the true DGP. Just show patterns as they are. .hi-purple[**Prediction:**] Want to estimate `\(\color{#6A5ACD}{\hat{Y_{i}}}\)`. - Focus on predicting `\(\color{#6A5ACD}{\hat{Y_{i}}}\)` given observed data `\(x_{0i}, x_{1i}, x_{2i}, ....\)` - Find a model that *works*, by any possible means. - Give up goal of *correctly* modeling the true DGP. .hi[**Causal inference:**] Want to estimate one coefficient `\(\color{#e64173}{\hat{\beta_0}}\)`, the effect of a treatment variable `\(x_{0i}\)`. - Give up goal of understanding causal effects of any other factors `\(x_{1i}, x_{2i}, ....\)` - Give up goal of explaining the overall level of `\(\color{#6A5ACD}{Y_{i}}\)`. - Only how much it changes in response to a change in `\(x_{0i}\)`. --- # Diff 2. Whether selection bias matters. `$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$` .hi-turquoise[**Description:**] **No.** Only want to infer patterns from observed data. - The reasons why patterns exist is an entirely separate question. .hi-purple[**Prediction:**] **No.** Only want to infer patterns from observed data. Selection gives us valuable information to predict `\(\color{#6A5ACD}{Y_i}\)`. - **Given** that a student attends summer school, what is their GPA? - **Given** how many people are wearing shorts, will an ice cream truck show up? - **Given** the size of its police department, how much crime does a city have? .hi[**Causal inference:**] **Yes.** Want to infer the result of active intervention. Must eliminate selection bias to estimate the treatment effect. - **If** a student attends summer school, how will their GPA change? - **If** someone chooses to wear shorts, will it make an ice cream truck show up? - **If** a city adds more police officers, will crime decrease? --- # Diff 3. Interpretation. `$$\color{#6A5ACD}{Y_i} = f(x_{0i}, x_{1i}, ..., x_{Ni}) = \color{#e64173}{\beta_0} x_{0i} + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \varepsilon_i$$` .hi-turquoise[**Description:**] `\(\beta_0\)` represents an **association** between `\(x_{0i}\)` and `\(\color{#6A5ACD}{Y_i}\)`. - It says how the average value of `\(\color{#6A5ACD}{Y_i}\)` changes as we look at different values of `\(x_{0}\)`. - Only a statement about the data, not about the reasons behind the pattern. .hi-purple[**Prediction:**] Model does **not** need to be interpretable. - Coefficients `\(\beta\)` are informative only of predictive power, not causal effects. - `\(\color{#e64173}{\beta_0}\)` might reflect the treatment effect of `\(x_{0}\)`, *or* the effects of omitted variables for which `\(x_0\)` happens to be a good proxy. - Generally, the model can be treated as a black box. .hi[**Causal inference:**] `\(\color{#e64173}{\beta_0}\)` is a **causal effect** of `\(x_{0}\)`. - Under stated assumptions (of the identification strategy). - Other coefficients generally lack interpretability. --- # Prediction vs. causal inference When would .hi[causal inference] be useful (to a policymaker, business executive, individual)? When would .hi-purple[prediction] be useful? 1. **Birth weight `\((Y)\)` ~ air pollution `\((x)\)`** 1. **Income `\((Y)\)` ~ educational attainment `\((x)\)`** 1. **Probability of default on a loan `\((Y)\)`** 1. **Poverty `\((Y)\)` ~ public infrastructure `\((x)\)`** --- # Prediction vs. causal inference When would .hi[causal inference] be useful (to a policymaker, business executive, individual)? When would .hi-purple[prediction] be useful? 1. **Birth weight `\((Y)\)` ~ air pollution `\((x)\)`** -- - Causal inference: How strictly should air pollution be regulated? - Prediction: What areas might need more prenatal services? -- 1. **Income `\((Y)\)` ~ educational attainment `\((x)\)`** -- - Causal inference: Should I go to college? Should we increase investment in public education? - Prediction: What types of people should we advertise to? Choose for a tax audit? -- 1. **Probability of default on a loan `\((Y)\)`** -- - Causal inference: How can we (borrowers) make better financial decisions? - Prediction: Who should we (bankers) lend money to? -- 1. **Poverty `\((Y)\)` ~ public infrastructure `\((x)\)`** -- - Causal inference: Can infrastructure investment help fight poverty? - Prediction: Which communities need the most aid right now? --- # Summary There are **3 main purposes** of data analysis: 1. .hi-turquoise[**Descriptive analysis:**] Characterize observed patterns among variables. 2. .hi[**Causal inference:**] Learn how Y changes as a result of actively changing X. 3. .hi-purple[**Prediction:**] Predict the value of one variable from other information. Causal questions require thinking about **counterfactuals.** * **The Fundamental Problem of Causal Inference:** We can only ever observe one of the **potential outcomes.** * A difference in means gives the **treatment effect** plus **selection bias.** Key differences among the 3 goals: 1. **Focus:** Prediction focuses on the outcome `\(\color{#6A5ACD}{\hat{Y_{i}}}\)`, causal inference on a coefficient `\(\color{#e64173}{\hat{\beta_0}}\)`. 2. **Selection bias** is a problem for causal inference, but not predictive or descriptive analysis. 3. **Interpretion:** Coefficients have meaning in causal inference and descriptive analysis, but not in prediction. --- # What about methods? What **methods** should we use for each goal? 1. .hi-turquoise[**Descriptive analysis:**] **Exploratory analysis and regression.** - Starting next week! 2. .hi[**Causal inference:**] **Econometrics.** - See your other classes. 3. .hi-purple[**Prediction:**] **Statistical learning / machine learning.** - Introduction later in the semester!