class: center, middle, inverse, title-slide # Learning from Observational Data ## EC 350: Labor Economics ###
Kyle Raze
### Winter 2022 --- # Learning from Observational Data 1. A taxonomy of data - Experimental *vs.* observational data 2. Direct acyclic graphs - Causal paths - Backdoor paths - Backdoor criterion 3. Regression discontinuity --- class: inverse, middle # A taxonomy of data --- # A taxonomy of data ## .hi-pink[Experimental] Data generated from a .hi-pink[randomized] experiment. - Treatment assigned at .hi-pink[random] - The **gold standard** of social science research - Often difficult/impractical/unethical to conduct -- ## .hi-purple[Observational (non-experimental)] Data generated from the .hi-purple[decisions] of various individuals in the "real world." - Sometimes treatment is randomly assigned (*e.g.,* in a lottery), but not usually .hi-purple[(non-random!)] - Prone to selection bias and omitted-variable bias - Must rely on natural experiments to identify causal relationships --- # A taxonomy of data ## **Example: Effect of job training on unemployment status** -- .pull-left[ ### .hi-pink[Experimental sample] <style type="text/css"> /* Table width = 100% max-width */ .remark-slide table{ width: 100%; } /* Change the background color to white for shaded rows (even rows) */ .remark-slide thead, .remark-slide tr:nth-child(2n) { background-color: white; } </style> <table> <caption> <br>**Unemployed?** (.mono[=] 1 if yes, .mono[=] if no)</caption> <thead> <tr> <th style="text-align:left;color: #708090 !important;"> </th> <th style="text-align:center;color: #708090 !important;"> 1 </th> <th style="text-align:center;color: #708090 !important;"> 2 </th> <th style="text-align:center;color: #708090 !important;"> 3 </th> <th style="text-align:center;color: #708090 !important;"> 4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> .hi-pink[Training?] </td> <td style="text-align:center;color: #272822 !important;"> -0.111 </td> <td style="text-align:center;color: #272822 !important;"> -0.116 </td> <td style="text-align:center;color: #272822 !important;"> -0.115 </td> <td style="text-align:center;color: #272822 !important;"> -0.113 </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #c2bebe !important;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.044) </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Control mean** </td> <td style="text-align:center;color: #272822 !important;"> 0.354 </td> <td style="text-align:center;color: #272822 !important;"> 0.354 </td> <td style="text-align:center;color: #272822 !important;"> 0.354 </td> <td style="text-align:center;color: #272822 !important;"> 0.354 </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Demographics** </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Education** </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Unemployed?.sub[t-1]** </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> </tr> </tbody> </table> .smallest[*Note:* Standard errors in parentheses.] ] -- .pull-right[ ### .hi-purple[Non-experimental sample] <style type="text/css"> /* Table width = 100% max-width */ .remark-slide table{ width: 100%; } /* Change the background color to white for shaded rows (even rows) */ .remark-slide thead, .remark-slide tr:nth-child(2n) { background-color: white; } </style> <table> <caption> <br>**Unemployed?** (.mono[=] 1 if yes, .mono[=] if no)</caption> <thead> <tr> <th style="text-align:left;color: #708090 !important;"> </th> <th style="text-align:center;color: #708090 !important;"> 1 </th> <th style="text-align:center;color: #708090 !important;"> 2 </th> <th style="text-align:center;color: #708090 !important;"> 3 </th> <th style="text-align:center;color: #708090 !important;"> 4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> .hi-purple[Training?] </td> <td style="text-align:center;color: #272822 !important;"> 0.128 </td> <td style="text-align:center;color: #272822 !important;"> 0.164 </td> <td style="text-align:center;color: #272822 !important;"> 0.160 </td> <td style="text-align:center;color: #272822 !important;"> -0.182 </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #c2bebe !important;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.025) </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.027) </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.027) </td> <td style="text-align:center;color: #272822 !important;color: #c2bebe !important;"> (0.027) </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Control mean** </td> <td style="text-align:center;color: #272822 !important;"> 0.115 </td> <td style="text-align:center;color: #272822 !important;"> 0.115 </td> <td style="text-align:center;color: #272822 !important;"> 0.115 </td> <td style="text-align:center;color: #272822 !important;"> 0.115 </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Demographics** </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Education** </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> </tr> <tr> <td style="text-align:left;color: #272822 !important;color: #272822 !important;"> **Unemployed?.sub[t-1]** </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> </td> <td style="text-align:center;color: #272822 !important;"> `\(\checkmark\)` </td> </tr> </tbody> </table> .smallest[*Note:* Standard errors in parentheses.] ] --- class: inverse, middle # Direct acyclic graphs --- # Direct acyclic graphs .pull-left[ A direct acyclic graph (DAG) can help us visualize the assumptions necessary to estimate causal relationships using observational data. .hi-pink[Nodes] represent .hi-pink[variables]. .hi-black[Arrows] represent .hi-black[causal relationships] between variables. ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- # DAGs follow two rules .pull-left[ .center[**Rule 1 ("direct"):** No bidirectional arrows!] <img src="05-Observational_Data_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> .center[.hi-red[Illegal!]] ] -- .pull-right[ .center[**Rule 2 ("acyclic"):** No feedback loops!] <img src="05-Observational_Data_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> .center[.hi-red[Illegal!]] ] --- # Causal paths Our objective is to **identify the causal effect** of a treatment variable .hi[D] on an outcome variable .hi[Y]. - The treatment could have a **direct effect** on the outcome: .hi[D] .mono[-->] .hi[Y]. - Alternatively, the treatment could have an **indirect effect** on the outcome through .hi[X], a mediator variable: .hi[D] .mono[-->] .hi[X] .mono[-->] .hi[Y]. <img src="05-Observational_Data_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> --- # Backdoor paths The presence of a confounder variable .hi[W] opens a **backdoor path** from the treatment to the outcome:<br>.center[.hi[D] .mono[<--] .hi[W] .mono[-->] .hi[Y]] <img src="05-Observational_Data_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> An open backdoor path creates a **spurious correlation** between the treatment and the outcome! --- # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** How does education affect earnings? - .hi[D] .mono[=] Education (*e.g.,* going to college or not) - .hi[Y] .mono[=] Earnings as an adult - .hi[PE] .mono[=] Parental education - .hi[I] .mono[=] Family income - .hi[U] .mono[=] Unobserved characteristics (*e.g.,* family background) ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> ] --- count: false # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** How does education affect earnings? - .hi[D] .mono[=] Education (*e.g.,* going to college or not) - .hi[Y] .mono[=] Earnings as an adult - .hi[PE] .mono[=] Parental education - .hi[I] .mono[=] Family income - .hi[U] .mono[=] Unobserved characteristics (*e.g.,* family background) The presence—*or absence*—of an arrow illustrates our **causal assumptions** about how education affects earnings! ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> ] --- # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** What are the paths through which education affects earnings? ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> ] --- count: false # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** What are the paths through which education affects earnings? - .hi[D] .mono[-->] .hi[Y] (causal effect) - .hi[D] .mono[<--] .hi[I] .mono[-->] .hi[Y] (backdoor path) - .hi[D] .mono[<--] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y] (backdoor path) - .hi[D] .mono[<--] .hi[U] .mono[-->] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y] (backdoor path) ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> ] --- # Backdoor paths ## **Backdoor criterion** > The observed correlation between .hi[Y] and .hi[D] isolates the causal effect of .hi[D] on .hi[Y] if and only if all backdoor paths from .hi[D] to .hi[Y] are closed. -- **Q:** What closes a backdoor path? - **A.sub[1]:** *Conditioning* or *controlling for* the confounder variable on the path. -- - **A.sub[2]:** The presence of a collider variable on the path. --- # Backdoor paths The presence of a collider variable .hi[C] closes a backdoor path from the treatment to the outcome:<br>.center[.hi[D] .mono[-->] .hi[C] .mono[<--] .hi[Y]] <img src="05-Observational_Data_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> -- **The implication?** We don't want to control for collider variables! - Conditioning on a collider can open up new backdoor paths. (More on this later.) --- # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** How could we satisfy the backdoor criterion given our assumptions about the effect of education on earnings? ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> ] --- count: false # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** How could we satisfy the backdoor criterion given our assumptions about the effect of education on earnings? **A:** Control for family income (.hi[I]) - **Why?** Family income appears as a non-collider on each backdoor path:<br>.center[.hi[D] .mono[<--] .hi[I] .mono[-->] .hi[Y]] .center[.hi[D] .mono[<--] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y]] .center[.hi[D] .mono[<--] .hi[U] .mono[-->] .hi[PE] .mono[-->] .hi[I] .mono[-->] .hi[Y]] ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> ] --- # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** Would controlling for family income isolate the causal effect of education on earnings if unobserved family background (.hi[U]) has a direct effect on earnings (.hi[Y])? ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> ] --- count: false # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** Would controlling for family income isolate the causal effect of education on earnings if unobserved family background (.hi[U]) has a direct effect on earnings (.hi[Y])? **A:** No! - .hi[U] is unobserved, so we can't control for it. - The backdoor path .hi[D] .mono[<--] .hi[U] .mono[-->] .hi[Y] would stay open. ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> ] --- count: false # Backdoor paths ## **Example: Returns to education** .pull-left[ **Q:** Would controlling for family income isolate the causal effect of education on earnings if unobserved family background (.hi[U]) has a direct effect on earnings (.hi[Y])? **A:** No! - .hi[U] is unobserved, so we can't control for it. - The backdoor path .hi[D] .mono[<--] .hi[U] .mono[-->] .hi[Y] would stay open. **The takeaway?**<br>.hi-pink[ALL causal inference is by assumption!] ] .pull-right[ <img src="05-Observational_Data_files/figure-html/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" /> ] --- class: inverse, middle # Regression discontinuity --- # Regression discontinuity There are situations in the real world where treatment is assigned in a way that is **as good as random.** - These situations can provide **valid comparison groups**, just like the ones you'd find in a randomized control trial! **Examples?** When some arbitrary threshold triggers a change in treatment: - Anti-discrimination laws only apply to firms with more than 15 employees. - Prisoners are eligible for early parole if some score exceeds a threshold. - An individual has legal access to alcohol if they are 21 or older. - You get a ticket if your speed exceeds the speed limit. - A candidate for governor wins if her vote share exceeds that of her competitors. -- Economists can (and often do) use these situations to estimate causal effects. --- # Regression discontinuity **Example:** Effect of merit scholarships on graduation - Outcome variable .mono[=] probability of graduation - Treatment .mono[=] scholarship money - "Assignment variable" .mono[=] admissions test score (*e.g.,* the SAT) - "Cutoff/threshold" .mono[=] minimum score for getting a scholarship (*e.g.,* SAT score of 1200 or higher) -- **Assumption:** Students *just below* the cutoff are comparable to those *just above* the cutoff. --- layout: true class: clear-slide --- Let's start with potential graduation rates: `\(\color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} \right]}\)` <img src="05-Observational_Data_files/figure-html/s1-1.svg" style="display: block; margin: auto;" /> --- count: false Let's start with potential graduation rates: `\(\color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} \right]}\)` and `\(\color{#e64173}{\mathop{E}\left[ \text{Y}_{1,i} \mid \text{SAT}_{i} \right]}\)`. <img src="05-Observational_Data_files/figure-html/s2-1.svg" style="display: block; margin: auto;" /> --- You only get a scholarship if if your .hi-slate[SAT score exceeds the cutoff score]. <img src="05-Observational_Data_files/figure-html/s3-1.svg" style="display: block; margin: auto;" /> --- `\(\color{#e64173}{\mathop{E}\left[ \text{Y}_{1,i} \mid \text{SAT}_{i} = 1200 \right]} - \color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} = 1200 \right]}\)` gives the .hi-orange[causal effect] .hi-slate[at the cutoff]. <img src="05-Observational_Data_files/figure-html/s4-1.svg" style="display: block; margin: auto;" /> --- Using real data, researchers have to estimate `\(\color{#e64173}{\mathop{E}\left[ \text{Y}_{1,i} \mid \text{SAT}_{i} \right]}\)` and `\(\color{#9370DB}{\mathop{E}\left[ \text{Y}_{0,i} \mid \text{SAT}_{i} \right]}\)`. <img src="05-Observational_Data_files/figure-html/s5-1.svg" style="display: block; margin: auto;" /> --- One way to estimate the .hi-orange[jump] is to estimate a regression on each side of the cutoff. <img src="05-Observational_Data_files/figure-html/s6-1.svg" style="display: block; margin: auto;" /> --- count: false One way to estimate the .hi-orange[jump] is to estimate a regression on each side of the cutoff. <img src="05-Observational_Data_files/figure-html/s7-1.svg" style="display: block; margin: auto;" /> --- Another way is to estimate regressions using only data closer to the cutoff. <img src="05-Observational_Data_files/figure-html/s8-1.svg" style="display: block; margin: auto;" /> --- Different choices of samples and models can lead to different estimates of the treatment effect! <img src="05-Observational_Data_files/figure-html/s9-1.svg" style="display: block; margin: auto;" /> --- Different choices of samples and models can lead to different estimates of the treatment effect! <img src="05-Observational_Data_files/figure-html/s10-1.svg" style="display: block; margin: auto;" /> --- Different choices of samples and models can lead to different estimates of the treatment effect! <img src="05-Observational_Data_files/figure-html/s11-1.svg" style="display: block; margin: auto;" /> --- Different choices of samples and models can lead to different estimates of the treatment effect! <img src="05-Observational_Data_files/figure-html/s12-1.svg" style="display: block; margin: auto;" /> --- Different choices of samples and models can lead to different estimates of the treatment effect! <img src="05-Observational_Data_files/figure-html/s13-1.svg" style="display: block; margin: auto;" /> --- Different choices of samples and models can lead to different estimates of the treatment effect! <img src="05-Observational_Data_files/figure-html/s14-1.svg" style="display: block; margin: auto;" /> --- Some modeling choices can find an effect even if none exists! <img src="05-Observational_Data_files/figure-html/s15a-1.svg" style="display: block; margin: auto;" /> --- count: false Some modeling choices can find an effect even if none exists! <img src="05-Observational_Data_files/figure-html/s15b-1.svg" style="display: block; margin: auto;" /> --- count: false Some modeling choices can find an effect even if none exists! <img src="05-Observational_Data_files/figure-html/s15c-1.svg" style="display: block; margin: auto;" /> --- count: false Some modeling choices can find an effect even if none exists! <img src="05-Observational_Data_files/figure-html/s15d-1.svg" style="display: block; margin: auto;" /> --- count: false Some modeling choices can find an effect even if none exists! <img src="05-Observational_Data_files/figure-html/s15e-1.svg" style="display: block; margin: auto;" /> --- layout: false # Regression discontinuity **Q:** When should we trust a regression discontinuity comparison? - When is the comparison *internally valid*? -- **A:** When we believe that **treatment is the only thing that changes** (other than observed outcomes) at the cutoff. 1. We don't want to see evidence of people **bunching** on one side of the threshold. - This could mean that people are **manipulating the assignment variable** near the cutoff so that they get the treatment. - Example: cheating among students who anticipate being close to the cutoff as a way to increase their score just enough to get the scholarship. 2. We don't want to see a **"jump" in other variables** at the cutoff. - This would mean that people on one side of the cutoff are **no longer comparable** to people on the other side! --- # Regression discontinuity **Q:** How can we tell if the treatment actually has a causal effect on the outcome? -- **A:** The treatment has an effect if **all three** of the statements below are true. 1. We believe that the regression discontinuity comparison is **internally valid.** 2. We can see that the **outcome variable "jumps"** at the cutoff ***when we look at the raw data.*** 3. The estimate of the "jump" is **precise enough** to conclude that the effect is statistically significant.