class: center, middle, inverse, title-slide # Instrumental Variables ## EC 421, Set 11 ### Edward Rubin ### 08 March 2019 --- class: inverse, middle # Prologue --- name: schedule # Schedule ## Last Time Causality ## Today - Econ. Masters program - Review: Causality - New: Instrumental variables ## Upcoming Assignment soon. --- layout: false name: masters # Master's Program ## Applied Economics You could be a master of (applied) economics... - 1-year program including courses on applied econometrics, data science, and "big data". - Awesome opportunity to focus on applying economic methods to real-world questions/scenarios. - [Applications are due by **May 1 (2019)**](https://economics.uoregon.edu/masters/). More information: [https://economics.uoregon.edu/masters/](https://economics.uoregon.edu/masters/) --- layout: false class: inverse, middle # Causality ## Review --- layout: true # Causality ## Review --- name: causality_review In our last lecture, we returned to the concept of .hi[causality]. -- We worked through the *Rubin causal model*, in which we defined - `\(\color{#e64173}{y_{1i}:}\)` the outcome for individual `\(i\)` if she had received treatment -- - `\(\color{#6A5ACD}{y_{0i}:}\)` the outcome for individual `\(i\)` if she had not received treatment -- and we referred to individuals who did not receive treatment as *control*. -- ***If*** we were able to know both `\(\color{#e64173}{y_{1i}}\)` ***and*** `\(\color{#6A5ACD}{y_{0i}}\)`, we could calculate the causal effect of treatment for individual `\(i\)`, _i.e._, -- $$ `\begin{align} \tau_i = \color{#e64173}{y_{1i}} - \color{#6A5ACD}{y_{0i}} \end{align}` $$ --- .hi-slate[Fundamental problem of causal inference:] <br>We cannot simultaneously know `\(\color{#e64173}{y_{1i}}\)` and `\(\color{#6A5ACD}{y_{0i}}\)`. -- Either we observe individual `\(i\)` in the treatment group, _i.e._, $$ `\begin{align} \tau_i = \color{#e64173}{y_{1i}} - \color{#6A5ACD}{?} \end{align}` $$ -- or we observe `\(i\)` in the control group, _i.e._, $$ `\begin{align} \tau_i = \color{#e64173}{?} - \color{#6A5ACD}{y_{0i}} \end{align}` $$ -- but never both at the same time. --- If we want to know `\(\tau_i\)` (or at least `\(\overline{\tau}\)`), what can we do? -- **Idea:** Estimate the .hi-slate[average treatment effect] as the difference between the average outcomes in the treatment group and the control group, _i.e._, $$ `\begin{align} \color{#e64173}{\mathop{Avg}\left( y_i\mid D_i = 1 \right)} - \color{#6A5ACD}{\mathop{Avg}\left( y_i\mid D_i =0 \right)} \end{align}` $$ where `\(D_i=1\)` if `\(i\)` received treatment, and `\(D_i=0\)` if `\(i\)` is in the control group. --- **Result:** We showed that even when the treatment effect is constant (meaning `\(\tau_i=\tau\)` for all `\(i\)`), $$ `\begin{align} &\color{#e64173}{\mathop{Avg}\left( y_i\mid D_i = 1 \right)} - \color{#6A5ACD}{\mathop{Avg}\left( y_i\mid D_i =0 \right)} \\ &\quad\quad = \tau + \underbrace{\color{#e64173}{\mathop{Avg}\left(\color{#6A5ACD}{y_{0,i}} \mid D_i = 1 \right)} - \color{#6A5ACD}{\mathop{Avg}\left( y_{0,i}\mid D_i =0 \right)}}_{\color{#FFA500}{\text{Selection bias}}} \end{align}` $$ which says that the difference in the groups' means will give us a **biased estimate** for the causal effect of treatment .hi-orange[if we have selection bias.] --- **Q:** What is this .hi-orange[selection bias]? -- **A:** **(Informal)** We have selection bias when our control group doesn't offer a good comparison for our treatment group. -- Specifically, the control group doesn't give us a good .hi-orange[counterfactual] for .orange[what our treatment group would have looked like if the members had not received treatment.] -- Basically, the groups are different. -- **A:** **(Formal)** The .pink[average *untreated* outcome for a member of our **treatment group**] (which we cannot observe) differs from the .purple[average *untreated* outcome for a member of our **control group**], _i.e._, $$ `\begin{align} \color{#e64173}{\mathop{Avg}\left(\color{#6A5ACD}{y_{0,i}} \mid D_i = 1 \right)} - \color{#6A5ACD}{\mathop{Avg}\left( y_{0,i}\mid D_i =0 \right)} \end{align}` $$ --- .hi-slate[Practical problem:] Selection bias is also difficult to observe $$ `\begin{align} \underbrace{\color{#e64173}{\mathop{Avg}\left(\color{#6A5ACD}{y_{0,i}} \mid D_i = 1 \right)}}_{\color{#e64173}{\text{Unobservable}}} - \color{#6A5ACD}{\mathop{Avg}\left( y_{0,i}\mid D_i =0 \right)} \end{align}` $$ (back to the *fundamental problem of causal inference*) -- .hi-slate[Bigger problem:] If selection bias is present, our estimate for `\(\tau\)` is biased, preventing us from understanding the causal effect of treatment. -- Sounds a bit like omitted-variable bias, right? -- Our .pink[treatment] variable is correlated with something that makese the two groups different. --- **Example:** Imagine we have two people—Al and Bri—and a single binary treatment, college. We interested in the effect of college on earnings. .pull-left[.center[ .pink[Earn.sub[1,Al]] = .pink[$60K] <br>.purple[Earn.sub[0,Al]] = .purple[$30K] ]] .pull-right[ .pink[Earn.sub[1,Bri]] = .pink[$140K] <br>.purple[Earn.sub[0,Bri]] = .purple[$110K] ] -- They both have the same treatment effect (return to college) -- <br> `\(\quad\quad\tau\)`.sub[Al] = .pink[Earn.sub[1,Al]] - .purple[Earn.sub[0,Al]] = .pink[$60K] - .purple[$30K] = $30K -- <br> `\(\quad\quad\tau\)`.sub[Bri] = .pink[Earn.sub[1,Bri]] - .purple[Earn.sub[0,Bri]] = .pink[$140K] - .purple[$100K] = $30K -- but any real-world estimate would have serious selection issues since .purple[Earn.sub[0,Al]] ≠ .purple[Earn.sub[0,Bri].] --- count: false **Example:** Imagine we have two people—Al and Bri—and a single binary treatment, college. We interested in the effect of college on earnings. .pull-left[.center[ .pink[Earn.sub[1,Al]] = .pink[$60K] <br>.purple[Earn.sub[0,Al]] = .purple[$30K] ]] .pull-right[ .pink[Earn.sub[1,Bri]] = .pink[$140K] <br>.purple[Earn.sub[0,Bri]] = .purple[$110K] ] The selection bias... -- If Bri attended college (D.sub[Bri]=1) and Al did not (D.sub[Al]=0): <br> `\(\quad\quad\hat{\tau}\)` = .pink[Earn.sub[1,Bri]] - .purple[Earn.sub[0,Al]] = .pink[$140K] - .purple[$30K] = $110K -- If Al attended college (D.sub[Al]=1) and Bri did not (D.sub[Bri]=0): <br> `\(\quad\quad\hat{\tau}\)` = .pink[Earn.sub[1,Al]] - .purple[Earn.sub[0,Bri]] = .pink[$60K] - .purple[$110K] = -$50K --- We have (at least) two problems... 1. Selection bias is difficult to observe 2. If selection bias is present, our estimate for `\(\tau\)` is biased, preventing us from understanding the causal effect of treatment. -- .hi-slate[Solution:] Eliminate/minimize selection bias. -- - .hi-slate[Option 1:] .hi-pink[Distribute treatment] in a way such that the treatment and control groups are essentially identical -- (experiments). -- - .hi-slate[Option 2:] .hi-purple[Build a control] group that *matches* the treatment group -- <br>(life with observational data). --- layout: true # Instrumental variables --- class: inverse, middle --- name: intro ## Intro .hi[Instrumental variables (IV)] is one route econometricians often take toward estimating the causal effect of a treatment/program. -- *Recall:* .hi-orange[Selection bias] means our .pink[treatment] and .purple[control] groups differ on some unobserved/omitted dimension. (.hi-slate[Endogeneity]) -- .hi-pink[Instrumental variables] attempts to separate out - the .hi-slate[exogenous] part of `\(x\)`, which gives us unbiased estimates - the .hi-slate[endogenous] part of `\(x\)`, which biases our results -- If we use only the exogenous (*good*) variation in `\(x\)`, then we can avoid selection bias/omitted-variable bias. --- ## Introductory example *Example:* If we want to estimate the effect of veteran status on earnings, $$ `\begin{align} \text{Earnings}_i = \beta_0 + \beta_1 \text{Veteran}_i + u_i \tag{1} \end{align}` $$ -- We would love to calculate `\(\color{#e64173}{\text{Earnings}_{1i}} - \color{#6A5ACD}{\text{Earnings}_{0i}}\)`, but we can't. -- And OLS will likely be biased for `\((1)\)` due to selection/omitted-variable bias. --- ## Introductory example Imagine that we can split veteran status into an exogenous part and an endogenous part... -- $$ `\begin{align} \text{Earnings}_i &= \beta_0 + \beta_1 \text{Veteran}_i + u_i \tag{1} \\ &= \beta_0 + \beta_1 \left(\text{Veteran}_i^{\text{Exog.}} + \text{Veteran}_i^{\text{Endog.}}\right) + u_i \\ &= \beta_0 + \beta_1 \text{Veteran}_i^{\text{Exog.}} + \underbrace{\beta_1 \text{Veteran}_i^{\text{Endog.}} + u_i}_{w_i} \\ &= \beta_0 + \beta_1 \text{Veteran}_i^{\text{Exog.}} + w_i \end{align}` $$ -- We could use this exogenous variation in veteran status to consistently estimate `\(\beta_1\)`. -- **Q:** What would exogenous variation in veteran status mean? --- ## Introductory example **Q:** What would exogenous variation in veteran status mean? -- **A.sub[1]:** Choices to enlist in the military that are essentially random—or at least uncorrelated with omitted variables and the disturbance. -- **A.sub[2]:** .hi-orange[No selection bias:] $$ `\begin{align} \color{#e64173}{\mathop{Avg}\left(\text{Earnings}_{0i}\mid\text{Veteran}_i = 1\right)} - \color{#6A5ACD}{\mathop{Avg}\left( \text{Earnings}_{0i} \mid \text{Veteran}_i = 0 \right)} = 0 \end{align}` $$ --- name: instruments ## Instruments **Q:** How do we isolate this *exogenous variation* in our explanatory variable? -- <br>**A:** Find an instrument (an instrumental variable). -- **Q:** What's an instrument? -- <br>**A:** An .hi-pink[instrument] is a variable that is 1. **correlated** with the **explanatory variable** of interest (.hi[relevant]), 2. **uncorrelated** with the **disturbance** (.hi[exogenous]). --- ## Instruments >**Q:** What's an instrument? ><br>**A:** An .hi-pink[instrument] is a variable that is > >1. **correlated** with the **explanatory variable** of interest (.hi[relevant]), >2. **uncorrelated** with the **disturbance** (.hi[exogenous]). -- So if we want an instrument `\(z_i\)` for endogenous veteran status in $$ `\begin{align} \text{Earnings}_i = \beta_0 + \beta_1 \text{Veteran}_i + u_i \end{align}` $$ 1. .hi[Relevant:] `\(\mathop{\text{Cov}} \left( \text{Veteran}_i,\, z_i \right) \neq 0\)` 2. .hi[Exogenous:] `\(\mathop{\text{Cov}} \left( z_i,\, u_i \right) = 0\)` --- name: relevant ## Instruments: Relevance .hi[Relevance:] We need the instrument to cause a change in (correlate with) our endogenous explanatory variable. We can actually test this requirement using regression and a *t* test. -- ***Example:*** For the .pink[veteran status], consider three potential instruments: .pull-left[1\. Social security number<br>.white[blank]] -- .pull-right[.hi-slate[Probably not relevant]<br>uncorrelated with military service] -- .pull-left[2\. Physical fitness<br>.white[blank]] -- .pull-right[.hi-pink[Potentially relevant]<br>service may correlate with fitness] -- .pull-left[3\. Vietnam War draft] -- .pull-right[.hi-pink[Relevant]<br>being draw led to service] --- name: exogenous ## Instruments: Exogeneity .hi[Exogeneity:] The instrument to be independent of omitted factors that affect our outcome variable—as good as randomly assigned. `\(z_i\)` must be uncorrelated with our disturbance `\(u_i\)`. .hi[Not testable.] -- ***Example:*** For the .pink[veteran status], consider three potential instruments: .pull-left[1\. Social security number<br>.white[blank]] -- .pull-right[.hi-pink[Exogenous]<br>Indep. of other factors of service] -- .pull-left[2\. Physical fitness<br>.white[blank]] -- .pull-right[.hi-slate[Not exogenous]<br>fitness correlates with many things] -- .pull-left[3\. Vietnam War draft] -- .pull-right[.hi-pink[Exogenous]<br>the lottery was random] --- ## Instrumental review Let's recap... - Our instrument must be .hi[correlated with our endogenous variable]. - Our instrument must be .hi[uncorrelated with any other variable that affects the outcome]. -- .hi-slate[In other words:] <br>The instrument only affects our outcome through the endogenous variable. --- ## Back to our example For .pink[veteran status] we considered three potential instruments: -- .pull-left[1\. Social security number<br>.white[blank]<br>.white[blank]] .pull-right[.hi-slate[Not relevant]<br>.hi-pink[Exogenous]<br>.white[blank]] -- .pull-left[2\. Physical fitness<br>.white[blank]<br>.white[blank]] .pull-right[.hi-pink[Probably relevant]<br>.hi-slate[Not exogenous]<br>.white[blank]] -- .pull-left[3\. Vietnam War draft<br>.white[blank]<br>.white[blank]] .pull-right[.hi-pink[Relevant]<br>.hi-pink[Exogenous]<br>.white[blank]] -- Thus, only the Vietnam War's draft lottery appears to be a .hi[*valid* instrument]. --- layout: false class: clear, middle If we have a *valid* instrument (_e.g._, the draft lottery), how do we use it? --- layout: true # Instrumental variables ## Estimation --- name: iv_estimation *Recall:* We want to estimate the effect of veteran status on earnings. $$ `\begin{align} \color{#FFA500}{\text{Earnings}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Veteran}_i} + u_i \end{align}` $$ -- Let's consider two related effects: -- 1. The effect of the .hi-pink[instrument] on the .hi-purple[endogenous variable], _e.g._, $$ `\begin{align} \color{#6A5ACD}{\text{Veteran}_i} = \gamma_0 + \gamma_1 \color{#e64173}{\text{Draft}_i} + v_i \end{align}` $$ -- 1. The effect of the .hi-pink[instrument] on the .hi-orange[outcome variable], _e.g._, $$ `\begin{align} \color{#FFA500}{\text{Earnings}_i} = \pi_0 + \pi_1 \color{#e64173}{\text{Draft}_i} + w_i \end{align}` $$ --- *Recall:* We want to estimate the effect of veteran status on earnings. $$ `\begin{align} \color{#FFA500}{\text{Earnings}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Veteran}_i} + u_i \end{align}` $$ and we know that the draft affected veteran status. -- .center[ .hi-pink[Draft] ⟶ .hi-purple[Veteran status] ⟶ .hi-orange[Earnings] ] Using our assumptions on independence and exogeneity: (Effect of .hi-pink[the draft] on .hi-orange[earnings]) = <br> (Effect of .hi-pink[the draft] on .hi-purple[veteran status])× <br> (Effect of .hi-purple[veteran status] on .hi-orange[earnings]) --- We just wrote out an expression for the effect of .hi-pink[the draft] on .hi-orange[earnings], _i.e._, (Effect of .hi-pink[the draft] on .hi-orange[earnings]) = <br> (Effect of .hi-pink[the draft] on .hi-purple[veteran status])× <br> (Effect of .hi-purple[veteran status] on .hi-orange[earnings]) -- but we want to know the effect of .hi-purple[veteran status] on .hi-orange[earnings]. -- Rearrange! -- (Effect of .hi-purple[veteran status] on .hi-orange[earnings]) = <br> .top[(Effect of .hi-pink[the draft] on .hi-orange[earnings])] <br> .bottom[(Effect of .hi-pink[the draft] on .hi-purple[veteran status])] -- Our .hi-pink[instrument] consistently estimates both parts of this fraction! --- layout: true # Instrumental variables ## Estimation: Bring it all together --- By estimating two regressions involving our .hi-pink[instrument], 1. The effect of the .hi-pink[instrument] on the .hi-purple[endogenous variable], _e.g._, $$ `\begin{align} \color{#6A5ACD}{\text{Veteran}_i} = \gamma_0 + \gamma_1 \color{#e64173}{\text{Draft}_i} + v_i \end{align}` $$ 1. The effect of the .hi-pink[instrument] on the .hi-orange[outcome variable], _e.g._, $$ `\begin{align} \color{#FFA500}{\text{Earnings}_i} = \pi_0 + \pi_1 \color{#e64173}{\text{Draft}_i} + w_i \end{align}` $$ -- we can estimate our desired effect: -- (Effect of .hi-purple[veteran status] on .hi-orange[earnings]) = <br> .top[(Effect of .hi-pink[the draft] on .hi-orange[earnings])] <br> .bottom[(Effect of .hi-pink[the draft] on .hi-purple[veteran status])] --- count: false By estimating two regressions involving our .hi-pink[instrument], 1. The effect of the .hi-pink[instrument] on the .hi-purple[endogenous variable], _e.g._, $$ `\begin{align} \color{#6A5ACD}{\text{Veteran}_i} = \gamma_0 + \gamma_1 \color{#e64173}{\text{Draft}_i} + v_i \end{align}` $$ 1. The effect of the .hi-pink[instrument] on the .hi-orange[outcome variable], _e.g._, $$ `\begin{align} \color{#FFA500}{\text{Earnings}_i} = \pi_0 + \pi_1 \color{#e64173}{\text{Draft}_i} + w_i \end{align}` $$ we can estimate our desired effect: (Effect of .hi-purple[veteran status] on .hi-orange[earnings]) = `\(\dfrac{\pi_1}{\gamma_1}\)` --- So with instrumental variables, we estimate `\(\beta_1\)` using $$ `\begin{align} \hat{\beta}_1^\text{IV} = \dfrac{\hat{\pi}_1}{\hat{\gamma}_1} \end{align}` $$ where `\(\hat{\pi}_1\)` and `\(\hat{\gamma}_1\)` come from the two equations we just discussed. -- **Q:** Can we trust `\(\hat{\beta}_1^\text{IV}\)`? -- <br>**A:** Yes... **if we have a valid instrument.** -- $$ `\begin{align} \mathop{\text{plim}}\left( \hat{\beta}_1^\text{IV} \right) = \beta_1 + \dfrac{\mathop{\text{Cov}} \left( \color{#e64173}{\text{Instrument}},\, u \right)}{\mathop{\text{Cov}} \left( \color{#e64173}{\text{Instrument}},\, \color{#6A5ACD}{\text{Endog. variable}} \right)} \end{align}` $$ -- which equals `\(\beta_1\)` as long as our instrument is .hi-pink[exogenous] (numerator) and .hi-purple[relevant] (denominator). --- layout: false class:clear name: venn <img src="11_instrumental_variables_files/figure-html/venn_iv-1.svg" style="display: block; margin: auto;" /> --- class: clear count: false <img src="11_instrumental_variables_files/figure-html/venn_iv_endog-1.svg" style="display: block; margin: auto;" /> --- class: clear count: false <img src="11_instrumental_variables_files/figure-html/venn_iv_irrelevant-1.svg" style="display: block; margin: auto;" /> --- class: clear count: false <img src="11_instrumental_variables_files/figure-html/venn_iv_endog2-1.svg" style="display: block; margin: auto;" /> --- layout: false # Venn diagram explanation In these figures (Venn diagrams) - Each circle illustrates a variable. - Overlap gives the share of correlatation between two variables. - Dotted borders denote *omitted* variables. Take-aways - Figure 1: .hi-pink[Valid instrument] (relevant; exogenous) - Figure 2: .hi-slate[Invalid instrument] (relevant; not exogenous) - Figure 3: .hi-slate[Invalid instrument] (not relevant; not exogenous) - Figure 4: .hi-slate[Invalid instrument] (relevant; not exogenous) --- layout: false class: clear, middle Let's work an example in .mono[R]. --- layout: true # Instrumental variables ## Example in .mono[R] --- name: r_example Back to our age-old battle to estimate the returns to education. ``` #> # A tibble: 722 x 4 #> wage education education_dad education_mom #> <int> <int> <int> <int> #> 1 769 12 8 8 #> 2 808 18 14 14 #> 3 825 14 14 14 #> 4 650 12 12 12 #> 5 562 11 11 6 #> 6 600 10 8 8 #> 7 1154 15 5 14 #> 8 1000 12 11 12 #> 9 930 18 14 13 #> 10 900 15 12 12 #> # … with 712 more rows ``` --- OLS for the returns to education with will likely (definitely) be biased... $$ `\begin{align} \color{#FFA500}{\text{Wage}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Education}}_i + u_i \end{align}` $$ -- .hi-slate[(Likely biased) OLS results:] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> 176.504 </td> <td style="text-align:right;background-color: white;"> 89.152 </td> <td style="text-align:right;background-color: white;"> 1.98 </td> <td style="text-align:left;background-color: white;"> 0.0481 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 58.594 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 6.439 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 9.10 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> <0.0001 </td> </tr> </tbody> </table> -- but what if mother's education provides a valid instrument? --- We can check/test the *relevance* of .hi-pink[mother's education] for .hi-purple[education]. -- This regression is known as the .hi-slate[*first stage*:] <br> The effect of the .pink[instrument] on our .purple[endogenous explanatory variable]. $$ `\begin{align} \color{#6A5ACD}{\text{Education}_i} = \gamma_0 + \gamma_1 \color{#e64173}{\left( \text{Mother's Education} \right)_i} + v_i \end{align}` $$ -- .hi-slate[First-stage results:] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> 10.487 </td> <td style="text-align:right;background-color: white;"> 0.306 </td> <td style="text-align:right;background-color: white;"> 34.32 </td> <td style="text-align:left;background-color: white;"> <0.0001 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> Mother's Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 0.294 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 0.027 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 10.75 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> <0.0001 </td> </tr> </tbody> </table> -- The *p*-value suggests a very strong relationship (very *relevant*). --- layout: false # Instrumental variables ## Visualizing the first stage <img src="11_instrumental_variables_files/figure-html/first_stage_plot-1.svg" style="display: block; margin: auto;" /> --- count: false # Instrumental variables ## Visualizing the first stage <img src="11_instrumental_variables_files/figure-html/first_stage_plot2-1.svg" style="display: block; margin: auto;" /> --- # Instrumental variables ## Exogeneity **Q:** What does .hi[exogeneity] mean in this case? -- <br>**A:** We need two things 1. .pink[Mother's education (our instrument)] must only affect earnings through .purple[education (our endogenous explanatory variable)]. 2. .pink[Mother's education] must be uncorrelated with other factors that affect .orange[wages (our outcome variable)]. -- We want to be able to compare two people (*A* and *B*) whose mothers have different levels of education and say that the only differences between the two people (*A* and *B*) are due to their mothers' educational levels. -- **Q:** Does *mother's education* seem likely to satisfy exogeneity? --- layout: true # Instrumental variables ## Example in .mono[R] --- Now, let's estimate the .hi-turquoise[*reduced form*]: <br> The effect of the .pink[instrument] on our .orange[outcome variable]. $$ `\begin{align} \color{#FFA500}{\text{Wage}_i} = \pi_0 + \pi_1 \color{#e64173}{\left( \text{Mother's Education} \right)_i} + w_i \end{align}` $$ -- .hi-turquoise[Reduced-form results:] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> 633.34 </td> <td style="text-align:right;background-color: white;"> 58.58 </td> <td style="text-align:right;background-color: white;"> 10.81 </td> <td style="text-align:left;background-color: white;"> <0.0001 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> Mother's Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 31.81 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 5.24 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 6.07 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> <0.0001 </td> </tr> </tbody> </table> -- **Q.sub[1]:** How do we interpret this estimated coefficient `\(\left( \hat{\pi}_1 \right)\)`? -- <br>**Q.sub[2]:** If our instrument is *valid*, can we interpret these estimates as .hi[causal]? --- So what is our IV-based estimate for the returns to education? $$ `\begin{align} \color{#FFA500}{\text{Wage}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Education}}_i + u_i \end{align}` $$ -- We know that the IV estimate for `\(\beta_1\)` is $$ `\begin{align} \hat{\beta}_1^\text{IV} = \dfrac{\color{#20B2AA}{\hat{\pi}_1}}{\color{#314f4f}{\hat{\gamma}_1}} \end{align}` $$ -- 1. In the .hi-turquoise[reduced-form equation], we estimated `\(\color{#20B2AA}{\hat{\pi}_1 \approx 31.81}\)`. 2. In the .hi-slate[first-stage equation], we estimated `\(\color{#314f4f}{\hat{\gamma}_1 \approx 0.294}\)`. -- $$ `\begin{align} \implies\hat{\beta}_1^\text{IV} = \dfrac{\color{#20B2AA}{\hat{\pi}_1}}{\color{#314f4f}{\hat{\gamma}_1}} = \dfrac{\color{#20B2AA}{31.81}}{\color{#314f4f}{0.294}} \approx 108.2 \end{align}` $$ --- **Alternative:** Use the function `iv_robust()` from the `estimatr` package. This new function `iv_robust` works very similar to our good friend `lm`: `iv_robust(y ~ x | z, data = dataset)` - `formula` Specify the regression followed by `|` and your instrument (`z`). - `data` You still need a dataset. ***Note:*** As you might guess by its name, `iv_robust` calculates heteroskedasticity-robust standard errors by default. --- In practice... ```r # Estimate our IV regression iv_est <- iv_robust(wage ~ education | education_mom, data = wage_df) ``` <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> -501.474 </td> <td style="text-align:right;background-color: white;"> 226.476 </td> <td style="text-align:right;background-color: white;"> -2.21 </td> <td style="text-align:left;background-color: white;"> 0.0271 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 108.214 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 16.810 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 6.44 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> <0.0001 </td> </tr> </tbody> </table> --- layout: true # Instrumental variables ## More --- So now we know how to "do" instrumental variables -- *when we have one endogenous variable and one exogenous variable.* -- 1. Estimate the reduced form (regress .orange[outcome var.] on .pink[instrument]). 2. Estimate the first stage (regress .purple[expl. var.] on .pink[instrument]). 3. Calculate the IV estimate using the estimates from (1) and (2). Our magical .pink[instrument] isolates the exogenous variation in our .purple[endogenous variable]. -- **Q:** What if we want more? -- (_E.g._, more instruments or endog. variables) -- <br>**A:** Too bad. --- count: false So now we know how to "do" instrumental variables *when we have one endogenous variable and one exogenous variable.* 1. Estimate the reduced form (regress .orange[outcome var.] on .pink[instrument]). 2. Estimate the first stage (regress .purple[expl. var.] on .pink[instrument]). 3. Calculate the IV estimate using the estimates from (1) and (2). Our magical .pink[instrument] isolates the exogenous variation in our .purple[endogenous variable]. **Q:** What if we want more? (_E.g._, more instruments or endog. variables) <br>**A:** .st[Too bad.] Extend IV to .hi[two-stage least squares (2SLS)]. --- layout: false class: inverse, middle # Two-stage least squares --- layout: true # Two-stage least squares ## Intro --- name: 2sls_intro The intuition and insights of IV carry over into two-stage least squares. -- **Plus:** The *first stage* that we've been discussing is actually the *first* of the *two stages* in two-stage least squares. -- $$ `\begin{align} {\color{#c5c5c5}{\text{Endogenous model}}}& &\color{#FFA500}{\text{Outcome}_i} &= \beta_0 + \beta_1 \color{#6A5ACD}{\left( \text{Endog. var.} \right)_i} + u_i\\[0.5em] {\text{First stage}}& &\color{#6A5ACD}{\left( \text{Endog. var.} \right)_i} &= \pi_0 + \pi_1 \color{#e64173}{\text{Instrument}_i} + v_i\\[0.25em] {\text{Second stage}}& &\color{#FFA500}{\text{Outcome}_i} &= \delta_0 + \delta_1 \color{#6A5ACD}{\widehat{\left( \text{Endog. var.} \right)}_i} + \varepsilon_i\\[0.5em] {\color{#c5c5c5}{\text{Reduced form}}}& &\color{#FFA500}{\text{Outcome}_i} &= \pi_0 + \pi_1 \color{#e64173}{\text{Instrument}_i} + w_i\\[0.25em] \end{align}` $$ where `\(\color{#6A5ACD}{\widehat{\left( \text{Endog. var.} \right)}_i}\)` denotes the predicted values (*fitted values*) from the first-stage regression. --- Two-stage least squares is very flexible—we include other controls, additional endogenous variables, *and* have multiple instruments. **But** don't get too distracted by this fancy flexiblity. We still need .hi[valid] instruments. --- layout: true # Two-stage least squares ## In .mono[R] --- name: 2sls_r Back to our *returns to education* example. $$ `\begin{align} \color{#FFA500}{\text{Wage}_i} = \beta_0 + \beta_1 \color{#6A5ACD}{\text{Education}}_i + u_i \end{align}` $$ Imagine that mother's *and* father's education are both valid instruments. Then our .hi-slate[first-stage regression] is $$ `\begin{align} \color{#6A5ACD}{\text{Education}}_i = \gamma_0 + \gamma_1 \color{#e64173}{\left( \text{Mother's education} \right)}_i + \gamma_2 \color{#e64173}{\left( \text{Father's education} \right)}_i + v_i \end{align}` $$ which we can estimate via OLS. -- **Q:** Why? --- $$ `\begin{align} \color{#6A5ACD}{\text{Education}}_i = \gamma_0 + \gamma_1 \color{#e64173}{\left( \text{Mother's education} \right)}_i + \gamma_2 \color{#e64173}{\left( \text{Father's education} \right)}_i + v_i \end{align}` $$ ```r stage1 <- lm(education ~ education_mom + education_dad, wage_df) ``` .hi-slate[First-stage results:] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> 9.845 </td> <td style="text-align:right;background-color: white;"> 0.305 </td> <td style="text-align:right;background-color: white;"> 32.31 </td> <td style="text-align:left;background-color: white;"> <0.0001 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> Mother's Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 0.149 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 0.032 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 4.62 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> <0.0001 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> Father's Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 0.216 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 0.028 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #e64173;"> 7.84 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #e64173;"> <0.0001 </td> </tr> </tbody> </table> -- Our instruments each appear to be *relevant*. -- <br>Formally, we should jointly test them (_e.g._, `\(F\)` test). --- Using our .slate[estimated first stage], we grab the *fitted* .purple[endogenous variable] $$ `\begin{align} \color{#6A5ACD}{\widehat{\text{Education}}}_i = \widehat{\gamma}_0 + \widehat{\gamma}_1 \color{#e64173}{\left( \text{Mother's education} \right)}_i + \widehat{\gamma}_2 \color{#e64173}{\left( \text{Father's education} \right)}_i \end{align}` $$ -- ```r # Add fitted values from first stage wage_df$education_hat <- stage1$fitted.values ``` -- Now we use OLS (again) to estimate the .hi-green[second-stage regression] $$ `\begin{align} \color{#FFA500}{\text{Wage}_i} = \delta_0 + \delta_1 \color{#6A5ACD}{\widehat{\text{Education}}}_i + \varepsilon_i \end{align}` $$ --- $$ `\begin{align} \color{#FFA500}{\text{Wage}_i} = \delta_0 + \delta_1 \color{#6A5ACD}{\widehat{\text{Education}}}_i + \varepsilon_i \end{align}` $$ ```r stage2 <- lm(wage ~ education_hat, wage_df) ``` .hi-green[Second-stage results:] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> -454.683 </td> <td style="text-align:right;background-color: white;"> 198.149 </td> <td style="text-align:right;background-color: white;"> -2.29 </td> <td style="text-align:left;background-color: white;"> 0.022 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> Fitted Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 104.789 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 14.462 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 7.25 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> <0.0001 </td> </tr> </tbody> </table> --- layout: false class: clear .purple[Ordinary least squares] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> 176.504 </td> <td style="text-align:right;background-color: white;"> 89.152 </td> <td style="text-align:right;background-color: white;"> 1.98 </td> <td style="text-align:left;background-color: white;"> 0.0481 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 58.594 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 6.439 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 9.10 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> <0.0001 </td> </tr> </tbody> </table> <br>.slate[Instrumental variables] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> -501.474 </td> <td style="text-align:right;background-color: white;"> 226.476 </td> <td style="text-align:right;background-color: white;"> -2.21 </td> <td style="text-align:left;background-color: white;"> 0.0271 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: darkslategrey;"> Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: darkslategrey;"> 108.214 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: darkslategrey;"> 16.810 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: darkslategrey;"> 6.44 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: darkslategrey;"> <0.0001 </td> </tr> </tbody> </table> <br>.green[Two-stage least squares w/ two instruments] <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> -454.683 </td> <td style="text-align:right;background-color: white;"> 198.149 </td> <td style="text-align:right;background-color: white;"> -2.29 </td> <td style="text-align:left;background-color: white;"> 0.022 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #8bb174;"> Education </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #8bb174;"> 104.789 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #8bb174;"> 14.462 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #8bb174;"> 7.25 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #8bb174;"> <0.0001 </td> </tr> </tbody> </table> --- layout: false # Two-stage least squares ## In .mono[R] As you probably guessed, .mono[R] will do both of the stages for you. -- `iv_robust(y ~ x1 + x2 + ⋯ | z1 + z2 + ⋯, data)` -- In our case, we have - one explanatory variable (`x`) (.purple[education]) - two instruments (`z`) (.pink[parents' educations]) ```r iv_robust(wage ~ education | education_mom + education_dad, data = wage_df) ``` <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> Est. </th> <th style="text-align:right;"> S.E. </th> <th style="text-align:right;"> t stat. </th> <th style="text-align:left;"> p-Value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: white;"> Intercept </td> <td style="text-align:right;background-color: white;"> -454.683 </td> <td style="text-align:right;background-color: white;"> 199.946 </td> <td style="text-align:right;background-color: white;"> -2.27 </td> <td style="text-align:left;background-color: white;"> 0.0233 </td> </tr> <tr> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> Education, fitted </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 104.789 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 14.852 </td> <td style="text-align:right;background-color: white;font-weight: bold;color: #6A5ACD;"> 7.06 </td> <td style="text-align:left;background-color: white;font-weight: bold;color: #6A5ACD;"> <0.0001 </td> </tr> </tbody> </table> --- name: more # Two-stage least squares ## There's more! Because 2SLS .hi[isolates exogenous variation in an endogenous variable], we apply it in other settings that are biased from *endogenous* relationships. -- .hi[Common applications] - **General causal inference** for observational data (as we've seen). - **Experiments:** Randomize a treatment that affects an endog. variable. - **Measurement error:** Regress noisy `\(x_1\)` on noisy `\(x_2\)` to capture signal. - **Simultaneous relationships** (_e.g._, `\(p\)` and `\(q\)` from supply and demand). However, in any 2SLS/IV setting, you need to mind the requirements for .hi[valid instruments]—.pink[exogeneity] and .pink[relevance]. --- layout: false # Table of contents .pull-left[ ### Admin .smallest[ 1. [Schedule](#schedule) 1. [Masters program](#masters) 1. [Causality review](#causality_review) ] ] .pull-right[ ### Instrumental variables .smallest[ 1. [Introduction](#intro) 1. [What is an instrument?](#instruments) - [Relevant](#relevant) - [Exogenous](#exogenous) 1. [IV Estimation](#iv_estimation) 1. [Venn diagrams](#venn) 1. [Example in .mono[R]](#r_example) 1. [Two-stage least squares](#2sls_intro) - [Introduction](#2sls_intro) - [Back to .mono[R]](#2sls_r) 1. [More applications](#more) ] ] --- exclude: true