class: center, middle, inverse, title-slide .title[ # ECON 4050: Introduction to Econometrics ] .subtitle[ ## Instrumental Variables and Applications ] .author[ ### Adam Soliman, PhD ] .date[ ### Clemson University ] --- ## Today * We will introduce *instrumental variables* (IV) * To motivate IV, we will look back to London in 1850 and learn about John Snow. * We will finally introduce the IV estimator formally. --- # Setting the Scene .pull-left[ * In chapters [7](https://scpoecon.github.io/ScPoEconometrics/causality.html), [8](https://scpoecon.github.io/ScPoEconometrics/STAR.html) and [9](https://scpoecon.github.io/ScPoEconometrics/RDD.html) of the book (and the intro course) we talk about the merits of _experimental methods_. * Randomized Control Trials (RCTs) or _Quasiexperimental_ (as good as random) settings allow us to estimate **causal** effects. * In particular the [RCT](https://scpoecon.github.io/ScPoEconometrics/causality.html#rct) should be familiar to you. ] -- .pull-right[ * If people have some sort of control about getting treatment, there will be *selection*. * RCTs can break the self-selection of people into treatment by assigning randomly. * So with experimental data, we have a good solution. * What about non-experimental data? ] --- # Non-Experimental Data .pull-left[ * We talked about **omitted variable bias**. * What if there is correlation between a variable in the error term `\(u\)`, `\(x_2\)` say, and our explanatory variable `\(x_1\)`? * We will obtain biased estimates because we cannot separate out what is what: effect of `\(x_1\)`, or of `\(x_2\)`? * Remember that this can be so severe that we don't even get the correct sign of an effect. ] .pull-right[ <img src="04-IV_files/figure-html/dag1-1.svg" style="display: block; margin: auto;" /> ] -- .center[**IV** provides a solution to OVB.] --- background-image: url(../img/Kensington_slums_large.jpg) background-size: cover class: middle # <span style="color: #FFFFFF;">Welcome to London in 1850</span> ## <span style="color: #FFFFFF;">(Slum in Kensington)</span> --- # John Snow's (Non) Experiment: Cholera Hits the Town .pull-left[ * John Snow was a physician in London around 1850, when Cholera erupted several times in the City. * There was a dispute at the time about how the disease is transmitted: via air or via water? ] .pull-right[ <img src="../img/father-thames.jpg" width="2667" style="display: block; margin: auto;" /> ] --- background-image: url(../img/slum.jpg) background-size: cover --- background-image: url(../img/slum.jpg) background-size: cover .center[ <br> <br> <br> ## <span style="color: #FFFFFF;">In 1850:</span> ### <span style="color: #FFFFFF;">Unknown that germs can cause disease.</span> ### <span style="color: #FFFFFF;">Microscopes exist, but work at rather poor resolution.</span> ### <span style="color: #FFFFFF;">Most human pathogens are not visible to the naked eye.</span> ### <span style="color: #FFFFFF;">The so-called *infection theory* (i.e. infection via *germs*) had some supporters,</span> ### <span style="color: #FFFFFF;">but the dominant idea was that disease, in general, results from [*miasmas*](https://en.wikipedia.org/wiki/Miasma_theory)</span> ] --- background-image: url(https://media.giphy.com/media/3s4jGZP9UapxROVVN9/giphy.gif) background-size: 800px # Let's Go Watch a Movie <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> .center[May the force be with you. (click [here!](https://youtu.be/lNjrAXGRda4?si=Oj-MONvr9Exp-2xo))] --- # Snow's Detective Work .pull-left[ * Snow collected a lot of data. * He first mapped the location of dead during the 1854 outbreak. * This was the notorious *Broadstreet Pump Outbreak* ] .pull-right[ <img src="../img/snow-map.jpg" width="1024" style="display: block; margin: auto;" /> ] --- # The `cholera` package .pull-left[ * The `cholera` package has some interesting features. * For example an R version of Snow's map: ] .pull-right[ ``` r cholera::snowMap() ``` <img src="04-IV_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] --- # `cholera` .pull-left[ ...or the walking path of case number 15 in Snow's data: <img src="04-IV_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ ...or estimate Voronoi Polygons for pump neighborhoods: <img src="04-IV_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> ] --- # Removal of the Broad Street Pump? .pull-left[ * Snow identified the Broad Street Pump as culprit. * He pleaded to have its handle removed, but he was skeptical this was the reason the epidemic ended. ] .pull-right[ <img src="04-IV_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] --- # Mapping London's Water Supply * Water supply came from the River Thames * Different supply companies had different intake points * Southwark and Vauxhall water companies took in water beneath a major sewage discharge. * Lambeth water did not. --- # Snow's conclusion * Snow collected the following data: |area | numhouses| deaths| death1000| |:----------------------|---------:|------:|---------:| |Southwark and Vauxhall | 40046| 1263| 315| |Lambeth | 26107| 98| 37| |Rest of London | 256423| 1422| 59| * And concluded >that if Southwark and Vauxhall water companies had moved their water intakes upstream to where Lambeth water was taking in their supply, roughly 1,000 lives could have been saved. * For proponents of the miasma theory, this was still not evidence enough, because there were also many factors that led to poor air quality in those areas. --- layout: false class: separator, middle # We Need A Model, because... ## *It takes a model to beat a model*, which is attributed to Thomas L Sargent. --- # Snow's Model of Cholera Transmission * Suppose that `\(c_i\)` takes the value 1 if individual `\(i\)` dies of cholera, 0 else. * Let `\(w_i = 1\)` mean that `\(i\)`'s water supply is impure and `\(w_i = 0\)` vice versa. Water purity is assessed with a technology that cannot detect small microbes. * Collect in `\(u_i\)` all unobservable factors that impact `\(i\)`'s likelihood of dying from the disease: whether `\(i\)` is poor, where exactly they reside, whether there is bad air quality in `\(i\)`'s surrounding, and other invidivual characteristics which impact the outcome (like genetic setup of `\(i\)`). -- We can write: $$ c_i = \alpha + \delta w_i + u_i $$ --- # Doing the Simple Thing is always right? .pull-left[ * John Snow could have used his data and assess the correlation between drinking pure water and cholera incidence. * Measure `\(Cor(c_i,w_i)\)` * Suppose `\(Cor(c_i,w_i) \approx 0.5\)`. Does that prove the infection theory? ] -- .pull-right[ Note quite. Angus Deaton says: > The people who drank impure water were also more likely to be poor, and to live in an environment contaminated in many ways, not least by the ‘poison miasmas’ that were then thought to be the cause of cholera. ] --- # The Simple Thing * It does not make sense to compare someone who drinks pure water with someone with impure water because *all else is not equal*: pure water is correlated with being poor, living in bad area, bad air quality and so on - all factors that we encounter in `\(u_i\)`. * This violates the crucial orthogonality assumption for valid OLS estimates, `\(E[u_i | w_i]=0\)` in this context. * Another way to say this, is that `\(Cov(w_i, u_i) \neq 0\)`, implying that `\(w_i\)` is *endogenous*. * There are factors in `\(u_i\)` that affect both `\(w_i\)` and `\(c_i\)` --- # Snow's Model and Some Algebra Remember our simple model: `$$c_i = \alpha + \delta w_i + u_i$$` Now let's condition on both values of `\(w\)`: `\begin{align} E[c_i | w_i = 1] &= \alpha + \delta + E[u_i | w_i = 1] \\ E[c_i | w_i = 0] &= \alpha + \phantom{\delta} + E[u_i | w_i = 0] \end{align}` -- Now substract one line from the other: `\begin{equation} E[c_i | w_i = 1] - E[c_i | w_i = 0] = \delta + \left\{ E[u_i | w_i = 1] - E[u_i | w_i = 0]\right\} \end{equation}` * The last term `\(\left\{ E[u_i | w_i = 1] - E[u_i | w_i = 0]\right\}\)` is not equal to zero (by what Deaton said!) * A regression estimate for `\(\delta\)` would be biased by that quantity. --- layout: false class: separator, middle # The IV Estimator --- # John Snow Says > [...] the mixing of the supply is of the most intimate kind. The pipes of each Company go down all the streets, and into nearly all the courts and alleys. [...] The experiment, too, is on the grandest scale. No fewer than three hundred thousand people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and in most cases, without their knowledge; one group supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity. --- background-image: url("../img/snow-supply.jpg") background-size: cover # London Water Supply --- # Proposing an IV * Snow is proposing an **instrumental variable** `\(z_i\)`, the *identity of the water supplying company* to household `\(i\)`: More formally, let's define the instrument as follows: `\begin{align*} z_i &= \begin{cases} 1 & \text{if water supplied by Lambeth} \\ 0 & \text{if water supplied by Southwark or Vauxhall.} \\ \end{cases} \\ \end{align*}` * `\(z_i\)` is highly correlated with the water purity `\(w_i\)`. * However, it seems to be uncorrelated with all the other factors in `\(u_i\)`, which worried us before: Water supply was decided years before, and now houses on the same street have different suppliers! --- background-image: url(../img/IV-dag.png) background-position: 60% 50% # Simple IV in a DAG * `\(u\)` affects both outcome and explanatory variable --- # Defining Snow's IV Formally Here are the conditions for a valid instrument: 1. **Relevance** or **First Stage**: Water purity is indeed a function of supplier identity. We want that `$$E[w_i | z_i = 1] \neq E[w_i | z_i = 0]$$` i.e. the average water purity differs across suppliers. We can *verify* this condition with observational data. We want this effect to be reliably causal. -- 2. **Independence**: Whether a household has `\(z_i = 1\)` or `\(z_i = 0\)` is unrelated to `\(u\)`, hence *as good as random*. Whether we condition `\(u\)` on certain values of `\(z\)` does not change the result - we want `$$E[u_i | z_i = 1] = E[u_i | z_i = 0].$$` -- 3. **Excludability** the instrument should affect the outcome `\(c\)` *only* through the specified channel (i.e. via water purity `\(w\)`), and nothing else. --- # Defining the IV Estimator We are now ready to define a simple IV estimator. Like before, let's condition on the values of `\(z\)`: `\begin{align} E[c_i | z_i = 1] &= \alpha + \delta E[w_i | z_i = 1] + E[u_i | z_i = 1] \\ E[c_i | z_i = 0] &= \alpha + \delta E[w_i | z_i = 0] + E[u_i | z_i = 0] \end{align}` which upon differencing both lines gives `\begin{align} E[c_i | z_i = 1] - E[c_i | z_i = 0] &= \delta \left\{ E[w_i | z_i = 1] - E[w_i | z_i = 0]\right\} \\ &+ \underbrace{\left\{ E[u_i | z_i = 1] - E[u_i | z_i = 0] \right\}}_{=0 \text{ by Exogeneity Assumption}} \end{align}` -- * Finally, if the IV is *relevant*, i.e. `\(E[w_i | z_i = 1] - E[w_i | z_i = 0] \neq 0\)`: `\begin{equation} \delta = \frac{E[c_i | z_i = 1] - E[c_i | z_i = 0]}{E[w_i | z_i = 1] - E[w_i | z_i = 0]} (\#eq:IV) \end{equation}` --- # Special Case: Wald Estimator Let's say that `\(x \mapsto y\)` means that `\(x\)` is an estimate for `\(y\)`: 1. `\(\overline{c}_1 \mapsto E[c_i | z_i = 1]\)`: the proportion of households supplied by Lambeth with cholera. 1. `\(\overline{w}_1 \mapsto E[w_i | z_i = 1]\)`: the proportion of households supplied by Lambeth with bad water. 1. `\(\overline{c}_0 \mapsto E[c_i | z_i = 0]\)`: the proportion of households not supplied by Lambeth with cholera. 1. `\(\overline{w}_0 \mapsto E[w_i | z_i = 0]\)`: the proportion of households not supplied by Lambeth with bad water. The estimator would then be `\begin{equation} \hat{\delta} = \frac{\overline{c}_1 - \overline{c}_0}{\overline{w}_1 - \overline{w}_0} \end{equation}` In this special case where all involved variables `\(c,w,z\)` are binary, the estimator is called the *Wald estimator*. --- **Summary**: IVs are a powerful tool to establish causality in contexts with observational data only and where we are concerned that the conditional mean assumption `\(E[u_i | x_i]=0\)` is violated, hence, we cannot say *all else equal, as `\(x\)` changes, `\(y\)` changes like this and that*. Then we say that `\(x\)` is *endogenous*. The key features of IV `\(z\)` are that 1. `\(z\)` is *relevant* for `\(x\)`. For example, in a simple regression of `\(z\)` on `\(x\)`, we want `\(z\)` to have considerable predictive power. We can *test* this condition in data. 2. We need a theory according to which is *reasonable* to assume that `\(z\)` is *unrelated* to other unobservable factors that might impact the outcome. Hence, `\(z\)` is *exogenous* to `\(u\)`, or `\(E[u | z] = 0\)`. This is an **assumption** (i.e. we can not test this with data). --- # Applications * We now look at more IV applications. * We introduce an extension called *Two Stage Least Squares*. * We will use `R` to compute the estimates. * Finally we'll talk about *weak* instruments. --- layout: false class: separator, middle # Prologue ## One cool `R` package a day, keeps the doctor away... --- layout: false # [`Rayshader`](https://www.rayshader.com) * Can *raytrace* all kinds of elevation data. * Can *also* make most `ggplot`s come to 3D-life! <video width="90%" height="40%" controls id="my_video"> <source src="https://www.tylermw.com/wp-content/uploads/2019/05/deathappended2.mp4" type="video/mp4"> </video> --- # [`Rayshader`](https://www.rayshader.com) * This is all it takes: ``` r library(ggplot2) library(rayshader) #Data from Social Security administration death = read_csv("https://www.tylermw.com/data/death.csv", skip = 1) meltdeath = reshape2::melt(death, id.vars = "Year") meltdeath$age = as.numeric(meltdeath$variable) # make a ggplot deathgg = ggplot(meltdeath) + geom_raster(aes(x=Year,y=age,fill=value)) + scale_x_continuous("Year",expand=c(0,0),breaks=seq(1900,2010,10)) + scale_y_continuous("Age",expand=c(0,0),breaks=seq(0,100,10),limits=c(0,100)) + scale_fill_viridis("Death\nProbability\nPer Year", trans = "log10",breaks=c(1,0.1,0.01,0.001,0.0001), labels = c("1","1/10","1/100","1/1000","1/10000")) + ggtitle("Death Probability vs Age and Year for the USA") + labs(caption = "Data Source: US Dept. of Social Security") # give it to rayshader plot_gg(deathgg, multicore=TRUE,height=5,width=6,scale=500) ``` * Amazing, right? 🎉 🎊 --- layout: false class: separator, middle # Back to school! --- # Returns To Schooling .pull-left[ * What's the causal impact of schooling on earnings? * [Jacob Mincer](https://en.wikipedia.org/wiki/Jacob_Mincer) was interested in this important question. * Here's his model: $$ \log Y_i = \alpha + \rho S_i + \beta_1 X_i + \beta_2 X_i^2 + e_i, $$ where Y is income, S is years of schooling, and X is years of potential labor market experience ] .pull-right[ <img src="04-IV_files/figure-html/mincer-1.svg" style="display: block; margin: auto;" /> ] --- # Returns To Schooling .pull-left[ $$ \log Y_i = \alpha + \rho S_i + \beta_1 X_i + \beta_2 X_i^2 + e_i $$ * He found an estimate for `\(\rho\)` of about 0.11, * 11% earnings advantage for each additional year of education * Look at the DAG. Is that a good model? Well, why would it not be? ] .pull-right[ <img src="04-IV_files/figure-html/mincer2-1.svg" style="display: block; margin: auto;" /> ] --- # Ability Bias .pull-left[ * We compare earnings of men with certain schooling and work experience * Is all else equal, after controlling for those? * Given `\(X\)`, * Can we find differently diligent workers out there? * Can we find differently able workers? * Do family connections of workers vary? ] -- .pull-right[ * Yes, of course. So, *all else* is not equal at all. * That's an issue, because for OLS consistency we require the orthogonality assumption (exogeneity) `$$E[e_i | S_i, X_i] \neq 0$$` * More formally, ability bias arises when more able individuals both obtain more schooling and earn higher wages. This creates an endogeneity problem if ability is unobserved and omitted, as it can bias estimates of the true effect of schooling on wages. * Therefore, let's introduce **ability** `\(A_i\)` explicitly. ] --- # Mincer with Unobserved Ability .pull-left[ * In fact we have *two* unobservables: `\(e\)` and `\(A\)`. * Of course we can't tell them apart. * So we defined a new unobservable factor `$$u_i = e_i + A_i$$` ] -- .pull-right[ <img src="04-IV_files/figure-html/mincer3-1.svg" style="display: block; margin: auto;" /> ] --- # Mincer with Unobserved Ability .pull-left[ * In terms of an equation: `$$\log Y_i = \alpha + \rho S_i + \beta_1 X_i + \beta_2 X_i^2 + \underbrace{u_i}_{A_i + e_i}$$` * Sometimes, this does not matter, and the OLS bias is small. * But sometimes it does and we get it totally wrong! Example. ] .pull-right[ <img src="04-IV_files/figure-html/mincer4-1.svg" style="display: block; margin: auto;" /> ] --- # Angrist and Krueger (1991): Birthdate is as good as Random .pull-left[ * Angrist and Krueger (AK91) is an influental study addressing ability bias. * Idea: 1. construct an IV that encodes *birth date of student*. 1. Child born just after cutoff date will start school later! * Suppose all children who reach the age of 6 by 31st of december 2021 are required to enroll in the first grade of school in september of that year (2021). ] -- .pull-right[ * If born on 31/12/2015, they will be 5 years and 3/4 by the time they start school (in September 2021). * If born on 01/01/2016, they will be 6 and 3/4 years when *they* enter school (in September 2022). * However, people can drop out of school legally on their 16th birthday! * So, out of people who drop out, some got more schooling than others. * AK91 construct IV *quarter of birth* dummy: affects schooling, but not related to `\(A\)`! ] --- # Birthdate Setup Let's set up 2 children: ``` r library(lubridate) born1 = as.Date("2015-12-31") born2 = as.Date("2016-01-01") school1 = as.Date("2021-09-01") school2 = as.Date("2022-09-01") ``` How much days of school if they drop out on their 16th birthday? --- # Birthdate Setup Let's set up 2 children: ``` r library(lubridate) born1 = as.Date("2015-12-31") born2 = as.Date("2016-01-01") school1 = as.Date("2021-09-01") school2 = as.Date("2022-09-01") ``` How much days of school if they drop out on their 16th birthday? ``` r dropout1 = born1 %m+% years(16) dropout2 = born2 %m+% years(16) (schooldays1 = dropout1 - school1) ``` ``` ## Time difference of 3773 days ``` ``` r (schooldays2 = dropout2 - school2) ``` ``` ## Time difference of 3409 days ``` --- # AK91 IV setup .pull-left[ * *quarter of birth* dummy `\(z\)`: affects schooling, but not related to `\(A\)`! * In particular: whether born in 4-th quarter or not. ] .pull-right[ <img src="04-IV_files/figure-html/ak-mod-1.svg" style="display: block; margin: auto;" /> ] --- # AK91 Estimation: Two Stage Least Squares (2SLS) AK91 allow us to introduce a widely used variation of our simple IV estimator: **2SLS** 1. We estimate a **first stage model** which uses only exogenous variables (like `\(z\)`) to explain our endgenous regressor `\(s\)`. 2. We then use the first stage model to *predict* values of `\(s\)` in what is called the **second stage** or the **reduced form** model. Performing this procedure is supposed to take out any impact of `\(A\)` in the correlation we observe in our data between `\(s\)` and `\(y\)`. `\begin{align} \text{1st Stage: }s_i &= \alpha_0 + \alpha_1 z_i + \eta_i \\ \text{2nd Stage: }y_i &= \beta_0 + \beta_1 \hat{s}_i + u_i \end{align}` **Conditions:** 1. Relevance of the IV: `\(\alpha_1 \neq 0\)` (only Testable one) 1. Independence (IV assignment as good as random): `\(E[\eta | z] = 0\)` * Assumption. Argue based on contextual evidence or plausibility of instrument’s randomness 1. Exogeneity (our exclusion restriction): `\(E[u | z] = 0\)` * Assumption. Argue using theory/sound argument, asserting that any correlation between instrument and dependent variable is solely through instrument’s effect on schooling, not other pathways --- layout: false class: separator, middle # Let's do Angrist and Krueger (1991)! --- # Data on birth quarter and wages and quick transformation Let's load the data and look at a quick summary ``` r data("ak91", package = "masteringmetrics") # from the modelsummary package datasummary_skim(data.frame(ak91),histogram = TRUE) ``` ### AK91 Data Transformations * We want to create the `q4` dummy which is `TRUE` if you are born in the 4th quarter. * create `factor` versions of quarter and year of birth. ``` r ak91 <- mutate(ak91, qob_fct = factor(qob), q4 = as.integer(qob == "4"), yob_fct = factor(yob)) # get mean wage by year/quarter ak91_age <- ak91 %>% group_by(qob, yob) %>% summarise(lnw = mean(lnw), s = mean(s)) %>% mutate(q4 = (qob == 4)) ``` --- # AK91 Figure 1: First Stage! Let's reproduce AK91's first figure now on education as a function of quarter of birth! ``` r ggplot(ak91_age, aes(x = yob + (qob - 1) / 4, y = s )) + geom_line() + geom_label(mapping = aes(label = qob, color = q4)) + guides(label = FALSE, color = FALSE) + scale_x_continuous("Year of birth", breaks = 1930:1940) + scale_y_continuous("Years of Education", breaks = seq(12.2, 13.2, by = 0.2), limits = c(12.2, 13.2)) + theme_bw() ``` --- # AK91 Figure 1: First Stage! .left-thin[ 1. The numbers label mean education *by* quarter of birth groups. 1. The 4-th quarters **did** get more education in most years! 1. There is a general trend. ] .right-wide[ <img src="04-IV_files/figure-html/ak91-dummy-1.svg" style="display: block; margin: auto;" /> ] --- # AK91 Figure 2: Impact of IV on outcome What about earnings for those groups? ``` r ggplot(ak91_age, aes(x = yob + (qob - 1) / 4, y = lnw)) + geom_line() + geom_label(mapping = aes(label = qob, color = q4)) + scale_x_continuous("Year of birth", breaks = 1930:1940) + scale_y_continuous("Log weekly wages") + guides(label = FALSE, color = FALSE) + theme_bw() ``` --- # AK91 Figure 2: Impact of IV on outcome .left-thin[ 1. The 4-th quarters are among the high-earners by birth year. 1. In general, weekly wages seem to decline somewhat over time. ] .right-wide[ <img src="04-IV_files/figure-html/ak91-wage-1.svg" style="display: block; margin: auto;" /> ] --- # Running IV estimation in `R` <br> <br> .pull-left[ * Several options (like always with `R`! 😉) * Will use the [`iv_robust`](https://declaredesign.org/r/estimatr/reference/iv_robust.html) function from the `estimatr` package. * *Robust*? Computes standard errors which are correcting for heteroskedasticity. [Details here.](https://declaredesign.org/r/estimatr/articles/mathematical-notes.html) ] .pull-right[ ``` r library(estimatr) # create a list of models mod <- list() # standard (biased!) OLS mod$ols <- lm(lnw ~ s, data = ak91) # IV: born in q4 is TRUE? # doing IV manually in 2 stages. mod[["1. stage"]] <- lm(s ~ q4, data = ak91) ak91$shat <- predict(mod[["1. stage"]]) mod[["2. stage"]] <- lm(lnw ~ shat, data = ak91) # run 2SLS # doing IV all in one go # notice the formula! # formula = y ~ x | z mod$`2SLS` <- iv_robust(lnw ~ s | q4, data = ak91, diagnostics = TRUE) ``` ] --- count: false # Running IV estimation in `R` <br> <br> .pull-left[ * Several options (like always with `R`! 😉) * Will use the [`iv_robust`](https://declaredesign.org/r/estimatr/reference/iv_robust.html) function from the `estimatr` package. * *Robust*? Computes standard errors which are correcting for heteroskedasticity. [Details here.](https://declaredesign.org/r/estimatr/articles/mathematical-notes.html) * Notice the `predict` to get `\(\hat{s}\)`. ] .pull-right[ ``` r library(estimatr) # create a list of models mod <- list() # standard (biased!) OLS mod$ols <- lm(lnw ~ s, data = ak91) # IV: born in q4 is TRUE? # doing IV manually in 2 stages. mod[["1. stage"]] <- lm(s ~ q4, data = ak91) *ak91$shat <- predict(mod[["1. stage"]]) mod[["2. stage"]] <- lm(lnw ~ shat, data = ak91) # run 2SLS # doing IV all in one go # notice the formula! # formula = y ~ x | z mod$`2SLS` <- iv_robust(lnw ~ s | q4, data = ak91, diagnostics = TRUE) ``` ] --- # AK91 Results Table .pull-left[
ols
1. stage
2. stage
2SLS
(Intercept)
4.995***
12.747***
4.955***
4.955***
(0.004)
(0.007)
(0.381)
(0.358)
s
0.071***
0.074**
(0.000)
(0.028)
q4
0.092***
(0.013)
shat
0.074*
(0.030)
R2
0.117
0.000
0.000
0.117
RMSE
0.64
3.28
0.68
0.64
1. Stage F:
48.9904279657299
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
] .pull-right[ * OLS likely downward biased (measurement error in schooling) * First Stage: IV `q4` is statistically significant, but small effect: born in q4 has 0.092 years of educ. `\(R^2\)` is 0%! But F-stat is large. 😅 * Second stage has same point estimate as `2SLS` but different std error (2. stage one is wrong) ] --- # Remember the F-Statistic? * We encountered this before: it's useful to test restricted vs unrestricted models against each other. * Here, we are interested whether our instruments are *jointly* significant. Of course, with only one IV, that's not more informative than the t-stat of that IV. * This F-Stat compares the predictive power of the first stage with and without the IVs. If they have very similar predictive power, the F-stat will be low, and we will not be able to reject the H0 that our IVs are **jointly insignificant** in the first stage model. 😞 --- # Additional Control Variables * We saw a clear time trend in education earlier. * There are also business-cycle fluctuations in earnings * We should somehow control for different time periods. * Also, we can use more than one IV! Here is how: ## In R... ``` r # we keep adding to our `mod` list: mod$ols_yr <- update(mod$ols, . ~ . + yob_fct) # previous OLS model # add exogenous vars on both sides of the `|` ! mod[["2SLS_yr"]] <- estimatr::iv_robust(lnw ~ s + yob_fct | q4 + yob_fct, data = ak91, diagnostics = TRUE ) # use all quarters as IVs mod[["2SLS_all"]] <- estimatr::iv_robust(lnw ~ s + yob_fct | qob_fct + yob_fct, data = ak91, diagnostics = TRUE ) ``` --- # Additional Control Variables .left-wide[
ols
2SLS
ols_yr
2SLS_yr
2SLS_all
(Intercept)
5.00***
4.96***
5.02***
4.97***
4.59***
(0.00)
(0.36)
(0.01)
(0.35)
(0.25)
s
0.07***
0.07**
0.07***
0.08**
0.11***
(0.00)
(0.03)
(0.00)
(0.03)
(0.02)
R2
0.117
0.117
0.118
0.117
0.091
RMSE
0.64
0.64
0.64
0.64
0.65
Instruments
none
Q4
none
Q4
All Quarters
Year of birth
no
no
yes
yes
yes
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
] .right-thin[ **Adding year controls** leaves OLS mostly unchanged, with slight increase in 2SLS estimate **Using all quarters as IV** increases precision of 2SLS estimate a lot! Point estimate is 10.5% now! ] --- # AK91: Taking Stock - The Quarter of Birth (QOB) IV .pull-left[ * This will produce consistent estimates if 1. The IV predicts the endogenous regressor well. 2. The IV is as good as random / independent of OVs. 3. Can only impact outcome through schooling. * How does the QOB perform along those lines? ] -- .pull-right[ 1. Plot of first stage and high F-stat offer compelling evidence for **relevance**. ✅ 2. Is QOB **independent** of, say, *maternal characteristics*? Birthdays are not really random - there are birth seasons for certain socioeconomic backgrounds. highest maternal schooling give birth in second quarter. (not in 4th! ✅) 3. Exclusion: What if the youngest kids (born in Q4!) are the disadvantaged ones early on, which has long-term negative impacts? That would mean `\(E[u|z] \neq 0\)`! Well, with QOB the youngest ones actually do better (more schooling and higher wage)! ✅ ] --- layout: false class: separator, middle # Mechanics of IV ## Identification and Inference --- # IV Identification Let's go back to our simple linear model: $$ y = \beta_0 + \beta_1 x + u $$ where we fear that `\(Cov(x,u) \neq 0\)`, `\(x\)` is *endogenous*. ## Conditions for IV 1. **first stage** or **relevance**: `\(Cov(z,x) \neq 0\)` 2. **IV exogeneity**: `\(Cov(z,u) = 0\)`: the IV is exogenous in the outcome equation. --- # Valid Model (A) vs Invalid Model (B) for IV `z` <img src="04-IV_files/figure-html/IV-dag2-1.svg" style="display: block; margin: auto;" /> --- # IV Identification .pull-left[ >## Conditions for IV >1. **first stage** or **relevance**: `\(Cov(z,x) \neq 0\)` >2. **IV exogeneity**: `\(Cov(z,u) = 0\)`: the IV is exogenous in the outcome equation. ] .pull-right[ * How does this *identify* `\(\beta_1\)`? * (How can we express `\(\beta_1\)` in terms of population moments to pin it's value down?) ] --- # IV Identification `\begin{align} Cov(z,y) &= Cov(z, \beta_0 + \beta_1 x + u) \\ &= \beta_1 Cov(z,x) + Cov(z,u) \end{align}` .pull-left[ Under condition 2. above (**IV exogeneity**), we have `\(Cov(z,u)=0\)`, hence $$ Cov(z,y) = \beta_1 Cov(z,x) $$ ] -- .pull-right[ and under condition 1. (**relevance**), we have `\(Cov(z,x)\neq0\)`, so that we can divide the equation through to obtain $$ \beta_1 = \frac{Cov(z,y)}{Cov(z,x)}. $$ * `\(\beta_1\)` is *identified* via population moments `\(Cov(z,y)\)` and `\(Cov(z,x)\)`. * We can *estimate* those moments via their *sample analogs* ] --- # IV Estimator Just plugging in for the population moments: `$$\hat{\beta}_1 = \frac{\sum_{i=1}^n (z_i - \bar{z})(y_i - \bar{y})}{\sum_{i=1}^n (z_i - \bar{z})(x_i - \bar{x})}$$` * The intercept estimate is `\(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\)` -- * Given both assumptions 1. and 2. are satisfied, we say that *the IV estimator is consistent for `\(\beta_1\)`*. We write $$ \text{plim}(\hat{\beta}_1) = \beta_1 $$ in words: the *probability limit* of `\(\hat{\beta}_1\)` is the true `\(\beta_1\)`. * If this is true, we say that this estimator is **consistent**. --- # IV Inference Assuming `\(E(u^2|z) = \sigma^2\)` the variance of the IV slope estimator is `$$Var(\hat{\beta}_{1,IV}) = \frac{\sigma^2}{n \sigma_x^2 \rho_{x,z}^2}$$` * `\(\sigma_x^2\)` is the population variance of `\(x\)`, * `\(\sigma^2\)` the one of `\(u\)`, and * `\(\rho_{x,z}\)` is the population correlation between `\(x\)` and `\(z\)`. -- You can see 2 important things here: 1. Without the term `\(\rho_{x,z}^2\)`, this is **like OLS variance**. 2. As sample size `\(n\)` increases, the **variance decreases**. --- # IV Variance is Always Larger than OLS Variance * Replace `\(\rho_{x,z}^2\)` with `\(R_{x,z}^2\)`, i.e. the R-squared of a regression of `\(x\)` on `\(z\)`: `$$Var(\hat{\beta}_{1,IV}) = \frac{\sigma^2}{n \sigma_x^2 R_{x,z}^2}$$` 1. Given `\(R_{x,z}^2 < 1\)` in most real life situations, we have that `\(Var(\hat{\beta}_{1,IV}) > Var(\hat{\beta}_{1,OLS})\)` almost certainly. -- 1. The higher the correlation between `\(z\)` and `\(x\)`, the closer their `\(R_{x,z}^2\)` is to 1. With `\(R_{x,z}^2 = 1\)` we get back to the OLS variance. This is no surprise, because that implies that in fact `\(z = x\)`. So, if you have a valid, exogenous regressor `\(x\)`, you should *not* perform IV estimation using `\(z\)` to obtain `\(\hat{\beta}\)`, since your variance will be unnecessarily large. --- # Returns to Education for Married Women Consider the following model for married women's wages: $$ \log wage = \beta_0 + \beta_1 educ + u $$ Let's run an OLS on this, and then compare it to an IV estimate using *father's education*. Keep in mind that this is a valid IV `\(z\)` if 1. *fatheduc* and *educ* are correlated 2. *fatheduc* and `\(u\)` are not correlated. --- # Returns to Education for Married Women ``` r data(mroz,package = "wooldridge") mods = list() mods$OLS <- lm(lwage ~ educ, data = mroz) mods[['First Stage']] <- lm(educ ~ fatheduc, data = subset(mroz, inlf == 1)) mods$IV <- estimatr::iv_robust(lwage ~ educ | fatheduc, data = mroz) ```
OLS
First Stage
IV
(Intercept)
-0.185
10.237
0.441
(0.185)
(0.276)
(0.467)
educ
0.109
0.059
(0.014)
(0.037)
fatheduc
0.269
(0.029)
Num.Obs.
428
428
428
R2
0.118
0.173
0.093
--- # IV Standard Errors <img src="04-IV_files/figure-html/se-plot-1.svg" style="display: block; margin: auto;" /> --- # IV with a Weak Instrument * IV is consistent under given assumptions. * However, *even if* we have only very small `\(Cor(z,u)\)`, we can get wrong-footed * Small corrleation between `\(x\)` and `\(z\)` can produce **inconsistent** estimates. .pull-left[ <br> <br> $$ \text{plim}(\hat{\beta}_{1,IV}) = \beta_1 + \frac{Cor(z,u)}{Cor(z,x)} \cdot \frac{\sigma_u}{\sigma_x} $$ ] -- .pull-right[ * Take `\(Cor(z,u)\)` is very small, * A **weak instrument** is one with only a small absolute value for `\(Cor(z,x)\)` * This will blow up this second term in the probability limit. * Even with a very big sample size `\(n\)`, our estimator would *not* converge to the true population parameter `\(\beta_1\)`, because we are using a weak instrument. ] --- # Weak Stuff To illustrate this point, let's assume we want to look at the impact of number of packs of cigarettes smoked per day by pregnant women (*packs*) on the birthweight of their child (*bwght*): $$ \log(bwght) = \beta_0 + \beta_1 packs + u $$ We are worried that smoking behavior is correlated with a range of other health-related variables which are in `\(u\)` and which could impact the birthweight of the child. So we look for an IV. Suppose we use the price of cigarettes (*cigprice*), assuming that the price of cigarettes is uncorrelated with factors in `\(u\)`. Let's run the first stage of *cigprice* on *packs* and then let's show the 2SLS estimates: --- # Weak Stuff ``` r data(bwght, package = "wooldridge") mods <- list() mods[["First Stage"]] <- lm(packs ~ cigprice, data = bwght) mods[["IV"]] <- estimatr::iv_robust(log(bwght) ~ packs | cigprice, data = bwght, diagnostics = TRUE) ```
First Stage
IV
(Intercept)
0.067
4.448
(0.103)
(0.940)
cigprice
0.000
(0.001)
packs
2.989
(8.996)
R2
0.000
-23.230
1. Stage F:
0.120905223675045
--- # Weak Stuff .pull-left[ * The first columns shows: very weak first stage. *cigprice* has zero impact on packs it seems! * `\(R^2\)` is zero. * What if we use this IV nevertheless? ] .pull-right[ * in the second column: very large, positive(!) impact of packs smoked on birthweight. 🤔 * Huge Standard Error though. * An `\(R^2\)` of -23?! * F-stat of first stage: 0.121. Corresponds to a p-value of 0.728 : we **cannot** reject the H0 of an insignificant first stage here *at all*. * So: **invalid** approach. ❌ ] --- # Summary of IV conditions 1. **Relevance**: The instrument must be correlated with the endogenous explanatory variable it is instrumenting for. This is often tested using the first-stage regression, where a significant relationship indicates that the instrument has explanatory power over the endogenous variable. * This is often written as `\({Cov}(Z, X) \neq 0\)` 2. **Exogeneity**: The instrument must be uncorrelated with the error term in the outcome equation. This means that the instrument should not have a direct effect on the dependent variable except through the endogenous explanatory variable. * This condition is usually written as `\(Cov(Z, \varepsilon) = 0\)` 3. **Exclusion Restriction**: The instrument should only affect the outcome through its effect on the endogenous variable, not through any other channels. This condition requires a strong theoretical or empirical justification. * Mathematically, this means that, conditional on X, the expected value of Y does not vary with Z, or `\(E[Y | X, Z] = E[Y | X]\)`. This condition is essential for isolating the effect of X on Y and ruling out any direct effect of Z on Y. --- ## Examples of Good Instrumental Variables 1. **Distance to the Nearest College as an Instrument for Education**: Used to study the impact of education on earnings. The idea is that distance affects the likelihood of attending college but isn't directly related to an individual's income, aside from its effect on education. 2. **Weather as an Instrument for Agricultural Output**: For studying the effects of crop yields on income or consumption in farming communities. Weather can influence crop yield independently of the farmer's skill or economic environment, making it an external instrument. 3. **Random Assignment in Lottery Systems**: Often used in studies of educational or housing interventions. For example, in lottery-based school admissions, the lottery outcome can serve as an instrument for studying the effects of attending a particular school on later outcomes. 4. **Military Draft Eligibility as an Instrument for Veteran Status**: Commonly applied in studies examining the effect of military service on earnings or health outcomes. Eligibility based on birth year influences the likelihood of serving but doesn’t directly affect outcomes like income, making it a strong instrument. 5. **Rainfall for Studying the Impact of Agricultural Shocks on Migration or Labor**: Rainfall variability can be used to instrument for economic conditions in agriculture-dependent areas, allowing researchers to study the effect of shocks on migration or labor decisions.