We introduced the DiD method.
We looked at the case minimum wages in NJ/Pennsylvania.
We highlighted some drawbacks of the method and learned the underlying assumptions.
We introduced the DiD method.
We looked at the case minimum wages in NJ/Pennsylvania.
We highlighted some drawbacks of the method and learned the underlying assumptions.
we will introduce instrumental variables (IV)
To motivate IV, we will look back to London in 1850 and learn about John Snow.
We will finally introduce the IV estimator formally.
If people have some sort of control about getting treatment, there will be selection.
RCTs can break the self-selection of people into treatment by assigning randomly.
So with experimental data, we have a good solution.
What about non-experimental data?
We talked about omitted variable bias.
What if there is correlation between a variable in the error term u, x2 say, and our explanatory variable x1?
We will obtain biased estimates because we cannot separate out what is what: effect of x1, or of x2?
Remember that this can be so severe that we don't even get the correct sign of an effect.
We talked about omitted variable bias.
What if there is correlation between a variable in the error term u, x2 say, and our explanatory variable x1?
We will obtain biased estimates because we cannot separate out what is what: effect of x1, or of x2?
Remember that this can be so severe that we don't even get the correct sign of an effect.
IV provides a solution to OVB.
John Snow was a physician in London around 1850, when Cholera erupted several times in the City.
There was a dispute at the time about how the disease is transmitted: via air or via water?
Unknown that germs can cause disease.
Microscopes exist, but work at rather poor resolution.
Most human pathogens are not visible to the naked eye.
The so-called infection theory (i.e. infection via germs) has some supporters,
but the dominant idea is that disease, in general, results from miasmas
Snow collected a lot of data.
He first mapped the location of dead during the 1854 outbreak.
This was the notorious Broadstreet Pump Outbreak
Snow collected a lot of data.
He first mapped the location of dead during the 1854 outbreak.
This was the notorious Broadstreet Pump Outbreak
cholera
packageThe cholera
package has some interesting features.
For example an R version of Snow's map:
cholera::snowMap()
cholera
...or the walking path of case number 15 in Snow's data:
cholera
...or the walking path of case number 15 in Snow's data:
...or estimate Voronoi Polygons for pump neighborhoods:
Snow identified the Broad Street Pump as culprit.
He pleaded to have its handle removed.
He was sceptical this was the reason the epidemic ended.
Water supply came from the River Thames
Different supply companies had different intake points
Southwark and Vauxhall water companies took in water beneath a major sewage discharge.
Lambeth water did not.
area | numhouses | deaths | death1000 |
---|---|---|---|
Southwark and Vauxhall | 40046 | 1263 | 315 |
Lambeth | 26107 | 98 | 37 |
Rest of London | 256423 | 1422 | 59 |
that if Southwark and Vauxhall water companies had moved their water intakes upstream to where Lambeth water was taking in their supply, roughly 1,000 lives could have been saved.
Suppose that ci takes the value 1 if individual i dies of cholera, 0 else.
Let wi=1 mean that i's water supply is impure and wi=0 vice versa. Water purity is assessed with a technology that cannot detect small microbes.
Collect in ui all unobservable factors that impact i's likelihood of dying from the disease: whether i is poor, where exactly they reside, whether there is bad air quality in i's surrounding, and other invidivual characteristics which impact the outcome (like genetic setup of i).
Suppose that ci takes the value 1 if individual i dies of cholera, 0 else.
Let wi=1 mean that i's water supply is impure and wi=0 vice versa. Water purity is assessed with a technology that cannot detect small microbes.
Collect in ui all unobservable factors that impact i's likelihood of dying from the disease: whether i is poor, where exactly they reside, whether there is bad air quality in i's surrounding, and other invidivual characteristics which impact the outcome (like genetic setup of i).
We can write:
ci=α+δwi+ui
John Snow could have used his data and assess the correlation between drinking pure water and cholera incidence.
measure Cor(ci,wi)
Suppose Cor(ci,wi)≈0.5. Does that prove the infection theory?
John Snow could have used his data and assess the correlation between drinking pure water and cholera incidence.
measure Cor(ci,wi)
Suppose Cor(ci,wi)≈0.5. Does that prove the infection theory?
Note quite. Angus Deaton says:
The people who drank impure water were also more likely to be poor, and to live in an environment contaminated in many ways, not least by the ‘poison miasmas’ that were then thought to be the cause of cholera.
☹️
It does not make sense to compare someone who drinks pure water with someone with impure water.
because all else is not equal: pure water is correlated with being poor, living in bad area, bad air quality and so on - all factors that we encounter in ui.
This violates the crucial orthogonality assumption for valid OLS estimates, E[ui|wi]=0 in this context.
Another way to say this, is that Cov(wi,ui)≠0, implying that wi is endogenous.
There are factors in ui that affect both wi and ci
Remember our simple model: ci=α+δwi+ui Now let's condition on both values of w: E[ci|wi=1]=α+δ+E[ui|wi=1]E[ci|wi=0]=α+δ+E[ui|wi=0]
Remember our simple model: ci=α+δwi+ui Now let's condition on both values of w: E[ci|wi=1]=α+δ+E[ui|wi=1]E[ci|wi=0]=α+δ+E[ui|wi=0]
Now substract one line from the other:
E[ci|wi=1]−E[ci|wi=0]=δ+{E[ui|wi=1]−E[ui|wi=0]}
The last term {E[ui|wi=1]−E[ui|wi=0]} is not equal to zero (by what Deaton said!)
A regression estimate for δ would be biased by that quantity.
[...] the mixing of the supply is of the most intimate kind. The pipes of each Company go down all the streets, and into nearly all the courts and alleys. [...] The experiment, too, is on the grandest scale. No fewer than three hundred thousand people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and in most cases, without their knowledge; one group supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity.
More formally, let's define the instrument as follows:
zi={1if water supplied by Lambeth0if water supplied by Southwark or Vauxhall.
zi is highly correlated with the water purity wi.
However, it seems to be uncorrelated with all the other factors in ui, which worried us before: Water supply was decided years before, and now houses on the same street have different suppliers!
Here are the conditions for a valid instrument:
Here are the conditions for a valid instrument:
Relevance or First Stage: Water purity is indeed a function of supplier identity. We want that E[wi|zi=1]≠E[wi|zi=0] i.e. the average water purity differs across suppliers. We can verify this condition with observational data. We want this effect to be reliably causal.
Independence: Whether a household has zi=1 or zi=0 is unrelated to u, hence as good as random. Whether we condition u on certain values of z does not change the result - we want E[ui|zi=1]=E[ui|zi=0].
Here are the conditions for a valid instrument:
Relevance or First Stage: Water purity is indeed a function of supplier identity. We want that E[wi|zi=1]≠E[wi|zi=0] i.e. the average water purity differs across suppliers. We can verify this condition with observational data. We want this effect to be reliably causal.
Independence: Whether a household has zi=1 or zi=0 is unrelated to u, hence as good as random. Whether we condition u on certain values of z does not change the result - we want E[ui|zi=1]=E[ui|zi=0].
Excludability the instrument should affect the outcome c only through the specified channel (i.e. via water purity w), and nothing else.
We are now ready to define a simple IV estimator. Like before, let's condition on the values of z:
E[ci|zi=1]=α+δE[wi|zi=1]+E[ui|zi=1]E[ci|zi=0]=α+δE[wi|zi=0]+E[ui|zi=0]
which upon differencing both lines gives
E[ci|zi=1]−E[ci|zi=0]=δ{E[wi|zi=1]−E[wi|zi=0]}+{E[ui|zi=1]−E[ui|zi=0]}⏟=0 by Exogeneity Assumption
We are now ready to define a simple IV estimator. Like before, let's condition on the values of z:
E[ci|zi=1]=α+δE[wi|zi=1]+E[ui|zi=1]E[ci|zi=0]=α+δE[wi|zi=0]+E[ui|zi=0]
which upon differencing both lines gives
E[ci|zi=1]−E[ci|zi=0]=δ{E[wi|zi=1]−E[wi|zi=0]}+{E[ui|zi=1]−E[ui|zi=0]}⏟=0 by Exogeneity Assumption
δ=E[ci|zi=1]−E[ci|zi=0]E[wi|zi=1]−E[wi|zi=0](#eq:IV)
Let's say that x↦y means that x is an estimate for y:
The estimator would then be
ˆδ=¯c1−¯c0¯w1−¯w0
In this special case where all involved variables c,w,z are binary, the estimator is called the Wald estimator.
Summary: IVs are a powerful tool to establish causality in contexts with observational data only and where we are concerned that the conditional mean assumption E[ui|xi]=0 is violated, hence, we cannot say all else equal, as x changes, y changes like this and that. Then we say that x is endogenous. The key features of IV z are that
We introduced the DiD method.
We looked at the case minimum wages in NJ/Pennsylvania.
We highlighted some drawbacks of the method and learned the underlying assumptions.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |