# Load packages
library(pacman)
p_load(tidyverse, patchwork, haven, huxtable, modelsummary, fixest, magrittr, here)
# Load allsites.dta
= read_dta("data/allsites.dta") all_df
Problem Set 3
01 Load the allsites.dta
data. You can use read_dta()
from haven
to read Stata data files into R. Now view it using the View()
function. You should see nice variable labels. Get to know the data a bit. Describe anything that sticks out to you. glimpse()
is a handy function for getting to know a dataset.
Answer: Done.
02 Let’s start simple. Regress the 2000 log median house value for a Census tract on the indicator for whether the tract was listed on the National Priorities List before 2000. Report your results, interpreting the coefficient of interest.
Answer:
# First regression
= feols(lnmdvalhs0 ~ npl2000, data = all_df, vcov = 'hetero')
est2 # Table of results
%>% modelsummary(gof_map = 'nobs', output = 'huxtable') est2
Model 1 | |
---|---|
(Intercept) | 11.720 |
(0.003) | |
npl2000 | -0.021 |
(0.017) | |
Num.Obs. | 42974 |
In this regression, listing a tract in the NPL is associated with a reduction in median home value of about 2 percent.
03 Does clustering your errors by state affect your standard errors? Do you think it should? Explain.
Hint: The feols()
function from fixest
allows you to cluster and is quite fast.
Answer:
# First regression
= feols(lnmdvalhs0 ~ npl2000, data = all_df, cluster = ~statefips)
est3 # Table of results
list(
'Heterosk. robust' = est2,
'Cluster robust' = est3
%>% modelsummary(gof_map = 'nobs', output = 'huxtable') )
Heterosk. robust | Cluster robust | |
---|---|---|
(Intercept) | 11.720 | 11.720 |
(0.003) | (0.092) | |
npl2000 | -0.021 | -0.021 |
(0.017) | (0.049) | |
Num.Obs. | 42974 | 42974 |
It seems at least plausible that treatment (NPL listing) and disturbances (to property values) are spatially correlated—perhaps within state. Thus, we might think it’s important to all the disturbances to cluster within state. When we do so, our standard errors more than double.
04 Now run the three regressions described below, sequentially adding more controls to the regressions you ran in 02 and 03. We are still principally interested in the effect of listing on the NPL.
- Control for 1980 housing values
- Also control for economic and demographic variables. (Report which variables you included.)
- Also add state fixed effects.
Briefly interpret your results.
Answer:
# First regression
= feols(
est4a ~ npl2000 + lnmeanhs8,
lnmdvalhs0 cluster = ~statefips,
data = all_df
)= feols(
est4b ~ npl2000 + lnmeanhs8 + pop_den8 + shrblk8,
lnmdvalhs0 cluster = ~statefips,
data = all_df
)= feols(
est4c ~ npl2000 + lnmeanhs8 + pop_den8 + shrblk8 | statefips,
lnmdvalhs0 cluster = ~statefips,
data = all_df
)# Table of results
list(
est2, est3, est4a, est4b, est4c%>% modelsummary(gof_map = 'nobs', output = 'huxtable') )
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
---|---|---|---|---|---|
(Intercept) | 11.720 | 11.720 | 2.404 | 2.403 | |
(0.003) | (0.092) | (0.444) | (0.515) | ||
npl2000 | -0.021 | -0.021 | 0.040 | 0.096 | 0.032 |
(0.017) | (0.049) | (0.025) | (0.019) | (0.022) | |
lnmeanhs8 | 0.856 | 0.850 | 0.795 | ||
(0.041) | (0.047) | (0.041) | |||
pop_den8 | 0.000 | 0.000 | |||
(0.000) | (0.000) | ||||
shrblk8 | -0.167 | -0.153 | |||
(0.061) | (0.060) | ||||
Num.Obs. | 42974 | 42974 | 42974 | 42974 | 42974 |
Adding additional controls actually flips the sign of the point estimate—though it is only statistically significant in column (4). When we add state fixed effects column (5), the point estimate is again does not statistically significantly differ from zero (though the point estimate is not zero and the confidence interval is relatively large).
05 Under what conditionsl will the coefficients on the NPL indicator (in 02–04) be unbiased for the effect of NPL listing on local property values?
Answer: If, conditional on some set of covariates, a tract’s NPL status in 2000 is “as good as random” (exogenous… uncorrelated with potential outcomes), then the regression that includes those covariates will produce consistent estimates for the average effect of NPL listing on median property values in the tract.
06 Let’s compare the covariates for the treated (NPL listed) and control (non-NPL listed) tracts. First use allcovariates.dta
to compare the covariates (the controls you used above) between the tracts listed on the NPL and the tracts that were not listed on the NPL (before 2000).
Does it look like the covariates are balanced across NPL and non-NPL listed tracts?
Notes: The all
in the filename means that the file has all of the Census tracts. This comparison should be done via regression. You do not need to cluster your errors (though in real life you would want to).
Answer: Lots of options here. I’m providing five…
# Load the data
= read_dta("data/allcovariates.dta")
all_cov # Example regression
= feols(shrblk8 ~ npl2000, data = all_cov)
est6a = feols(pop_den8 ~ npl2000, data = all_cov)
est6b = feols(avhhin8 ~ npl2000, data = all_cov)
est6c = feols(ownocc8 ~ npl2000, data = all_cov)
est6d = feols(povrat8 ~ npl2000, data = all_cov)
est6e list(
shrblk8 = est6a,
pop_den8 = est6b,
avhhin8 = est6c,
ownocc8 = est6d,
povrat8 = est6e
%>% modelsummary(gof_map = 'nobs', output = 'huxtable', vcov = 'HC3') )
shrblk8 | pop_den8 | avhhin8 | ownocc8 | povrat8 | |
---|---|---|---|---|---|
(Intercept) | 0.117 | 5424.074 | 21510.179 | 800.653 | 0.113 |
(0.001) | (43.605) | (39.636) | (2.135) | (0.000) | |
npl2000 | -0.025 | -4017.126 | -1170.021 | 107.208 | -0.007 |
(0.006) | (84.425) | (206.217) | (14.925) | (0.003) | |
Num.Obs. | 48245 | 48245 | 48245 | 48245 | 48245 |
Moving across the columns, we find statistically significant relationships between NPL listing and many demographics:
- Tracts with a larger share of black residents in 1980 are less likely to be listed.
- Tracts with higher population density (in 1980) are less likely to be listed.
- Tracts with higher income (in 1980) are less likely to be listed.
- Tracts with higher shares of owner-occupied housing (in 1980) are more likely to be listed.
- Tracts with higher rates of poverty (in 1980) are less likely to be listed.
07 Repeat the exercise from 06 with the dataset named sitecovariates.dta
. This dataset focuses on tracts that received an HRS test in 1982. Separately compare the balance for treated and control units using the following three definitions of treated and control:
- NPL-listed prior to 2000 vs. not NPL-listed prior to 2000 (what we’ve been doing so far)
- HRS score above 28.5 vs. HRS score below 28.5
- HRS score in [16.5, 28.5) vs. HRS score in [28.5, 40.5]
What are your conclusions from these three comparisons?
Answer: I’m focusing on the tract’s 1980 share of black residents:
# Load the data
= read_dta("data/sitecovariates.dta")
site_cov # Example regression
= feols(
est7a ~ npl2000,
shrblk8 data = site_cov
)= feols(
est7b ~ I(hrs_82 > 28.5),
shrblk8 data = site_cov
)= feols(
est7c ~ I(hrs_82 > 28.5),
shrblk8 data = filter(site_cov, between(hrs_82, 16.5, 40.5))
)list(
est7a, est7b, est7c%>% modelsummary(gof_map = 'nobs', output = 'huxtable', vcov = 'HC3') )
Model 1 | Model 2 | Model 3 | |
---|---|---|---|
(Intercept) | 0.118 | 0.113 | 0.082 |
(0.019) | (0.017) | (0.022) | |
npl2000 | -0.046 | ||
(0.021) | |||
I(hrs_82 > 28.5) | -0.041 | 0.003 | |
(0.020) | (0.027) | ||
Num.Obs. | 487 | 487 | 227 |
Columns (1) and (2) are essentially the same, as NPL2000 is very strongly correlated with the indicator for whether the HRS score exceed 28.5. We still see unbalance in the 1980 share of black residents in columns (1) and (2). However, when we narrow in closer to the threshold, we see much better balance (though the standard error is a bit larger than we might want).
08 Suppose we want to instrument NPL listing with HRS score. What assumptions are necessary?
Answer: We have three requirements: 1. Exogeneity: HRS score is exogenous (seems unlikely—probably want to use the discrete cutoff). 2. Relevance: HRS score actually affects NPL listing (true). 3. Monotonicity: HRS score only moves NPL in one direction (likely true).
09 Imagine we instead want to estimate the effect of NPL listing using a regression discontinuity where HRS score is the running variable and the cutoff is 28.5. What assumptions are necessary?
Answer: We need the potential outcomes to vary smoothly across the cutoff—in other words: we cannot have selection onto the two sides of the cutoff. We also need NPL listing to actually be affected by the HRS cutoff at 28.5.
10 Consider the following three “facts”:
- The EPA states that the 28.5 cutoff was selected because it produced a manageable number of sites.
- None of the individuals involved in identifying the site, testing the level of pollution, or running the 1982 HRS test knew the cutoff/threshold.
- EPA documentation emphasizes that the HRS test is an imperfect scoring measure.
Do these three facts suggest that the requirements for IV and/or RD as valid or invalid? Explain.
Answer: Facts 2 and 3 seem to suggest a valid IV/RD: they suggest sorting did not happen. The first fact may be a little concerning if we think the cutoff was a function of some political decision that created imbalance across the threshold. However, number of firms above/below the threshold seems less concerning.
11 Now using the 2miledata.dta
dataset: Estimate the effect of NPL listing on log median house value (in 2000) using HRS score as an instrument. Estimate the three standard regressions of 2SLS (first stage, reduced form, second state) and briefly discuss them. Cluster your errors at the state level.
Make sure that your second-stage estimate is equal to the ratio of your reduced form and the first stage.
Answer:
# Load data
= read_dta("data/2miledata.dta")
main_df # First stage
= feols(npl2000 ~ hrs_82, data = main_df)
est11_1 # Reduced form
= feols(log(mdvalhs0) ~ hrs_82, data = main_df)
est11_2 # Actual 2SLS estimate
= feols(
est11_3 log(mdvalhs0) ~ 1 | npl2000 ~ hrs_82,
data = main_df
)list(
'First stage' = est11_1,
'Reduced form' = est11_2,
'2SLS' = est11_3
%>% modelsummary(gof_map = 'nobs', output = 'huxtable', vcov = ~statefips) )
First stage | Reduced form | 2SLS | |
---|---|---|---|
(Intercept) | 0.004 | 11.430 | 11.429 |
(0.039) | (0.108) | (0.106) | |
hrs_82 | 0.020 | 0.006 | |
(0.001) | (0.002) | ||
fit_npl2000 | 0.309 | ||
(0.084) | |||
Num.Obs. | 483 | 483 | 483 |
12 Does adding state fixed effects to the 2SLS estimates in 11 change your estimates? Use the first-stage and reduced-form estimates to explain what is happening. Do you think adding additional controls could be important?
Answer: Controls are likely important, as we are still not actually using the discontinuity.
# First stage
= feols(
est12_1 ~ hrs_82 | statefips,
npl2000 data = main_df
)# Reduced form
= feols(
est12_2 log(mdvalhs0) ~ hrs_82 | statefips,
data = main_df
)# Actual 2SLS estimate
= feols(
est12_3 log(mdvalhs0) ~ 1 | statefips | npl2000 ~ hrs_82,
data = main_df
)list(
'First stage' = est12_1,
'Reduced form' = est12_2,
'2SLS' = est12_3
%>% modelsummary(gof_map = 'nobs', output = 'huxtable', vcov = ~statefips) )
First stage | Reduced form | 2SLS | |
---|---|---|---|
hrs_82 | 0.020 | 0.003 | |
(0.001) | (0.001) | ||
fit_npl2000 | 0.165 | ||
(0.072) | |||
Num.Obs. | 483 | 483 | 483 |
Adding state fixed effects cuts the reduced-form estimate in half, which halves our treatment-effect estimate.
13 Repeat the 2SLS analysis from 11 but change your instrument to an indicator for whether HRS score is above 28.5 (no fixed effects). How do you results change? Briefly discuss.
# First stage
= feols(
est13_1 ~ I(hrs_82 > 28.5),
npl2000 data = main_df
)# Reduced form
= feols(
est13_2 log(mdvalhs0) ~ I(hrs_82 > 28.5),
data = main_df
)# Actual 2SLS estimate
= feols(
est13_3 log(mdvalhs0) ~ 1 | npl2000 ~ I(hrs_82 > 28.5),
data = main_df
)list(
'First stage' = est13_1,
'Reduced form' = est13_2,
'2SLS' = est13_3
%>% modelsummary(gof_map = 'nobs', output = 'huxtable', vcov = ~statefips) )
First stage | Reduced form | 2SLS | |
---|---|---|---|
(Intercept) | 0.164 | 11.528 | 11.493 |
(0.032) | (0.091) | (0.098) | |
I(hrs_82 > 28.5) | 0.826 | 0.179 | |
(0.033) | (0.062) | ||
fit_npl2000 | 0.216 | ||
(0.073) | |||
Num.Obs. | 483 | 483 | 483 |
The first stage is much stronger (it’s a fuzzy RD, but it’s quite strong), and the reduced-form effect is also much larger. These two effects balance out to a modest increase in the treatment-effect estimate.
14 Based upon your first stage estimates in 13: If we want to estimate the effect using a regression discontinuity, will it be a sharp or fuzzy RD? Briefly explain your answer.
Answer: It is a fuzzy RD because treatment status (NPL listing) is not perfectly predicted by being above/below the threshold.
15 Create the standard plots of a regression discontinuity (remember to bin your observations using the running variable, HRS score):
- The outcome variable vs. the running variable
- The treatment variable vs. the running variable
- Covariates vs. the running variable
- Bin counts vs. the running variable
# Create plotting dataset
= main_df %>%
plot_df mutate(run = hrs_82 - 28.5, run_bin = run %/% 1 * 1) %>%
group_by(run_bin) %>%
transmute(
n = n(),
mean_y = mean((mdvalhs0), na.rm = T),
mean_trt = mean(npl2000),
mean_share_black = mean(shrblk8_nbr)
%>% ungroup()
) # Outcome
= ggplot(data = plot_df, aes(x = run_bin, y = mean_y, group = run_bin < 0)) +
p1 geom_vline(xintercept = 0, linetype = "dashed", size = 0.2, color = "grey40") +
geom_line(size = 0.25) +
geom_point(size = 2.5) +
geom_smooth(se = F, method = loess) +
scale_y_continuous("Mean property value", labels = scales::dollar) +
scale_x_continuous("Distance from HRS threshold") +
theme_minimal()
# Treatment
= ggplot(data = plot_df, aes(x = run_bin, y = mean_trt, group = run_bin < 0)) +
p2 geom_vline(xintercept = 0, linetype = "dashed", size = 0.2, color = "grey40") +
geom_line(size = 0.25) +
geom_point(size = 2.5) +
geom_smooth(se = F, method = loess) +
scale_y_continuous("Share of treated tracts", labels = scales::percent) +
scale_x_continuous("Distance from HRS threshold") +
theme_minimal()
# Covariates
= ggplot(data = plot_df, aes(x = run_bin, y = mean_share_black, group = run_bin < 0)) +
p3 geom_vline(xintercept = 0, linetype = "dashed", size = 0.2, color = "grey40") +
geom_line(size = 0.25) +
geom_point(size = 2.5) +
geom_smooth(se = F, method = loess) +
scale_y_continuous("Mean share of tract that is black", labels = scales::percent) +
scale_x_continuous("Distance from HRS threshold") +
theme_minimal()
# Counts
= ggplot(data = plot_df, aes(x = run_bin, y = n, group = run_bin < 0)) +
p4 geom_vline(xintercept = 0, linetype = "dashed", size = 0.2, color = "grey40") +
geom_line(size = 0.25) +
geom_point(size = 2.5) +
geom_smooth(se = F, method = loess) +
scale_y_continuous("Number of observations") +
scale_x_continuous("Distance from HRS threshold") +
theme_minimal()
# Plot
+ p2 + p3 + p4 p1
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
16 Based upon your figures in 15, does this look like a plausible regression discontinuity? Use the figures to explain your answer.
Answer:
- Treatment looks good for a fuzzy RD: We see a big, discontinuous jump across the threshold.
- We do not see a corresponding jump in property values (which is more about treatment effects than validity of the RD).
- The share of the census tract that is black seems to vary fairly smoothly across the threshold.
- The number of observations in the bins jumps a bit at the threshold, which may warrant some concern.
17 Time for RD estimates! Limit your analysis to HRS scores between [16.5, 40.5] and re-estimate the 2SLS from 13. Add controls if you think they’re necessary. Report your results.
Answer:
# Fuzzy RD estimate
= feols(
est17_3a log(mdvalhs0) ~ 1 | 0 | npl2000 ~ I(hrs_82 > 28.5),
data = main_df %>% filter(hrs_82 %>% between(16.5, 40.5))
)# Add fixed effects
= feols(
est17_3b log(mdvalhs0) ~ 1 | statefips | npl2000 ~ i(hrs_82 > 28.5),
data = main_df %>% filter(hrs_82 %>% between(16.5, 40.5))
)list(
'Fuzzy RD' = est17_3a,
'With FEs' = est17_3b
%>% modelsummary(gof_map = 'nobs', output = 'huxtable', vcov = ~statefips) )
Fuzzy RD | With FEs | |
---|---|---|
(Intercept) | 11.555 | |
(0.092) | ||
fit_npl2000 | 0.056 | 0.092 |
(0.087) | (0.073) | |
Num.Obs. | 226 | 226 |
Without state fixed effects, we estimate a 5.6% increase in property values that does not differ statistically from zero. With state fixed effects, we estimate an increase of approximately 9.2%, but again, this difference does not differ statistically from zero. The standard errors on these point estimates are quite large: the confidence interval for the second estimate covers a 6.6% decrease to a 25% increase.
18 Which of your estimates do you find most credible? Explain.
Answer: Based upon the institutional information, it seems like the fuzzy RD likely provides the best apples-to-apple comparison: folks who barely missed out on getting listed to folks who were listed. That said, this design still has some shortcomings: (1) the bin-count graph does not look great, (2) it seems pretty underpowered with only 226 observations in our “window” (at least: large standard errors), and (3) the point estimate is a hyper-local LATE.