Problem Set 3: Causality, IV, and Final Review

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Optional!!! This problem set is entirely optional. If you submit it, it will replace your lowest problem set grade (unless it is lower than your lowest grade). If you do not submit this assignment, your grade will be unchanged. Either way, I encourage you to work through the problems to prepare for the final exam.

Due Upload your answer on Canvas before 11:59PM on Wednesday, 11 March 2026.

Important You must submit your answers as an HTML or PDF file, built from an RMarkdown (.Rmd) or Quarto (.qmd) file. Do not submit the .Rmd or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course.

README! The data for this problem set are stored in data-ps3.csv. They build on the county-level life-expectancy dataset from Problem Set 1 and add two new tax-related variables from the original county-characteristics source:

Variable names and descriptions
Variable name	Variable type	Variable description
`county_name`	character	County name
`county_code`	integer	County FIPS code
`state_abb`	character	State abbreviation
`income_quartile`	character	Income quartile (1 or 4)
`life_exp`	numeric	Life expectancy (years)
`pct_uninsured`	numeric	Percent uninsured
`poverty_rate`	numeric	Share below poverty line
`pct_religious`	numeric	Percent religious
`pct_black`	numeric	Percent Black
`pct_hispanic`	numeric	Percent Hispanic
`unemployment_rate`	numeric	Unemployment rate
`median_hh_inc`	numeric	Median household income
`is_urban`	integer	Urban county indicator (1 or 0)
`pop`	integer	County population
`pop_density`	numeric	Population density
`pct_smoke`	numeric	Percent who smoke
`pct_obese`	numeric	Percent obese (BMI)
`pct_exercise`	numeric	Percent who exercise
`tax_rate`	numeric	Local tax rate
`tax_prog`	numeric	Tax progressivity

Recall that the data came from a study titled The Association Between Income and Life Expectancy in the United States, 2001-2014 (by Raj Chetty, Augustin Bergeron, David Cutler, Benjamin Scuderi, Michael Stepner, and Nicholas Turner).

You can find more information about the life-expectancy variables here and the county characteristics here.

Objective This problem set has three goals:

Practice reasoning carefully about causality in observational data.
Implement and interpret instrumental-variables and two-stage least squares estimators in R.
Review the main ideas from lecture to prepare for the final exam.

2 Setup

[00] Load your R packages and the dataset (data-ps3.csv). You will likely want tidyverse, scales, fixest, and here.

3 Section 1: Causality and IV

In this section, we will study the relationship between smoking and life expectancy. A natural starting point is the regression

\[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{(Smoking share)}_i + u_i, \]

but there are obvious reasons to worry that smoking is endogenous—even at the county level. We will therefore compare OLS and IV estimates.

[01] Explain why smoking is likely endogenous in this regression. Give specific reasons and discuss the direction of bias if you can.

[02] Use R to report

the total number of observations,
the number of counties,
the means of tax_rate and tax_prog,
for each of the two income quartiles (1 and 4): the means of life_exp, pct_smoke.

Hint: You can use group_by and summarise to get summary statistics grouped by another variable.

[03] Create a figure that helps you visualize the raw relationship between smoking and life expectancy. Make the figure informative. At a minimum, it should

put pct_smoke on the horizontal axis,
put life_exp on the vertical axis,
distinguish the two income quartiles somehow (e.g., make two separate plots or color by quartile),

Bonus: Include fitted regression lines in your plot.

Hint: In ggplot2, you can use geom_smooth(method = 'lm') to add a fitted regression line to a scatter plot.

Briefly describe what you see.

[04] Estimate the following model by OLS using heteroskedasticity-robust standard errors:

\[\begin{aligned} \text{(Life expectancy)}_i = \beta_0 &+ \beta_1 \text{(Smoking share)}_i + \beta_2 \text{(Poverty rate)}_i \\ &+ \beta_3 \text{(Black share)}_i + \beta_4 \text{(Hispanic share)}_i \\ &+ \beta_5 \text{(Urban indicator)}_i + \beta_6 1\{ \text{(Income quartile)}_i = 4 \} \\ &+ u_i \end{aligned}\]

where \(1\{ \text{(Income quartile)}_i = 1 \}\) is an indicator for being in the lowest income quartile (with the lowest quartile as the reference group).

Report the results (e.g., in a table).
Then interpret the coefficient on pct_smoke.

Hint: You have a few options for how to implement this regression’s indicator variable:

create a new variable in the dataset that is 1 when income_quartile == 1 and 0 otherwise, and then include that variable in the regression (e.g., using mutate(income_quartile == 1)),
use fixest’s built-in i function inside of the regression where you define the reference level (the level to be dropped, here: 4) e.g., y ~ x + i(income_quartile, ref = 4).

[05] Suppose we want to use the local tax rate tax_rate and local tax progressivity tax_prog as instruments.

Estimate the first-stage regression using tax_rate and tax_prog as instruments for pct_smoke—keeping the same controls as in [04]. Use heteroskedasticity-robust standard errors.

Do the two tax variables appear relevant? Report the results of the regression and explain your answer about the relevance of your instruments.

Bonus: Use the wald function from fixest to jointly test the significance of the two variables.

[06] Estimate the reduced-form regression by replacing pct_smoke in [04] with the two tax variables.

Report your results.
Do we need exogeneity to interpret the reduced-form coefficients as causal effects of the tax variables on life expectancy?
How do you interpret the reduced-form coefficients?

[07] Estimate the causal effect of smoking on life expectancy using two-stage least squares (2SLS), instrumenting pct_smoke with taxrate and tax_st_diff_top20. Keep the same controls as in [04] and use heteroskedasticity-robust standard errors.

Report your 2SLS results.

You may use feols from fixest, iv_robust from estimatr, or any other package you like.

[08] Compare your OLS and 2SLS (IV) estimates for smoking. Which one would you trust more for causal inference here, and why? Be explicit about 2SLS’s assumptions.

[09] Beyond exogeneity issues, the relevance of your instrument(s) is a huge concern when using instrumental variables or two-stage least squares. Use the probability limit of the instrumental variables estimator to discuss how a weak instrument could substantially bias the estimator. \[\text{plim} \hat{\beta}^\text{IV} = \beta + \dfrac{\text{Cov}(z,\, u)}{\text{Cov}(z,\, x)}\] for instrument \(z\), endogenous regressor \(x\), and disturbance \(u\).

[10] On the topic of weak instruments: We should probably account for the fact that our disturbance likely correlates within state.

Re-run the first-stage regression but with standard errors that are robust to the disturbance to correlate (cluster) at the state level (e.g., cluster = ~state.abb).

Do the tax variables still appear relevant? Report your results and explain your answer.

4 Section 2: Final Review

This section is mostly conceptual. Keep your answers concise but precise.

[11] Define the concept of exogeneity in the context of regression analysis. Why is it important for causal inference?

[12] In one or two sentences each, define

a population model,
a sample model,
a parameter,
an estimator,
an estimate,
a standard error.

[13] Explain the two conditions required for omitted-variable bias. If the omitted variable increases \(y\) and is positively correlated with an included regressor \(x_j\), what direction is the bias in the OLS estimate of \(\beta_j\)?

[14] Consider the regression

\[\text{Wage}_i = \beta_0 + \beta_1 \text{Female}_i + \beta_2 \text{Education}_i + \beta_3 (\text{Female}_i \times \text{Education}_i) + u_i\]

Interpret \(\beta_1\), \(\beta_2\), and \(\beta_3\).

[15] Suppose heteroskedasticity is present, but the exogeneity assumption still holds. What happens to

the OLS coefficient estimates,
the usual OLS standard errors,
OLS efficiency?

[16] Give the best econometric tool for each setting below and explain why in one sentence.

You believe OLS coefficients are fine, but you do not know the form of the heteroskedasticity.
Observations may be correlated within states, classrooms, or firms.
You want to correct your standard errors because the distrubance correlates with itself through time.
You want a heteroskedasticity test that is less tied to a specific functional-form assumption.

[17] What does it mean for an estimator to be consistent? Can an estimator be biased but consistent? Can an estimator be unbiased but inconsistent?

[18] What happens to the OLS slope estimate when an explanatory variable is measured with classical measurement error? Why is this often called attenuation bias?

[19] How does autocorrelation affect OLS in

a static time-series model, and
a dynamic model with a lagged dependent variable?

[20] Define stationarity. Why can non-stationary series generate spurious regressions? Give

one case where differencing is appropriate,
the null hypothesis in a Dickey-Fuller or augmented Dickey-Fuller test.