Problem Set 3: Causality, Instrumental Variables, and Review

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your answer on Canvas before 11:59 pm (Pacific) on Saturday, 07 June 2025.

Optional: This assignment is optional. If you submit it on Canvas, we will grade it, and it will replace the lowest grade among your assignments. If you do not submit it, your grade will not change.

Important You must submit your answers as an HTML or PDF file. Do not submit an .RMD or .qmd file. You will not receive credit for them.

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

Objective This problem set has three main purposes: (1) reinforce what you learned about causality and instrumental variables; (2) review the material from our course; (3) help you prepare for the final.

2 Causality

[01] Use plain English (i.e., no math) to explain The Fundamental Problem of Causal Inference.

[02] Use mathematical notation to explain The Fundamental Problem of Causal Inference.

[03] Explain what we mean by causality (i.e., based upon the potential outcomes framework).

[04] What does the potential outcomes framework (as defined in class—also called the Rubin causal model) assume about causal effects?

[05] Suppose I’m interested in estimating the effect of healthcare access on longevity. Use the potential outcomes framework to explain why it’s likely a bad idea to use the difference in means between individuals with and without healthcare

\[ \textit{Avg}(\text{Longevity}_{i}|\text{Healthcare}_{i}=1) - \textit{Avg}(\text{Longevity}_{i}|\text{Healthcare}_{i}=0) \]

to estimate the effect of healthcare on longevity.

Assume \(\text{Longevity}_{i}\) denotes individual \(i\)’s years lived and \(\text{Healthcare}_{i}\) is an indicator variable for whether individual \(i\) has access to healthcare.

[06] Suppose I ran a regression instead of just taking a difference in mean, e.g.,

\[ \text{Longevity}_{i} = \beta_{0} + \beta_{1} \text{Healthcare}_{i} + u_{i} \]

[07] Does this regression have a better chance at estimating the causal effect of healthcare on longevity? Explain your answer.

[08] Would your answers to the questions about change if you knew that access to healthcare was randomized? Why or why not? Explain your answer.

3 Instrumental variables

[09] In plain English (again, no math), explain the goal of instrumental variables (IV).

[10] What are the two main requirements for a variable to be a valid instrument (as defined in class)?

[11] For outcome variable \(y\), explanatory variable \(x\), and instrument \(z\), explain the regressions you would run for

  • the reduced form;
  • the first stage;
  • the second stage.

[12] Explain why it makes sense that the IV estimator divides the coefficient from the reduced form by the coefficient from the first stage.

[13] Imagine you work for a real-estate startup, and your boss asks you to estimate how proximity to a bus stop affects apartment rent.

Write out a regression model for your boss’s desired relationship (i.e., the model, including \(\beta\)s that you would estimate).

[14] Suppose your boss suggests that you use house age as an instrumental variable. Using the two requirements of a valid instrument, discuss whether this instrument is a “good” or “bad” idea.

4 Additional topics

[15] Why do we care about stationarity?

[16] Compare and contrast the concepts of autocorrelation and heteroskedasticity.

[17] Which of OLS assumptions relate to the mean of the disturbance and which relate to the variance of the disturbance?

[18] Define exogeneity and explain what happens when we violate it.

[19] Write out a simple linear regression model (for the effect of \(x\) on \(y\)) for each of the following specifications and explain how you interpret the coefficient on \(x\).

  • Fully linear model where \(y\) and \(x\) are both continuous;
  • Log-linear model where \(y\) and \(x\) are both continuous;
  • Log-log model where \(y\) and \(x\) are both continuous;
  • Linear model where \(y\) is binary and \(x\) is continuous;
  • Linear model where \(y\) is continuous and \(x\) is binary.

[20] Explain what information standard errors provide in the context of OLS regression.