Problem Set 3: Causality, Instrumental Variables, and Review

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your answer on Canvas before 11:59 pm (Pacific) on Saturday, 07 June 2025.

Optional: This assignment is optional. If you submit it on Canvas, we will grade it, and it will replace the lowest grade among your assignments. If you do not submit it, your grade will not change.

Important You must submit your answers as an HTML or PDF file. Do not submit an .RMD or .qmd file. You will not receive credit for them.

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

Objective This problem set has three main purposes: (1) reinforce what you learned about causality and instrumental variables; (2) review the material from our course; (3) help you prepare for the final.

2 Causality

[01] Use plain English (i.e., no math) to explain The Fundamental Problem of Causal Inference.

Answer The fundamental problem of causal inference is that we can never observe the treated and untreated outcomes for the same individual at the same time.

[02] Use mathematical notation to explain The Fundamental Problem of Causal Inference.

Answer We want to be able to calculate \(\tau_i = Y_{1i} - Y_{0i}\), where \(Y_{1i}\) is the outcome for individual \(i\) if they are treated and \(Y_{0i}\) is the outcome for individual \(i\) if they are untreated. However, we can only observe one of these outcomes for each individual at a time, leading to the fundamental problem of causal inference.

[03] Explain what we mean by causality (i.e., based upon the potential outcomes framework).

Answer Again, in the potential outcomes framework, causality refers to the comparison between to alternate states of the world, i.e., comparing the treated outcome to the untreated outcome. The difference between the two is the effect caused by treatment.

[04] What does the potential outcomes framework (as defined in class—also called the Rubin causal model) assume about causal effects?

Answer The potential outcomes framework assumes that an individual’s potential outcomes are not affected by the treatment status of other individuals.

[05] Suppose I’m interested in estimating the effect of healthcare access on longevity. Use the potential outcomes framework to explain why it’s likely a bad idea to use the difference in means between individuals with and without healthcare

\[ \textit{Avg}(\text{Longevity}_{i}|\text{Healthcare}_{i}=1) - \textit{Avg}(\text{Longevity}_{i}|\text{Healthcare}_{i}=0) \]

to estimate the effect of healthcare on longevity.

Assume \(\text{Longevity}_{i}\) denotes individual \(i\)’s years lived and \(\text{Healthcare}_{i}\) is an indicator variable for whether individual \(i\) has access to healthcare.

Answer The difference in means between a treated group (here: those with healthcare) and an untreated group (those without healthcare) is equal to the treatment effect plus selection bias, i.e., \[ = \tau + \text{Selection bias} \]

Here selection bias is \[ = \textit{Avg}(\text{L}_{1i} | \text{Healthcare}_{i}=1) - \textit{Avg}(\text{L}_{0i} | \text{Healthcare}_{i}=0) \] where \(\text{L}_{1i}\) is the longevity outcome for individual \(i\) if they have healthcare and \(\text{L}_{0i}\) is the longevity outcome for individual \(i\) if they do not have healthcare.

Note: Here, as we did in class, I’m imposing the assumption that the treatment effect is constant across individuals, which is not always true in practice.

[06] Suppose I ran a regression instead of just taking a difference in mean, e.g.,

\[ \text{Longevity}_{i} = \beta_{0} + \beta_{1} \text{Healthcare}_{i} + u_{i} \]

Answer There wasn’t much of a question here. Let’s move on.

[07] Does this regression have a better chance at estimating the causal effect of healthcare on longevity? Explain your answer.

Answer This regression will suffer from the same selection bias as the difference in means. Regression is just a statistical technique; it does not fix selection bias.

[08] Would your answers to the questions about change if you knew that access to healthcare was randomized? Why or why not? Explain your answer.

Answer Yes! If healthcare access was randomized, then selection bias (on average) disappears, as the treated group and the control group have been pulled from the same population: there is no longer selection into treatment.

3 Instrumental variables

[09] In plain English (again, no math), explain the goal of instrumental variables (IV).

Answer The goal of IV is to use an instrumental variable to extract exogenous variation in the treatment (variable) of interest. With this exogenous variation, we are able to estimate the causal effect of te treatment without selection bias.

[10] What are the two main requirements for a variable to be a valid instrument (as defined in class)?

Answer The two main requirements for a variable to be a valid instrument are:

Relevance: The instrument must correlate with the treatment variable (the endogenous explanatory variable).
Exogeneity: The instrument must be independent (here: zero covariance/correlation) from the disturbance in the outcome equation (the equation we are trying to estimate). In other words, the instrument should not be related to the outcome variable except through its effect on the treatment variable.

[11] For outcome variable \(y\), explanatory variable \(x\), and instrument \(z\), explain the regressions you would run for

the reduced form;
the first stage;
the second stage.

Answer Here are the regressions:

Reduced form: \(y_i = \gamma_0 + \gamma_1 z_i + v_i\);
First stage: \(x_i = \pi_0 + \pi_1 z_i + w_i\);
Second stage: \(y_i = \beta_0 + \beta_1 \hat{x}_i + u_i\), where \(\hat{x}_i\) is the predicted value of \(x_i\) from the first stage regression.

[12] Explain why it makes sense that the IV estimator divides the coefficient from the reduced form by the coefficient from the first stage.

Answer This “ratio” makes sense because

the numerator (from the reduced form) estimates how changes in the instrument translate into changes in our outcome;
the denominator (from the first stage) estimates how changes in the instrument translate into changes in our treatment variable.

Dividing the numerator (changes in \(y\) per chanage in \(z\)) by the denominator (changes in \(x\) per change in \(z\)) gives us the change in \(y\) per change in \(x\), which is the causal effect we are trying to estimate.

Because the instrument is exogenous, we can “trust” the estimates in each part of the ratio.

[13] Imagine you work for a real-estate startup, and your boss asks you to estimate how proximity to a bus stop affects apartment rent.

Write out a regression model for your boss’s desired relationship (i.e., the model, including \(\beta\)s that you would estimate).

Answer One possible simple linear regression model is \[ \text{Rent}_{i} = \beta_0 + \beta_1 \text{(Bus Stop Distance})_{i} + u_{i} \]

Of course, you could add other terms (e.g., quadratic effects of distance), other variables (with or without interactions), and/or change the functional form (e.g., a log-log model).

[14] Suppose your boss suggests that you use house age as an instrumental variable. Using the two requirements of a valid instrument, discuss whether this instrument is a “good” or “bad” idea.

Answer Let’s evaluate the two requirements:

Relevance: It’s definitely possible that house age correlates with distance to bus stops. If older houses are closer to the city center, and the city center has more bus stops, you could find that house age is relevant for bus-stop proximity. This assumption is testable!
Exogeneity: I don’t think house age is exogenous. House age likely directly affects rent, which means it is correlated with the disturbance in our main equation of interest. House age likely correlates with a bunch of other variables in the disturbance (e.g., square footage, lot size, general quality).

Because the proposed instrument fails the exogeneity requirement, it is a bad instrument (and idea).

4 Additional topics

[15] Why do we care about stationarity?

Answer We care about stationarity because, we’ve been assuming that the data are “well behaved”, i.e., that the properties of the data do not change over time. If things like the mean, variance, or covariance are changing over time, then we can find spurious results.

[16] Compare and contrast the concepts of autocorrelation and heteroskedasticity.

Answer Autocorrelation occurs when a variable correlates with itself through time. We often are concerned about the disturbance being autocorrelated. This is mainly a time-series concept.

Heteroskedasticity occurs when the variance of the disturbance’s variance differs across observations. This is mainly a cross-sectional concept, but it can also occur in time-series data.

Both autocorrelation and heteroskedasticity violate OLS assumptions. In static models (or models with only lagged explanatory variables), they cause similar issues: OLS is biased for the standard errors and is also inefficient.

[17] Which of OLS assumptions relate to the mean of the disturbance and which relate to the variance of the disturbance?

Answer The assumption of exogeneity relates to the mean of the disturbance, i.e., \(E[u|X] = 0\).

The assumption of homoskedasticty relates to the variance of the disturbance, i.e., \(Var[u|X] = \sigma^2\).

[18] Define exogeneity and explain what happens when we violate it.

Answer Exogeneity means that the disturbance is independent from the explanatory variables, i.e., \(E[u | X] = 0\). When we violate exogeneity OLS estimates of regression coefficients are biased and inconsistent.

[19] Write out a simple linear regression model (for the effect of \(x\) on \(y\)) for each of the following specifications and explain how you interpret the coefficient on \(x\).

Fully linear model where \(y\) and \(x\) are both continuous;
Log-linear model where \(y\) and \(x\) are both continuous;
Log-log model where \(y\) and \(x\) are both continuous;
Linear model where \(y\) is binary and \(x\) is continuous;
Linear model where \(y\) is continuous and \(x\) is binary.

Answer Here are the models and interpretations:

Fully linear model

\(y_i = \beta_0 + \beta_1 x_i + u_i\);
\(\beta_1\): when \(x_i\) increases by one unit, we expect \(y_i\) to increase by \(\beta_1\) units.

Log-linear model

\(\log(y_i) = \beta_0 + \beta_1 x_i + u_i\);
\(\beta_1\): when \(x_i\) increases by one unit, we expect \(y_i\) to increase by \(100 cdot \beta_1\) percent.

Log-log model

\(\log(y_i) = \beta_0 + \beta_1 log(x_i) + u_i\);
\(\beta_1\): when \(x_i\) increases by one percent, we expect \(y_i\) to increase by \(\beta_1\) percent.

Linear model with binary \(y\) and continuous \(x\)

\(y_i = \beta_0 + \beta_1 x_i + u_i\);
\(\beta_1\): when \(x_i\) increases by one unit, we expect the probability that \(y_i = 1\) to increase by \(\beta_1\).

Linear model with continuous \(y\) and binary \(x\)

\(y_i = \beta_0 + \beta_1 x_i + u_i\);
\(\beta_1\) gives the average difference in \(y_i\) between the two groups defined by \(x_i\) (where \(x_i = 1\) is one group and \(x_i = 0\) is the other group).

[20] Explain what information standard errors provide in the context of OLS regression.

Answer Standard errors provide information about the uncertainty inherent to our estimates of the population parameters. They tell us the (estimated) standard deviation of the sampling distribution of our estimator: bigger standard errors mean more uncertainty.