Problem Set 1: Heteroskedasticity, Clustering, and Omitted-Variable Bias

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your PDF or HTML answers on Canvas before 11:59PM on Monday, 02 Feb. 2026.

Important Submit your answers as an HTML or PDF file. The submitted file should be built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

Objective This problem set has three goals: (1) build your understanding of how violations of OLS assumptions affect inference; (2) continue building your R analytical/coding skillset; (3) practice writing up your results in a clear and concise manner.

README! The data for this problem set (stored as data-ps1.csv) come from a study titled The Association Between Income and Life Expectancy in the United States, 2001-2014 (by Raj Chetty, Augustin Bergeron, David Cutler, Benjamin Scuderi, Michael Stepner, and Nicholas Turner). The authors created a nice website for the project here. Below is a brief description of project (written by the authors)1:

How can we reduce socioeconomic disparities in health outcomes? Although it is well known that there are significant differences in health and longevity between income groups, debate remains about the magnitudes and determinants of these differences. We use new data from 1.4 billion anonymous earnings and mortality records to construct more precise estimates of the relationship between income and life expectancy at the national level than was feasible in prior work. We then construct new local area (county and metro area) estimates of life expectancy by income group and identify factors that are associated with higher levels of life expectancy for low-income individuals.

Our findings show that disparities in life expectancy are not inevitable. There are cities throughout America — from New York to San Francisco to Birmingham, AL — where gaps in life expectancy are relatively small or are narrowing over time. Replicating these successes more broadly will require targeted local efforts, focusing on improving health behaviors among the poor in cities such as Las Vegas and Detroit. Our findings also imply that federal programs such as Social Security and Medicare are less redistributive than they might appear because low-income individuals obtain these benefits for significantly fewer years than high-income individuals, especially in cities like Detroit.

Going forward, the challenge is to understand the mechanisms that lead to better health and longevity for low-income individuals in some parts of the U.S. To facilitate future research and monitor local progress, we have posted annual statistics on life expectancy by income group and geographic area (state, CZ, and county) at The Health Inequality Project website. Using these data, researchers will be able to study why certain places have high or improving levels of life expectancy and ultimately apply these lessons to reduce health disparities in other parts of the country.

Variable names and descriptions
Variable name Variable type Variable description
county_name character County name
county_code integer County FIPS code
state_abb character State abbreviation
income_quartile character Income quartile (1 or 4)
life_exp numeric Life expectancy (years)
pct_uninsured numeric Percent uninsured
poverty_rate numeric Share below poverty line
pct_religious numeric Percent religious
pct_black numeric Percent Black
pct_hispanic numeric Percent Hispanic
unemployment_rate numeric Unemployment rate
median_hh_inc numeric Median household income
is_urban integer Urban county indicator (1 or 0)
pop integer County population
pop_density numeric Population density
pct_smoke numeric Percent who smoke
pct_obese numeric Percent obese (BMI)
pct_exercise numeric Percent who exercise

You can find more information about the life-expectancy variables here and the county characteristics here.

2 Setup

[00] Let’s start by loading the R packages.

Reminder: You will need to install any packages that are not already installed. After you’ve installed them one time, you will not need to install them again. (The pacman package makes this easier; see the hint below.)

Hints

  • Use pacman to load/install desired packages.
  • You will likely want to use tidyverse, here, and fixest (among others).

As before, here’s an example where I load five packages:

  • tidyverse (for data manipulation),
  • scales (for formatting numbers),
  • fixest (for regressions),
  • here (for managing file paths).
# Load packages using 'pacman'
library(pacman)
p_load(tidyverse, scales, fixest, here)

[01] Now load the data (stored in data-ps1.csv).

Remember: The here() function helps with the file path to the data file (as does creating Projects in RStudio)

3 Get to know your data

As with the first problem set, the first step in any data analysis is to get to know your data.

[02] Use R to

  • show the number of observations in the dataset,
  • show the number of observations without any missing data,
  • show the names of the variables in the dataset.

Hints

  1. The functions dim(), nrow(), ncol() show the number of rows and columns in a dataset, e.g., nrow(some_data).
  2. The function na.omit() removes observations with any missing data.
  3. The function names() shows the variable names in a dataset, e.g., names(some_data).

4 Visualize the data

Let’s visualize some of the key variables in the dataset.

[03] Important The dataset contains a variable called income_quartile, which indicates whether the observation is in the lowest income quartile (1) or the highest income quartile (4). In other words, for every county, there are two observations: income_quartile = 1 for the individuals in the county with incomes below the 25th percentile of income in the US, and income_quartile = 4 for the individuals in the county with incomes above the 75th percentile of income in the US.

Create a histogram of life expectancy (life_exp) for individuals in the lowest income quartile (income_quartile = 1) and another histogram of life expectancy for individuals in the highest income quartile (income_quartile = 4).

You can create two separate plot or can try to combine them into a single plot.

Note: Make sure to label your axes. A title would be good too. Aesthetics (colors, themes, etc.) are up to you.

[04] Repeat [03] but instead plot histograms of the share of the population that smokes (pct_smoke) for individuals in the lowest income quartile (income_quartile = 1) and for individuals in the highest income quartile (income_quartile = 4).

[05] One more pair of histograms: Repeat [03] but instead plot histograms of the share of the population that exercises (pct_exercise) for individuals in the lowest income quartile (income_quartile = 1) and for individuals in the highest income quartile (income_quartile = 4).

[06] Summarize the three histograms you created in [03], [04], and [05]. What are your takeaways about life expectancy, smoking rates, and exercise rates in the US across low- and high-income quartiles?

[07] One more plot: Now create a single scatter plot of life expectancy (life_exp on the y axis) against the share of the population that recently exercised (pct_exercise). Color the points by income quartile (income_quartile).

Hint You can use the color aesthetic in ggplot2 to color points by a categorical variable, e.g.,

ggplot(
  data = some_df,
  aes(x = some_x, y = some_y, color = factor(some_category))
) +
geom_point()

For more examples, see here.

[08] Summarize the scatter plot you created in [07]. What are your takeaways about the relationship between life expectancy and exercise rates in the US across low- and high-income quartiles?

5 Omitted-variable bias

[09] Suppose we are interested in estimating the effect of exercise on life expectancy, i.e., \[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{Exercise}_i + u_i \] Explain how the scatterplot from [07] shows that we should be concerned about omitted-variable bias for this regression.

[10] Using what you observe in the scatterplot from [07], explain whether you expect the OLS estimate of \(\beta_1\) to be biased upward or downward. Explain your answer.

[11] Create a new variable called high_income that equals 1 if the observation is in the highest income quartile (income_quartile = 4) and 0 otherwise.

Hint: You can use the mutate() function from the dplyr package to create new variables, e.g.,

some_df =
  some_df |>
  mutate(new_variable = as.numeric(old_variable == 'some_value'))

The code above creates a new variable called new_variable that equals 1 if old_variable equals 'some_value' and 0 otherwise. The as.numeric() function converts the logical variable (TRUE/FALSE) into a numeric variable (1/0).

[12] Now use R to estimate

  1. the regression model above, and
  2. a regression model that includes exercise and the high-income variable as regressors, \[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{Exercise}_i + \beta_2 \text{(High income)}_i + u_i \]

Provide a table of your results.

[13] Compare the estimates of \(\beta_1\) from the two regressions in [12]. How does including income quartile as a regressor change the estimate of the effect of exercise on life expectancy? Is this consistent with your expectations from [10]? Explain.

[14] Suppose you have access to a dataset with many more observations, but the dataset does not have information on income. Would increasing the sample size help address omitted-variable bias in this case? Explain your reasoning.

[15] Imagine that people are not entirely truthful when reporting their exercise habits. How would this untruthful reporting affect OLS estimates of the effect of exercise on life expectancy? Explain your reasoning.

6 Testing for heteroskedasticity

[16] Does the scatterplot from [07] suggest that we should be concerned about heteroskedasticity? Explain your reasoning.

[17] If heteroskedasticity is present, how would it affect the OLS estimates in [12]? Explain.

[18] Conduct a Goldfeld-Quandt test for heteroskedasticity on the regression model from [12b] (the model with both exercise and high income as regressors). Use pct_exercise as the variable to order the observations. Report your results and interpret them.

[19] Now use the White test to test for heteroskedasticity in the regression model from [12b]. Report your results and interpret them.

Important: If you re-ordered your dataset in [18], make sure to re-estimate the regression model from [12b] using this new re-ordered dataset before adding your residuals onto the data frame. Otherwise, your results will be incorrect (because the residuals will not correspond to the correct observations).

Hint: Squared a binary indicator variable is just the variable itself.

[20] Do the Goldfeld-Quandt and White tests always agree on whether heteroskedasticity is present? Explain why or why not.

7 Living with heteroskedasticity

[21] Let’s update the model slightly. Estimate the following regression model: \[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{Exercise}_i + \beta_2 \text{(High income)}_i + \beta_3 \text{Exercise}_i \times \text{(High income)}_i + u_i \] Show the results of this regression and interpret the coefficient and p-value on the interaction term.

[22] Update your standard errors to be heteroskedasticity robust.

Hint: Remember that the fixest package allows you to get heteroskedasticity-robust standard errors using the vcov = 'het' argument in the feols() function, the summary() function, and/or the etable() function.

[23] Does using heteroskedasticity-robust standard errors change any of your conclusions/inferences from [22]? Explain.

[24] Would it make sense to use weighted least squares (WLS) here? Explain why or why not.

Hints: Think about what the data represent, how WLS works, and the example we discussed in class.

8 Clustering

[25] Explain the concept of correlated disturbances and how it affects inference for OLS estimates.

[26] Why might we expect correlated disturbances in this dataset? Explain your reasoning.

[27] Update your regression from [21] to use cluster-robust standard errors, clustering at the state level (state_abb). Show the results.

Hint: Remember that the feols() function allows you to get cluster-robust standard errors via cluster = ~ some_var (where some_var is the clustering variable).

[28] How do the cluster-robust standard errors compare to the heteroskedasticity-robust standard errors from [22]? Explain any differences you observe. Did accounting for potential correlated disturbances change any of your conclusions/inferences from [22]?

Footnotes

  1. You can find the full paper on JAMA and a non-technical summary from the authors here.↩︎