Problem Set 1: Heteroskedasticity, Clustering, and Omitted-Variable Bias

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your PDF or HTML answers on Canvas before 11:59PM on Monday, 02 Feb. 2026.

Important Submit your answers as an HTML or PDF file. The submitted file should be built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

Objective This problem set has three goals: (1) build your understanding of how violations of OLS assumptions affect inference; (2) continue building your R analytical/coding skillset; (3) practice writing up your results in a clear and concise manner.

README! The data for this problem set (stored as data-ps1.csv) come from a study titled The Association Between Income and Life Expectancy in the United States, 2001-2014 (by Raj Chetty, Augustin Bergeron, David Cutler, Benjamin Scuderi, Michael Stepner, and Nicholas Turner). The authors created a nice website for the project here. Below is a brief description of project (written by the authors)¹:

How can we reduce socioeconomic disparities in health outcomes? Although it is well known that there are significant differences in health and longevity between income groups, debate remains about the magnitudes and determinants of these differences. We use new data from 1.4 billion anonymous earnings and mortality records to construct more precise estimates of the relationship between income and life expectancy at the national level than was feasible in prior work. We then construct new local area (county and metro area) estimates of life expectancy by income group and identify factors that are associated with higher levels of life expectancy for low-income individuals.

Our findings show that disparities in life expectancy are not inevitable. There are cities throughout America — from New York to San Francisco to Birmingham, AL — where gaps in life expectancy are relatively small or are narrowing over time. Replicating these successes more broadly will require targeted local efforts, focusing on improving health behaviors among the poor in cities such as Las Vegas and Detroit. Our findings also imply that federal programs such as Social Security and Medicare are less redistributive than they might appear because low-income individuals obtain these benefits for significantly fewer years than high-income individuals, especially in cities like Detroit.

Going forward, the challenge is to understand the mechanisms that lead to better health and longevity for low-income individuals in some parts of the U.S. To facilitate future research and monitor local progress, we have posted annual statistics on life expectancy by income group and geographic area (state, CZ, and county) at The Health Inequality Project website. Using these data, researchers will be able to study why certain places have high or improving levels of life expectancy and ultimately apply these lessons to reduce health disparities in other parts of the country.

Variable names and descriptions
Variable name	Variable type	Variable description
`county_name`	character	County name
`county_code`	integer	County FIPS code
`state_abb`	character	State abbreviation
`income_quartile`	character	Income quartile (1 or 4)
`life_exp`	numeric	Life expectancy (years)
`pct_uninsured`	numeric	Percent uninsured
`poverty_rate`	numeric	Share below poverty line
`pct_religious`	numeric	Percent religious
`pct_black`	numeric	Percent Black
`pct_hispanic`	numeric	Percent Hispanic
`unemployment_rate`	numeric	Unemployment rate
`median_hh_inc`	numeric	Median household income
`is_urban`	integer	Urban county indicator (1 or 0)
`pop`	integer	County population
`pop_density`	numeric	Population density
`pct_smoke`	numeric	Percent who smoke
`pct_obese`	numeric	Percent obese (BMI)
`pct_exercise`	numeric	Percent who exercise

You can find more information about the life-expectancy variables here and the county characteristics here.

2 Setup

[00] Let’s start by loading the R packages.

Reminder: You will need to install any packages that are not already installed. After you’ve installed them one time, you will not need to install them again. (The pacman package makes this easier; see the hint below.)

Hints

Use pacman to load/install desired packages.
You will likely want to use tidyverse, here, and fixest (among others).

As before, here’s an example where I load five packages:

tidyverse (for data manipulation),
scales (for formatting numbers),
fixest (for regressions),
here (for managing file paths).

# Load packages using 'pacman'
library(pacman)
p_load(tidyverse, scales, fixest, here)

[01] Now load the data (stored in data-ps1.csv).

Remember: The here() function helps with the file path to the data file (as does creating Projects in RStudio)

Answer I’m using read_csv() from the tidyverse package to load the data.

# Load data
life_df = here('problem-sets', '001', 'data-ps1.csv') |> read_csv()

Rows: 3106 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): county_name, state_abb
dbl (16): county_code, income_quartile, life_exp, pct_uninsured, poverty_rat...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3 Get to know your data

As with the first problem set, the first step in any data analysis is to get to know your data.

[02] Use R to

show the number of observations in the dataset,
show the number of observations without any missing data,
show the names of the variables in the dataset.

Hints

The functions dim(), nrow(), ncol() show the number of rows and columns in a dataset, e.g., nrow(some_data).
The function na.omit() removes observations with any missing data.
The function names() shows the variable names in a dataset, e.g., names(some_data).

Answer

# The number of observations in the dataset
life_df |> nrow() |> comma()

[1] "3,106"

# The number of observations without missing data
life_df |> na.omit() |> nrow() |> comma()

[1] "3,106"

# The number of variables
life_df |> names()

 [1] "county_name"       "county_code"       "state_abb"        
 [4] "income_quartile"   "life_exp"          "pct_uninsured"    
 [7] "poverty_rate"      "pct_religious"     "pct_black"        
[10] "pct_hispanic"      "unemployment_rate" "median_hh_inc"    
[13] "is_urban"          "pop"               "pop_density"      
[16] "pct_smoke"         "pct_obese"         "pct_exercise"

We have 3,106 total observations in the dataset, and 3,106 observations without any missing data.

The names of the 18 variables are printed above.

4 Visualize the data

Let’s visualize some of the key variables in the dataset.

[03] Important The dataset contains a variable called income_quartile, which indicates whether the observation is in the lowest income quartile (1) or the highest income quartile (4). In other words, for every county, there are two observations: income_quartile = 1 for the individuals in the county with incomes below the 25th percentile of income in the US, and income_quartile = 4 for the individuals in the county with incomes above the 75th percentile of income in the US.

Create a histogram of life expectancy (life_exp) for individuals in the lowest income quartile (income_quartile = 1) and another histogram of life expectancy for individuals in the highest income quartile (income_quartile = 4).

You can create two separate plot or can try to combine them into a single plot.

Note: Make sure to label your axes. A title would be good too. Aesthetics (colors, themes, etc.) are up to you.

Answer Using ggplot2

# Histogram of life expectancy by income quartile
ggplot(data = life_df, aes(x = life_exp, fill = factor(income_quartile))) +
geom_histogram(
  bins = 50, color = 'white', alpha = .80,
  position = 'identity', linewidth = .25
) +
geom_hline(yintercept = 0, color = 'black', linewidth = .25) +
scale_x_continuous('Life expectancy', labels = comma) +
scale_y_continuous('Number of counties', labels = comma) +
scale_fill_viridis_d(
  name = 'Income quartile',
  labels = c('1 (lowest)', '4 (highest)'),
  option = 'rocket',
  begin = .15,
  end = .75
) +
theme_minimal(base_size = 12, base_family = 'Fira Sans Condensed') +
theme(legend.position = 'bottom')

US life expectancy in lowest and highest income quartiles

[04] Repeat [03] but instead plot histograms of the share of the population that smokes (pct_smoke) for individuals in the lowest income quartile (income_quartile = 1) and for individuals in the highest income quartile (income_quartile = 4).

Answer

# Histogram of share of population that smokes by income quartile
ggplot(data = life_df, aes(x = pct_smoke, fill = factor(income_quartile))) +
geom_histogram(
  bins = 50, color = 'white', alpha = .80,
  position = 'identity', linewidth = .25
) +
geom_hline(yintercept = 0, color = 'black', linewidth = .25) +
scale_x_continuous('Percent smokes', labels = percent) +
scale_y_continuous('Number of counties', labels = comma) +
scale_fill_viridis_d(
  name = 'Income quartile',
  labels = c('1 (lowest)', '4 (highest)'),
  option = 'rocket',
  begin = .15,
  end = .75
) +
theme_minimal(base_size = 12, base_family = 'Fira Sans Condensed') +
theme(legend.position = 'bottom')

US share of population that smokes in lowest and highest income quartiles

[05] One more pair of histograms: Repeat [03] but instead plot histograms of the share of the population that exercises (pct_exercise) for individuals in the lowest income quartile (income_quartile = 1) and for individuals in the highest income quartile (income_quartile = 4).

Answer

# Histogram of share of population that exercises by income quartile
ggplot(data = life_df, aes(x = pct_exercise, fill = factor(income_quartile))) +
geom_histogram(
  bins = 50, color = 'white', alpha = .80,
  position = 'identity', linewidth = .25
) +
geom_hline(yintercept = 0, color = 'black', linewidth = .25) +
scale_x_continuous('Percent exercises', labels = percent) +
scale_y_continuous('Number of counties', labels = comma) +
scale_fill_viridis_d(
  name = 'Income quartile',
  labels = c('1 (lowest)', '4 (highest)'),
  option = 'rocket',
  begin = .15,
  end = .75
) +
theme_minimal(base_size = 12, base_family = 'Fira Sans Condensed') +
theme(legend.position = 'bottom')

US share of population that exercises in lowest and highest income quartiles

[06] Summarize the three histograms you created in [03], [04], and [05]. What are your takeaways about life expectancy, smoking rates, and exercise rates in the US across low- and high-income quartiles?

Answer The difference in life expectancy between the lowest and highest income quartiles striking: the histograms barely overlap, and the median life expectancy for individuals in the first quartile (low-income) is more than seven years lower than that of individuals in the fourth quartile (high-income).

[07] One more plot: Now create a single scatter plot of life expectancy (life_exp on the y axis) against the share of the population that recently exercised (pct_exercise). Color the points by income quartile (income_quartile).

Hint You can use the color aesthetic in ggplot2 to color points by a categorical variable, e.g.,

ggplot(
  data = some_df,
  aes(x = some_x, y = some_y, color = factor(some_category))
) +
geom_point()

For more examples, see here.

Answer

# Scatter plot of life expectancy against percent exercise, colored by income quartile
ggplot(data = life_df, aes(x = pct_exercise, y = life_exp, color = factor(income_quartile))) +
geom_point(alpha = .60, size = 1.5) +
scale_x_continuous('Percent exercises', labels = percent) +
scale_y_continuous('Life expectancy', labels = comma) +
scale_color_viridis_d(
  name = 'Income quartile',
  labels = c('1 (lowest)', '4 (highest)'),
  option = 'rocket',
  begin = .15,
  end = .75
) +
theme_minimal(base_size = 12, base_family = 'Fira Sans Condensed') +
theme(legend.position = 'bottom')

Scatter plot of life expectancy against percent smokes, by income quartile

[08] Summarize the scatter plot you created in [07]. What are your takeaways about the relationship between life expectancy and exercise rates in the US across low- and high-income quartiles?

Answer Lots of options here. Higher exercise rates correlate with higher life expectancy. However, higher income quartile also correlates with both (a) exercise rate and (b) higher life expectancy.

5 Omitted-variable bias

[09] Suppose we are interested in estimating the effect of exercise on life expectancy, i.e., \[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{Exercise}_i + u_i \] Explain how the scatterplot from [07] shows that we should be concerned about omitted-variable bias for this regression.

Answer The scatterplot shows that income quartile is correlated with both exercise rates and life expectancy. Assuming income impacts life expectancy (and is thus in the disturbance), omitting income quartile from the regression will lead to omitted-variable bias in the OLS estimate of \(\beta_1\).

[10] Using what you observe in the scatterplot from [07], explain whether you expect the OLS estimate of \(\beta_1\) to be biased upward or downward. Explain your answer.

Answer Higher income positively correlates with higher exercise rates. Higher income likely leads to higher life expectancy. Using our handy signing the bias formula, we expect the OLS estimates of \(\beta_1\) to be biased upward—overstating the effect of exercise (when we omit income).

[11] Create a new variable called high_income that equals 1 if the observation is in the highest income quartile (income_quartile = 4) and 0 otherwise.

Hint: You can use the mutate() function from the dplyr package to create new variables, e.g.,

some_df =
  some_df |>
  mutate(new_variable = as.numeric(old_variable == 'some_value'))

The code above creates a new variable called new_variable that equals 1 if old_variable equals 'some_value' and 0 otherwise. The as.numeric() function converts the logical variable (TRUE/FALSE) into a numeric variable (1/0).

Answer

# Create 'high_income' variable
life_df =
  life_df |>
  mutate(high_income = as.numeric(income_quartile == 4))

[12] Now use R to estimate

the regression model above, and
a regression model that includes exercise and the high-income variable as regressors, \[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{Exercise}_i + \beta_2 \text{(High income)}_i + u_i \]

Provide a table of your results.

Answer

# Estimate regression models
model_a =
  feols(life_exp ~ pct_exercise, data = life_df)
model_b =
  feols(life_exp ~ pct_exercise + high_income, data = life_df)
# Show results
etable(model_a, model_b)

                          model_a           model_b
Dependent Var.:          life_exp          life_exp
                                                   
Constant        67.99*** (0.1885) 76.26*** (0.1622)
pct_exercise    19.94*** (0.2545) 4.360*** (0.2655)
high_income                       6.003*** (0.0830)
_______________ _________________ _________________
S.E. type                     IID               IID
Observations                3,106             3,106
R2                        0.66410           0.87496
Adj. R2                   0.66399           0.87488
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[13] Compare the estimates of \(\beta_1\) from the two regressions in [12]. How does including income quartile as a regressor change the estimate of the effect of exercise on life expectancy? Is this consistent with your expectations from [10]? Explain.

Answer Including the high-income variable substantially decreased the estimate of \(\beta_1\)—moving from approximately 20 to 4. This decrease is consistent with our expectation from [10] that the OLS estimate of \(\beta_1\) would be biased upward when omitting income.

[14] Suppose you have access to a dataset with many more observations, but the dataset does not have information on income. Would increasing the sample size help address omitted-variable bias in this case? Explain your reasoning.

Answer No. Omitted-variable bias comes from violating exogeneity. OLS is both biased and inconsistent when exogeneity is violated. Increasing the sample size will not fix the bias/inconsistency problem.

[15] Imagine that people are not entirely truthful when reporting their exercise habits. How would this untruthful reporting affect OLS estimates of the effect of exercise on life expectancy? Explain your reasoning.

Answer If we have measurement error, when OLS is biased. If the measurement error is classical, then OLS estimates are biased toward zero (attenuation bias). The more mismeasurement, the more we are biased toward zero.

6 Testing for heteroskedasticity

[16] Does the scatterplot from [07] suggest that we should be concerned about heteroskedasticity? Explain your reasoning.

Answer Yes, the scatterplot suggests that the variance of life expectancy gets smaller as exercise rates increase. In addition, we seems to have higher-variance disturbances for the low-income quartile. Both of these patterns suggest heteroskedasticity.

[17] If heteroskedasticity is present, how would it affect the OLS estimates in [12]? Explain.

Answer Heteroskedasticity does not bias OLS estimates of the coefficients. However, heteroskedasticity does make the usual OLS standard errors invalid, which affects inference (hypothesis tests and confidence intervals). It also means OLS is no longer the most efficient linear unbiased estimator (BLUE).

[18] Conduct a Goldfeld-Quandt test for heteroskedasticity on the regression model from [12b] (the model with both exercise and high income as regressors). Use pct_exercise as the variable to order the observations. Report your results and interpret them.

Answer

# Arrange data by 'pct_exercise'
life_df = 
  life_df |>
  arrange(pct_exercise)
# Run regressions for top and bottom thirds (1,035 observations each)
gq1 =
  feols(
    life_exp ~ pct_exercise + high_income,
    data = life_df |> head(1035)
  )
gq2 =
  feols(
    life_exp ~ pct_exercise + high_income,
    data = life_df |> tail(1035)
  )
# SSE for each regression
sse1 = sum(residuals(gq1)^2)
sse2 = sum(residuals(gq2)^2)
# Goldfeld-Quandt test statistic
gq_stat = sse1 / sse2
# p-Value
gq_pv = pf(gq_stat, df1 = 1035 - 3, df2 = 1035 - 3, lower.tail = FALSE)

The Goldfeld-Quandt test statistic is 1.124, with a p-value of 0.031. Since the p-value is less than 0.05, we reject the null hypothesis of homoskedasticity at the 5% significance level. Thus, we have statistically significant evidence of heteroskedasticity in the regression model.

[19] Now use the White test to test for heteroskedasticity in the regression model from [12b]. Report your results and interpret them.

Important: If you re-ordered your dataset in [18], make sure to re-estimate the regression model from [12b] using this new re-ordered dataset before adding your residuals onto the data frame. Otherwise, your results will be incorrect (because the residuals will not correspond to the correct observations).

Hint: Squared a binary indicator variable is just the variable itself.

Answer

# Run the original regression
reg19 =
  feols(
    life_exp ~ pct_exercise + high_income,
    data = life_df
  )
# Get residuals and add to data frame
life_df$resid = residuals(reg19)
# Run White regression
white_reg = 
  feols(
    I(resid^2) ~ pct_exercise + I(pct_exercise^2) + high_income + I(pct_exercise * high_income),
    data = life_df
  )
# White test statistic
white_stat = nrow(life_df) * summary(white_reg)$sq.cor
# p-value
white_pv = pchisq(white_stat, df = 4, lower.tail = FALSE)

The White test statistic is 31.243, with a p-value of 0.0000027. Since the p-value is less than 0.05, we reject the null hypothesis of homoskedasticity at the 5% significance level. Again, we have statistically significant evidence of heteroskedasticity in the regression model.

[20] Do the Goldfeld-Quandt and White tests always agree on whether heteroskedasticity is present? Explain why or why not.

Answer To a degree, yes. Both tests find statistically significant evidence of heteroskedasticity in this case. However, the G-Q test finds more marginal evidence (a larger p-value). The White test finds much stronger evidence.

7 Living with heteroskedasticity

[21] Let’s update the model slightly. Estimate the following regression model: \[ \text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{Exercise}_i + \beta_2 \text{(High income)}_i + \beta_3 \text{Exercise}_i \times \text{(High income)}_i + u_i \] Show the results of this regression and interpret the coefficient and p-value on the interaction term.

Answer

# Estimate regression model with interaction term
model21 =
  feols(
    life_exp ~ pct_exercise * high_income,
    data = life_df
  )
# Show results
etable(model21)

                                     model21
Dependent Var.:                     life_exp
                                            
Constant                   75.94*** (0.2201)
pct_exercise               4.895*** (0.3640)
high_income                6.838*** (0.3980)
pct_exercise x high_income  -1.141* (0.5318)
__________________________ _________________
S.E. type                                IID
Observations                           3,106
R2                                   0.87514
Adj. R2                              0.87502
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient on the interaction term is -1.141, with a p-value of 0.032. This indicates that the effect of exercise on life expectancy is significantly different (lower) for high-income individuals compared to low-income individuals. Specifically, the effect of moving from 0% to 100% “exercisers” (i.e., 0 to 1) has a 1.141-year smaller effect on higher-income individuals relative to the effect for lower-income individuals.

[22] Update your standard errors to be heteroskedasticity robust.

Hint: Remember that the fixest package allows you to get heteroskedasticity-robust standard errors using the vcov = 'het' argument in the feols() function, the summary() function, and/or the etable() function.

Answer

# Use heteroskedasticity-robust standard errors
model21 |> etable(vcov = 'het')

                                     model21
Dependent Var.:                     life_exp
                                            
Constant                   75.94*** (0.2413)
pct_exercise               4.895*** (0.4068)
high_income                6.838*** (0.4755)
pct_exercise x high_income  -1.141. (0.6266)
__________________________ _________________
S.E. type                  Heteroskeda.-rob.
Observations                           3,106
R2                                   0.87514
Adj. R2                              0.87502
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[23] Does using heteroskedasticity-robust standard errors change any of your conclusions/inferences from [22]? Explain.

Answer To a degree, yes. The standard errors increase by abount 20%, which moves the p-value from significant at the 5% level to significant at the 10% level. Thus, using heteroskedasticity-robust standard errors does change our inference about the statistical significance of the interaction term.

[24] Would it make sense to use weighted least squares (WLS) here? Explain why or why not.

Hints: Think about what the data represent, how WLS works, and the example we discussed in class.

Answer It certainly could. The data are heteroskedastic, and WLS is more efficient than OLS in the presence of heteroskedasticity (if we know the correct weights). However, we do not know the correct weights. We could use population as weights, as each observation is actually a summary of a group defined by county and income-quartile. Because these groups have different sizes, population weighting could make some sense—though this difference-in-sizes may not be the sole source of heteroskedasticity.

8 Clustering

[25] Explain the concept of correlated disturbances and how it affects inference for OLS estimates.

Answer As the name suggests, correlated disturbances occurs when the disturbances of different observations are not independent—i.e., there is something that links the disturbances of different observations (causing them to correlate).

When distrubances correlate, we violate the classical OLS assumption \(\text{Cov}(u_i, u_j) = 0\) for \(i \neq j\). This violation biases the usual OLS standard errors, which again affects inference (hypothesis tests and confidence intervals).

[26] Why might we expect correlated disturbances in this dataset? Explain your reasoning.

Answer We might expect the disturbances to correlate here because we have two observations per county (one for low-income individuals and one for high-income individuals). If there are county-specific factors that affect life expectancy (aside from the included regressors), then these factors will be in the disturbance term for both observations in the county, causing the disturbances to correlate. Similarly, we might expect larger areas (e.g., states) to have correlated disturbances.

[27] Update your regression from [21] to use cluster-robust standard errors, clustering at the state level (state_abb). Show the results.

Answer

# Use cluster-robust standard errors (clustered by state)
model21 |> etable(cluster = ~ state_abb)

                                     model21
Dependent Var.:                     life_exp
                                            
Constant                   75.94*** (0.4514)
pct_exercise               4.895*** (0.8347)
high_income                6.838*** (0.5194)
pct_exercise x high_income   -1.141 (0.7530)
__________________________ _________________
S.E.: Clustered                by: state_abb
Observations                           3,106
R2                                   0.87514
Adj. R2                              0.87502
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hint: Remember that the feols() function allows you to get cluster-robust standard errors via cluster = ~ some_var (where some_var is the clustering variable).

[28] How do the cluster-robust standard errors compare to the heteroskedasticity-robust standard errors from [22]? Explain any differences you observe. Did accounting for potential correlated disturbances change any of your conclusions/inferences from [22]?

Answer The cluster-robust standard errors (allowing for correlated disturbances at the state level) are about 20% larger than the heteroskedasticity-robust standard errors (and approximately 40% larger than the “classic” OLS standard errors). This increase in the standard errors moves the p-value on the interaction term to above 0.10, meaning we would no longer consider the interaction between high income and exercise to be statistically significant at conventional significance levels.

Footnotes

You can find the full paper on JAMA and a non-technical summary from the authors here.↩︎