Problem Set 1: Heteroskedasticity, Clustering, and Omitted-Variable Bias
EC 421: Introduction to Econometrics
Author
Edward Rubin
1 Instructions
Due Upload your PDF or HTML answers on Canvasbefore 11:59PM on Wednesday, 29 April 2026.
ImportantSubmit your answers as an HTML or PDF file. The submitted file should be built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.
If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).
Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.
Objective This problem set has three goals: (1) build your understanding of how violations of OLS assumptions affect inference; (2) continue building your R analytical/coding skillset; (3) practice writing up your results in a clear and concise manner.
README! The dataset (stored as data-ps1.csv) for this problem set is the same dataset that you used in the first problem set.
Reminder: The data come from a study titled The Association Between Income and Life Expectancy in the United States, 2001-2014 (by Raj Chetty, Augustin Bergeron, David Cutler, Benjamin Scuderi, Michael Stepner, and Nicholas Turner). The authors created a nice website for the project here. Below is a brief description of the project (written by the authors)1:
How can we reduce socioeconomic disparities in health outcomes? Although it is well known that there are significant differences in health and longevity between income groups, debate remains about the magnitudes and determinants of these differences. We use new data from 1.4 billion anonymous earnings and mortality records to construct more precise estimates of the relationship between income and life expectancy at the national level than was feasible in prior work. We then construct new local area (county and metro area) estimates of life expectancy by income group and identify factors that are associated with higher levels of life expectancy for low-income individuals.
Our findings show that disparities in life expectancy are not inevitable. There are cities throughout America — from New York to San Francisco to Birmingham, AL — where gaps in life expectancy are relatively small or are narrowing over time. Replicating these successes more broadly will require targeted local efforts, focusing on improving health behaviors among the poor in cities such as Las Vegas and Detroit. Our findings also imply that federal programs such as Social Security and Medicare are less redistributive than they might appear because low-income individuals obtain these benefits for significantly fewer years than high-income individuals, especially in cities like Detroit.
Going forward, the challenge is to understand the mechanisms that lead to better health and longevity for low-income individuals in some parts of the U.S. To facilitate future research and monitor local progress, we have posted annual statistics on life expectancy by income group and geographic area (state, CZ, and county) at The Health Inequality Project website. Using these data, researchers will be able to study why certain places have high or improving levels of life expectancy and ultimately apply these lessons to reduce health disparities in other parts of the country.
Variable names and descriptions
Variable name
Variable type
Variable description
county_name
character
County name
county_code
integer
County FIPS code
state_abb
character
State abbreviation
income_quartile
numeric
Income quartile (either 1 or 4)
life_exp
numeric
Life expectancy (years)
pct_uninsured
numeric
Proportion uninsured
poverty_rate
numeric
Share below poverty line
pct_religious
numeric
Proportion religious
pct_black
numeric
Proportion Black
pct_hispanic
numeric
Proportion Hispanic
unemployment_rate
numeric
Unemployment rate
median_hh_inc
numeric
Median household income (in thousands of dollars)
is_urban
integer
Urban county indicator (1 or 0)
pop
integer
County population
pop_density
numeric
Population density
pct_smoke
numeric
Proportion who smoke
pct_obese
numeric
Proportion obese (BMI)
pct_exercise
numeric
Proportion who exercise
You can find more information about the life-expectancy variables here and the county characteristics here.
2 Setup
[00] Create a new RMarkdown (.rmd) or Quarto (.qmd) file for this problem set.
then check out this guide to getting started with Quarto in RStudio.
[01] Let’s start by loading the R packages and then the data.
File management I recommend using “Projects” in RStudio to help with file management. If you create a Project, then R will automatically start looking for files in the Project folder. Click here for a short guide to using Projects in RStudio.
Additional hints
Use pacman to load/install desired packages.
You will likely want to use tidyverse, scales, fixest, and here (among others).
The here() function helps with file paths, especially if you create Projects in RStudio.
Rows: 3106 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): county_name, state_abb
dbl (16): county_code, income_quartile, life_exp, pct_uninsured, poverty_rat...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here, I am telling R that relative to my current working directory (the folder R is looking at), the CSV file data-ps1.csv is in the 001 folder, which is inside the problem-sets folder.
3 Get to Know Your Data
As always, the first step in any data analysis is to get to know your data.
[02] Use R to report
the number of observations,
the number of variables,
the number of complete observations,
the number of distinct counties,
the number of distinct states,
the number of observations in each income quartile,
the number of missing values in each variable.
Hints
The functions nrow() and ncol() show the number of rows and columns in a dataset.
The function complete.cases() tells you whether an observation has any missing values.
The function n_distinct() counts the number of unique values.
The function count() is useful for counting observations by groups.
[03] Create a new variable high_income which equals 1 if income_quartile == 4 and equals 0 otherwise.
Hint: You can use mutate().
[04] Make a summary table that reports the mean and standard deviation of the following variables by income group:
life expectancy (life_exp),
smoking rate (pct_smoke),
obesity rate (pct_obese),
exercise rate (pct_exercise).
In one or two sentences, summarize the differences you see across income groups.
4 Visualize the Data
Let’s visualize several key relationships before estimating regression models.
[05] Create side-by-side boxplots (geom_boxplot) of life expectancy (life_exp) by income quartile (income_quartile). Separate the figure by urban status (is_urban).
Note: Make sure to label your axes. A title would be good too. Aesthetics (colors, themes, etc.) are up to you.
[06] Create a density plot of smoking rates (pct_smoke) by income quartile (income_quartile).
Hint:geom_density() works similarly to geom_histogram(), but it shows a smoothed density rather than counts.
[07] Create a scatter plot of life expectancy (life_exp, on the y axis) against smoking rates (pct_smoke, on the x axis). Color the points by income quartile (income_quartile) and add separate linear fitted lines for each income quartile.
Hints
Use the color aesthetic to color points by income quartile.
Add geom_smooth(method = 'lm', se = FALSE) to add linear fitted lines.
[08] Summarize your figures from [05], [06], and [07]. What do they suggest about life expectancy, smoking, and income?
5 Omitted-Variable Bias
Suppose we are interested in estimating the relationship between smoking and life expectancy: \[
\text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{(Smoking rate)}_i + u_i.
\]
[09] Explain why the scatter plot from [07] suggests that omitted-variable bias may be a concern for this regression.
[10] Based on the sign of the relationships in [07], do you expect omitting income group to bias the OLS estimate of \(\beta_1\) upward or downward? Explain your reasoning.
[11] Estimate the following two regressions and provide a table of the results:
the simple regression with only smoking, \[
\text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{(Smoking rate)}_i + u_i;
\]
the regression that also controls for high income, \[
\text{(Life expectancy)}_i = \beta_0 + \beta_1 \text{(Smoking rate)}_i + \beta_2 \text{(High income)}_i + u_i.
\]
[12] Compare the estimates of \(\beta_1\) from the two regressions in [11]. Is the change consistent with your prediction from [10]? Interpret the coefficients using a 10-percentage-point increase in smoking.
[13] Now estimate a model that controls for smoking, high income, and whether the county is urban: \[
\text{(Life expectancy)}_i =
\beta_0 + \beta_1 \text{(Smoking rate)}_i
+ \beta_2 \text{(High income)}_i
+ \beta_3 \text{(Urban county)}_i + u_i.
\]
Does adding the urban indicator meaningfully change the estimated coefficient on smoking? Explain what this tells us about omitted-variable bias.
6 Testing for Heteroskedasticity
[14] Using the regression from [13] plot residuals against smoking rates.
Do the plots suggest that heteroskedasticity may be present? Explain.
[15] If heteroskedasticity is present in the regression from [13], how would it affect the OLS coefficient estimates and the “classic” OLS standard errors? Explain.
[16] Conduct a Goldfeld-Quandt test for heteroskedasticity on the regression from [13]. Order the observations by smoking rates (pct_smoke) and compare the bottom third of observations to the top third of observations.
Report your test statistic, p-value, and conclusion.
[17] Now conduct a White test for heteroskedasticity on the regression from [13].
Important: Because high_income and is_urban are binary variables, their squares are just the original variables. You do not need to include I(high_income^2) or I(is_urban^2) in the auxiliary regression.
Hint: Your auxiliary regression should include pct_smoke, I(pct_smoke^2), high_income, is_urban, and the pairwise interactions among pct_smoke, high_income, and is_urban.
[18] Do the Goldfeld-Quandt test and White test always have to agree? Explain. What happened in this application?
7 Living with Heteroskedasticity
[19] Estimate the following model, which allows the relationship between smoking and life expectancy to differ by income group: \[
\begin{aligned}
\text{(Life expectancy)}_i =
\beta_0 &+ \beta_1 \text{(Smoking rate)}_i
+ \beta_2 \text{(High income)}_i
+ \beta_3 \text{(Urban county)}_i \\
&+ \beta_4 \text{(Smoking rate)}_i \times \text{(High income)}_i
+ u_i.
\end{aligned}
\]
Show the results and interpret the coefficient on the interaction term.
[20] Using the estimates from [19], calculate the implied smoking-life-expectancy slope for the lowest-income group and for the highest-income group. Interpret each slope using a 10-percentage-point increase in smoking.
[21] Update the standard errors from [19] to be heteroskedasticity robust. Does using heteroskedasticity-robust standard errors change your conclusions about the interaction term?
Hint: The fixest package lets you use heteroskedasticity-robust standard errors with vcov = 'het'.
[22] Estimate the model from [19] using weighted least squares with county population (pop) as the weights. Show the WLS results next to the unweighted results, using heteroskedasticity-robust standard errors for both.
Then explain how the WLS estimates differ from the unweighted estimates. Would population weighting automatically solve heteroskedasticity? Why or why not?
8 Clustering
[23] Explain the concept of correlated disturbances. Why might correlated disturbances be a concern in this dataset?
[24] Update the regression from [19] to use cluster-robust standard errors, clustering at the county level (county_code). Show the results.
[25] Compare the county-clustered standard errors from [24] to the heteroskedasticity-robust standard errors from [21]. Did clustering at the county level change any of your conclusions?
[26] Now cluster the standard errors at the state level (state_abb). Show the results and compare them to the county-clustered results. Why might state-level clustering produce different standard errors?
[27] Suppose a classmate reports only the conventional OLS standard errors from [19] and concludes that “the standard errors are small, so the estimates are precise.” Based on your work in this problem set, explain what is incomplete about this statement.
[28] Write a short final paragraph that summarizes what you learned about the relationship between smoking and life expectancy in this dataset. Your paragraph should mention omitted-variable bias, heteroskedasticity, and clustering.
Footnotes
You can find the full paper on JAMA and a non-technical summary from the authors here.↩︎