Problem Set 1: Heteroskedasticity, Clustering, and OLS Assumptions

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your answer on Canvas before midnight on Saturday, 08 February 2025.

Important You must submit your answers as an HTML or PDF file, built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

README! The data in this problem set come from the 2023 American Community Survey (ACS; downloaded from IPUMS USA). The ACS annually surveys approximately 3.5 million households. I’ve provided a random subset of 10,000 individuals—all of whom are at least 18 years old. The data are stored in a CSV file named data-acs.csv.

The table below describes each variable in the dataset.

Variable names and descriptions
Variable name	Variable description
`state`	The state FIPS code for the individual’s state of residence (2-digit numeric code). (`character`)
`county`	The county FIPS code for the individual’s county of residence (5-digit numeric code). (`character`)
`sex`	The individual’s sex (`Female` or `Male`). (`character`)
`age`	The individual’s age (`18` to `99`). (`integer`)
`race`	The individual’s race (6 broad categories). (`character`)
`hispanic`	Whether the individual is `Hispanic` or `Non-Hispanic`. (`character`)
`educ`	A rough estimate of the individual’s years of eduation (`1=` first grade; `17=` graduate school). (`integer`)
`empstat`	The individual’s employment status (`Employed`, `Unemployed`, `Not in labor force`). (`character`)
`hrs_wk`	The number of hours the individual works per week. (`integer`)
`income`	The individual’s income in dollars. (`integer`)
`deg_bachelors`	A binary indicator for whether the individual has a bachelor’s degree. (`integer`)
`deg_masters`	A binary indicator for whether the individual has a master’s degree. (`integer`)
`deg_profession`	A binary indicator for whether the individual has a professional degree (e.g., law or medicine). (`integer`)
`deg_phd`	A binary indicator for whether the individual has a doctorate. (`integer`)
`i_female`	A binary indicator for whether the individual’s sex is female. (`integer`)
`i_black`	A binary indicator for whether the individual is Black. (`integer`)
`i_white`	A binary indicator for whether the individual is White. (`integer`)
`i_hispanic`	A binary indicator for whether the individual is Hispanic. (`integer`)
`i_workforce`	A binary indicator for whether the individual is in the workforce (employed or unemployed). (`integer`)
`i_employed`	A binary indicator for whether the individual is employed. (`integer`)

Objective This problem set has three purposes: (1) reinforce the metrics topics we introduced (esp. heteroskedasticity, clustering, and violations of OLS’s assumptions) in class; (2) build your R toolset; (3) start building your intuition about causality within econometrics/regression.

2 Setup

[01] Load your R packages. You will probably want tidyverse, here, and fixest (among others).

Reminder: pacman and it’s p_load() function make package management easier—you just use p_load() to load packages, and R will install the packages if they’re not already installed.

[02] Now load the data (stored in data-acs.csv).

Hints:

If the first problem set did not go well, check out our solutions! In addition to showing you how we solved the last problem set, our answers will help you with the various steps of this problem set.
This dataset is the same as the last dataset but with two additional variables (state and county).

I saved the data as a CSV, so you’ll want to use a function that can read CSV files—for example, read_csv() in the readr package, which is part of the tidyverse.

3 Visualize the data

Before we start, let’s take a look at the data—and practice our data visualization skills.

[03] Create a histogram that describes the distribution of log-10 income for each of the following four groups:

females aged 45 and under;
non-females aged 45 and under;
females over 45;
non-females over 45.

To be clear: You should have four separate histograms. Don’t forget to label the axes and title each plot.

Hints:

You can tell ggplot() to use log base-10 scaling on the x-axis by using scale_x_log10(). (You could also use the log10() function to create a new variable, but I like the first option more.)
Remember the filter() function.
Alternatively, you could use the fill aesthetic in ggplot().
You might also check out the facet_grid() function from ggplot2.

[04] What do you notice about the distribution of income in the four groups?

[05] Why would we want to use log-10 income (instead of plain income)?

[06] Does plotting the distribution of log-10 income instead of income accurately represent the data? Explain your answer.

[07] What happens to the individuals with 0 income when we plot the distribution of log-10 income?

4 Heteroskedasticity

[08] Let’s go back to our first regression in the last problem set. We estimated the model:

\[ \text{Income} = \beta_0 + \beta_1 \text{Education}_i + u_i \]

Estimate the model again. Use the lm() function. Set the option na.action = na.exclude inside of lm().

Now plot—scatterplot using geom_point()—the residuals against education. What do you notice?

[09] We don’t need to rely on scatterplots to detect heteroskedasticity. We have formal tests! Use the Goldfeld-Quandt test to test for heteroskedasticity in the model. Make sure to report the p-value and your conclusion.

Hint: The course notes walk you through this test.

Note: I want you to do the test manually (do not use the gqtest() function from the lmtest package).

[10] Now use the White test to test for heteroskedasticity in the model. Make sure to report the p-value and your conclusion.

[11] What is the difference between the Goldfeld-Quandt test and the White test?

[12] Assuming that you found significant evidence of heteroskedasticity, what are some ways to address the issue?

[13] “Fix” your standard errors using White’s heteroskedasticity-robust standard errors. How much do the standard errors change?

Hint: You can set vcov = 'het' inside of feols() to get heteroskedasticity-robust standard errors. For example, feols(y ~ x, data = data, vcov = 'het').

5 Clustering

Suppose we are interested in estimating the returns to education but we suspect the returns may differ by individuals’ races.

[14] Estimate the model:

\[ \text{Income} = \beta_0 + \beta_1 \text{Education}_i + \beta_2 \mathbb{I}(\text{White})_i + \beta_3 \text{Education}_i \times \mathbb{I}(\text{White})_i + u_i \] where \(\mathbb{I}(\text{White})_i\) is the indicator for whether the individual is white (i_white).

Use our “standard” (IID) standard errors (the ones from 320 that assume homoskedasticity).

Report the coefficient estimates and standard errors in a table.

[15] Interpret the coefficient on the interaction term in [14].

[16] Re-estimate the model from [14] using cluster-robust standard errors. Use the cluster option in feols() to cluster the standard errors by county (county), e.g., feols(y ~ x, data, cluster = ~ county).

[17] Compare the standard errors from [14] and [16]. What do you notice? Are they meaningfully different? Explain.

[18] Why would we potentially want to cluster our standard errors at the county level in this dataset?

[19] Why are the coefficients in [14] and [16] the same?

6 General questions

[20] Which assumption of OLS is violated when the error term is heteroskedastic?

[21] Which assumption of OLS is violated when we need to cluster our standard errors?

[22] Why is OLS still unbiased when the disturbance is heteroskedastic?

[23] Why is OLS inefficient when the disturbance is heteroskedastic?