Problem Set 1: Heteroskedasticity, Clustering, and OLS Assumptions

1 Instructions

Due Upload your answer on Canvas before midnight on Friday, 02 May 2025.

Important You must submit your answers as an HTML or PDF file, built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it. The submitted file should include your R code, your responses, and the relevant figures/regression results.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

README! The data in this problem set come from Tony McGovern’s archive of presidential election results for the 2016, 2020, and 2024 presidential elections.

The table below describes each variable in the dataset.

Variable names and descriptions
Variable name	Variable description
`state_name`	The state’s name for the given county. (`character`)
`state`	The 2-digit alphabetical abbreviation for the county’s state. (`character`)
`county_name`	The county’s name. (`character`)
`county_fips`	The county FIPS code for the individual’s county of residence (5-digit numeric code). (`character`)
`confederate`	Binary indicator for whether the county is in a state that was part of the Confederacy during the Civil War. (`integer`: `0` or `1`)
`votes_gop_2016`	The number of votes cast for the Republican candidate in the 2016 presidential election. (`integer`)
`votes_dnc_2016`	The number of votes cast for the Democratic candidate in the 2016 presidential election. (`integer`)
`votes_gop_2020`	The number of votes cast for the Republican candidate in the 2020 presidential election. (`integer`)
`votes_dnc_2020`	The number of votes cast for the Democratic candidate in the 2020 presidential election. (`integer`)
`votes_gop_2024`	The number of votes cast for the Republican candidate in the 2024 presidential election. (`integer`)
`votes_dnc_2024`	The number of votes cast for the Democratic candidate in the 2024 presidential election. (`integer`)

Objective This problem set has three purposes: (1) reinforce the metrics topics we introduced (esp. heteroskedasticity, clustering, and violations of OLS’s assumptions) in class; (2) build your R toolset; (3) start building your intuition about causality within econometrics/regression.

2 Setup

[01] Load your R packages. You will probably want tidyverse, here, and fixest (among others).

Reminder: pacman and it’s p_load() function make package management easier—you just use p_load() to load packages, and R will install the packages if they’re not already installed.

[02] Now load the data (stored in vote-data.csv).

I saved the data as a CSV, so you’ll want to use a function that can read CSV files—for example, read_csv() in the readr package, which is part of the tidyverse.

Hint: If the first problem set did not go well, check out our solutions! In addition to showing you how we solved the last problem set, our answers will help you with the various steps of this problem set.

[03] Get to know the dataset. Try the skim() function from the skimr package to get a summary of the dataset and to answer the following questions:

How many observations are in the dataset?
How many numeric variables are there?
How many observations are missing values?
How many unique counties are in the dataset?

3 Visualize the data

Throughout the problem set, we’re going to investigate how the 2016 and 2020 election results may help explain the 2024 election results. We will start by visualizing the data.

[04] Create a histogram of the number of votes cast for the Republican candidate in the 2024 presidential election (votes_gop_2024).

[05] What do you notice about the distribution of votes in the previous histogram? Why is the distribution so skewed?

[06] Repeat the histogram from [04] but use a log-10 scale on the x-axis.

Hint: You can tell ggplot() to use log base-10 scaling on the x-axis by using scale_x_log10(). (You could also use the log10() function to create a new variable, but I like the first option more.)

[07] Did the log-10 transformation help visualize the distribution of votes? Did it provide any new insights?

[08] We’re actually interested in the share of votes won by the candidates—not just the total number of votes. Create three new variables:

share_gop_2016: the share of votes cast for the Republican candidate in the 2016 presidential election;
share_gop_2020: the share of votes cast for the Republican candidate in the 2020 presidential election;
share_gop_2024: the share of votes cast for the Republican candidate in the 2024 presidential election.

Hints:

You can use the mutate() function from the dplyr package to create new variables.
The GOP share of votes is the number of votes for the GOP candidate divided by the total number of votes cast for both candidates (here: votes for GOP plus votes for DNC).

[09] Plot the histogram of the share of votes cast for the Republican candidate in the 2024 presidential election (share_gop_2024).

[10] Why is this histogram also so skewed? What are we missing?

[11] Create a scatterplot of the share of votes cast for the Republican candidate in the 2024 presidential election (share_gop_2024) against the share of votes cast for the Republican candidate in the 2020 presidential election (share_gop_2020). Does the scatterplot suggest that the share of votes cast for the Republican candidate in 2024 is correlated with the share of votes cast for the Republican candidate in 2020?

4 Heteroskedasticity

[12] From the scatterplot in [11], do you think the following regression model would have a heteroskedastic disturbance? Explain your answer.

\[ \text{(GOP share 2024)}_i = \beta_0 + \beta_1 \text{(GOP share 2020)}_i + u_i \]

[13] Estimate the model from [12]. Report the results and interpret both the intercept and coefficient.

[14] Now plot a scatterplot (using geom_point()) of the residuals from [13] against the regression’s explanatory variable (share_gop_2020). Does the scatterplot suggest that the model has a heteroskedastic disturbance? Explain your answer.

[15] While certainly helpful, we don’t need to rely on scatterplots to detect heteroskedasticity. We have formal tests! Use the Goldfeld-Quandt test to test for heteroskedasticity in the previous regression model. Put 1,034 observations in the first group and 1,034 observations in the second group.

Make sure to report the p-value and your conclusion.

Hint: The course notes walk you through this test—as do the videos from lab.

Note: I want you to do the test manually (do not use the gqtest() function from the lmtest package).

[16] How would the White test work in this case? What regressions would you need to run? Don’t run them—just explain.

[17] Why do we typically prefer the White test over the Goldfeld-Quandt test?

[18] One approach to “fixing” heteroskedasticity is to check your specification. Let’s integrate a new variable into our model. Some political commentators have suggested that historical race-related factors may be important in explaining the 2024 election results. For example, the counties that were part of the Confederacy during the Civil War may have different voting patterns than counties that were not part of the Confederacy. Let’s test this hypothesis.

Estimate a model that includes the Confederacy indicator (confederate) as an additional explanatory variable and the interaction between the Confederacy indicator and the 2020 GOP vote share. The model should look like this: \[ \begin{align*} \text{(GOP share 2024)}_i = &\beta_0 + \beta_1 \text{(GOP share 2020)}_i + \beta_2 \mathbb{I}(\text{Confederate})_i \\ &+ \beta_3 (\text{GOP share 2020})_i \times \mathbb{I}(\text{Confederate})_i + u_i \end{align*} \] where \(\mathbb{I}(\text{Confederate})_i\) is an indicator for whether the county was part of the Confederacy during the Civil War.

Report your results.

[19] What does the coefficient on the interaction term suggest? Does it support the hypothesis that the counties that were part of the Confederacy during the Civil War have different voting patterns than counties that were not part of the Confederacy?

[20] Plot the residuals from the regression in [18] against the explanatory variable (share_gop_2020). Does the scatterplot suggest that we “fixed” the heteroskedasticity issue? Explain your answer.

[21] The “standard” approach to dealing with heteroskedasticity is to use heteroskedasticity-robust standard errors. Estimate the model from [18] again, but this time use heteroskedasticity-robust standard errors.

Hints:

You can set vcov = 'het' inside of feols() to get heteroskedasticity-robust standard errors. For example, feols(y ~ x, data = data, vcov = 'het').
Alternatively, you can use the summary() function on an object estimated by feols() and set vcov = 'het' inside of summary(). For example, summary(feols(y ~ x, data = data), vcov = 'het').

[22] Compare the heteroskedasticity-robust standard errors to the standard errors from [18]. Are they meaningfully different? Explain.

[23] Why are the coefficients in [18] and [21] the same?

[24] Our third approach for dealing with heteroskedasticity is to upweight the “noisy” observations and downweight the “less noisy” observations—i.e., to use weighted least squares (WLS).

Explain why weighting by the total number of votes cast in the 2024 presidential election (votes_gop_2024 + votes_dnc_2024) could be a reasonable approach to WLS in this case.

Hint: Check the notes when we walk through an example of WLS.

[25] Create a variable for the total votes cast in the 2024 presidential election (votes_gop_2024 + votes_dnc_2024).

Now estimate the model from [18] again, but this time use WLS with weights equal to the total number of votes cast in the 2024 presidential election (this new variable).

Remember, you can use the weights argument in feols() to specify the weights. For example, feols(y ~ x, data = data, weights = ~w), where w is the name of the variable that contains the weights. Notice that the weights argument must be preceded by a ~ (tilde).

[26] Compare the WLS standard errors to the heteroskedasticity-robust standard errors from [21]. Are they meaningfully different? Explain.

5 Clustering

[27] In class we discussed the idea of correlated disturbances (a.k.a clustered disturbances). In the context of this dataset, why might the disturbances be correlated across observations (counties)?

[28] Update the regression from [18] using cluster-robust standard errors. Cluster by state (i.e., using the state variable).

Remember, you can

use the cluster option in feols(), e.g., feols(y ~ x, data, cluster = ~ state);
use the cluster option in summary(), e.g., summary(feols(y ~ x, data), cluster = ~ state).

[29] Compare the standard errors from [21] (heteroskedasticity-robust) and [28] (cluster-robust). What do you notice? Are they meaningfully different? Explain.

[30] Which standard errors do you prefer? Why?