Midterm: Take-Home Portion

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your answer on Canvas before midnight on Sunday, 14 May 2023.

Materials and Academic Integrity You must work alone. You can use any materials (course notes, Google, books, ChatGPT), but you need to put answers in your own words. If you are suspected of cheating, you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes discussing the exam with your classmates and copying directly from other sources (including ChatGPT).

Submission As with your homework, you must submit your answers as an HTML or PDF file, built from an RMarkdown (.RMD) or Quarto (.qmd) file. Do not submit the .RMD or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

README! The data in this problem set are a random sample of 10,000 housholds from the American Housing Survey (AHS), which is sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau.

The table below describes each variable in the dataset.

Variable names and descriptions
Variable name	Variable description
`black`	Indicator: Head of household identified as Black
`hispanic`	Indicator: Head of household identified as Hispanic
`white`	Indicator: Head of household identified as White
`rental`	Indicator: Property is rented
`household_income`	Household income (2021 US dollars)
`received_foodstamps`	Indicator: Someone in household received foodstamps
`cost_housing`	Total cost of housing (2021 US dollars)
`has_roaches`	Indicator: Roaches observed in home
`has_rodents`	Indicator: Rodents observed in home
`in_metro`	Indicator: Household is located in a metropolitan area
`cbsa`	The household’s metropolitan area

2 Prompts

[00] Load your R packages and the dataset (data-midterm.csv). You will probably want tidyverse, here, and fixest.

[01] (10 points) In this exam, we are going to analyze on the variable has_roaches (meaning, it is going to be our outcome variable).

The has_roaches variable is a binary (indicator) variable that equals 1 when the members of the household observed cockroaches in their home and equals 0 if they did not.

Use the data and R to answer the following questions:

What percent of all households observed cockroaches in their home?
What percent of rental (rental == 1) households observed cockroaches in their home?
What percent of Black (black == 1) households observed cockroaches in their home?
What percent of urban (in_metro == 1) households observed cockroaches in their home?
What percent of households who received food stamps (received_foodstamps == 1) observed cockroaches in their home?

[02] (5 points) Suppose we are interested in regressing the indicator variable has_roaches on some of the other variables in the dataset.

I asked ChatGPT “How do I interpret the results of an OLS regression when the outcome variable is a binary variable?”. Below is the answer.

Ordinary Least Squares (OLS) regression is typically used for continuous outcomes. When the outcome or dependent variable is binary (0 or 1), logistic regression is generally more appropriate.

In OLS, the interpretation of the regression coefficients is straightforward: a one-unit increase in the predictor variable results in a corresponding change in the outcome variable, holding all other predictors constant. However, if your outcome variable is binary, this interpretation doesn’t hold anymore, because the outcome is not continuous, but rather “yes” (1) or “no” (0).

Still, if you’ve run an OLS regression with a binary outcome, here’s a rough guide on how to interpret the results:

Intercept (Constant): The estimated outcome when all predictors are zero. This might not make much sense when your outcome is binary.

Regression Coefficients: For each one-unit increase in the predictor variable, the difference in the mean values of the outcome variable, assuming all other variables are held constant. This interpretation can be problematic when the outcome is binary because the outcome is not really a “mean value.”

Is ChatGPT correct? Explain why or why not.

[03] (15 points) Regress has_roaches on the indicator for whether the head of the household is Black (black), the indicator for whether the home is rented (rental), and their interaction. Interpret the intercept and each of the coefficients.

[04] (5 points) Based upon your results in [03], would you say that there is significant evidence that Black families, on average, face worse housing conditions? Explain your answer.

[05] (5 points) The regression in [03] doesn’t include anything about whether the home is in an urban or rural setting—a potential source of omitted-variable bias.

When I asked ChatGPT how omitting this information on urban/rural setting would affect or results, it said:

Omitting a relevant variable from a regression model can lead to omitted variable bias, which occurs when a variable that is correlated with both the dependent variable and one or more of the independent variables is left out of the model.

In your case, if “urban” is a relevant variable, i.e., it is correlated with both the presence of roaches (has_roaches) and with the predictors (black, rental, and their interaction), then omitting it from your model could bias your estimates.

Is ChatGPT right? Explain why or why not—and fill in any information that would improve upon its answer.

[06] (5 points) Re-estimate the regression in [03] using only data from urban (metro) areas (i.e., in_metro == 1).

(No need to interpret the results—just show the regression output.)

[07] (5 points) Do the estimates in [06] suggest that omitting “urban” was causing bias? Explain your answer.

[08] (5 points) Correct your standard errors in [06] for heteroskedasticty.

(Again, no need to interpret the results—just show the regression output.)

[09] (5 points) I asked ChatGPT why we might want to cluster our standard errors (i.e., correct our standard errors for correlated disturbances) in this context. Its answer is below. Explain whether its answer is correct or incorrect—and add any additional information that would improve the answer.

Clustering standard errors is a way to adjust for the fact that observations within the same group or cluster may be correlated, and thus are not truly independent. This lack of independence can lead to underestimation of standard errors, which in turn can make your statistical tests overly optimistic (i.e., you might find statistical significance where there is none).

In the context of your example, suppose households are nested within neighborhoods, and you have reason to believe that households within the same neighborhood are more similar to each other (in terms of has_roaches) than to households in other neighborhoods. This might be due to shared environmental factors, similar housing conditions, or any number of unobserved factors.

If you ignore this clustering and run a standard OLS regression, you’re implicitly assuming that all of your observations are independent. If that’s not true, then your standard errors could be underestimated, leading to overconfident inference (p-values that are too small, confidence intervals that are too narrow).

By clustering the standard errors at the neighborhood level, you’re allowing for intra-cluster correlation, adjusting the standard errors upward to account for the fact that observations within the same neighborhood are not truly independent. This results in more accurate (usually larger) standard errors, which makes your statistical inference more reliable.

In a nutshell, clustering the standard errors is a way to get more accurate standard errors when you have reason to believe that observations within certain groups are not independent. It’s a common technique in fields like economics, sociology, and political science, where data often have a hierarchical or clustered structure.