You are not allowed to work with anyone else. Working with anyone else will be considered cheating. You will receive a zero for both parts of the exam and may fail the class.
You can use online materials, books, notes, solutions, etc. However, you still must put all of your answers in your own words. Copying other people’s words will be considered cheating.
Shan and Ed will not help you debug your code. Do not ask.
2 General instructions
Data You will need the data contained in midterm-data.csv.
Points There are 71 points available on this portion of the midterm. The in-class portion of the midterm was worth 120 points. Your total midterm grade will be the sum of the points you earned on the two parts divided by 191 (= 120 + 71).
Due Upload your answers on Canvasbefore 11:59 pm (Pacific) on Tuesday, 21 February 2023.
Important You must submit two files:
your typed responses/answers to the question (in a Word file or something similar)
the R script you used to generate your answers. Each student must turn in her/his own answers.
If you are using RMarkdown (or Quarto), you can turn a single file, but it must be a html or pdf file with both your R code and your answers.
All figures and regression output (tables) should be visible in the file you submit for your writeup. You will not get credit for just the code. We will not run your code. If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).
Do not write your answers in the comments of your R script file. You will not receive credit for answers written in the R script.
README! We are using a dataset that is very similar to the datasets in the first two problem sets. As before, the data come from 2021 American Community Survey (ACS) public-use microdata downloaded from the US Census (with codebooks).
These new data are a random sample of 7,500 employed (in 2021) individuals living throughout California with income less than one million dollars.
The table at the end of this document describes each variable in the data.
3 Problems
Note: You should load the data and whichever R packages you think will be helpful.
[1] (4 points) Regress hours worked (hrs_work_perwk) on an indicator for whether the individual is female (i_female). Include the regression output in your submission.
Rows: 7500 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): hh_id
dbl (7): hh_puma, i_kids, income, age, i_female, yrs_education, hrs_work_perwk
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Run the regressionreg01 =feols(hrs_work_perwk ~ i_female, data = census_df)# The resultsreg01
[2] (5 points) Use the regression output from the previous question to answer: Do non-females work significantly less than 40 hours each week (on average)? What about females? Explain your answer.
Answer
Note: You have some options here on exactly how you answer this question. Here’s one way to reason through it…
The estimated intercept here is 39.24, which tells us that non-females work, on average, 39.24 hours per week. The standard error is 0.20 hours, so the 95% confidence interval would be approximately 39.04 to 39.64 hours for the average hours worked in a week for non-male works. This does reject 40 hours, but practically it is quite close to 40 hours.
Female workers work significantly fewer hours in a week (at the 5 percent level) than non-female workers. And non-female workers work less than 40 hours. Thus, the average hours worked in a week by a female worker is likely significantly less than 40 hours (our estimate is 39.24 - 3.57 = 35.67).
[3] (5 points) Does using heteroskedasticity-robust standard errors change your answers to the last question? Explain.
Answer
Now we update our regression to use heteroskedasticty-robust standard errors.
# Run the regressionreg03 =feols(hrs_work_perwk ~ i_female, data = census_df, vcov ='hetero')# The resultsreg03
The standard errors hardly change at all, which means we do not change our inferences from above.
The intercept’s standard error is now slightly smaller, which means we will reject “40 hours” for non-female workers.
Female workers still work significantly less than non-female workers (who work less than 40 hours per week).
[4] (4 points) Now regress hours worked on the indicator for whether the individual is female (i_female) and the indicator for whether there are children in the household (i_kids). Include the regression output in your submission.
Answer
# Run the regressionreg04 =feols(hrs_work_perwk ~ i_female + i_kids, data = census_df)# The resultsreg04
Note Students were free to use heteroskedasticity-robust standard errors but were not required to.
[5] (5 points) Interpret each of the coefficient estimates (and the intercept) from the previous regression.
Answer
The intercept tells us that non-females without kids work, on average, 39.11 hours (all else equal).
The coefficient on i_female tells us that females, on average, work 3.56 hours fewer than non-females (all else equal).
The coefficient on i_kids tells us individuals with children, on average, work 0.34 hours more than individuals without kids (all else equal).
[6] (4 points) Now estimate a regression model that also includes the interaction between female and has kids (also include the uninteracted variables too). Include the regression output in your submission.
Answer
# Run the regressionreg06 =feols(hrs_work_perwk ~ i_female + i_kids + i_female:i_kids, data = census_df)# The resultsreg06
[7] (5 points) Interpret each of the coefficient estimates (and intercept) from the previous regression.
Answer
The intercept tells us that, on average, non-female individuals without children work approximtaely 38.89 hours.
The coefficient on i_female tells us that, on average, females without children work 3.10 hours less than non-females without children.
The coefficient on i_kids tells us that non-females with children work 0.89 hours more than non-females with children.
The significant interaction tells us that the “effect” of children for females is 1.21 hours less than the effect of children for non-females (all else equal).
[8] (5 points) Using the estimated coefficients: What is the (numeric) difference between the number of hours worked by females with children versus females without children?
Answer
One option for the solution
The average hours worked by females with children is 38.89 - 3.10 + 0.89 - 1.21 = 35.47.
The average hours by females without children is 38.89 - 3.10 = 35.79.
The difference between these two groups is 35.47 - 35.79 = -0.32.
Another option
The difference between females with children and females without children is the sum of the coefficients on i_kids and i_female:i_kids, i.e., 0.89 - 1.21 = -0.32.
[9] (5 points) Using the same regression estimates: What is the average of hours worked for non-females with children?
Answer
The average hours worked by non-females with children is 38.89 + 0.89 = 39.78.
[10] (5 points) Explain whether you think omitting ‘education level’ could cause bias.
Answer
Perhaps. For education level to bias our estimates, two things must be true:
An individual’s level of education must affect the number of hours an individual works.
Level of eduation must correlate with one of our included regressors.
The first requirement seems reasonable: people with different levels of education likely work in different industries, which might affect their hours worked.
The second requirement is plausible: females have different levels of education than non-females and/or individuals with children have different levels of education than individuals without children.
[11] (4 points) Add education level to the regression model (it should now have female, kids, female interacted with kids and years of education). Make sure the regression output is in your submission.
Answer
# Run the regressionreg11 =feols(hrs_work_perwk ~ i_female + i_kids + i_female:i_kids + yrs_education, data = census_df)# The resultsreg11
[12] (5 points) Using your most recent regression estimates—and any other information that is helpful—explain whether it looks like education was causing omitted-variable bias.
Answer
The “effect” of kids for non-female individuals in the sample changed a little bit, while most of the other estimates did not change much. The fact that including years of education did not change our estimates much suggests that if omitting education was causing bias, it was not causing a lot of bias.
We can directly investigate the correlation of education with our included variables using regression:
# For femalesreg12_female =feols(i_female ~ yrs_education, data = census_df)# For kidsreg12_kids =feols(i_kids ~ yrs_education, data = census_df)# The resultsetable(reg12_female, reg12_kids)
These regressions show that there is a statistically significant relationship between our included regressors and education, but they also show that the relationship is not very large.
Together these pieces of evidence suggest that education was not causing a lot of omitted-variable bias in our previous estimates.
[13] (5 points) Why did the intercept change so much with this new regression? Explain your answer.
Answer
The intercept changed because it’s interpretation changed.
Previously the intercept told us about the expected hours worked in a week for non-females without kids.
Now the intercept tells us the expected hours worked in a week for non-females without kids with zero years of education.
[14] (5 points) Should we be worried about measurement error in this context? Explain.
Answer
I’d argue that we probably do not need to worry about measurement error very much in this context. Our explanatory variables are fairly easy to measure accurately—number of kids, whether someone is female, years of education. The only thing that could cause measurement error here is if people do not accurately answer the survey that generated the data.
[15] (5 points) Write a one-paragraph summary of your findings from the preceding regressions (for example, discussing the sign, magnitude, and significance of the various relationships).
Answer
Pretty open to different answers here—just needs to make sense (and follow the prompt).
4 Description of variables
Variable names, descriptions, and sources
Variable name
Variable description
hh_id
Household identifier
hh_puma
Urban-area identifier
i_kids
Indicator: Does the household have young children?
income
Individual income
age
Individual age
i_female
Indicator: Is the individual identified as female?
yrs_education
Approximate number of years of education
hrs_work_perwk
Number of hours the individual worked in an average week