# Load packages using 'pacman'
library(pacman)
p_load(tidyverse, scales, patchwork, fixest, here)Problem Set 0: Review
EC 421: Introduction to Econometrics
1 Instructions
Due Upload your PDF or HTML answers on Canvas before 5:00PM on Friday, 18 April 2025.
Important You must submit your answers as an HTML or PDF file, built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.
If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).
Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.
Objective This problem set has three goals: (1) review the central econometrics topics you covered in EC320; (2) refresh (or build) your R toolset; (3) start building your intuition about causality within econometrics/regression.
README! The data in this problem set come from the 2023 American Community Survey (ACS; downloaded from IPUMS USA). The ACS annually surveys approximately 3.5 million households. I’ve provided a random subset of 10,000 individuals—all of whom are at least 18 years old. The data are stored in a CSV file named data-acs.csv.
The table below describes each variable in the dataset.
| Variable name | Variable description |
|---|---|
sex |
The individual’s sex (Female or Male). (character) |
age |
The individual’s age (18 to 99). (integer) |
race |
The individual’s race (6 broad categories). (character) |
hispanic |
Whether the individual is Hispanic or Non-Hispanic. (character) |
educ |
A rough estimate of the individual’s years of education (1= first grade; 17= graduate school). (integer) |
empstat |
The individual’s employment status (Employed, Unemployed, Not in labor force). (character) |
hrs_wk |
The number of hours the individual works per week. (integer) |
income |
The individual’s income in dollars. (integer) |
deg_bachelors |
A binary indicator for whether the individual has a bachelor’s degree. (integer) |
deg_masters |
A binary indicator for whether the individual has a master’s degree. (integer) |
deg_profession |
A binary indicator for whether the individual has a professional degree (e.g., law or medicine). (integer) |
deg_phd |
A binary indicator for whether the individual has a doctorate. (integer) |
i_female |
A binary indicator for whether the individual’s sex is female. (integer) |
i_black |
A binary indicator for whether the individual is Black. (integer) |
i_white |
A binary indicator for whether the individual is White. (integer) |
i_hispanic |
A binary indicator for whether the individual is Hispanic. (integer) |
i_workforce |
A binary indicator for whether the individual is in the workforce (employed or unemployed). (integer) |
i_employed |
A binary indicator for whether the individual is employed. (integer) |
2 Setup
[01] Load your R packages (and install any packages that are not already installed).
- You will likely want to use
tidyverseandhere(among others). - Also:
pacmanand itsp_load()function make package management easier—you just usep_load()to load packages, andRwill install the packages if they’re not already installed. E.g., usep_load(tidverse, here)after you load thepacmanpackage withlibrary(pacman). Remember that you will have to installpacman(install.packages("pacman")) if you have not installed it already.
[02] Now load the data (stored in data-acs.csv).
As described above, I saved the data as a CSV, so you’ll want to use a function that can read CSV files.
Examples of functions that can read a CSV file:
read_csv()in thereadrpackage, which is part of thetidyverse;fread()in thedata.tablepackage;read.csv(), which is available without loading any packages.
3 Get to know your data
In this problem set, we are going to explore the relationship between hours worked, education, and demographics. Let’s get to know the data a bit better.
[03] How many observations (rows) are in the dataset? How many of the observations have exactly 0 hours worked per week (hrs_wk)?
Hints:
- The functions
dim()ornrow()show the number of rows in a dataset, e.g.,nrow(some_data). - You can use the
filter()function (from thetidyverse) to filter your dataset to observations with a specific value, e.g.,my_data |> filter(my_variable == 0)would filter the datasetmy_datato the observations for whichmy_variableis equal to0. - You can combine hints 1 and 2 to find the number of observations with
hrs_wk == 0by usingnrow()on the filtered dataset, e.g.,my_data |> filter(my_variable == 0) |> nrow().
[04] It’s good to know which variables are in the dataset and what type (class()) they are. How many categorical variables are in the dataset?
Hint: You have many options here; try glimpse() (in the tidyverse), summary(), or skim() (from the skimr package). Also: If you used read_csv() or fread() to load the data, then just typing the name of the dataset will display the first few rows and the class of each variable.
[05] How many observations are missing data on hours worked (hrs_wk)?
Hints:
- The function
is.na()detects whether observations are missing. - You can filter your dataset to observations missing a variable using the
filter()function, for example,my_data |> filter(is.na(my_variable))would filter the datasetmy_datato observations missing values for the variablemy_variable. - You could also sum the results of
is.na()to see how many of them are missing.is.na()returnsTRUEorFALSE.TRUEis a1, andFALSEis a0.
[06] We’re also going to be interested in the variable for years of education (educ). How many observations are missing their values for education?
4 Summarizing data
Time to make a few figures. Simple summaries and visualizations are fantastic ways to get to know the data and to try to figure out any potential issues/features. In this case, they will also provide insights into the distribution of income and education in the United States (in 2023).
[07] Before you make any figures, calculate the mean and median of the variables hrs_wk, educ, i_female, age, and income.
Hints:
- If a variable is missing values, then the mean and median will be missing too. You can use the
na.rm = TRUEargument to remove missing values from the calculation, e.g.mean(my_variable, na.rm = TRUE). - You can also use the
mean()andmedian()functions directly. You can use thesummarise_all()function to calculate the mean and median of all variables in a dataset—andselect()allows you to select specific variables.
Example: Calculating the mean and standard deviation of income:
# Calculate the mean and standard deviation of 'income'
acs_df |>
select() |>
summarise_all(list(mean = mean, stnd_dev = sd), na.rm = TRUE)[08] What does the mean of a (binary) indicator variable like i_female tell us?
[09] Create a histogram of the hours worked variable to visualize its distribution in the dataset.
Important: Make sure to label your axes and title your plot.
Hints: You have a few options for creating histograms:
ggplot2includes thegeom_histogram()function;hist()is a baseRfunction that can create histograms.
Note that both functions allow you to select the number of bins in the histogram. ggplot2 uses either the bins or the binwidth arguments; hist() uses the breaks argument.
[10] Now create a histogram of age.
[11] Why might age matter for the distribution of hours worked? Briefly explain your answer.
[12] Repeat the hours-worked histogram from [09] for individuals who are between the ages of 25 and 64 (age >= 25 & age <= 64).
Hint: You can use the filter() function to select observations that meet certain criteria, e.g., filter(my_data, i_female == 1) would filter the dataset my_data to the observations for whom i_female is equal to the value 1.
Important: Again, don’t forget to label your plot’s axes. A title would be good too.
[13] Did changing the sample in [12] produce changes to the histogram that match your hypothesis? Explain.
5 Analyzing hours worked
Time to start analyzing the data! What correlates with (or causes) hours worked?
[14] Start with a simple linear regression of the relationship between hours worked and education.
In other words: regress hrs_wk on educ (with an intercept).
Note: Use the full dataset unless otherwise specified.
Generate a summary of the regression (estimated intercept, coefficient, and standard errors). You have a few options here:
- use the
tidy()function from thebroompackage on the output of thelm()function; - use the
summary()function on the output oflm(); - use
feols()(from thefixestpackage) to estimate your regression (and possibly useetable()to display the results); - use the
modelsummary()function from themodelsummarypackage.
[15] Interpret the intercept and coefficient from the regression in [14].
[16] Based upon the regression in [14], what is the expected hours worked for someone with 13 years of education?
[17] The regression in [14] included individuals that work zero hours per week. Repeat the regression in [14] but only include individuals that work more than zero hours per week.
Hint: Remember your friend filter().
[18] How did focusing on individuals that work more than zero hours per week change the regression results?
[19] Why did focusing on individuals that work more than zero hours per week change the regression results?
[20] Wait… we should have plotted the data before running any regressions. Make a scatterplot of hours worked (y axis) against years of education (x axis). What do you think? Is the graphical relationship as strong as the regression suggested?
Hint: You can use the geom_point() function in ggplot2 to create a scatterplot. You can also add a regression line using the geom_smooth() function.
Important: Make sure to label your axes and title your plot.
6 Explaining who works
[21] Let’s dig into the zero-hours-worked topic. First, create a new binary variable (i_zero_hrs) that is equal to 1 if the individual works zero hours per week and 0 otherwise.
Hint: You can use the mutate() function to create a new variable in your dataset. For example, my_data = my_data |> mutate(new_variable = old_variable == 0) would add a new variable called new_variable that is equal to 1 if old_variable is equal to 0 (the new variable will equal 0 otherwise).
[22] Now regress this new zero-hours indicator on the indicator for whether the individual is female (i_female) (and an intercept).
Provide a summary (e.g., table) of the regression results.
[23] Interpret the intercept and coefficient from the regression in [22].
Hint: Remember that a regression with a binary dependent variable can be interpreted as modeling the probability that the dependent variable is equal to one.
[24] Now regress the zero-hours indicator on (1) i_female, (2) educ, and (3) the interaction between i_female and educ (and an intercept).
Hint: To take the interaction between two variables, you can use the : operator in the regression formula. For example, lm(y ~ x1 + x2 + x1:x2) would include the interaction between x1 and x2.
[25] Interpret the intercept and each of the coefficients from the regression in [24].
[26] Based on the regression in [25], what is the probability that a female with 13 years of education is working zero hours per week?
[27] What percent of the variation in the zero-hours indicator is explained by the regression in [24]?
[28] Could age be causing omitted variable bias in the OLS estimates above—for example, in [22]? Why or why not? Explain your answer.
[29] What must be true for the OLS estimates in [22] to be unbiased?
7 Wrap up
[30] What are your main takeaways/insights about hours worked, education, and demographics from this problem set and its data? Explain your answer using figures/regressions from above and any additional analyses you think are relevant.
Reminder Submit your final file to Canvas as PDF or HTML only.
(Do not submit it as .rmd or .qmd.)