# Calculate the mean and standard deviation of 'hrs_wk'
|>
acs_df select(hrs_wk) |>
summarise_all(list(mean = mean, stnd_dev = sd), na.rm = TRUE)
Problem Set 0: Review
EC 421: Introduction to Econometrics
1 Instructions
Due Upload your answer on Canvas before midnight on Tuesday, 28 January 2025.
Important You must submit your answers as an HTML or PDF file, built from an RMarkdown (.rmd
) or Quarto (.qmd
) file. Do not submit the .rmd
or .qmd
file. You will not receive credit for it.
If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).
Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.
Objective This problem set has three goals: (1) review the central econometrics topics you covered in EC320; (2) refresh (or build) your R toolset; (3) start building your intuition about causality within econometrics/regression.
README! The data in this problem set come from the 2023 American Community Survey (ACS; downloaded from IPUMS USA). The ACS annually surveys approximately 3.5 million households. I’ve provided a random subset of 10,000 individuals—all of whom are at least 18 years old. The data are stored in a CSV file named data-acs.csv
.
The table below describes each variable in the dataset.
Variable name | Variable description |
---|---|
sex |
The individual’s sex (Female or Male ). (character ) |
age |
The individual’s age (18 to 99 ). (integer ) |
race |
The individual’s race (6 broad categories). (character ) |
hispanic |
Whether the individual is Hispanic or Non-Hispanic . (character ) |
educ |
A rough estimate of the individual’s years of eduation (1= first grade; 17= graduate school). (integer ) |
empstat |
The individual’s employment status (Employed , Unemployed , Not in labor force ). (character ) |
hrs_wk |
The number of hours the individual works per week. (integer ) |
income |
The individual’s income in dollars. (integer ) |
deg_bachelors |
A binary indicator for whether the individual has a bachelor’s degree. (integer ) |
deg_masters |
A binary indicator for whether the individual has a master’s degree. (integer ) |
deg_profession |
A binary indicator for whether the individual has a professional degree (e.g., law or medicine). (integer ) |
deg_phd |
A binary indicator for whether the individual has a doctorate. (integer ) |
i_female |
A binary indicator for whether the individual’s sex is female. (integer ) |
i_black |
A binary indicator for whether the individual is Black. (integer ) |
i_white |
A binary indicator for whether the individual is White. (integer ) |
i_hispanic |
A binary indicator for whether the individual is Hispanic. (integer ) |
i_workforce |
A binary indicator for whether the individual is in the workforce (employed or unemployed). (integer ) |
i_employed |
A binary indicator for whether the individual is employed. (integer ) |
2 Setup
[01] Load your R
packages (and install any packages that are not already installed).
- You will probably want
tidyverse
andhere
(among others). - Also:
pacman
and it’sp_load()
function make package management easier—you just usep_load()
to load packages, andR
will install the packages if they’re not already installed. E.g., usep_load(tidverse, here)
after you load thepacman
package withlibrary(pacman)
. Remember that you will have to installpacman
(install.packages("pacman")
) if you have not installed it already.
[02] Now load the data (stored in data-acs.csv
).
As described above, I saved the data as a CSV, so you’ll want to use a function that can read CSV files.
Examples of functions that can read a CSV file:
read_csv()
in thereadr
package, which is part of thetidyverse
;fread()
in thedata.table
package;read.csv()
, which is available without loading any packages.
3 Get to know your data
[03] Use dim()
or nrow()
to confirm that you have 10,000 observations (rows) in your dataset.
[04] It’s good to know which variables are in the dataset and what type (class()
) they are. How many categorical variables are in the dataset?
Hint: You have many options here; try glimpse()
(in the tidyverse
), summary()
, or skim()
(from the skimr
package). Also: If you used read_csv()
or fread()
to load the data, then just typing the name of the dataset will display the first few rows and the class of each variable.
[05] How many observations are missing data on income
?
Hints:
- The function
is.na()
detects whether observations are missing. - You can filter your dataset to observations missing a variable using the
filter()
function, for example,my_data |> filter(is.na(my_variable))
would filter the datasetmy_data
to observations missing values for the variablemy_variable
. - You could also sum the results of
is.na()
to see how many of them are missing.is.na()
returnsTRUE
orFALSE
.TRUE
is a1
, andFALSE
is a0
.
[06] How many observations are missing data on education (educ
)?
[07] If we regress income
on educ
, how many observations will be in that regression? Explain your answer.
4 Summarize income, education, and age
Time to make a few figures. Simple summaries and visualizations are fantastic ways to get to know the data and to try to figure out any potential issues/features. In this case, they will also provide insights into the distribution of income and education in the United States (in 2023).
[08] Calculate the mean and median of the variables income
, educ
, age
.
Hint: You can also use the mean()
and median()
functions directly. You can use the summarise_all()
function to calculate the mean and median of all variables in a dataset—and select()
allows you to select specific variables.
Example: Calculating the mean and standard deviation of hrs_wk
:
[09] In [08], You should have found that the mean of income
is much larger than the median. What does this result suggest about the distribution of income in the dataset?
[10] Create a histogram of income
to visualize the distribution of income in the dataset.
Important: Make sure to label your axes and title your plot.
Hints: You have a few options for creating histograms:
ggplot2
includes thegeom_histogram()
function;hist()
is a baseR
function that can create histograms.
Note that both functions allow you to select the number of bins in the histogram. ggplot2
uses either the bins
or the binwidth
arguments; hist()
uses the breaks
argument.
[11] In a couple (2–3) sentences, explain whether the histogram in [10] supports recent concerns/discussions about income inequality in the United States.
[12] One may be concerned that our sense of income is a bit distorted because we (1) have individuals who are out of the workforce and/or (2) have individuals outside of their “prime working years”. Repeat the histogram in [10] for individuals who are (1) in the workforce (i_workforce == 1
) and between the ages of 25 and 64 (age >= 25 & age <= 64
).
Hint: You can use the filter()
function to select observations that meet certain criteria, e.g., filter(my_data, i_female == 1)
would filter the dataset my_data
to the observations for whom i_female
is equal to the value 1
.
Important: Again, don’t forget to label your plot’s axes. A title would be good too.
[13] Did changes the sample in [12] change the distribution of income? Briefly explain your answer.
5 Analyze the returns to education
Throughout the class we’ve been talking about the returns to education… let’s run a few regressions to actually investigate these returns.
[14] Let’s start with a simple linear regression of the relationship between income and education. In other words: regress income
on educ
(with an intercept).
Note: Use the full dataset unless otherwise specified.
Generate a summary of the regression (estimated intercept, coefficient, and standard errors). You have a few options here:
- use the
tidy()
function from thebroom
package on the output of thelm()
function; - use the
summary()
function on the output oflm()
; - use
feols()
(from thefixest
package) to estimate your regression (and possibly useetable()
to display the results); - use the
modelsummary()
function from themodelsummary
package.
[15] Interpret the intercept and coefficient from the regression in [14].
[16] Based upon the regression in [14], what is the estimated income for someone with 12 years of education (i.e., a high school diploma)?
[17] Compare your estimate in [16] to the mean income for individuals with 12 years of education. Do the linear-regression-based estimates get close to the mean? Should they? Explain your answer.
Hint: Remember your friend filter()
from earlier in problem [12].
[18] Earlier we examined how the distribution of income changes when we restrict the sample to individuals in the workforce. Explain how omitting these variables (i_workforce
and age
) from the regression in [14] might bias the estimated returns to education.
[19] Now add workforce participation (i_workforce
) and age (age
) to the regression from [14]. Provide a summary (e.g., table) of the regression results.
[20] Do the results in [19] suggest that our simple linear regression (in [14]) had an issue with omitted variable bias? Explain.
[21] One more regression: Add the binary indicator for whether the individual has a bachelor’s degree (deg_bachelors
) to the regression in [19]. Provide a summary of the regression results.
[22] Why do you think the returns to education changed so much when we included the binary indicator for a bachelor’s degree in the regression?
6 Shifting gears: Who graduates?
The previous results suggest that maybe there’s something special about having a bachelor’s degree. Let’s explore this a bit more.
[21] First off, what share of the sample has a bachelor’s degree?
[22] Now regress the indicator for whether the individual has a bachelor’s degree (deg_bachelors
) on age (age
), the indicator for whether the individual is female (i_female
), and the interaction between age and female.
Provide a summary of the regression results.
Hint: To take the interaction between two variables, you can use the :
operator in the regression formula. For example, lm(y ~ x1 + x2 + x1:x2)
would include the interaction between x1
and x2
.
[23] Interpret the intercept and each of the coefficients from the regression in [22].
Hint: Remember that a regression with a binary dependent variable can be interpreted as modeling the probability that the dependent variable is equal to one.
[24] Based on the regression in [23], what is the probability that a 25-year-old female has a bachelor’s degree? What about a 25-year-old male?
7 Wrap up
[25] What are your main takeaways/insights about income and education from this problem set and its data? Explain your answer using figures/regressions from above and any additional analyses you think are relevant.