Problem Set 0: Review

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your PDF or HTML answers on Canvas before 11:59PM on Tuesday, 20 Jan. 2026.

Important You must submit your answers as an HTML or PDF file. The submitted file should be built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

Objective This problem set has three goals: (1) review the central econometrics topics you covered in EC320; (2) refresh (or build) your R toolset; (3) start building your intuition about causality within econometrics/regression.

README! The data in this problem set come from a classic labor economics paper (“Evaluating the Econometric Evaluations of Training Programs” by Robert LaLonde) that examined a job-skill training program using a variety of analyses. The program—The National Supported Work Demonstration (NSW) aimed to provide individuals struggling on the job market with job-related training. As with LaLonde and many other papers, we are going to explore how this training program affected participans’ earnings.

The table below describes each variable in the dataset. The dataset that I am sharing with you is a modified version of the original datasets available here.

Variable names and descriptions
Variable name	Variable description
`treat`	indicator for participation in the NSW program (1 if participated, 0 otherwise)
`re75`	real earnings in 1975 (1982 dollars)
`re78`	real earnings in 1978 (1982 dollars)
`age`	age measured in years
`education`	education measured in years
`black`	indicator for race (1 if black, 0 otherwise)
`hispanic`	indicator for Hispanic ethnicity (1 if Hispanic, 0 otherwise)
`married`	indicator for marital status (1 if married, 0 otherwise)
`nodegree`	indicator for high school diploma (1 if no degree, 0 otherwise)
`data_id`	character variable describing the data source (‘NSW’ or ‘PSID’)

2 Setup

[00] Let’s start by loading the R packages.

You will need to install any packages that are not already installed. After you’ve installed them one time, you will not need to install them again. (The pacman package makes this easier; see the hint below.)

You will likely want to use tidyverse and here (among others).
Also: pacman and its p_load() function make package management easier—you just use p_load() to load packages, and R will install the packages if they’re not already installed. E.g., use p_load(tidverse, here) after you load the pacman package with library(pacman). Remember that you will have to install pacman (install.packages("pacman")) if you have not installed it already.

Here’s an example where I load five packages:

tidyverse (for data manipulation),
scales (for formatting numbers),
patchwork (for combining plots),
fixest (for regressions),
here (for managing file paths).

# Load packages using 'pacman'
library(pacman)
p_load(tidyverse, scales, patchwork, fixest, here)

[01] Now load the data (stored in data-ps0.csv).

As described above, I saved the data as a CSV, so you’ll want to use a function that can read CSV files.

Examples of functions that can read a CSV file:

read_csv() in the readr package, which is part of the tidyverse;
fread() in the data.table package;
read.csv(), which is available without loading any packages.

Hint: Use the here() function to create the file path to the data file. For example, if your data file is in a folder called data in your project directory and is called my_data.csv, then you would use here('data', 'my-data.csv') to create the file path, e.g.,

# Load data
acs_df = here('data', 'my-data.csv') |> read_csv()

You will need to adjust the file path to (1) match where the data file is stored in your project directory and (2) match the name of our data file.

3 Get to know your data

The first step in any data analysis is to get to know your data. This includes understanding the variables, their types, their distributions, and any missing data.

[02] Let’s start simply: How many observations (rows) are in the dataset? How many variables? Are any observations missing data?

Hints:

The functions dim(), nrow(), ncol() show the number of rows and columns in a dataset, e.g., nrow(some_data).
The function na.omit() removes observations with any missing data.

[03] Are there any variables that are not numeric? If so, which ones?

Hint: The glimpse() function from the tibble package (part of the tidyverse) provides a nice summary of each variable in a dataset, including its type.

4 Summarizing data

Time to make a few figures. Simple summaries and visualizations are fantastic ways to get to know the data and to try to figure out any potential issues/features.

[04] Create histograms of individuals’ real earnings in 1975 and 1978 (re75 and re78 variables).

Important: Make sure to label your axes. A title would be good too. Aesthetics (colors, themes, etc.) are up to you.

[05] Do the histograms in [04] provide any insights about the NSW program’s efficacy? Explain your answer.

[06] Now create separate histograms of individuals’ real earnings in 1975 by whether they participated in the NSW program (treat == 1) or not (treat == 0).

Hint: You can use the filter() function to create separate datasets for participants and non-participants.

[07] Participants in the NSW program received jobs training between 1975 and 1978—i.e., after their real earnings were recorded in 1975. Based upon the histograms in [06], do NSW participants and non-participants appear to have similar earnings distributions in 1975 (prior to the program)? Briefly explain your answer.

[08] Why would a difference in earnings distributions in 1975 (before the NSW program) between participants and non-participants be a problem for estimating the effect of the NSW program on earnings in 1978?

Hint: Think about exogeneity and omitted-variable bias.

[09] Do the participants and non-participants appear to differ along other dimensions? Create two histograms that compare the two groups along another variable.

[10] Another way to summarize the data is to look at summary statistics. Find the mean of real earnings in 1975 (re75) for both participants and non-participants.

[11] Finally, let’s formally (statistically) test whether participants and non-participants differed in their pre-treatment, 1975 real earnings.

Regress real earnings in 1975 (re75) on the indicator for participation in the NSW program (treat) (with an intercept).
Provide a summary of the regression (estimated intercept, coefficient, standard errors, etc.).
Interpret the intercept and coefficient from the regression.

[12] Using the regression results from [11]: Conduct a hypothesis test of whether NSW participants and non-participants differed in their real earnings in 1975. Use a significance level of 5%.

Use the following steps to guide your hypothesis test:

State the null and alternative hypotheses.
Report the test statistic and p-value.
State whether you reject or fail to reject the null hypothesis.
Provide a brief conclusion in the context of the problem.

5 Analyzing the NSW’s impact

Now that we have a sense of the data, let’s dig into the impact of the NSW program.

[13] Regress real earnings in 1978 (re78) on the indicator for participation in the NSW program (treat) (with an intercept).

Report your results (e.g., a table with the estimated intercept, coefficient, standard errors).

[14] Interpret the intercept and coefficient from the regression in [13].

[15] Based upon the regression in [13] and your work in the preceding section: Did the NSW program appear to help participants increase their earnings? Explain your answer.

[16] Given our previous findings that NSW participants and non-participants differed in their pre-treatment earnings (and along other dimensions), let’s control for each of the following variables: re75, age, education, black, hispanic, married, and nodegree.

Regress real earnings in 1978 (re78) on the indicator for participation in the NSW program (treat) and each of the variables listed above (along with an intercept). Report the results (again, in a table).

Just to be clear: You regression should have one outcome variable (re78) and eight independent variables (treat, re75, age, education, black, hispanic, married, and nodegree).

[17] Does “controlling” for these additional variales meaningfully change your estimate of the effect of the NSW program on real earnings in 1978? Explain your answer.

[18] It turns out that the individuals in the dataset come from two different sources:

a randomized experiment of the NSW program (the ‘NSW’ data, where the variable data_id equals "NSW"),
a non-experimental comparison group from the Panel Study of Income Dynamics (the ‘PSID’ data, where the variable data_id equals "PSID").

Create a new dataset that only includes individuals from the randomized NSW experiment (i.e., subset/filter the data to only include observations where data_id == 'NSW').

Hint: You can use the filter() function to create a new dataset that only includes observations that meet certain criteria. For example,

marry_df = nsw_df |> filter(married == 1)

creates a new dataset called marry_df that only includes observations where the married variable equals 1.

[19] In this new dataset (the individuals where data_id == 'NSW'), whether individuals participated in the NSW program (treat variable) was randomly assigned.

Using this new dataset, regress real earnings in 1978 (re78) on the indicator for participation in the NSW program (treat) (with an intercept). Report your results (a table with the estimated intercept, coefficient, standard errors, etc.).

[20] Based upon the regression in [19]: Did the NSW program appear to help participants increase their earnings? Explain your answer.

Hint: Why is this regression different from the regression in [13]? How does the random assignment of NSW participation for this subset of individuals change “things”?

Reminder Submit your final file to Canvas as PDF or HTML only.
(Do not submit it as .rmd or .qmd.)