Problem Set 0: Review

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your HTML or PDF answers on Canvas before 11:59PM on Sunday, 19 April 2026.

Important You must submit your answers as an HTML or PDF file. The submitted file should be built from an RMarkdown (.rmd) or Quarto (.qmd) file. Do not submit the .rmd or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

Objective This problem set has three goals: (1) review the central econometrics topics you covered in EC320; (2) refresh (or build) your R toolset; (3) start building your intuition about causality within econometrics/regression.

README! The data for this problem set (stored as data-ps0.csv) come from a study titled The Association Between Income and Life Expectancy in the United States, 2001-2014 (by Raj Chetty, Augustin Bergeron, David Cutler, Benjamin Scuderi, Michael Stepner, and Nicholas Turner). The authors created a nice website for the project here. Below is a brief description of the project (written by the authors)¹:

How can we reduce socioeconomic disparities in health outcomes? Although it is well known that there are significant differences in health and longevity between income groups, debate remains about the magnitudes and determinants of these differences. We use new data from 1.4 billion anonymous earnings and mortality records to construct more precise estimates of the relationship between income and life expectancy at the national level than was feasible in prior work. We then construct new local area (county and metro area) estimates of life expectancy by income group and identify factors that are associated with higher levels of life expectancy for low-income individuals.

Our findings show that disparities in life expectancy are not inevitable. There are cities throughout America — from New York to San Francisco to Birmingham, AL — where gaps in life expectancy are relatively small or are narrowing over time. Replicating these successes more broadly will require targeted local efforts, focusing on improving health behaviors among the poor in cities such as Las Vegas and Detroit. Our findings also imply that federal programs such as Social Security and Medicare are less redistributive than they might appear because low-income individuals obtain these benefits for significantly fewer years than high-income individuals, especially in cities like Detroit.

Going forward, the challenge is to understand the mechanisms that lead to better health and longevity for low-income individuals in some parts of the U.S. To facilitate future research and monitor local progress, we have posted annual statistics on life expectancy by income group and geographic area (state, CZ, and county) at The Health Inequality Project website. Using these data, researchers will be able to study why certain places have high or improving levels of life expectancy and ultimately apply these lessons to reduce health disparities in other parts of the country.

Variable names and descriptions
Variable name	Variable type	Variable description
`county_name`	character	County name
`county_code`	integer	County FIPS code
`state_abb`	character	State abbreviation
`income_quartile`	numeric	Income quartile (either 1 or 4)
`life_exp`	numeric	Life expectancy (years)
`pct_uninsured`	numeric	Proportion uninsured
`poverty_rate`	numeric	Share below poverty line
`pct_religious`	numeric	Proportion religious
`pct_black`	numeric	Proportion Black
`pct_hispanic`	numeric	Proportion Hispanic
`unemployment_rate`	numeric	Unemployment rate
`median_hh_inc`	numeric	Median household income (in thousands of dollars)
`is_urban`	integer	Urban county indicator (1 or 0)
`pop`	integer	County population
`pop_density`	numeric	Population density
`pct_smoke`	numeric	Proportion who smoke
`pct_obese`	numeric	Proportion obese (BMI)
`pct_exercise`	numeric	Proportion who exercise

You can find more information about the life-expectancy variables here and the county characteristics here.

2 Setup

[00] Let’s start by loading the R packages.

You will need to install any packages that are not already installed. After you’ve installed them one time, you will not need to install them again. (The pacman package makes this easier; see the hint below.)

You will likely want to use tidyverse and here (among others).
Also: pacman and its p_load() function make package management easier—you just use p_load() to load packages, and R will install the packages if they’re not already installed. E.g., use p_load(tidyverse, here) after you load the pacman package with library(pacman). Remember that you will have to install pacman (install.packages("pacman")) if you have not installed it already.

Here’s an example where I load five packages:

tidyverse (for data manipulation),
scales (for formatting numbers),
patchwork (for combining plots),
fixest (for regressions),
here (for managing file paths).

# Load packages using 'pacman'
library(pacman)
p_load(tidyverse, scales, patchwork, fixest, here)

[01] Now load the data (stored in data-ps0.csv).

As described above, I saved the data as a CSV, so you’ll want to use a function that can read CSV files.

Examples of functions that can read a CSV file:

read_csv() in the readr package, which is part of the tidyverse;
fread() in the data.table package (really nice for large datasets; not part of the tidyverse);
read.csv(), which is available without loading any packages (but is slower and less convenient).

Hint: Use the here() function to create the file path to the data file. For example, if your data file is in a folder called data in your project directory and is called my_data.csv, then you would use here('data', 'my-data.csv') to create the file path, e.g.,

# Load a (pretend) dataset called 'my-data.csv' from a folder called 'data'
my_df = here('data', 'my-data.csv') |> read_csv()

You will need to adjust the file path to (1) match where the data file is stored in your project directory and (2) match the name of our data file.

3 Get to know your data

The first step in any data analysis is to get to know your data. This includes understanding the variables, their types, their distributions, whether there are any missing data, etc.

[02] Let’s start simply: How many observations (rows) are in the dataset? How many variables? Are any observations missing data?

Hints:

The functions dim(), nrow(), ncol() show the number of rows and columns in a dataset, e.g., nrow(some_data).
The function na.omit() removes observations with any missing data.

[03] Are there any variables that are not numeric? If so, which ones?

Hint: The glimpse() function from the tibble package (part of the tidyverse) provides a nice summary of each variable in a dataset, including its type.

4 Summarizing data

Time to make a few figures. Simple summaries and visualizations are fantastic ways to get to know the data and to try to figure out any potential issues/features.

[04] Create a single histogram of the median household income (median_hh_inc in the dataset).

Important: Make sure to label your axes. A title would be good too. Aesthetics (colors, themes, etc.) are up to you.

Hints:

Median income is measured in thousands of dollars.
You may want to format the x-axis labels with scales::dollar to make the numbers easier to read.

[05] Repeat the histogram in [04] but with life expectancy (life_exp) on the x-axis instead of median household income.

[06] That’s strange… why does the histogram in [04] look “normal” (i.e., bell-shaped), while the histogram in [05] looks “bimodal” (i.e., has two peaks)?

[07] The histogram in [05] ignores the fact that each county has two observations in the dataset: the life expectancy for the individuals in the lowest income quartile in the county (the county’s bottom 25% of the income distribution) and the life expectancy for individuals in the highest income quartile in the county (the county’s top 25% of the income distribution).

Create two separate histograms of life expectancy: one for the lowest income quartile (income_quartile == 1) and one for the highest income quartile (income_quartile == 4).

Hints:

You can use the filter() function to create separate datasets for the lowest and highest income quartiles.
The variable income_quartile tells you whether the observation belongs to the lowest income quartile (when income_quartile == 1) or the highest income quartile (when income_quartile == 4).

Extra: If you’re interested, you can use the patchwork package to combine the two histograms into a single figure.

[08] One more histogram. Create a histogram of the difference in life expectancy between the highest and lowest income quartiles for each county.

Note: This histogram requires that you create a new variable that captures the difference in life expectancies between Q4 and Q1 for each county. There are several ways to accomplish this task. Here’s one route:

Create a new dataset that only includes observations for the lowest income quartile (filter-ing to rows where income_quartile == 1).
Create a new dataset that only includes observations for the highest income quartile (filter-ing to rows where income_quartile == 4).
Merge the two datasets together by county (using the county_code variable).
Create a new variable that captures the difference in life expectancies between the highest and lowest income quartiles for each county (e.g., life_exp_diff = life_exp_q4 - life_exp_q1).

Here’s what the code for these steps might look like:

# Step 1: Create dataset for lowest income quartile
life_q1 =
  life_df |>
  filter(income_quartile == 1) |>
  select(county_code, life_exp_q1 = life_exp)
# Step 2: Create dataset for highest income quartile
life_q4 =
  life_df |>
  filter(income_quartile == 4) |>
  select(county_code, life_exp_q4 = life_exp)
# Step 3: Merge datasets
life_diff =
  life_q1 |>
  full_join(life_q4, by = "county_code")
# Step 4: Create new variable for difference in life expectancy
life_diff =
  life_diff |>
  mutate(life_exp_diff = life_exp_q4 - life_exp_q1)

Then make your histogram of the life_exp_diff variable.

[09] Summarize your findings from the histograms/reflections in [04] through [08]. What do these histograms tell you about how life expectancy varies across counties and also within counties?

5 Digging into the relationship between income and life expectancy

[10] Plot a scatter plot (geom_point if you are using ggplot2) of life expectancy (life_exp) on the y-axis and median household income (median_hh_inc) on the x-axis. Color the points by income quartile.

Hint: You can use the color aesthetic in ggplot2 to color points by a categorical variable, e.g., aes(color = income_quartile). If R sees income_quartile as a numeric variable, you can convert it to a factor with as.factor(income_quartile), for example, aes(color = as.factor(income_quartile)).

[11] Summarize the relationship between life expectancy and median household income that you see in the scatter plot in [10]. Imagine you’re writing a quick summary for a policymaker who is interested in the relationship between income and life expectancy. What would you say?

Hint: The median household income variable is for the entire county, so you can think of it as measuring whether a county is generally more affluent. The life expectancy variable is for a specific income quartile in the county, so you can think of it as measuring the life expectancy of either the lowest or highest income quartile in the county.

[12] Time for a regression. Regress life expectancy (life_exp) on median household income (median_hh_inc). Show your results (e.g., summary() or a nice table).

[13] Interpret the intercept and coefficient from the regression in [12].

[14] That simple linear regression was nice, but we already know from the scatter plot in [10] that there are differences in life expectancy between the lowest and highest income quartiles. Let’s dig into those differences.

Repeat the regression in [12] for two separate subsets of the data: the lowest income quartile (income_quartile == 1) and the highest income quartile (income_quartile == 4).

Report your results (you don’t need to interpret them yet).

To be clear, you’re estimating two different simple linear regressions. Each estimates the model below separately for the lowest and highest income quartiles: \[ (\text{Life expectancy})_i = \beta_0 + \beta_1 (\text{Median household income})_i + u_i \]

Hint: You can use the filter() function to create separate datasets for the lowest and highest income quartiles, and then run the regression separately on each dataset.

[15] Do the regression results in [14] suggest that the association between life expectancy and median household income is the same for the lowest and highest income quartiles? Explain your answer.

[16] Now run a single regression that includes all observations. The model you are estimating is \[\begin{align} (\text{Life expectancy})_i = \beta_0 &+ \beta_1 (\text{Median household income})_i + \beta_2 (\text{Income quartile} = 4)_i \\ &+ \beta_3 (\text{Median household income})_i \times (\text{Income quartile} = 4)_i + u_i \end{align}\] In other words, you have a coefficient on median household income, a coefficient on an indicator variable for the highest income quartile, and an interaction term between median household income and the income-quartile indicator variable.

Estimate the model and report your results.

Hint:

You will probably want to create a new variable that is an indicator for the highest income quartile, e.g., is_q4 = as.integer(income_quartile == 4).
Recall that in R you can create an interaction term with : in the formula, e.g., y ~ x1 + x2 + x1 : x2 (or even y ~ x1 * x2).

[17] Interpret the intercept and each of the three coefficients in the regression in [16].

[18] Explain how the regression results in [16] relate to the regression results in [14].

Hint: Recall that an interaction between a continuous variable and a binary variable allows you to estimate two different slopes.

[19] Do the regression results in [16] suggest that the association between life expectancy and median household income is the same for the lowest and highest income quartiles? Explain your answer.

[20] Write a one-paragraph summary of your findings from this assignment. Feel free to include additional figures, tables, or analyses if you think they would be helpful for summarizing your findings. Again, imagine you’re writing a quick summary for your supervisor or for a policymaker who is interested in the relationship between income and life expectancy. What would you say? How would you summarize the relationship? Would you say that the relationship is causal?

Reminder Submit your final file to Canvas as PDF or HTML only.
(Do not submit it as .rmd or .qmd.)

Footnotes

You can find the full paper on JAMA and a non-technical summary from the authors here.↩︎