Problem Set 2: Time Series

EC 421: Introduction to Econometrics

Author

Edward Rubin

1 Instructions

Due Upload your answer on Canvas before midnight on Tuesday, 27 May 2025.

Important You must submit your answers as an HTML or PDF file, built from an RMarkdown (.RMD) or Quarto (.qmd) file. Do not submit the .RMD or .qmd file. You will not receive credit for it.

If we ask you to create a figure or run a regression, then the figure or the regression results should be in the document that you submit (not just the code—we want the actual figure or regression output with coefficients, standard errors, etc.).

Integrity If you are suspected of cheating, then you will receive a zero—for the assignment and possibly for the course. We may report you to the dean. Cheating includes copying from your classmates, from the internet, and from previous assignments.

README! It’s important that you understand the variables for this problem set. The data for this problem set come from two main sources:

  1. annual life expectancy at birth from the Human Mortality Database;
  2. real GDP (2017 dollars), CPI (indexed to 1982–1984), and US population from the St. Louis Federal Reserve’s FRED database.

The table below describes each variable in the dataset.

Variable names and descriptions
Variable name Variable description
year The year of the observations (t).
exp_female The estimated life expectancy for a female born in year t.
exp_male The estimated life expectancy for a male born in year t.
exp_pop The estimated life expectancy for a person born in year t.
pop Estimated US population in the given year in millions.
gdp Real US GDP (in hundreds of billions of 2017 dollars)
cpi Consumer price index (CPI), indexed 1982-1984=100
inf The inflation rate (based upon the CPI) in the given year (in percentage points, i.e., 7.67 is 7.67%).

Objective This problem set has three main purposes: (1) reinforce what you learned about time-series data and its analysis (2) continue building your R toolset; (3) develop your intuition on how to analyze “real-world” data.

2 Load the data

[01] Load your R packages and the dataset (data-ps2.csv). You will probably want tidyverse and here.

Answer

# Load packages using 'pacman'
library(pacman)
p_load(skimr, tidyverse, patchwork, fixest, collapse, here)
# Load the data
ps_df = here('data-life.csv') |> read_csv()
Rows: 72 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): year, exp_female, exp_male, exp_pop, pop, gdp, cpi, inf

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

[02] Now check out the data.

  • Which years do the data cover?
  • Are the data in any order?
  • Are there any missing values?

Hint: Some combination of skim() from skimr and glimpse() from dplyr could be helpful here.

Answer

# Use skim and glimpse to check out the data
ps_df |> skim()
Data summary
Name ps_df
Number of rows 72
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1987.50 20.93 1952.00 1969.75 1987.50 2005.25 2023.00 ▇▇▇▇▇
exp_female 0 1 77.59 2.97 71.50 74.59 78.34 79.93 81.46 ▃▅▂▇▇
exp_male 0 1 71.28 3.68 65.54 67.03 71.32 74.62 76.47 ▇▂▅▃▇
exp_pop 0 1 74.42 3.37 68.38 70.68 74.86 77.26 78.95 ▆▂▃▅▇
pop 0 1 248.82 53.54 157.49 204.40 243.86 296.82 337.01 ▆▇▆▆▇
gdp 0 1 76.47 74.76 3.67 10.59 50.46 132.33 277.21 ▇▃▂▂▁
cpi 0 1 125.62 84.35 26.57 38.30 115.95 196.84 304.70 ▇▃▃▃▂
inf 0 1 3.51 2.73 -0.32 1.61 2.90 4.23 13.50 ▇▇▂▁▁
ps_df |> glimpse()
Rows: 72
Columns: 8
$ year       <dbl> 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961,…
$ exp_female <dbl> 71.50, 71.91, 72.70, 72.80, 72.95, 72.74, 73.00, 73.31, 73.…
$ exp_male   <dbl> 65.54, 65.80, 66.57, 66.58, 66.61, 66.36, 66.60, 66.75, 66.…
$ exp_pop    <dbl> 68.38, 68.72, 69.50, 69.56, 69.64, 69.41, 69.67, 69.90, 69.…
$ pop        <dbl> 157.4926, 160.1462, 162.9683, 165.8723, 168.8577, 171.9068,…
$ gdp        <dbl> 3.67341, 3.89218, 3.90549, 4.25478, 4.49353, 4.74039, 4.812…
$ cpi        <dbl> 26.56667, 26.76833, 26.86500, 26.79583, 27.19083, 28.11333,…
$ inf        <dbl> 2.2843943, 0.7590966, 0.3611232, -0.2574601, 1.4741098, 3.3…

The data cover the years 1952–2023 with no missing values. The data are ordered by year, increasing from 1952 to 2023.

3 Time-series plots

[03] Time for some time-series plots.

Create five “classic” time-series plots (i.e., line plots with time on the x-axis):

  • population (pop),
  • GDP (gdp),
  • CPI (cpi),
  • inflation rate (inf),
  • female, male, and population life expectancy (exp_female, exp_male, and exp_pop) all on the same plot.

Hint: For the last plot, you can either use geom_line() three different time or you can use pivot_longer() to reshape the data and then specify color in aes().

Answer

# Create the time-series plots
p1 = ps_df |> 
  ggplot(aes(x = year, y = pop)) +
  geom_line() +
  labs(title = 'US Population (1952-2023)',
       x = 'Year',
       y = 'Population (millions)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p2 = ps_df |>
  ggplot(aes(x = year, y = gdp)) +
  geom_line() +
  labs(title = 'US GDP (1952-2023)',
       x = 'Year',
       y = 'GDP (hundreds of billions of dollars)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p3 = ps_df |>
  ggplot(aes(x = year, y = cpi)) +
  geom_line() +
  labs(title = 'US CPI (1952-2023)',
       x = 'Year',
       y = 'CPI (1982-1984=100)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p4 = ps_df |>
  ggplot(aes(x = year, y = inf)) +
  geom_line() +
  labs(title = 'US Inflation Rate (1952-2023)',
       x = 'Year',
       y = 'Inflation Rate (%)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p5 = ps_df |> pivot_longer(cols = starts_with('exp_')) |>
  ggplot(aes(x = year, y = value, color = name)) +
  geom_line() +
  labs(title = 'Life Expectancy (1952-2023)',
       x = 'Year',
       y = 'Life Expectancy (years)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed') +
  scale_color_viridis_d('Group', labels = c('Female', 'Male', 'Total')) +
  theme(legend.position = 'bottom')
# Combine the plots
p1 + p2 + p3 + p4 + p5

Time-series plots of population, GDP, and life expectancy.

[04] Do any of the time-series plots in [03] suggest the variables are autocorrelated? Explain your answer.

Answer

Yes. Except for the inflation rate, each of the time-series plots is strongly suggestive of autocorrelation. In fact, most of the time-series plots show a clear upward trend, which is a strong indicator of autocorrelation. The inflation rate does not show a clear trend, but it does have periods of high and low inflation that could also suggest some degree of autocorrelation.

[05] We’re often interested in GDP per capita (GDP divided by population), rather than plain GDP. Create a new variable in your dataset called gdppc that is GDP per capita. Then create a time-series plot of gdppc (with time on the x-axis).

Answer

# Create a new variable for GDP per capita
ps_df = ps_df |> 
  mutate(gdppc = gdp / pop)
# Create the time-series plot
ps_df |> 
  ggplot(aes(x = year, y = gdppc)) +
  geom_line() +
  labs(title = 'US GDP per Capita (1952-2023)',
       x = 'Year',
       y = 'GDP per Capita (100K USD)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed') +
  theme(legend.position = 'bottom')

Time-series plot of GDP per capita.

[06] Does the time-series for GDP per capita have a different shape/trend than plain GDP? In other words: Did dividing one time-series by the other address any of your concerns from [04]? Explain your answer.

Answer No, dividing GDP by population did not address the concerns from [04]. The time-series for GDP per capita still shows a clear upward trend, which suggests that the variable is autocorrelated.

4 Time-series analyses

[07] Let’s estimate a static model: Regress the population’s life expectancy (exp_pop variable) on the GDP per capita (gdppc variable) and the inflation rate (inf variable). Report your results (don’t interpret them).

Answer

# Static regression
reg07 = feols(exp_pop ~ gdppc + inf, data = ps_df)
etable(reg07)
                            reg07
Dependent Var.:           exp_pop
                                 
Constant        70.45*** (0.3698)
gdppc           13.79*** (0.7816)
inf               0.1028 (0.0640)
_______________ _________________
S.E. type                     IID
Observations                   72
R2                        0.81944
Adj. R2                   0.81421
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[08] Interpret the results from your regression in [07]. What do the coefficients on gdppc and inf suggest about the relationship between GDP per capita, inflation, and life expectancy?

Answer The coefficient on GDP per capita is strongly significant and positive. The interpretation: For each additional 100,000 USD in GDP per capita, life expectancy increases by approximately 13.79 years, all else equal.

The coefficient on inflation is positive but not statistically significant. Its interpretation is that a one-percentage-point increase in the inflation rate is associated with a 0.10-year increase in life expectancy, all else equal.

[09] Do the signs of the coefficients on gdppc and inf make sense? Why or why not?

Answer The sign on GDP per capita seems to make sense: more wealth often associates with better health outcomes and longer lives. The inflation result is less clear—though it also is not statistically significant. It’s important to keep in mind that we’re looking at a static model of how economic conditions at the time of birth affect estimated life expectancy. Perhaps inflation at time of birth is not meaningfully related to life expectancy.

[10] Now add population (pop) to the static regression in [07]. Report your results (you don’t need to interpret them).

Answer

# Static regression
reg10 = feols(exp_pop ~ gdppc + inf + pop, data = ps_df)
etable(reg10)
                             reg10
Dependent Var.:            exp_pop
                                  
Constant          53.27*** (1.140)
gdppc            -8.251*** (1.493)
inf               -0.0311 (0.0319)
pop             0.0941*** (0.0062)
_______________ __________________
S.E. type                      IID
Observations                    72
R2                         0.95918
Adj. R2                    0.95738
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[11] Did including population meaningfully change the results from the regression in [07]? Explain your answer. What’s going on here?

Answer Including population substantially changed the results. GPD per capita now has a significant and negative coefficient. Inflation is still not statistically significant, but it also has a negative coefficient.

If we think about the changes in terms of the omitted-variables formula, we’ve included a variable (population) that is correlated with the previous variables and may have an effect on life expectancy, changing the estimated coefficients on GDP per capita and inflation.

[12] Maybe we need to think more about time. Estimate a dynamic model that includes the variables from [10] plus the lag of GDP per capita and the lag of inflation.

Answer

# Static regression
reg12 = feols(exp_pop ~ gdppc + inf + pop + L(gdppc) + L(inf), data = ps_df)
NOTE: 1 observation removed because of NA values (RHS: 1).
etable(reg12)
                             reg12
Dependent Var.:            exp_pop
                                  
Constant          53.10*** (1.261)
gdppc               -6.820 (9.384)
inf              -0.1212* (0.0561)
pop             0.0944*** (0.0069)
L(gdppc)            -1.482 (10.28)
L(inf)            0.1126* (0.0508)
_______________ __________________
S.E. type                      IID
Observations                    71
R2                         0.96070
Adj. R2                    0.95767
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[13] What is the estimated total effect of GDP per capita on life expectancy?

Answer The estimated total effect of GDP per capita on life expectancy is the sum of the coefficient on GDP per capita and the coefficient on the lagged GDP per capita. In this case, the total effect is approximately -8.302.

[14] What assumptions do we need for OLS to be unbiased in [08], [10], and [12]? Are the assumptions realistic in this setting? Explain your answer specific to this setting.

Answer The main assumption required for OLS to be unbiased here is exogeneity. (We also need variation in our explanatory variables.)

Exogeneity here requires that GDP per capita, inflation, and population are independent from any other factors that affect life expectancy (i.e., the disturbance). This assumption is likely violated in our setting, as many factors like healthcare access, medical technology, smoking tendencies, and enviornmental conditions all affect life expectancy and are likely correlated with GDP per capita, inflation, and population—especially since they are all trending over time.

[15] In this setting, what could cause the disturbance to be autocorrelated? Given at least one example.

Answer As explained above, healthcare access, medical technology, smoking tendencies, and environmental conditions all affect life expectancy (i.e., are in the disturbance) and are likely autocorrelated.

[16] What issues would autocorrelation in the disturbance cause for OLS when estimating the model in [08] (or [12])?

Answer Autocorrelation in the disturbance would (1) cause OLS to be inefficient, (2) bias our standard errors (leading to incorrect inference).

[17] Use the residuals from your estimated model in [12] to test for first-order autocorrelation. Show the steps of your test and report your conclusion.

Hint: Remember that if you use the residuals() function, you may need to concatenate an NA to the beginning of the residuals vector to make the residual vector the same length as the number of observations in your dataset. For example, you can use c(NA, residuals(your_regression)).

Answer

# Grab the residuals
ps_df$e12 = c(NA, reg12$residuals)
# Run the regression
reg17 = feols(e12 ~ L(e12), ps_df)
NOTE: 2 observations removed because of NA values (LHS: 1, RHS: 2).
# Results
etable(reg17)
                             reg17
Dependent Var.:                e12
                                  
Constant          -0.0082 (0.0489)
L(e12)          0.7829*** (0.0745)
_______________ __________________
S.E. type                      IID
Observations                    70
R2                         0.61873
Adj. R2                    0.61313
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient on the lagged residual is statistically significant at the 5-percent level (and beyond). Accordingly, we reject the null hypothesis of no autocorrelation in the disturbance. We conclude that there is statistically significant evidence of first-order autocorrelation in the disturbance.

[18] What can one do to “deal with” autocorrelation in the disturbance? Discuss at least two options.

Answer There are several options to deal with autocorrelation in the disturbance:

  1. Use autocorrelation robust standard errors: This approach “fixes” the bias in our standard errors caused by autocorrelation and just lives with OLS’s inefficiency.
  2. Fix the specification, e.g. use a dynamic model: As with heteroskedasticity, we can try to fix the issue in our disturbance by changing the model specification. In this case, if we fail to include a lagged outcome variable that should be there, then we would cause autocorrelation in the disturbance. Including a lagged outcome variable could help address this issue.
  3. Use a feasible generalized least squares (FGLS) estimator: FGLS integrates autocorrelation into the model, which allowing for more efficient estimation of the coefficients (under some assumptions).

[19] Time for a dynamic model with a lagged outcome variable. Specifically, add a lagged outcome as a regressor to the model in [12].

Answer

# The model
reg19 = feols(exp_pop ~ gdppc + inf + pop + L(gdppc) + L(inf) + L(exp_pop), data = ps_df)
NOTE: 1 observation removed because of NA values (RHS: 1).
etable(reg19)
                             reg19
Dependent Var.:            exp_pop
                                  
Constant             1.716 (3.441)
gdppc              13.79** (4.619)
inf               -0.0389 (0.0269)
pop                0.0002 (0.0070)
L(gdppc)          -14.21** (4.909)
L(inf)           0.0655** (0.0241)
L(exp_pop)      0.9762*** (0.0644)
_______________ __________________
S.E. type                      IID
Observations                    71
R2                         0.99144
Adj. R2                    0.99063
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[20] Now test the disturbance in [19] for autocorrelation. Discuss the conclusion of your test.

Answer

# Grab the residuals
ps_df$e19 = c(NA, reg19$residuals)
# Run the regression
reg20 = feols(
  e19 ~
  gdppc + inf + pop + L(gdppc) + L(inf) + L(exp_pop) + L(e19),
  data = ps_df
)
NOTE: 2 observations removed because of NA values (LHS: 1, RHS: 2).
# Results
etable(reg20)
                           reg20
Dependent Var.:              e19
                                
Constant           2.157 (3.472)
gdppc              4.946 (4.979)
inf              0.0024 (0.0263)
pop              0.0042 (0.0072)
L(gdppc)          -5.519 (5.325)
L(inf)          -0.0060 (0.0236)
L(exp_pop)      -0.0418 (0.0653)
L(e19)          0.3599* (0.1542)
_______________ ________________
S.E. type                    IID
Observations                  70
R2                       0.08096
Adj. R2                 -0.02280
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient on the lagged residual is again statistically significant at the 5-percent level. We thus reject the null hypothesis of no autocorrelation in the disturbance and conclude that there is statistically significant evidence of first-order autocorrelation in the disturbance.

[21] What happens to OLS when we have a lagged outcome variable and the disturbance is autocorrelated?

Answer When we have a lagged outcome variable and the disturbance is autocorrelated, OLS is biased and inconsistent.

[22] Let’s take a step back. Do you think it even makes sense to include a lagged outcome variable in the model here? Why or why not? Explain.

Answer I don’t think the lagged outcome variable conceptually makes sense in this context. The lagged outcome variable implies that higher life expectancy the year before I was born causes me to have a longer life expectancy. This idea doesn’t make much sense to me: life expectancy is typically a function of many factors (healthcare, environmental quality, etc.) but not usually considered to be a function of the life expectancy of people born the previous year.

5 Stationarity

[23] Return to the time-series figures in [03] through [07]. Do any of the figures suggest that the variables may be non-stationary? Explain your answer.

Answer Yes! Except for inflation, all of the time-series plots suggest some type of non-stationarity due to their clear upward trends. These trends are consistent with either the mean or the variance changing over time.

[24] If the data are non-stationarity, what problems might this cause for statistical analyses?

Answer If the data are non-stationary, it can cause use to recover spurious results—i.e., finding relationships that do not actually exist.

[25] Suppose that we’re concerned that CPI variable is non-stationary. Why might the inflation variable still have a chance at being stationary?

Explain your answer using the definition of inflation below: \[ \text{inflation} = \frac{\text{CPI}_t - \text{CPI}_{t-1}}{\text{CPI}_{t-1}} \times 100 \]

Answer In class we discussed the concept of difference stationarity. Because inflation is based upon differences in CPI, it may be stationary even if CPI is not.

[26] One approach to deal with non-stationarity (as discussed in class) is to take differences, i.e., \(\Delta x_t = x_{t} - x_{t-1}\).

Create “differenced” versions of the variables exp_pop, gdppc, and inf.

Now plot the time-series of the differenced variables. Do the time-series of the differenced variables suggest that differencing “helped” with stationary? Explain your answer.

Answer

# Create differenced variables
ps_df = ps_df |> 
  mutate(
    d_exp_pop = exp_pop - lag(exp_pop),
    d_gdppc = gdppc - lag(gdppc),
    d_inf = inf - lag(inf)
  )
# Create the time-series plots
p1 = ps_df |> 
  ggplot(aes(x = year, y = d_exp_pop)) +
  geom_line() +
  labs(title = 'Differenced Life Expectancy (1952-2023)',
       x = 'Year',
       y = 'Years') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p2 = ps_df |>
  ggplot(aes(x = year, y = d_gdppc)) +
  geom_line() +
  labs(title = 'Differenced GDP per Capita (1952-2023)',
       x = 'Year',
       y = '100K USD') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p3 = ps_df |>
  ggplot(aes(x = year, y = d_inf)) +
  geom_line() +
  labs(title = 'Differenced Inflation Rate (1952-2023)',
       x = 'Year',
       y = 'Inf. Rate (%)') +
  theme_minimal(base_size = 9, base_family = 'Fira Sans Condensed')
p1 / p2 / p3
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

Differencing helps some…

  • After differencing, life expectancy is no longer trending upward, and its variance appear fairly constant through time.
  • Differenced inflation seems stationary in its mean and variance.

Differenced GDP per capita is still trending upward, and its variance appears to be increasing with time.

[27] Now estimate the differenced version of the model form [07], i.e., estimate the following model: \[ \Delta(\text{Life expectancy})_{t} = \beta_0 + \beta_1 \Delta \text{GDPpc}_t + \beta_2 \Delta \text{Inflation}_t + \Delta u_t \] Report your results (you don’t need to interpret them).

Answer

# Differenced regression
reg27 = feols(d_exp_pop ~ d_gdppc + d_inf, data = ps_df)
NOTE: 1 observation removed because of NA values (LHS: 1, RHS: 1).
etable(reg27)
                           reg27
Dependent Var.:        d_exp_pop
                                
Constant         0.0656 (0.0556)
d_gdppc           6.935* (3.308)
d_inf           -0.0367 (0.0247)
_______________ ________________
S.E. type                    IID
Observations                  71
R2                       0.07381
Adj. R2                  0.04657
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

[28] Interpret the results from your regression in [27].

Answer First note that the coefficients in this differenced model can be interpreted in the same way as the coefficients in the non-differenced models.

  • The coefficient on GDP per capita says a 100,000-dollar increase in GDP per capita is associated with a 6.935-year increase in life expectancy, all else equal. This coefficient is statistically significant at the 5-percent level.
  • The coefficient on inflation is not statistically significant, but it suggests that a one-percentage-point increase in the inflation rate is associated with a 0.0367-year decrease in life expectancy, all else equal.

[29] How do the estimated effects of GDP per capita and inflation in [27] compare to the estimated effects in [07]? What do you think is going on here?

Answer The estimated effect of GDP per capita is much smaller than in [07]—as is the (still not statistically significant) effect of inflation. The non-stationarity of the variables may have been leading us to overstate the effects of the variables, relative to the differenced model.

[30] Which—if any—of the estimated models in this assignment do you think is “best”? Explain your answer.

Answer While I don’t think I trust any of the estimates to provide unbiased causal effects, the differenced model seems to be a bit more reasonable, as it has take more steps to address the clear non-stationarity issues in the data. We still might want to consider testing for autocorrelation and/or correcting our standard errors…