Problem Set 3

class: center, middle, inverse, title-slide

# Problem Set 3
## Time Series and Autocorrelation
### <strong>EC 421:</strong> Introduction to Econometrics
### <br>Due <em>before</em> midnight (11:59pm) on Saturday, 03 March 2019

---

class: clear

.mono[DUE] Your solutions to this problem set are due *before* midnight on Sunday, 03 March 2019. Your files must be uploaded to [Canvas](https://canvas.uoregon.edu/)—including (1) your responses/answers to the question and (2) the .mono[R] script you used to generate your answers. Each student must turn in her/his own answers.

.mono[OBJECTIVE] This problem set has three purposes: (1) reinforce the econometrics topics we reviewed in class; (2) build your .mono[R] toolset; (3) start building your intuition about causality and time series within econometrics.

## Problem 1: Time Series

Imagine that we are interested in estimating the effect of monthly oil prices on monthly natural gas prices. The dataset `gas_oil.csv` contains these prices—the monthly average oil price (the price in dollars per barrel of *Brent Crude oil*, as measured by the [US EIA](https://www.eia.gov/dnav/pet/hist/RBRTED.htm)) and the monthly average price of natural gas (dollars per million BTUs for natural gas at the *Henry Hub*, recorded by the [US EIA](https://www.eia.gov/dnav/ng/hist/rngwhhdm.htm)).

The table on the last page describes the variables in this dataset.

**1a.** First, we consider the possibility that `$P_t^\text{Gas}$` (the price of natural gas in month `$t$`) only depends upon a constant `$\beta_0$`, `$P_t^\text{Oil}$` (the price of oil in month `$t$`), and a random disturbance `$u_t$`.

$$
`\begin{align}
  P_t^\text{Gas} = \beta_0 + \beta_1 P_t^\text{Oil} + u_t \tag{1a}
\end{align}`
$$

If model `$(1\text{a})$` is the true model, should we expect OLS to be consistent for `$\beta_1$`? **Explain.**

**1b.** Read `gas_oil.csv` and estimate model `$(1\text{a})$` with OLS. Interpret your estimate for `$\beta_1$` and comment on its statistical significance.

**1c.** In (1b), you should have found that the coefficient on `$P_t^\text{Oil}$` is statistically significant. Does this finding also mean that the price of oil explains a lot of the variation in the price of natural gas?

*Hint:* What is the R.super[2]? (In .mono[R], you can find R.super[2] using `summary()` applied to a model you estimated with `lm()`.)

**1d.** The model that we estimated in (1a) is a static model—meaning it does not allow previous periods' prices to affect the current price of natural gas. Suppose we think believe that the previous two months' oil prices also affect the price of natural gas, *i.e.*,

$$
`\begin{align}
  P_t^\text{Gas} = \beta_0 + \beta_1 P_t^\text{Oil} + \beta_2 P_{t-1}^\text{Oil} + \beta_3 P_{t-2}^\text{Oil} + u_t \tag{1d}
\end{align}`
$$

Estimate this model and compare your new estimate for `$\beta_1$` to your previous estimate (from model 1a).

*Hint:* Use the function `lag(x, n)` from the `dplyr` package to take the `n`.sup[th] lag of variable `x`.

**1e.** Interpret your estimated coefficients for `$\beta_2$` and `$\beta_3$`. Are they statistically significant?

**1f.** Has the amount of variation that we can explain increased very much? Compare the R.super[2] values for model (1a) and (1d). Also consider the *adjusted* R.super[2].

---
class: clear

**1g.** Formally test model `$(1\text{a})$` vs. model `$(1\text{d})$` using an `$F$` test.

*Hint:* You can test one model against another model in .mono[R] using the `waldtest()` function from the `lmtest` package. For example,

```r
# OLS model of y on x and two lags
est_model <- lm(y ~ x + lag(x) + lag(x, 2), data = example_df)
# Jointly test the coefficients on lag(x) and lag(x, 2)
waldtest(est_model, c("lag(x)", "lag(x, 2)"), test = "F")
```
calculates an `$F$` test for the coefficients on `lag(x)` and `lag(x, 2)` in the model `est_model`.

**Note:** For some reason, `lag(x, n)` needs to have a space between the comma (`,`) and `n` when you use `waldtest` to test lags.

**1h.** If model `$(1\text{d})$` is the true model, should we expect OLS to be consistent for `$\beta_1$`? **Explain.**

**1i.** Suppose we now think that the actual model includes the current price of oil *and* the previous month's prices of oil and natural gas, *i.e.*,

$$
`\begin{align}
  P_t^\text{Gas} = \beta_0 + \beta_1 P_t^\text{Oil} + \beta_2 P_{t-1}^\text{Oil} + \beta_3 P_{t-1}^\text{Gas} + u_t \tag{1i}
\end{align}`
$$

Estimate this model. Interpret the coefficients on `$\beta_1$` and `$\beta_3$`. How has your estimate on `$\beta_1$` changed?

**1j.** Compare the R.super[2] from model `$(1\text{i})$` to the R.super[2]s of the previous models. Explain what happened.

**1k.** If we assume `$u_t$` in `$(1\text{i})$` (**A**) follows our assumption of *contemporaneous exogeneity* and (**B**) is not autocorrelated, should we expect OLS to produce consistent estimates for the `$\beta$`s in this model? **Explain.**

---
class: clear

## Problem 2: Autocorrelation

**2a.** After starting to estimate these time-series models, you remember that autocorrelation affects OLS. For each of the three models above (1a, 1d, and 1i), explain how autocorrelation will affect OLS.

*Hint:* It will affect two of the models the same way and one of them differently.

**2b.** Add the residuals from your estimate of model `$(1\text{i})$` to your dataset.

**Important:** Don't forget that you will need to tell .mono[R] that you have a missing observation (since we have a lag in our model).

```r
# Add residuals from our estimated model in 1i to dataset 'price_df'
price_df$e_1i <- c(NA, residuals(ols_1i))
```

Here, I'm adding a new column to the dataset `price_df` for the residuals from the model I saved as `ols_1i`. The first observation is missing, because our model `ols_1i` includes a single lag.

**2c.** Construct two plots with the residuals from `$(1\text{i})$`: .hi[1] plot the residuals against the time variable (`t_month`) and .hi[2] plot the residuals against their lag. Do you see any evidence of autocorrelation? What would autocorrelation look like?

I strongly encourage you to use `ggplot2` for these graphs.

**2d.** Add the residuals from the models in `$(1\text{a})$` and `$(1\text{d})$` to your dataset. See below (we have to keep track of missing observations due to lags).

```r
# Residuals from the model in 1a
price_df$e_1a <- residuals(ols_1a)
# Residuals from the model in 1d
price_df$e_1d <- c(NA, NA, residuals(ols_1d))
```

**2e.** Repeat the plots from above—.hi[1] plot the residuals against the time variable (`t_month`) and .hi[2] plot the residuals against their lag—for both sets of residuals, *i.e.*, for the residuals from `$(1\text{a})$` and for the residuals from `$(1\text{d})$`. You should end up with four graphs for this part.

**2f.** Why do you think the residuals from `$(1\text{a})$` and `$(1\text{d})$` appear to have autocorrelation, while the residuals in `$(1\text{i})$` show much less evidence of autocorrelation?

*Hint:* Think back to our discussion of the ways we can work/live with autocorrelation.

**2g.** Following the steps for the Breusch-Godfrey test that we discussed in class, test the residuals from the model in `$(1\text{i})$` for second-order autocorrelation.

*Hint:* You can use the `waldtest()` from the `lmtest` package, as shown in the lecture slides.

**2h.** If we assume `$u_t$` is **not** autocorrelated, then can we trust OLS to be consistent for its estimates of the coefficients in model `$(1\text{i})$`? **Explain.**

**2i.** Should we interpret our estimates from `$(1\text{i})$` as causal? **Explain.**

---
class: clear

### Description of variables and names
<br>
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:left;"> Description </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> .mono-small[month_year] </td>
   <td style="text-align:left;"> The observation's month and year (.mono-small[character]) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> .mono-small[month] </td>
   <td style="text-align:left;"> The month (.mono-small[numeric]) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> .mono-small[year] </td>
   <td style="text-align:left;"> The year (.mono-small[numeric]) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> .mono-small[price_gas] </td>
   <td style="text-align:left;"> The average (Henry Hub) price of natural gas, $ per 1MM BTU (.mono-small[numeric]) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> .mono-small[price_oil] </td>
   <td style="text-align:left;"> The average (Brent Crude) price of oil, $ per barrel (.mono-small[numeric]) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> .mono-small[t_month] </td>
   <td style="text-align:left;"> Time, measured by months in the dataset (.mono-small[numeric]) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> .mono-small[t] </td>
   <td style="text-align:left;"> Time, approximately by fractions of years (.mono-small[numeric]) </td>
  </tr>
</tbody>
</table>