Problem Set 4

class: center, middle, inverse, title-slide

# Problem Set 4
## Nonstationarity, Causality, Instrumental Variables
### <strong>EC 421:</strong> Introduction to Econometrics
### <br>Due <em>before</em> midnight (11:59pm) on Saturday, 16 March 2019

---

class: clear

.mono[DUE] Your solutions to this problem set are due *before* midnight on Sunday, 03 March 2019. Your files must be uploaded to [Canvas](https://canvas.uoregon.edu/)—including (1) your responses/answers to the question and (2) the .mono[R] script you used to generate your answers. Each student must turn in her/his own answers.

.mono[OBJECTIVE] This problem set has three purposes: (1) reinforce econometrics topics from class; (2) build your .mono[R] toolset; (3) strengthen your intuition on causality and time series.

## Problem 1: Nonstationarity—the Basics

**1a.** Define stationarity.

*Note:* You can define it using math or words (or both).

**1b.** If our disturbance term `$u_t$` follows a .pink[random walk], *i.e.*,
$$
`\begin{align}
  u_{t} = u_{t-1} + \varepsilon_t
\end{align}`
$$
then it's variance is `$\mathop{\text{Var}} \left( u_t \right) = t \sigma_{\varepsilon}^2$`. Explain how this expression of its variance shows that the disturbance is .purple[nonstationary] (*i.e.*, it violates .pink[stationarity]).

**1c.** We previously discussed autocorrelated distrubances, *e.g.*, an AR(1) process such that
$$
`\begin{align}
  u_{t} = \rho u_{t-1} + \varepsilon_t
\end{align}`
$$
Under which circumstances would this AR(1) process become a random walk?

*Hint:* Consider the values of `$\rho$`.

---
class: clear

## Problem 2: Nonstationarity—the Simulation

In this problem, we are going to create two independent, .hi-purple[nonstationary] time series. Specifically, we'll create two random walks. Then, we'll regress the first random walk on the second random walk.

*Hint:* Generating random walks is *nearly* identical to generating AR(1) processes, as you did in lab.

**2a.** Generate the first 30-period random walk. We will name it `v`.
$$
`\begin{align}
  v_t = v_{t-1} + \varepsilon_t
\end{align}`
$$
where `$\varepsilon_t$` comes from a normal distribution with mean 0 and standard deviation 1.

Here is some .mono[R] to help.

```r
# Set a seed (so your results stay the same)
set.seed(123)
# Generate the initial number, (this will be v[1])
v <- rnorm(1, mean = 0, sd = 1)
# For loop to create the random walk
for (t in 2:30) {
  # Create the 'next' observation
  ...
}
```
while you're filling in the `for` loop, keep in mind (**1**) our equation for the random walk at the beginning of this question (meaning `$v_t$` depends upon `$v_{t-1}$` and `$\varepsilon_t$`) and (**2**) the fact that you can reference different observations in .mono[R], *e.g.*,

- `v[t]` refers to the `$t$`.super[th] observation
- `v[t-1]` refers to the `$(t-1)$`.super[th] observation
- `v[3]` refers to the `$3$`.super[rd] observation

If you need more help on for loops, don't forget there are lab materials on Canvas and resources online (*e.g.*, [datamentor.io](https://www.datamentor.io/r-programming/for-loop/) and [datacamp.com](https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r) have lots of resources).

**2b.** Generate a second 30-period random walk called `w`. This part is exactly the same as (2a), but you **use a different seed** (*i.e.*, `set.seed(456)`) and **name the variable** `w`.

**2c.** We .orange[independently] generated these two time series. Ideally (from a statistical point of view), should we find a statistically significant relationship between the two series? Explain.

**2d.** Regress `w` on `v`. Report the results from the `$t$` test. Do they match your expectations from (2c)?

---
class: clear

## Problem 3: Causality

Following the Rubin causal model, imagine that we observe the following data (which would be impossible observe in real life):

.center[
.bold[Table: Imaginary dataset]
]
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> .math[i] </th>
   <th style="text-align:right;"> Trt. </th>
   <th style="text-align:right;"> y.sub[1] </th>
   <th style="text-align:right;"> y.sub[0] </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;background-color: white;"> 1 </td>
   <td style="text-align:right;background-color: white;"> 0 </td>
   <td style="text-align:right;background-color: white;"> 12 </td>
   <td style="text-align:right;background-color: white;"> 8 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: white;"> 2 </td>
   <td style="text-align:right;background-color: white;"> 0 </td>
   <td style="text-align:right;background-color: white;"> 7 </td>
   <td style="text-align:right;background-color: white;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: white;"> 3 </td>
   <td style="text-align:right;background-color: white;"> 1 </td>
   <td style="text-align:right;background-color: white;"> 5 </td>
   <td style="text-align:right;background-color: white;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;background-color: white;"> 4 </td>
   <td style="text-align:right;background-color: white;"> 1 </td>
   <td style="text-align:right;background-color: white;"> 6 </td>
   <td style="text-align:right;background-color: white;"> 4 </td>
  </tr>
</tbody>
</table>

**3a.** Calculate the treatment effect **for each individaul** (*i.e.*, `$\tau_i$`).

**3b.** **[T/F]** The treatment effect is constant across individuals.

**3c.** Calculate the **average treatment effect**.

**3d** **Estimate the average treatment effect** by comparing the **mean of the treatment group** to the **mean of the control group**.

**3e.** Should we expect our estimator in (3d) to provide unbiased estimates? **Explain.**

**3f.** Why would it be impossible to actually observe all of the data in the table (in real life)?

**3g.** How does your answer in (3f) relate to *the fundametal problem of causal inference*?

---
class: clear

## Problem 4: Instrumental Variables

Let's return to our question of the returns to education. Specifically, we will use the dataset `wages.csv`, which .<sup>†</sup>

.footnote[These data come from `wage1` in the `wooldridge` package. I took a subset of variables and renamed them.]

We're interested in estimating `$\beta_1$` in
$$
`\begin{align}
  \text{Wage}_i = \beta_0 + \beta_1 \text{Education}_i + u_i
\end{align}`
$$
but we have a problem with omitted-variable bias. Instrumental variables can potentially help.

**4a.** Load and inspect the dataset.

**4b.** What are the two requirements for a valid instrument?

**4c.** As we've discussed, we need an instrument for (endogenous) education. Do you think the variable `n_kids`—the number of children—would be a valid instrument? Explain why it passes/fails ech of the two requirements for a valid instrument.

**4d.** We can test the *relevance* of our instrument by estimating the first stage, *i.e.*, regressing our endogenous variable .mono[education] on our (potential) instrument .mono[n_kids].

Do it.

Is there evidence that our potential instrument is relevant? Explain using a statistical test and interpret the coefficient.

**4e.** Let's assume *number of children* is a valid instrument for education.

Using the number of children (`n_kids`) as an instrument for education (`education`), estimate the returns to education via instrumental variables (IV).

Interpret the coefficient that gives the returns to education and its significance.

*Hint:* Recall that we can use `iv_robust(y ~ x | z, data)` from the `estimatr` package to get IV/2SLS estimates of the effect of `x` on `y` with the instrument `z` (and dataset `data`).

**4f.** How do your estimates of the returns to education from instrumental variables (IV) compare to estimates using plain ordinary least squares (OLS)?

*Hint:* You'll need to estimate the model using OLS.

**4g.** **Extra credit:** Explain which estimates you would trust more (or why you distrust both).