Adventures in Simulation in .mono[R]

class: center, middle, inverse, title-slide

# Adventures in Simulation in .mono[R]
## EC 425/525, Lab 6
### Edward Rubin
### 11 May 2019

---

class: inverse, middle

# Prologue

---
name: schedule

# Schedule

## Last time

Plotting

## Today

Simulation
---
layout: true
# Simulation

---
name: simulation
class: inverse, middle
---
name: motive
## Motivation

As we've discussed, simulation can be a quick and effective way to better understand how an estimator performs/behaves.

You just need to be careful to .hi[ask a clear, answerable question] and then .hi[run a simulation] that corresponds/answers this question.

In addition, simulations can be computationally intense—they are often the first time you have to really think about efficiency in coding.
---
name: gen-outline
## Generic outline

The general outline for a simulation is fairly consistent.

1\. .hi[Define the population] via a data-generating process (DGP)..super[.pink[†]]

.footnote[.pink[†] Some people prefer to actually construct the population in this step and then repeatedly sample from this fixed population. Others stick with a population defined by the DGP.]

2\. Iterate. In each iteration:
  - .hi[Sample] from your population.
  - Construct .hi[estimates/inferences] that relate to your original question.

3\. .hi[Summarize] results.

---
## Practical issues

This semi-theoretical framework needs a few practical reminders.

1. Always set a seed *at the beginning* of your simulation (`set.seed()`).

1. Parallelize where/when possible (_e.g._, the `furrr` package).

1. Writing a function for a single iteration can be helpful (see above).

1. There is a (big) difference between unbiasedness and consistency.

1. You build simulations/DGPs with assumptions.

1. Analytical results can inform and/or replace simulations.
---
layout: false
class: inverse, middle

# Example simulation
---
layout: true

# Simulation

---
name: ex-q
## The question

.qa[Q] We've shown that instrumental variables (IV) is consistent, how does it perform (_i.e._, is it unbiased) in finite (small) samples?

.note[Note] This question is definitely answerable analytically.

Nevertheless, let's see how IV performs at several small-ish sample sizes.

While we're at it, let's confirm OLS is indeed biased in this setting.

---
name: ex-dgp
## DGP

We want a valid instrument for a setting in which treatment is endogenous.
$$
`\begin{align}
  \text{Y}_{i} = \alpha + \tau \text{D}_{i} + \varepsilon_i
\end{align}`
$$
So we want
1. .hi-slate[Endogenous treatment:] `$\mathop{\text{Cov}} \left( \text{D}_{i},\, \varepsilon_i \right) \neq 0$`
1. .hi-pink[Predictive:] `$\mathop{\text{Cov}} \left( \text{Z}_{i},\, \text{D}_{i} \right) \neq 0$`
1. .hi-pink[Excludability:] `$\mathop{\text{Cov}} \left( \text{Z}_{i},\, \varepsilon_i \right) = 0$`

where (2) and (3) imply `$\text{Z}_{i}$` is a valid instrument.
---
## DGP

In other words, the variance-covariance matrix of `$\text{D}_{i}$`, `$\varepsilon_i$`, and `$\text{Z}_{i}$` is
$$
`\begin{align}
  \Sigma = \begin{bmatrix}
    \sigma^2_{\text{D}} & \sigma_{\text{D},\varepsilon} & \sigma_{\text{D},\text{Z}} \\
    \sigma_{\text{D},\varepsilon} & \sigma^2_{\varepsilon} & 0 \\
    \sigma_{\text{D},\text{Z}} & 0 & \sigma^2_{\text{Z}}
  \end{bmatrix}
\end{align}`
$$
--
If we assume unit variances and covariances are 0.6, then
$$
`\begin{align}
  \Sigma = \begin{bmatrix}
    1 & 0.6 & 0.6 \\
    0.6 & 1 & 0 \\
    0.6 & 0 & 1
  \end{bmatrix}
\end{align}`
$$
---
## DGP

To simplify our lives, let's assume that `$\text{D}_{i}$`, `$\varepsilon_i$`, and `$\text{Z}_{i}$` come from a multivariate normal distribution.

We defined their covariance matrix. We need to define their means.

`$\mu_{\text{D}} = 10$`, `$\mu_{\varepsilon} = 0$`, and `$\mu_{\text{Z}} = 3$`.

Finally, we need to define the way in which `$\text{D}_{i}$` and `$\varepsilon_i$` affect `$\text{Y}_{i}$`.
$$
`\begin{align}
  \text{Y}_{i} = 7 + 1 \times \text{D}_{i} + \varepsilon_i
\end{align}`
$$
_i.e._, `$\tau = 1$`.
---
## DGP

Lucky for us, .mono[R]'s `MASS` package has a function `mvrnorm()` that draws `n` random observations from a multivariate normal distribution with means `mu` and variance-covariance matrix `Sigma`.

---
name: ex-sample
## Sampling from our DPG

We're ready to write a function that performs one iteration.

Our function will accept a single argument `n`, the sample size.

Now we just need to estimate `$\beta_\text{IV}$` and `$\beta_\text{OLS}$`. We'll use `estimatr`.

.note[Previous] OLS estimates of the effect of `x` on `y`

```r
lm_robust(y ~ x)
```

.note[New] IV estimates of the effect of `x` on `y` with instrument `z`

```r
iv_robust(y ~ x | z)
```
---
layout: false
class: clear, middle

```r
sim_iter <- function(n) {
  # Define our variance-covariance matrix (D, ε, Z)
  Σ <- matrix(data = c(1, 0.6, 0.6, 0.6, 1, 0, 0.6, 0, 1), ncol = 3)
  # Our vector of means (D, ε, Z)
  μ = c(10, 0, 3)
  # Draw n observations; convert to tibble
  smpl_df <- MASS::mvrnorm(n = n, mu = μ, Sigma = Σ) %>% data.frame()
  # Name variables
  names(smpl_df) <- c("D", "ε", "Z")
  # Calculate Y
  smpl_df %<>% mutate(Y = 7 + 1 * D + ε)
  # Estimates
  est_df <- bind_rows(
    # The OLS estimates
    lm_robust(Y ~ D, data = smpl_df) %>% tidy() %>% mutate(est = "OLS"),
    # The IV estimates
    iv_robust(Y ~ D | Z, data = smpl_df) %>% tidy() %>% mutate(est = "IV")
  )
  return(est_df)
}
```
---
layout: true
# Simulation

---
name: ex-iter
## Repeat

Now we want run `sim_iter()` *many* times.

And we're going to do it in parallel—using the `furrr` package.

The output of `sim_iter()` is a data frame, so we can actually use a function from `furrr` that expects outputted data frames, namely, `future_map_dfr`.

The suffix `_dfr` means the function will row-bind the data frames returned by individual iterations.

We'll also use the `rep()` function which repeats things, _e.g._, `rep("a", 3)` repeats `"a"` three times.
---

Assuming we've already entered `sim_iter()` into memory, we can run our simulation 5,000 times, each with sample size 50—in parallel!

```r
# Load furrr
p_load(furrr)
# Tell R to parallelize with 4 cores
plan(multiprocess, workers = 4)
# Set a seed
set.seed(12345)
# Run simulation with sample size 50
sim50 <- future_map_dfr(
  # Repeat sample size 50 for 5000 times
  rep(50, 5000),
  # Our function
  sim_iter,
  # Let furrr know we want to set a seed
  .options = future_options(seed = T)
)
```

---
layout: false
class: clear, middle

Sample size 50 (5,000 iterations)
<img src="06RSim_files/figure-html/plot-sim50-1.svg" style="display: block; margin: auto;" />
---
layout: true
class: clear, middle

---

Let's vary the sample size and see what happens.

---

Sample size 10 (5,000 iterations)

<img src="06RSim_files/figure-html/plot-sim10-1.svg" style="display: block; margin: auto;" />
---

Sample size 25 (5,000 iterations)

<img src="06RSim_files/figure-html/plot-sim25-1.svg" style="display: block; margin: auto;" />
---

Sample size 50 (5,000 iterations)
<img src="06RSim_files/figure-html/plot-sim50-2-1.svg" style="display: block; margin: auto;" />
---

Sample size 100 (5,000 iterations)

<img src="06RSim_files/figure-html/plot-sim100-1.svg" style="display: block; margin: auto;" />
---
layout: true
# Simulation

---
name: ex-a

## Assumptions

Keep in mind that we made several assumptions about

- the distribution (joint normality is very restrictive)
- variance (all equal, independent, and homoskedastic)
- covariances (again, all equal)
- strong instrument
---
name: loops

## Looping

There are .b[many] ways to iterate/loop in .mono[R]:
- `for()`, `while()`, *etc.*
- `lapply()`, `mapply()`, *etc.*
- `parallel`: `mclapply()`, `mcmapply()`, *etc.*
- `foreach`
- `future`, `furrr`, and `future.apply`: `future_lapply()`, `future_map()`, *etc.*

They are not all equal/identical.
- Few can access values from previous iterations (`for()` and `foreach`).
- A subset is parallelizable (`parallel`, `foreach`, the `future` family).
- Behavior can be OS specific (especially `parallel`).

---

## `for()`

You'll often hear that you should never use `for()` loops in .mono[R].

This opinion is a bit extreme, but there are a few reasons to avoid them.

1. `for()` is not parallelized (though `foreach` can be).

1. `for()` doesn't clean up after itself—leaving objects in memory between iterations and after the loop finishes..super[.pink[†]]

1. `for()` loops generally ["grow" data](https://privefl.github.io/blog/why-loops-are-slow-in-r/) which can be slow/inefficient.

See [Grant McDermott's lectures](https://github.com/uo-ec607) for further justification.

.footnote[.pink[†] This feature *can* be desirable—*e.g.*, linking iterations.]
---
layout: false
# Table of contents

.pull-left[
### Simulation
.small[
1. [Motivation](#motive)
1. [Generic outline](#gen-outline)
1. [Example](#ex-q)
  - [The question](#ex-q)
  - [DGP](#ex-dgp)
  - [Sampling from the DGP](#ex-sample)
  - [Iterating](#ex-iter)
  - [Assumptions](#ex-a)
1. [Loops](#loops)
]]

---
exclude: true