---
title: "Adventures in Simulation in .mono[R]"
subtitle: "EC 425/525, Lab 6"
author: "Edward Rubin"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  xaringan::moon_reader:
    css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
    # self_contained: true
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
class: inverse, middle

```{R, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
  broom, tidyverse,
  latex2exp, ggplot2, ggthemes, ggforce, viridis, extrafont, gridExtra,
  kableExtra, snakecase, janitor,
  data.table, dplyr, estimatr,
  lubridate, knitr, parallel, furrr,
  lfe,
  here, magrittr
)
# Define pink color
red_pink <- "#e64173"
turquoise <- "#20B2AA"
orange <- "#FFA500"
red <- "#fb6107"
blue <- "#3b3b9a"
green <- "#8bb174"
grey_light <- "grey70"
grey_mid <- "grey50"
grey_dark <- "grey20"
purple <- "#6A5ACD"
slate <- "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
  comment = "#>",
  fig.align = "center",
  fig.height = 7,
  fig.width = 10.5,
  warning = F,
  message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
  svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```

# Prologue

---
name: schedule

# Schedule

## Last time

Plotting

## Today

Simulation
---
layout: true
# Simulation

---
name: simulation
class: inverse, middle
---
name: motive
## Motivation

As we've discussed, simulation can be a quick and effective way to better understand how an estimator performs/behaves.

--

You just need to be careful to .hi[ask a clear, answerable question] and then .hi[run a simulation] that corresponds/answers this question.

--

In addition, simulations can be computationally intense—they are often the first time you have to really think about efficiency in coding.
---
name: gen-outline
## Generic outline

The general outline for a simulation is fairly consistent.

--

1\. .hi[Define the population] via a data-generating process (DGP)..super[.pink[†]]

.footnote[.pink[†] Some people prefer to actually construct the population in this step and then repeatedly sample from this fixed population. Others stick with a population defined by the DGP.]

--

2\. Iterate. In each iteration:
  - .hi[Sample] from your population.
  - Construct .hi[estimates/inferences] that relate to your original question.

--

3\. .hi[Summarize] results.

---
## Practical issues

This semi-theoretical framework needs a few practical reminders.

--

1. Always set a seed *at the beginning* of your simulation (`set.seed()`).

--

1. Parallelize where/when possible (_e.g._, the `furrr` package).

--

1. Writing a function for a single iteration can be helpful (see above).

--

1. There is a (big) difference between unbiasedness and consistency.

--

1. You build simulations/DGPs with assumptions.

--

1. Analytical results can inform and/or replace simulations.
---
layout: false
class: inverse, middle

# Example simulation
---
layout: true

# Simulation

---
name: ex-q
## The question

.qa[Q] We've shown that instrumental variables (IV) is consistent, how does it perform (_i.e._, is it unbiased) in finite (small) samples?

.note[Note] This question is definitely answerable analytically.

--

Nevertheless, let's see how IV performs at several small-ish sample sizes.

While we're at it, let's confirm OLS is indeed biased in this setting.

---
name: ex-dgp
## DGP

We want a valid instrument for a setting in which treatment is endogenous.
$$
\begin{align}
  \text{Y}_{i} = \alpha + \tau \text{D}_{i} + \varepsilon_i
\end{align}
$$
So we want
1. .hi-slate[Endogenous treatment:] $\mathop{\text{Cov}} \left( \text{D}_{i},\, \varepsilon_i \right) \neq 0$
1. .hi-pink[Predictive:] $\mathop{\text{Cov}} \left( \text{Z}_{i},\, \text{D}_{i} \right) \neq 0$
1. .hi-pink[Excludability:] $\mathop{\text{Cov}} \left( \text{Z}_{i},\, \varepsilon_i \right) = 0$

--

where (2) and (3) imply $\text{Z}_{i}$ is a valid instrument.
---
## DGP

In other words, the variance-covariance matrix of $\text{D}_{i}$, $\varepsilon_i$, and $\text{Z}_{i}$ is
$$
\begin{align}
  \Sigma = \begin{bmatrix}
    \sigma^2_{\text{D}} & \sigma_{\text{D},\varepsilon} & \sigma_{\text{D},\text{Z}} \\
    \sigma_{\text{D},\varepsilon} & \sigma^2_{\varepsilon} & 0 \\
    \sigma_{\text{D},\text{Z}} & 0 & \sigma^2_{\text{Z}}
  \end{bmatrix}
\end{align}
$$
--
If we assume unit variances and covariances are 0.6, then
$$
\begin{align}
  \Sigma = \begin{bmatrix}
    1 & 0.6 & 0.6 \\
    0.6 & 1 & 0 \\
    0.6 & 0 & 1
  \end{bmatrix}
\end{align}
$$
---
## DGP

To simplify our lives, let's assume that $\text{D}_{i}$, $\varepsilon_i$, and $\text{Z}_{i}$ come from a multivariate normal distribution.

We defined their covariance matrix. We need to define their means.

$\mu_{\text{D}} = 10$, $\mu_{\varepsilon} = 0$, and $\mu_{\text{Z}} = 3$.

--

Finally, we need to define the way in which $\text{D}_{i}$ and $\varepsilon_i$ affect $\text{Y}_{i}$.
$$
\begin{align}
  \text{Y}_{i} = 7 + 1 \times \text{D}_{i} + \varepsilon_i
\end{align}
$$
_i.e._, $\tau = 1$.
---
## DGP

Lucky for us, .mono[R]'s `MASS` package has a function `mvrnorm()` that draws `n` random observations from a multivariate normal distribution with means `mu` and variance-covariance matrix `Sigma`.


---
name: ex-sample
## Sampling from our DPG

We're ready to write a function that performs one iteration.

Our function will accept a single argument `n`, the sample size.

--

```{R, iter-fun, eval = F}
sim_iter <- function(n) {
  # Define our variance-covariance matrix (D, ε, Z)
  Σ <- matrix(data = c(1, 0.6, 0.6, 0.6, 1, 0, 0.6, 0, 1), ncol = 3)
  # Our vector of means (D, ε, Z)
  μ = c(10, 0, 3)
  # Draw n observations; convert to tibble
  sample_df <- MASS::mvrnorm(n = n, mu = μ, Sigma = Σ) %>% tibble()
  # Name variables
  names(sample_df) <- c("D", "ε", "Z")
  # Calculate Y
  sample_df %<>% mutate(Y = 7 + 1 * D + ε)
}
```
---
## Estimation

Now we just need to estimate $\beta_\text{IV}$ and $\beta_\text{OLS}$. We'll use `estimatr`.

--

.note[Previous] OLS estimates of the effect of `x` on `y`
```{R, ex-lm, eval = F}
lm_robust(y ~ x)
```

--

.note[New] IV estimates of the effect of `x` on `y` with instrument `z`
```{R, ex-iv, eval = F}
iv_robust(y ~ x | z)
```
---
layout: false
class: clear, middle

```{R, iter-fun-full, eval = T}
sim_iter <- function(n) {
  # Define our variance-covariance matrix (D, ε, Z)
  Σ <- matrix(data = c(1, 0.6, 0.6, 0.6, 1, 0, 0.6, 0, 1), ncol = 3)
  # Our vector of means (D, ε, Z)
  μ = c(10, 0, 3)
  # Draw n observations; convert to tibble
  smpl_df <- MASS::mvrnorm(n = n, mu = μ, Sigma = Σ) %>% data.frame()
  # Name variables
  names(smpl_df) <- c("D", "ε", "Z")
  # Calculate Y
  smpl_df %<>% mutate(Y = 7 + 1 * D + ε)
  # Estimates
  est_df <- bind_rows(
    # The OLS estimates
    lm_robust(Y ~ D, data = smpl_df) %>% tidy() %>% mutate(est = "OLS"),
    # The IV estimates
    iv_robust(Y ~ D | Z, data = smpl_df) %>% tidy() %>% mutate(est = "IV")
  )
  return(est_df)
}
```
---
layout: true
# Simulation

---
name: ex-iter
## Repeat

Now we want run `sim_iter()` *many* times.

--

And we're going to do it in parallel—using the `furrr` package.

--

The output of `sim_iter()` is a data frame, so we can actually use a function from `furrr` that expects outputted data frames, namely, `future_map_dfr`.

The suffix `_dfr` means the function will row-bind the data frames returned by individual iterations.

--

We'll also use the `rep()` function which repeats things, _e.g._, `rep("a", 3)` repeats `"a"` three times.
---

Assuming we've already entered `sim_iter()` into memory, we can run our simulation 5,000 times, each with sample size 50—in parallel!

```{R, ex50, cache = T}
# Load furrr
p_load(furrr)
# Tell R to parallelize with 4 cores
plan(multiprocess, workers = 4)
# Set a seed
set.seed(12345)
# Run simulation with sample size 50
sim50 <- future_map_dfr(
  # Repeat sample size 50 for 5000 times
  rep(50, 5000),
  # Our function
  sim_iter,
  # Let furrr know we want to set a seed
  .options = future_options(seed = T)
)
```

---
layout: false
class: clear, middle

Sample size 50 (5,000 iterations)
```{R, plot-sim50, echo = F}
ggplot(
  data = sim50 %>% filter(term == "D"),
  aes(x = estimate, fill = est)
) +
geom_density(color = NA, alpha = 0.9) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 1, color = red_pink, size = 1, linetype = "longdash") +
geom_vline(
  xintercept = filter(sim50, term == "D" & est == "OLS")$estimate %>% mean(),
  color = slate
) +
geom_vline(
  xintercept = filter(sim50, term == "D" & est == "IV")$estimate %>% mean(),
  color = orange
) +
scale_fill_manual("Estimator", values = c(orange, slate)) +
theme_pander(base_size = 22, base_family = "Fira Sans Book")
```
---
layout: true
class: clear, middle

---

Let's vary the sample size and see what happens.

---

Sample size 10 (5,000 iterations)

```{R, ex10, cache = T, include = F}
# Load furrr
p_load(furrr)
# Tell R to parallelize with 4 cores
plan(multiprocess, workers = 4)
# Set a seed
set.seed(12345)
# Run simulation with sample size 10
sim10 <- future_map_dfr(
  # Repeat sample size 10 for 5000 times
  rep(10, 5000),
  # Our function
  sim_iter,
  # Let furrr know we want to set a seed
  .options = future_options(seed = T)
)
```

```{R, plot-sim10, echo = F}
ggplot(
  data = sim10 %>% filter(term == "D"),
  aes(x = estimate, fill = est)
) +
geom_density(color = NA, alpha = 0.9) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 1, color = red_pink, size = 1, linetype = "longdash") +
geom_vline(
  xintercept = filter(sim10, term == "D" & est == "OLS")$estimate %>% mean(),
  color = slate
) +
geom_vline(
  xintercept = filter(sim10, term == "D" & est == "IV")$estimate %>% mean(),
  color = orange
) +
xlim(-2.7, 2.5) +
# xlim(-3, 3) +
scale_fill_manual("Estimator", values = c(orange, slate)) +
theme_pander(base_size = 22, base_family = "Fira Sans Book")
```
---

Sample size 25 (5,000 iterations)

```{R, ex25, cache = T, include = F}
# Load furrr
p_load(furrr)
# Tell R to parallelize with 4 cores
plan(multiprocess, workers = 4)
# Set a seed
set.seed(12345)
# Run simulation with sample size 25
sim25 <- future_map_dfr(
  # Repeat sample size 25 for 5000 times
  rep(25, 5000),
  # Our function
  sim_iter,
  # Let furrr know we want to set a seed
  .options = future_options(seed = T)
)
```

```{R, plot-sim25, echo = F}
ggplot(
  data = sim25 %>% filter(term == "D"),
  aes(x = estimate, fill = est)
) +
geom_density(color = NA, alpha = 0.9) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 1, color = red_pink, size = 1, linetype = "longdash") +
geom_vline(
  xintercept = filter(sim25, term == "D" & est == "OLS")$estimate %>% mean(),
  color = slate
) +
geom_vline(
  xintercept = filter(sim25, term == "D" & est == "IV")$estimate %>% mean(),
  color = orange
) +
xlim(-2.7, 2.5) +
# xlim(-1.5, 2.5) +
scale_fill_manual("Estimator", values = c(orange, slate)) +
theme_pander(base_size = 22, base_family = "Fira Sans Book")
```
---

Sample size 50 (5,000 iterations)
```{R, plot-sim50-2, echo = F}
ggplot(
  data = sim50 %>% filter(term == "D"),
  aes(x = estimate, fill = est)
) +
geom_density(color = NA, alpha = 0.9) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 1, color = red_pink, size = 1, linetype = "longdash") +
geom_vline(
  xintercept = filter(sim50, term == "D" & est == "OLS")$estimate %>% mean(),
  color = slate
) +
geom_vline(
  xintercept = filter(sim50, term == "D" & est == "IV")$estimate %>% mean(),
  color = orange
) +
xlim(-2.7, 2.5) +
# xlim(-0.5, 2.25) +
scale_fill_manual("Estimator", values = c(orange, slate)) +
theme_pander(base_size = 22, base_family = "Fira Sans Book")
```
---

Sample size 100 (5,000 iterations)

```{R, ex100, cache = T, include = F}
# Load furrr
p_load(furrr)
# Tell R to parallelize with 4 cores
plan(multiprocess, workers = 4)
# Set a seed
set.seed(12345)
# Run simulation with sample size 100
sim100 <- future_map_dfr(
  # Repeat sample size 100 for 5000 times
  rep(100, 5000),
  # Our function
  sim_iter,
  # Let furrr know we want to set a seed
  .options = future_options(seed = T)
)
```

```{R, plot-sim100, echo = F}
ggplot(
  data = sim100 %>% filter(term == "D"),
  aes(x = estimate, fill = est)
) +
geom_density(color = NA, alpha = 0.9) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 1, color = red_pink, size = 1, linetype = "longdash") +
geom_vline(
  xintercept = filter(sim100, term == "D" & est == "OLS")$estimate %>% mean(),
  color = slate
) +
geom_vline(
  xintercept = filter(sim100, term == "D" & est == "IV")$estimate %>% mean(),
  color = orange
) +
xlim(-2.7, 2.5) +
# xlim(0.25, 2) +
scale_fill_manual("Estimator", values = c(orange, slate)) +
theme_pander(base_size = 22, base_family = "Fira Sans Book")
```
---
layout: true
# Simulation

---
name: ex-a

## Assumptions

Keep in mind that we made several assumptions about

- the distribution (joint normality is very restrictive)
- variance (all equal, independent, and homoskedastic)
- covariances (again, all equal)
- strong instrument
---
name: loops

## Looping

There are .b[many] ways to iterate/loop in .mono[R]:
- `for()`, `while()`, *etc.*
- `lapply()`, `mapply()`, *etc.*
- `parallel`: `mclapply()`, `mcmapply()`, *etc.*
- `foreach`
- `future`, `furrr`, and `future.apply`: `future_lapply()`, `future_map()`, *etc.*

--

They are not all equal/identical.
- Few can access values from previous iterations (`for()` and `foreach`).
- A subset is parallelizable (`parallel`, `foreach`, the `future` family).
- Behavior can be OS specific (especially `parallel`).

---

## `for()`

You'll often hear that you should never use `for()` loops in .mono[R].

This opinion is a bit extreme, but there are a few reasons to avoid them.

1. `for()` is not parallelized (though `foreach` can be).

1. `for()` doesn't clean up after itself—leaving objects in memory between iterations and after the loop finishes..super[.pink[†]]

1. `for()` loops generally ["grow" data](https://privefl.github.io/blog/why-loops-are-slow-in-r/) which can be slow/inefficient.

See [Grant McDermott's lectures](https://github.com/uo-ec607) for further justification.

.footnote[.pink[†] This feature *can* be desirable—*e.g.*, linking iterations.]
---
layout: false
# Table of contents

.pull-left[
### Simulation
.small[
1. [Motivation](#motive)
1. [Generic outline](#gen-outline)
1. [Example](#ex-q)
  - [The question](#ex-q)
  - [DGP](#ex-dgp)
  - [Sampling from the DGP](#ex-sample)
  - [Iterating](#ex-iter)
  - [Assumptions](#ex-a)
1. [Loops](#loops)
]]

---
exclude: true

```{R, generate pdfs, include = F, eval = T}
source("../../ScriptsR/unpause.R")
unpause("06RSim.Rmd", ".", T, T)
```