---
title: "Lab .mono[002]"
subtitle: "Cross validation and simulation"
author: "Edward Rubin"
#date: "`r format(Sys.time(), '%d %B %Y')`"
date: "24 January 2020"
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{R, setup, include = F}
library(pacman)
p_load(
tidyverse,
ggplot2, ggthemes,
latex2exp, viridis, extrafont, gridExtra, plotly, ggformula, scales,
kableExtra, snakecase, janitor,
data.table,
lubridate, knitr,
caret,
huxtable, DT,
here, magrittr, parallel, future.apply, tictoc
)
# Define colors
red_pink = "#e64173"
turquoise = "#20B2AA"
orange = "#FFA500"
red = "#fb6107"
blue = "#3b3b9a"
green = "#8bb174"
grey_light = "grey70"
grey_mid = "grey50"
grey_dark = "grey20"
purple = "#6A5ACD"
slate = "#314f4f"
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
---
layout: true
# Admin
---
class: inverse, middle
---
name: admin
.b[In lab today]
- Check in
- Cross validation and parameter tuning (with `caret`)
- Simulation
.b.it[Upcoming]
- Keep up with readings (.it[ISL] 3–4, 6)
- Assignment will be posted soon
---
name: check-in
layout: false
# Check in
## Some questions
(Be honest)
1. How is .b[this class] going?
- Are there areas/topics you would like more coverage/review?
- How is the speed?
- How were the assignments?
- Is there more we or you could be doing?
--
1. How is your .b[quarter] going?
1. How is your .b[program] going?
--
1. .b[Anything] else?
---
layout: true
# Cross validation
---
class: inverse, middle
---
name: cv-review
## Review
.b[Cross validation].super[.pink[†]] (CV) aims to .pink[estimate out-of-sample performance]
.footnote[
.pink[†] Plus other resampling (and specifically hold-out) methods
]
1. .hi-pink[efficiently], using the full dataset to both .it[train] and .it[test] (validate)
1. .hi-pink[without overfitting], separating observations used to .it[train] or .it[test]
--
.b[CV] (_e.g._, LOOCV and $k$-fold) aids with .purple[several tasks]
- .hi-purple[model selection]: choosing key parameters for a model's flexibility
_e.g._, .it[K] in KNN, polynomials and interactions for regression (.it[AKA] tuning)
- .hi-purple[model assessment]: determining how well the model performs
_e.g._, estimating the true test MSE, MAE, or error rate
---
layout: true
# Cross validation
## $k$-fold cross validation, review
1. .b[Divide] training data into $k$ folds (mutally exclusive groups)
1. .it[For each fold] $i$:
- .b[Train] your model on all observations, excluding members of $i$
- .b[Test] and .b[assess] model.sub[i] on the members of fold $i$, _e.g._, MSE.sub[i]
1. .b[Estimate test performance] by averaging across folds, _e.g._, Avg(MSE.sub[i])
---
--
```{R, data-cv, include = F, cache = T}
# Generate data
X = 40
Y = 12
set.seed(12345)
cv_df = expand_grid(
x = 1:X,
y = 1:Y
) %>% mutate(
id = 1:(X*Y),
grp = sample(X * Y) %% 5 + 1
)
# Find groups
a = seq(1, X*Y, by = X*Y/5)
b = c(a[-1] - 1, X*Y)
```
```{R, plot-cvk-0b, echo = F, fig.height = 3}
ggplot(data = cv_df, aes(x, y, color = grp == 1, fill = grp == 1)) +
geom_point(shape = 21, size = 4.5, stroke = 0.5) +
scale_fill_manual("", values = c("white", slate)) +
scale_color_manual("", values = c(purple, slate)) +
theme_void() +
theme(legend.position = "none")
```
Each fold takes a turn at .hi-slate[validation]. The other $k-1$ folds .hi-purple[train].
---
```{R, plot-cvk-1, echo = F, fig.height = 3}
ggplot(
data = cv_df,
aes(x, y, color = between(id, a[1], b[1]), fill = between(id, a[1], b[1]))
) +
geom_point(shape = 21, size = 4.5, stroke = 0.5) +
scale_fill_manual("", values = c("white", slate)) +
scale_color_manual("", values = c(purple, slate)) +
theme_void() +
theme(legend.position = "none")
```
For $k=5$, fold number $1$ as the .hi-slate[validation set] produces MSE.sub[k=1].
---
```{R, plot-cvk-2, echo = F, fig.height = 3}
ggplot(
data = cv_df,
aes(x, y, color = between(id, a[2], b[2]), fill = between(id, a[2], b[2]))
) +
geom_point(shape = 21, size = 4.5, stroke = 0.5) +
scale_fill_manual("", values = c("white", slate)) +
scale_color_manual("", values = c(purple, slate)) +
theme_void() +
theme(legend.position = "none")
```
For $k=5$, fold number $2$ as the .hi-slate[validation set] produces MSE.sub[k=2].
---
```{R, plot-cvk-3, echo = F, fig.height = 3}
ggplot(
data = cv_df,
aes(x, y, color = between(id, a[3], b[3]), fill = between(id, a[3], b[3]))
) +
geom_point(shape = 21, size = 4.5, stroke = 0.5) +
scale_fill_manual("", values = c("white", slate)) +
scale_color_manual("", values = c(purple, slate)) +
theme_void() +
theme(legend.position = "none")
```
For $k=5$, fold number $3$ as the .hi-slate[validation set] produces MSE.sub[k=3].
---
```{R, plot-cvk-4, echo = F, fig.height = 3}
ggplot(
data = cv_df,
aes(x, y, color = between(id, a[4], b[4]), fill = between(id, a[4], b[4]))
) +
geom_point(shape = 21, size = 4.5, stroke = 0.5) +
scale_fill_manual("", values = c("white", slate)) +
scale_color_manual("", values = c(purple, slate)) +
theme_void() +
theme(legend.position = "none")
```
For $k=5$, fold number $4$ as the .hi-slate[validation set] produces MSE.sub[k=4].
---
```{R, plot-cvk-5, echo = F, fig.height = 3}
ggplot(
data = cv_df,
aes(x, y, color = between(id, a[5], b[5]), fill = between(id, a[5], b[5]))
) +
geom_point(shape = 21, size = 4.5, stroke = 0.5) +
scale_fill_manual("", values = c("white", slate)) +
scale_color_manual("", values = c(purple, slate)) +
theme_void() +
theme(legend.position = "none")
```
For $k=5$, fold number $5$ as the .hi-slate[validation set] produces MSE.sub[k=5].
---
layout: true
# Cross validation
---
name: cv-independence
## Independence
.b[Resampling methods] assume something similar to independence: our resampling must match the original sampling procedure.
--
If observations are "linked" but we resample independently, CV may break.
--
If we have .hi-pink[repeated observations] on individuals $i$ through time $t$:
- It's pretty likely $y_{i,t}$ and $y_{i,t+1}$ are related—and maybe $y_{i,t+\ell}$.
- Initial sample may draw individuals $i$, but .it[standard] CV ignores time.
--
In other case, .hi-pink[some individuals are linked with other individuals], *e.g.*,
- $y_{i,t}$ and $y_{j,t}$ my be correlated if $i$ and $j$ live togother
- Also: $y_{i,t}$ and $y_{j,t+\ell}$ could be correlated
---
## Independence
In other words: Spatial or temporal .b[dependence] between observations .b[breaks the separation between training and testing samples].
--
Breaking this separation train-test separation leads us back to
- .b[Overfitting] the sample—since training and testing samples are linked
- .b[Overestimating model performance]—the estimated test MSE will be more of a training MSE.
--
.b[Solutions] to this problem involve .hi-pink[matching the resampling process] to the .hi-pink[original sampling and underlying dependence].
.ex[Examples:] Spatial $k$-fold CV (SKCV) and blocked time-series CV
---
## Dependence
.qa[Q] So how big of a deal is this type of depedence?
--
.qa[A] Let's see! (Sounds like it's time for a simulation.)
---
layout: true
# Simulations
---
class: inverse, middle
---
name: sim-mc
## Monte Carlo
.b[Monte Carlo simulation].super[.pink[†]] allows us model probabilistic questions.
.footnote[
.pink[†] Also called Monte Carlo methods, experiments, *etc.*
]
1. .b[Generate a population] defined by some data-generating process (DGP), a model of how your fake/simulated world works.
--
1. Repeatedly .b[draw samples] $s$ from your population. .it[For each] $s$:
- Apply the relevant methods, sampling techniques, estimators, *etc.*
- Record your outcome(s) of interest (_e.g._, MSE or error rate), $O_s$
--
1. Use the .b[distribution] of the $O_s$ to .it[learn] about your methods, sampling techniques, and/or estimators—the mean, bias, variability *etc.*
--
.note[Ex.sub[1]] We can change DGP in (1) to see how performance (3) changes.
.note[Ex.sub[2]] Holding DGP (1) constant, we can compare competing methods (2).
---
layout: false
class: clear, middle
Sound familiar? Monte Carlo is very related to resampling methods.
---
layout: true
# Monte Carlo
---
name: sim-intro
## Introductory example: Define the question
It's always helpful to clearly define the question you want to answer, _e.g._:
.b[Example question:] Is OLS unbiased when $\varepsilon_i\sim$ Uniform(0, 1)?
--
Now we know our goal: Determine unbiasedness (mean of distribution).
---
## Introductory example: DGP and population
We'll use the DGP $y_i = 3 + 6 x_i + \varepsilon_i$,
where $\varepsilon_i\sim$ Uniform(0, 1), and $x_i\sim$ N(0, 1).
--
```{R, ex-pop, eval = F}
# Set seed
set.seed(123)
# Define population size
pop_size = 1e4
# Generate population
ols_pop = tibble(
# Generate disturbances: Using Uniform(0,1)
e = runif(pop_size, min = 0, max = 1),
# Generate x: Using N(0,1)
x = rnorm(pop_size),
# Calculate y
y = 3 + 6 * x + e
)
```
---
name: ex-fun
## Aside: Writing custom functions
In R you can creat your own functions with the `function()` function.
.col-left[
You define the arugments.
```{R, ex-function}
# Our function: Take the product
multiply = function(a, b, c) {
a * b * c
}
# Try the function
multiply(a = 1, b = 2, c = 3)
```
]
--
.col-right[
You can define default values too.
```{R, ex-function2}
# Our function: Take the product
multiply = function(a, b, c = 3) {
a * b * c
}
# Try the function
multiply(a = 1, b = 2)
```
]
--
.left95[.note[Note] Functions return the last .it[unassigned] object.]
---
## Aside: Writing custom functions
Typically you will have a slightly more complex function, _e.g._,
.col-left[
```{R, ex-function-fancy}
super_fancy = function(a, b, c = 3) {
# Test if 'c' == 3
if (c == 3) {
# If yes: divide 'a' by 'b'
step1 = a / b
} else {
# If no: multiply 'a' by 'b'
step1 = a * b
}
# Now add 7 to the result
step2 = step1 * 7
# And now multiply by zero
step2 * 0
}
```
]
--
.col-right[
```{R, ex-function-fancy-run}
# Try our function
super_fancy(a = 12, b = 7, c = 4)
```
]
---
## Introductory example: A single iteration
Now we want to write a function that
1. .b[samples] from the population `ols_pop`
2. .b[estimates OLS] regression on the sample
3. .b[records] the OLS estimate
--
```{R, ex-fun, eval = F}
ols_function = function(iter, sample_size) {
# Draw a sample of size 'sample_size'
ols_sample = sample_n(ols_pop, size = sample_size)
# Estimate OLS
ols_est = lm(y ~ x, data = ols_sample)
# Return coefficients and iteration number
data.frame(b0 = coef(ols_est)[1], b1 = coef(ols_est)[2], iter)
}
```
---
## Introductory example: Run the simulation
For the simulation, we run our single-iteration function `ols_function()`
.b[a bunch] of times (like 10,000) and then collect the results..super[.pink[†]]
.footnote[
.pink[†] The `apply()` family and parallelization are both key here (`future.apply` combines them).
]
```{R, ex-sim-eval1, include = F, eval = F}
# Set up parallelization (with 12 cores)
# Warning: Can take a little time to run (esp w/out 12 cores)
plan(multiprocess, workers = 12)
# Set seed
set.seed(12345)
# Run simulation
ols_sim = future_lapply(
X = 1:1e4,
FUN = ols_function,
sample_size = 50,
future.seed = T
) %>% bind_rows()
# Save simulation
write_csv(
x = ols_sim,
path = here("simple-sim.csv")
)
```
```{R, ex-sim-load, include = F}
# Load simulation results
ols_sim = here("simple-sim.csv") %>% read_csv()
```
```{R, ex-sim, eval = F}
# Set up parallelization (with 12 cores)
# Warning: Can take a little time to run (esp w/out 12 cores)
plan(multiprocess, workers = 12)
# Set seed
set.seed(12345)
# Run simulation
ols_sim = future_lapply(
X = 1:1e4,
FUN = ols_function,
sample_size = 50,
future.seed = T
) %>% bind_rows()
```
---
## Introductory example: Results
Now we're ready to summarize the results.
We wanted to know .b[if OLS is still unbiased].
Thus, plotting the .b[distributions of estimates] for $\beta_0$ and $\beta_1$ will be of interest—.b[especially the means] of these distributions.
.note[Recall:] If an estimator is unbiased, then the mean of its distribution should like up with the parameter it is attempting to estimate.
---
layout: true
class: clear, middle
---
.b[Considering OLS's unbiasedness:] The distribution of $\hat{\beta}_0$
```{R, plot-ex-b0, echo = F}
ggplot(ols_sim, aes(x = b0)) +
geom_density(fill = "black", alpha = 0.7) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 3, color = "orange", size = 1) +
geom_vline(xintercept = mean(ols_sim$b0), color = "black", size = 1) +
scale_y_continuous("Density") +
scale_x_continuous(expression(beta[0]*" estimates")) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
.b[Considering OLS's unbiasedness:] The distribution of $\hat{\beta}_1$
```{R, plot-ex-b1, echo = F}
ggplot(ols_sim, aes(x = b1)) +
geom_density(fill = "black", alpha = 0.7) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 6, color = "orange", size = 1) +
geom_vline(xintercept = mean(ols_sim$b1), color = "black", size = 1) +
scale_y_continuous("Density") +
scale_x_continuous(expression(beta[1]*" estimates")) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
layout: false
# Monte Carlo
## Introductory example: Conclusion
When our disturbances are distributed as Uniform(0,1)
1. OLS is biased for $\beta_0$ (the intercept)
1. OLS is still unbiased for $\beta_1$ (the slope)
... and we were able to learn all of this information by simulation.
---
class: clear, middle
Now back to our question on $k$-fold cross validation and interdependence.
---
layout: true
# Simulation: $k$-fold CV and dependence
---
name: sim-kcv-intro
## Our question
Let's write our previous question in a way we can try to answer it.
--
.b[Question:] How does correlation between observations affect the performance of $k$-fold cross validation?
---
layout: true
# Simulation: $k$-fold CV and dependence
## Our question
Let's write our previous question in a way we can try to answer it.
.b[Question:] How does .hi-purple[correlation between observations] affect the .hi-orange[performance] of $\color{#e64173}{k}$.hi-pink[-fold cross validation]?
---
We need to define some terms.
---
.hi-purple.it[correlation between observations]
- Observations can correlate in many ways. Let's keep it simple: we will simulate repeated observations (through time, $t$) on individuals, $i$.
- .hi-purple[DGP:] Continuous $y_{i,t}$ has a predictor $x_{i,t}$ and two random disturbances
$$
\begin{align}
y_{i,t} &= 3 x_{i,t} - 2 x_{i,t}^2 + 0.1 x_{i,t}^4 + \gamma_i + \varepsilon_{i,t} \\
x_{i,t} &= 0.9 x_{i,t-1} + \eta_{i,t} \\
\varepsilon_{i,t} &= 0.9 \varepsilon_{i,t-1} + \nu_{i,t} &
\end{align}
$$
---
.hi-orange.it[performance]
- We will focus on .hi-orange[MSE] for observations the model never saw.
- .it[Note:] .it[En route], we will meet .it[root] mean squared error (RMSE).
$$
\begin{align}
\text{RMSE}=\sqrt{\text{MSE}}
\end{align}
$$
---
$\color{#e64173}{k}$.hi-pink[-fold cross validation]
- We'll stick with 5-fold cross validation.
- Our answer shouldn't change too much based upon $k$.
- Feel free to experiment later...
---
layout: true
class: clear
---
name: sim-kcv-population
### Set up population
- Set seed
- Define number of individuals `N` and time periods `T`
```{R, sim-cv-setup}
# Set seed
set.seed(12345)
# Size of the population
N = 1000
# Number of time periods per individual
T = 50
# Create the indices of our population
pop_df = expand_grid(
i = 1:N,
t = 1:T
)
```
---
### Build population
- Generate temporally correlated disturbances and predictor $(x_{i,t})$
- Calculate outcome variable $y_{i,t}$
```{R, sim-cv-pop}
# Disturbance for (i,t): Correlated within time (not across individuals)
pop_df %<>%
group_by(i) %>%
mutate(
x = arima.sim(model = list(ar = c(0.9)), n = T) %>% as.numeric(),
e_it = arima.sim(model = list(ar = c(0.9)), n = T) %>% as.numeric()
) %>% ungroup()
# Disturbance for (i): Constant within individual
pop_df %<>%
group_by(i) %>%
mutate(e_i = runif(n = 1, min = -100, max = 100)) %>%
ungroup()
# Calculate 'y'; drop disturbances
pop_df %<>% mutate(
y = e_i + 3 * x - 2 * x^2 + 0.1 * x^4 + e_it
) %>% select(i, t, y, x)
```
---
class: clear
Notice the .b[correlation within observation] across time.
```{R, plot-sim-cv-i, echo = F}
# Plot a few observations' time series
ggplot(data = pop_df %>% filter(i < 5), aes(x = t, y = y, color = as.factor(i))) +
geom_line(size = 0.8) +
geom_point(size = 2.5) +
scale_y_continuous(expression(y[it])) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
scale_color_viridis_d(option = "magma", end = 0.9) +
theme(legend.position = "none")
```
---
class: clear
Viewing the correlation in $x_{i,t}$ and $y_{i,t}$.
```{R, plot-sim-cv-it, echo = F}
# Plot x,y for t=1:3
ggplot(data = pop_df %>% filter(i < 10), aes(x = x, y = y, color = as.factor(i))) +
geom_line(size = 0.8) +
geom_point(size = 2.5) +
scale_x_continuous(expression(x[it])) +
scale_y_continuous(expression(y[it])) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
scale_color_viridis_d(option = "magma", end = 0.9) +
theme(legend.position = "none")
```
---
class: middle
.b[Next steps:] Write out a single iteration (to become a function).
1. Draw a sample.
1. Estimate a model: We're going to use KNN regression.
1. Tune our model: We will use 5-fold CV to choose 'K'.
1. Assess the model's performance (CV and in the population).
---
name: sim-kcv-sample
### Draw a sample
```{R, sim-cv-sample}
# Define sample size (will be in input of our function)
n_i = 50
# Draw sample
i_sampled = sample.int(N, size = n_i)
# Draw a sample
sample_df = pop_df %>% filter(i %in% i_sampled)
```
`sample.int(n, size)` draws `size` integers between 1 and `n`.
Note that we are .b[sampling individuals] (`i`).
---
name: sim-kcv-model
### Tune and estimate a model
```{R, sim-cv-train-eval1, eval = F, include = F}
# Define number of folds
k_folds = 5
# k-fold CV
cv_output = train(
# The relationship: y as a function of w and x
y ~ .,
# The method of estimating the model (linear regression)
method = "knn",
# The training data (which will be used for cross validation)
data = sample_df %>% select(y, x),
# Controls for the model training: k-fold cross validation
trControl = trainControl(method = "cv", number = k_folds),
# Allow cross validation to choose best value of K (# nearest neighbors)
tuneGrid = expand.grid(k = seq(1, 500, by = 5))
)
# Save the output
write_rds(x = cv_output, path = here("cv-example.rds"))
```
```{R, load-cv-example, include = F, eval = T}
# Define number of folds
k_folds = 5
# Load
cv_output = here("cv-example.rds") %>% read_rds()
```
```{R, sim-cv-train-fake, eval = F}
# Define number of folds
k_folds = 5
# k-fold CV
cv_output = train(
# The relationshiop: y as a function of x
y ~ .,
# The method of estimating the model (linear regression)
method = "knn",
# The training data (which will be used for cross validation)
data = sample_df %>% select(y, x),
# Controls for the model training: k-fold cross validation
trControl = trainControl(method = "cv", number = k_folds),
# Allow cross validation to choose best value of K (# nearest neighbors)
tuneGrid = expand.grid(k = seq(1, 500, by = 5))
)
```
`train()` (from the `caret` package) assists in many training/tuning tasks (note `tuneGrid` argument)—including pre-processing data.
---
You can actually plot the results from `train()`, _i.e._,
```{R, plot-cv-fake, eval = F}
ggplot(cv_output) + theme_minimal()
```
```{R, plot-cv-real, echo = F, fig.height = 6}
ggplot(cv_output) + theme_minimal(base_size = 20, base_family = "Fira Sans Book")
```
---
The output from `train()` also contains a lot of additional information, _e.g._,
```{R, cv-output, eval = F}
cv_output$results
```
```{R, cv-output-fake, echo = F}
cv_output$results %>%
datatable(
rownames = F,
options = list(dom = 't'),
width = '95%', height = '50%'
) %>% formatRound(columns = names(cv_output$results)[2:7], digits = 3)
```
---
The output from `train()` also contains a lot of additional information, _e.g._,
```{R, cv-output2, eval = F}
cv_output$resample
```
```{R, cv-output2-fake, echo = F}
cv_output$resample %>%
datatable(
rownames = F,
options = list(dom = 't'),
width = '50%', height = '33%'
) %>% formatRound(columns = names(cv_output$resample)[1:3], digits = 3)
```
---
class: middle
Now we .b[assess the performance] of our chosen/tuned model.
1. Record the .hi-purple[CV-based MSE] (already calculated by `train()`).
2. Since we know the population, we can calculate the .hi-orange[true test MSE].
.hi-orange[(2)] is usually not available to use—you can see how simulation helps.
---
### Assess performance
- Subset to unseen data (individuals not sampled)
- Evaluate prediction performance on new individuals
```{R, sim-cv-assess}
# Subset to unseen data
test_df = pop_df %>% filter(i %>% is_in(i_sampled) %>% not())
# Make predictions on
predictions = predict(
cv_output,
newdata = test_df
)
# Calculate RMSE in full population
rmse_true = mean((test_df$y - predictions)^2) %>% sqrt()
rmse_true
```
---
Comparing .hi-purple[CV RMSE] to .hi-orange[true test RMSE]
```{R, sim-assess-plot, echo = F}
# Plot the true test MSE against the estimated MSE
ggplot(cv_output) +
geom_segment(
data = cv_output$results %>% filter(RMSE == min(RMSE)),
aes(x = k, xend = k, y = RMSE, yend = rmse_true),
color = purple,
linetype = "longdash",
size = 0.3
) +
geom_point(
data = cv_output$results %>% filter(RMSE == min(RMSE)),
aes(x = k, y = RMSE),
color = purple,
size = 3
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
geom_hline(yintercept = rmse_true, color = orange) +
scale_y_continuous("RMSE") +
scale_x_continuous("K (# of nearest neighbors)")
```
---
class: middle
Now we basically wrap the last 7 slides into a function and we're set!
---
name: sim-kcv-function
### Function: One iteration of the simulation
```{R, sim-cv-function, eval = F}
# Our function for a single iteration of the simulation
sim_fun = function(iter, n_i = 50, k_folds = 5) {
# Draw sample
i_sampled = sample.int(N, size = n_i)
# Draw a sample
sample_df = pop_df %>% filter(i %in% i_sampled)
# k-fold CV
cv_output = cv_function(k_folds = k_folds)
# Find the estimated MSE
mse_est = mean(cv_output$resample$RMSE^2)
# Subset to unseen data
test_df = pop_df %>% filter(i %>% is_in(i_sampled) %>% not())
# Make predictions on true test data
predictions = predict(cv_output, newdata = test_df)
# Calculate RMSE in full population
mse_true = mean((test_df$y - predictions)^2)
# Output results
data.frame(iter, k = cv_output$bestTune$k, mse_est, mse_true)
}
```
Notice that we can define default values for our function's arguments.
---
### Function: $k$-fold CV of KNN model for a sample
```{R, sim-cv-function-2, eval = F}
# Wrapper function for caret::train()
cv_function = function(k_folds) {
# The relationship: y as a function of w and x
y ~ .,
# The method of estimating the model (linear regression)
method = "knn",
# The training data (which will be used for cross validation)
data = sample_df %>% select(y, x),
# Controls for the model training: k-fold cross validation
trControl = trainControl(method = "cv", number = k_folds),
# Allow cross validation to choose best value of K (# nearest neighbors)
tuneGrid = expand.grid(k = seq(1, 200, by = 10))
}
```
.it[Note] This function would need to be defined first in a script.
---
class: middle
Now run our one-iteration function `sim_fun()` for many iterations!
---
### Run the simulation!
```{R, sim-cv-run, eval = F}
# Set seed
set.seed(123)
# Run simulation 1,000 times in parallel (and time)
tic()
sim_df = mclapply(
X = 1:1e3,
FUN = sim_fun,
mc.cores = 11
) %>% bind_rows()
toc()
# Save dataset
write_csv(
x = sim_df,
path = here("cv-sim-knn.csv")
)
```
`tic()` and `toc()` come from `tictoc()` and help with timing tasks.
`mclapply()` is a parallelized `lapply()` from `parallel` (sorry, no Windows).
---
class: middle
How about some results?
---
layout: true
class: clear, middle
---
```{R, load-knn-sim, include = F, cache = T}
# Read results from KNN simulation
knn_sim = here("cv-sim-knn.csv") %>% read_csv()
knn_ind = here("cv-sim-knn-uncorrelated.csv") %>% read_csv()
# Gather to wide
knn_sim_wide = knn_sim %>% gather("mse_type", "value", -iter, -k)
knn_ind_wide = knn_ind %>% gather("mse_type", "value", -iter, -k)
```
.b[With dependence:] distributions of the .hi-orange[true MSE] and the .hi-purple[CV-based MSE].
```{R, plot-sim-knn-1, echo = F}
# Plot resulting distributions of MSE
ggplot(
data = knn_sim_wide,
aes(x = value, fill = mse_type)
) +
geom_density(color = NA, alpha = 0.8) +
geom_hline(yintercept = 0) +
geom_vline(
data = knn_sim_wide %>% group_by(mse_type) %>% summarize(value = mean(value)),
aes(xintercept = value, color = mse_type)
) +
scale_y_continuous("Denisty") +
scale_x_continuous("MSE") +
scale_fill_manual(
"MSE",
labels = c("Estimated", "True"),
values = c(purple, orange)
) +
scale_color_manual(
"MSE",
labels = c("Estimated", "True"),
values = c(purple, orange)
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
axis.text.y = element_blank()
)
```
---
With .b[independence across observations:] .hi-orange[true MSE] and the .hi-purple[CV-based MSE].
```{R, plot-sim-knn-1-ind, echo = F}
# Plot resulting distributions of MSE
ggplot(
data = knn_ind_wide,
aes(x = value, fill = mse_type)
) +
geom_density(color = NA, alpha = 0.8) +
geom_hline(yintercept = 0) +
geom_vline(
data = knn_ind_wide %>% group_by(mse_type) %>% summarize(value = mean(value)),
aes(xintercept = value, color = mse_type)
) +
scale_y_continuous("Denisty") +
scale_x_continuous("MSE") +
scale_fill_manual(
"MSE",
labels = c("Estimated", "True"),
values = c(purple, orange)
) +
scale_color_manual(
"MSE",
labels = c("Estimated", "True"),
values = c(purple, orange)
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
axis.text.y = element_blank()
)
```
---
.b[Dependence:] Comparing .b[true MSE] and .b[CV-based MSE] (.pink[45° line])
Tendency to .b[underestimate test MSE] (rather than .grey[overestimate]): `r transmute(knn_sim, p = mse_true > mse_est)$p %>% mean %>% scales::percent()`
```{R, plot-sim-knn-1b, echo = F}
# Plot resulting distributions of MSE
ggplot(
data = knn_sim,
aes(x = (mse_true), y = (mse_est))
) +
geom_point(
aes(color = mse_true > mse_est),
shape = 19, size = 2.5, alpha = 0.5
) +
# geom_smooth(se = F, color = orange, size = 1, method = "lm") +
geom_abline(intercept = 0, slope = 1, color = red_pink, size = 1) +
scale_x_continuous(expression(MSE[True])) +
scale_y_continuous(expression(MSE[CV])) +
scale_color_manual(
values = c("grey70", "grey20")
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "none")
```
---
.b[Independence:] Comparing .b[true MSE] and .b[CV-based MSE] (.pink[45° line])
Less likely (`r transmute(knn_ind, p = mse_true > mse_est)$p %>% mean %>% scales::percent()`) to underestimate true, test MSE
```{R, plot-sim-knn-1b-ind, echo = F}
# Plot resulting distributions of MSE
ggplot(
data = knn_ind,
aes(x = (mse_true), y = (mse_est))
) +
geom_point(
aes(color = mse_true > mse_est),
shape = 19, size = 2.5, alpha = 0.5
) +
# geom_smooth(se = F, color = orange, size = 1, method = "lm") +
geom_abline(intercept = 0, slope = 1, color = red_pink, size = 1) +
scale_x_continuous(expression(MSE[True])) +
scale_y_continuous(expression(MSE[CV])) +
scale_color_manual(
values = c("grey70", "grey20")
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(legend.position = "none")
```
---
.b[Dependence:] The .hi-slate[difference] between the true and CV-estimated MSE.
```{R, plot-sim-knn-2a, echo = F}
# Plot resulting distributions of difference in MSEs
ggplot(
data = knn_sim,
aes(x = mse_true - mse_est)
) +
geom_density(fill = slate, color = NA, alpha = 0.8) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_vline(
xintercept = knn_sim %>% transmute(dif = mse_true - mse_est) %$%dif %>% mean(),
color = slate
) +
scale_y_continuous("Denisty") +
scale_x_continuous(
expression(MSE[True]*" - MSE"[CV]),
breaks = seq(-1000, 1500, 500)
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
axis.text.y = element_blank()
)
```
---
.b[Independence:] The .hi-slate[difference] between the true and CV-estimated MSE.
```{R, plot-sim-knn-2a-ind, echo = F}
# Plot resulting distributions of difference in MSEs
ggplot(
data = knn_ind,
aes(x = mse_true - mse_est)
) +
geom_density(fill = slate, color = NA, alpha = 0.8) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_vline(
xintercept = knn_ind %>% transmute(dif = mse_true - mse_est) %$%dif %>% mean(),
color = slate
) +
scale_y_continuous("Denisty") +
scale_x_continuous(
expression(MSE[True]*" - MSE"[CV]),
# breaks = seq(-1000, 1500, 500)
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
axis.text.y = element_blank()
)
```
---
.b[Dependence:] .hi-pink[Percent difference] between the true and CV-estimated MSE.
```{R, plot-sim-knn-2b, echo = F}
# Plot resulting distributions of the percentage difference in MSEs
ggplot(
data = knn_sim,
aes(x = (mse_true - mse_est)/mse_true)
) +
geom_density(fill = red_pink, color = NA, alpha = 0.8) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_vline(
xintercept = knn_sim %>% transmute(dif = (mse_true - mse_est)/mse_true) %$%dif %>% mean(),
color = red_pink
) +
scale_y_continuous("Denisty") +
scale_x_continuous(
"Pct. error in estimated MSE",
labels = scales::percent
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
axis.text.y = element_blank()
)
```
---
.b[Independence:] .hi-pink[Percent difference] between the true and CV-estimated MSE.
```{R, plot-sim-knn-2b-in, echo = F}
# Plot resulting distributions of the percentage difference in MSEs
ggplot(
data = knn_ind,
aes(x = (mse_true - mse_est)/mse_true)
) +
geom_density(fill = red_pink, color = NA, alpha = 0.8) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_vline(
xintercept = knn_ind %>% transmute(dif = (mse_true - mse_est)/mse_true) %$%dif %>% mean(),
color = red_pink
) +
scale_y_continuous("Denisty") +
scale_x_continuous(
"Pct. error in estimated MSE",
labels = scales::percent
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
axis.text.y = element_blank()
)
```
---
.b[Dependence: KNN flexibility] (K) and percentage error
```{R, plot-sim-knn-3, echo = F}
# Plot resulting distributions of the percentage difference in MSEs
ggplot(
data = knn_sim,
aes(x = k, y = (mse_true - mse_est)/mse_true)
) +
geom_point(color = red_pink, alpha = 0.8) +
geom_smooth(se = F, method = "loess") +
# geom_hline(yintercept = 0) +
# geom_vline(xintercept = 0) +
scale_x_continuous("K (# of nearest neighbors)") +
scale_y_continuous(
"Pct. error in estimated MSE",
labels = scales::percent
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
# axis.text.y = element_blank()
)
```
---
.b[Independence: KNN flexibility] (K) and percentage error
```{R, plot-sim-knn-3-ind, echo = F}
# Plot resulting distributions of the percentage difference in MSEs
ggplot(
data = knn_ind,
aes(x = k, y = (mse_true - mse_est)/mse_true)
) +
geom_point(color = red_pink, alpha = 0.8) +
geom_smooth(se = F, method = "loess") +
# geom_hline(yintercept = 0) +
# geom_vline(xintercept = 0) +
scale_x_continuous("K (# of nearest neighbors)") +
scale_y_continuous(
"Pct. error in estimated MSE",
labels = scales::percent
) +
theme_minimal(base_size = 20, base_family = "Fira Sans Book") +
theme(
legend.position = "bottom",
# axis.text.y = element_blank()
)
```
---
layout: false
# Table of contents
.col-left[
.smaller[
#### Admin
- [Today and upcoming](#admin)
- [Check in](#check-in)
#### Cross validation
- [Review](#cv-review)
- [Independence](#cv-independence)
#### Simulation
- [Monte Carlo](#sim-mc)
- [Introductory example](#sim-intro)
- [Writing functions](#ex-fun)
]
]
.col-right[
.smaller[
#### $k$-fold CV and dependence
- [Simulation setup](#sim-kcv-intro)
- [Define/build population](#sim-kcv-population)
- [Sample](#sim-kcv-sample)
- [Tune/train the model](#sim-kcv-model)
- [Iteration function](#sim-kcv-function)
]
]