class: center, middle, inverse, title-slide # Lab .mono[002] ## Cross validation and simulation ### Edward Rubin ### 24 January 2020 --- exclude: true --- layout: true # Admin --- class: inverse, middle --- name: admin .b[In lab today] - Check in - Cross validation and parameter tuning (with `caret`) - Simulation .b.it[Upcoming] - Keep up with readings (.it[ISL] 3–4, 6) - Assignment will be posted soon --- name: check-in layout: false # Check in ## Some questions (Be honest) 1. How is .b[this class] going? - Are there areas/topics you would like more coverage/review? - How is the speed? - How were the assignments? - Is there more we or you could be doing? -- 1. How is your .b[quarter] going? 1. How is your .b[program] going? -- 1. .b[Anything] else? --- layout: true # Cross validation --- class: inverse, middle --- name: cv-review ## Review .b[Cross validation].super[.pink[†]] (CV) aims to .pink[estimate out-of-sample performance] .footnote[ .pink[†] Plus other resampling (and specifically hold-out) methods ] 1. .hi-pink[efficiently], using the full dataset to both .it[train] and .it[test] (validate) 1. .hi-pink[without overfitting], separating observations used to .it[train] or .it[test] -- .b[CV] (_e.g._, LOOCV and `\(k\)`-fold) aids with .purple[several tasks] - .hi-purple[model selection]: choosing key parameters for a model's flexibility <br>_e.g._, .it[K] in KNN, polynomials and interactions for regression (.it[AKA] tuning) - .hi-purple[model assessment]: determining how well the model performs <br>_e.g._, estimating the true test MSE, MAE, or error rate --- layout: true # Cross validation ## `\(k\)`-fold cross validation, review 1. .b[Divide] training data into `\(k\)` folds (mutally exclusive groups) 1. .it[For each fold] `\(i\)`: - .b[Train] your model on all observations, excluding members of `\(i\)` - .b[Test] and .b[assess] model.sub[i] on the members of fold `\(i\)`, _e.g._, MSE.sub[i] 1. .b[Estimate test performance] by averaging across folds, _e.g._, Avg(MSE.sub[i]) --- -- <img src="002-slides_files/figure-html/plot-cvk-0b-1.svg" style="display: block; margin: auto;" /> Each fold takes a turn at .hi-slate[validation]. The other `\(k-1\)` folds .hi-purple[train]. --- <img src="002-slides_files/figure-html/plot-cvk-1-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(1\)` as the .hi-slate[validation set] produces MSE.sub[k=1]. --- <img src="002-slides_files/figure-html/plot-cvk-2-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(2\)` as the .hi-slate[validation set] produces MSE.sub[k=2]. --- <img src="002-slides_files/figure-html/plot-cvk-3-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(3\)` as the .hi-slate[validation set] produces MSE.sub[k=3]. --- <img src="002-slides_files/figure-html/plot-cvk-4-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(4\)` as the .hi-slate[validation set] produces MSE.sub[k=4]. --- <img src="002-slides_files/figure-html/plot-cvk-5-1.svg" style="display: block; margin: auto;" /> For `\(k=5\)`, fold number `\(5\)` as the .hi-slate[validation set] produces MSE.sub[k=5]. --- layout: true # Cross validation --- name: cv-independence ## Independence .b[Resampling methods] assume something similar to independence: our resampling must match the original sampling procedure. -- If observations are "linked" but we resample independently, CV may break. -- If we have .hi-pink[repeated observations] on individuals `\(i\)` through time `\(t\)`: - It's pretty likely `\(y_{i,t}\)` and `\(y_{i,t+1}\)` are related—and maybe `\(y_{i,t+\ell}\)`. - Initial sample may draw individuals `\(i\)`, but .it[standard] CV ignores time. -- In other case, .hi-pink[some individuals are linked with other individuals], *e.g.*, - `\(y_{i,t}\)` and `\(y_{j,t}\)` my be correlated if `\(i\)` and `\(j\)` live togother - Also: `\(y_{i,t}\)` and `\(y_{j,t+\ell}\)` could be correlated --- ## Independence In other words: Spatial or temporal .b[dependence] between observations .b[breaks the separation between training and testing samples]. -- Breaking this separation train-test separation leads us back to - .b[Overfitting] the sample—since training and testing samples are linked - .b[Overestimating model performance]—the estimated test MSE will be more of a training MSE. -- .b[Solutions] to this problem involve .hi-pink[matching the resampling process] to the .hi-pink[original sampling and underlying dependence]. .ex[Examples:] Spatial `\(k\)`-fold CV (SKCV) and blocked time-series CV --- ## Dependence .qa[Q] So how big of a deal is this type of depedence? -- .qa[A] Let's see! (Sounds like it's time for a simulation.) --- layout: true # Simulations --- class: inverse, middle --- name: sim-mc ## Monte Carlo .b[Monte Carlo simulation].super[.pink[†]] allows us model probabilistic questions. .footnote[ .pink[†] Also called Monte Carlo methods, experiments, *etc.* ] 1. .b[Generate a population] defined by some data-generating process (DGP), a model of how your fake/simulated world works. -- 1. Repeatedly .b[draw samples] `\(s\)` from your population. .it[For each] `\(s\)`: - Apply the relevant methods, sampling techniques, estimators, *etc.* - Record your outcome(s) of interest (_e.g._, MSE or error rate), `\(O_s\)` -- 1. Use the .b[distribution] of the `\(O_s\)` to .it[learn] about your methods, sampling techniques, and/or estimators—the mean, bias, variability *etc.* -- .note[Ex.sub[1]] We can change DGP in (1) to see how performance (3) changes. <br>.note[Ex.sub[2]] Holding DGP (1) constant, we can compare competing methods (2). --- layout: false class: clear, middle Sound familiar? Monte Carlo is very related to resampling methods. --- layout: true # Monte Carlo --- name: sim-intro ## Introductory example: Define the question It's always helpful to clearly define the question you want to answer, _e.g._: .b[Example question:] Is OLS unbiased when `\(\varepsilon_i\sim\)` Uniform(0, 1)? -- Now we know our goal: Determine unbiasedness (mean of distribution). --- ## Introductory example: DGP and population We'll use the DGP `\(y_i = 3 + 6 x_i + \varepsilon_i\)`, <br>where `\(\varepsilon_i\sim\)` Uniform(0, 1), and `\(x_i\sim\)` N(0, 1). -- ```r # Set seed set.seed(123) # Define population size pop_size = 1e4 # Generate population ols_pop = tibble( # Generate disturbances: Using Uniform(0,1) e = runif(pop_size, min = 0, max = 1), # Generate x: Using N(0,1) x = rnorm(pop_size), # Calculate y y = 3 + 6 * x + e ) ``` --- name: ex-fun ## Aside: Writing custom functions In R you can creat your own functions with the `function()` function. .col-left[ You define the arugments. ```r # Our function: Take the product multiply = function(a, b, c) { a * b * c } # Try the function multiply(a = 1, b = 2, c = 3) ``` ``` #> [1] 6 ``` ] -- .col-right[ You can define default values too. ```r # Our function: Take the product multiply = function(a, b, c = 3) { a * b * c } # Try the function multiply(a = 1, b = 2) ``` ``` #> [1] 6 ``` ] -- .left95[.note[Note] Functions return the last .it[unassigned] object.] --- ## Aside: Writing custom functions Typically you will have a slightly more complex function, _e.g._, .col-left[ ```r super_fancy = function(a, b, c = 3) { # Test if 'c' == 3 if (c == 3) { # If yes: divide 'a' by 'b' step1 = a / b } else { # If no: multiply 'a' by 'b' step1 = a * b } # Now add 7 to the result step2 = step1 * 7 # And now multiply by zero step2 * 0 } ``` ] -- .col-right[ ```r # Try our function super_fancy(a = 12, b = 7, c = 4) ``` ``` #> [1] 0 ``` ] --- ## Introductory example: A single iteration Now we want to write a function that 1. .b[samples] from the population `ols_pop` 2. .b[estimates OLS] regression on the sample 3. .b[records] the OLS estimate -- ```r ols_function = function(iter, sample_size) { # Draw a sample of size 'sample_size' ols_sample = sample_n(ols_pop, size = sample_size) # Estimate OLS ols_est = lm(y ~ x, data = ols_sample) # Return coefficients and iteration number data.frame(b0 = coef(ols_est)[1], b1 = coef(ols_est)[2], iter) } ``` --- ## Introductory example: Run the simulation For the simulation, we run our single-iteration function `ols_function()` <br>.b[a bunch] of times (like 10,000) and then collect the results..super[.pink[†]] .footnote[ .pink[†] The `apply()` family and parallelization are both key here (`future.apply` combines them). ] ```r # Set up parallelization (with 12 cores) # Warning: Can take a little time to run (esp w/out 12 cores) plan(multiprocess, workers = 12) # Set seed set.seed(12345) # Run simulation ols_sim = future_lapply( X = 1:1e4, FUN = ols_function, sample_size = 50, future.seed = T ) %>% bind_rows() ``` --- ## Introductory example: Results Now we're ready to summarize the results. We wanted to know .b[if OLS is still unbiased]. Thus, plotting the .b[distributions of estimates] for `\(\beta_0\)` and `\(\beta_1\)` will be of interest—.b[especially the means] of these distributions. .note[Recall:] If an estimator is unbiased, then the mean of its distribution should like up with the parameter it is attempting to estimate. --- layout: true class: clear, middle --- .b[Considering OLS's unbiasedness:] The distribution of `\(\hat{\beta}_0\)` <img src="002-slides_files/figure-html/plot-ex-b0-1.svg" style="display: block; margin: auto;" /> --- .b[Considering OLS's unbiasedness:] The distribution of `\(\hat{\beta}_1\)` <img src="002-slides_files/figure-html/plot-ex-b1-1.svg" style="display: block; margin: auto;" /> --- layout: false # Monte Carlo ## Introductory example: Conclusion When our disturbances are distributed as Uniform(0,1) 1. OLS is biased for `\(\beta_0\)` (the intercept) 1. OLS is still unbiased for `\(\beta_1\)` (the slope) ... and we were able to learn all of this information by simulation. --- class: clear, middle Now back to our question on `\(k\)`-fold cross validation and interdependence. --- layout: true # Simulation: `\(k\)`-fold CV and dependence --- name: sim-kcv-intro ## Our question Let's write our previous question in a way we can try to answer it. -- .b[Question:] How does correlation between observations affect the performance of `\(k\)`-fold cross validation? --- layout: true # Simulation: `\(k\)`-fold CV and dependence ## Our question Let's write our previous question in a way we can try to answer it. .b[Question:] How does .hi-purple[correlation between observations] affect the .hi-orange[performance] of `\(\color{#e64173}{k}\)`.hi-pink[-fold cross validation]? --- We need to define some terms. --- .hi-purple.it[correlation between observations] - Observations can correlate in many ways. Let's keep it simple: we will simulate repeated observations (through time, `\(t\)`) on individuals, `\(i\)`. - .hi-purple[DGP:] Continuous `\(y_{i,t}\)` has a predictor `\(x_{i,t}\)` and two random disturbances $$ `\begin{align} y_{i,t} &= 3 x_{i,t} - 2 x_{i,t}^2 + 0.1 x_{i,t}^4 + \gamma_i + \varepsilon_{i,t} \\ x_{i,t} &= 0.9 x_{i,t-1} + \eta_{i,t} \\ \varepsilon_{i,t} &= 0.9 \varepsilon_{i,t-1} + \nu_{i,t} & \end{align}` $$ --- .hi-orange.it[performance] - We will focus on .hi-orange[MSE] for observations the model never saw. - .it[Note:] .it[En route], we will meet .it[root] mean squared error (RMSE). $$ `\begin{align} \text{RMSE}=\sqrt{\text{MSE}} \end{align}` $$ --- `\(\color{#e64173}{k}\)`.hi-pink[-fold cross validation] - We'll stick with 5-fold cross validation. - Our answer shouldn't change too much based upon `\(k\)`. - Feel free to experiment later... --- layout: true class: clear --- name: sim-kcv-population ### Set up population - Set seed - Define number of individuals `N` and time periods `T` ```r # Set seed set.seed(12345) # Size of the population N = 1000 # Number of time periods per individual T = 50 # Create the indices of our population pop_df = expand_grid( i = 1:N, t = 1:T ) ``` --- ### Build population - Generate temporally correlated disturbances and predictor `\((x_{i,t})\)` - Calculate outcome variable `\(y_{i,t}\)` ```r # Disturbance for (i,t): Correlated within time (not across individuals) pop_df %<>% group_by(i) %>% mutate( x = arima.sim(model = list(ar = c(0.9)), n = T) %>% as.numeric(), e_it = arima.sim(model = list(ar = c(0.9)), n = T) %>% as.numeric() ) %>% ungroup() # Disturbance for (i): Constant within individual pop_df %<>% group_by(i) %>% mutate(e_i = runif(n = 1, min = -100, max = 100)) %>% ungroup() # Calculate 'y'; drop disturbances pop_df %<>% mutate( y = e_i + 3 * x - 2 * x^2 + 0.1 * x^4 + e_it ) %>% select(i, t, y, x) ``` --- class: clear Notice the .b[correlation within observation] across time. <img src="002-slides_files/figure-html/plot-sim-cv-i-1.svg" style="display: block; margin: auto;" /> --- class: clear Viewing the correlation in `\(x_{i,t}\)` and `\(y_{i,t}\)`. <img src="002-slides_files/figure-html/plot-sim-cv-it-1.svg" style="display: block; margin: auto;" /> --- class: middle .b[Next steps:] Write out a single iteration (to become a function). 1. Draw a sample. 1. Estimate a model: We're going to use KNN regression. 1. Tune our model: We will use 5-fold CV to choose 'K'. 1. Assess the model's performance (CV and in the population). --- name: sim-kcv-sample ### Draw a sample ```r # Define sample size (will be in input of our function) n_i = 50 # Draw sample i_sampled = sample.int(N, size = n_i) # Draw a sample sample_df = pop_df %>% filter(i %in% i_sampled) ``` `sample.int(n, size)` draws `size` integers between 1 and `n`. Note that we are .b[sampling individuals] (`i`). --- name: sim-kcv-model ### Tune and estimate a model ```r # Define number of folds k_folds = 5 # k-fold CV cv_output = train( # The relationshiop: y as a function of x y ~ ., # The method of estimating the model (linear regression) method = "knn", # The training data (which will be used for cross validation) data = sample_df %>% select(y, x), # Controls for the model training: k-fold cross validation trControl = trainControl(method = "cv", number = k_folds), # Allow cross validation to choose best value of K (# nearest neighbors) tuneGrid = expand.grid(k = seq(1, 500, by = 5)) ) ``` `train()` (from the `caret` package) assists in many training/tuning tasks (note `tuneGrid` argument)—including pre-processing data. --- You can actually plot the results from `train()`, _i.e._, ```r ggplot(cv_output) + theme_minimal() ``` <img src="002-slides_files/figure-html/plot-cv-real-1.svg" style="display: block; margin: auto;" /> --- The output from `train()` also contains a lot of additional information, _e.g._, ```r cv_output$results ```
--- The output from `train()` also contains a lot of additional information, _e.g._, ```r cv_output$resample ```
--- class: middle Now we .b[assess the performance] of our chosen/tuned model. 1. Record the .hi-purple[CV-based MSE] (already calculated by `train()`). 2. Since we know the population, we can calculate the .hi-orange[true test MSE]. .hi-orange[(2)] is usually not available to use—you can see how simulation helps. --- ### Assess performance - Subset to unseen data (individuals not sampled) - Evaluate prediction performance on new individuals ```r # Subset to unseen data test_df = pop_df %>% filter(i %>% is_in(i_sampled) %>% not()) # Make predictions on predictions = predict( cv_output, newdata = test_df ) # Calculate RMSE in full population rmse_true = mean((test_df$y - predictions)^2) %>% sqrt() rmse_true ``` ``` #> [1] 60.98547 ``` --- Comparing .hi-purple[CV RMSE] to .hi-orange[true test RMSE] <img src="002-slides_files/figure-html/sim-assess-plot-1.svg" style="display: block; margin: auto;" /> --- class: middle Now we basically wrap the last 7 slides into a function and we're set! --- name: sim-kcv-function ### Function: One iteration of the simulation ```r # Our function for a single iteration of the simulation sim_fun = function(iter, n_i = 50, k_folds = 5) { # Draw sample i_sampled = sample.int(N, size = n_i) # Draw a sample sample_df = pop_df %>% filter(i %in% i_sampled) # k-fold CV cv_output = cv_function(k_folds = k_folds) # Find the estimated MSE mse_est = mean(cv_output$resample$RMSE^2) # Subset to unseen data test_df = pop_df %>% filter(i %>% is_in(i_sampled) %>% not()) # Make predictions on true test data predictions = predict(cv_output, newdata = test_df) # Calculate RMSE in full population mse_true = mean((test_df$y - predictions)^2) # Output results data.frame(iter, k = cv_output$bestTune$k, mse_est, mse_true) } ``` Notice that we can define default values for our function's arguments. --- ### Function: `\(k\)`-fold CV of KNN model for a sample ```r # Wrapper function for caret::train() cv_function = function(k_folds) { # The relationship: y as a function of w and x y ~ ., # The method of estimating the model (linear regression) method = "knn", # The training data (which will be used for cross validation) data = sample_df %>% select(y, x), # Controls for the model training: k-fold cross validation trControl = trainControl(method = "cv", number = k_folds), # Allow cross validation to choose best value of K (# nearest neighbors) tuneGrid = expand.grid(k = seq(1, 200, by = 10)) } ``` .it[Note] This function would need to be defined first in a script. --- class: middle Now run our one-iteration function `sim_fun()` for many iterations! --- ### Run the simulation! ```r # Set seed set.seed(123) # Run simulation 1,000 times in parallel (and time) tic() sim_df = mclapply( X = 1:1e3, FUN = sim_fun, mc.cores = 11 ) %>% bind_rows() toc() # Save dataset write_csv( x = sim_df, path = here("cv-sim-knn.csv") ) ``` `tic()` and `toc()` come from `tictoc()` and help with timing tasks. <br>`mclapply()` is a parallelized `lapply()` from `parallel` (sorry, no Windows). --- class: middle How about some results? --- layout: true class: clear, middle --- .b[With dependence:] distributions of the .hi-orange[true MSE] and the .hi-purple[CV-based MSE]. <img src="002-slides_files/figure-html/plot-sim-knn-1-1.svg" style="display: block; margin: auto;" /> --- With .b[independence across observations:] .hi-orange[true MSE] and the .hi-purple[CV-based MSE]. <img src="002-slides_files/figure-html/plot-sim-knn-1-ind-1.svg" style="display: block; margin: auto;" /> --- .b[Dependence:] Comparing .b[true MSE] and .b[CV-based MSE] (.pink[45° line]) <br>Tendency to .b[underestimate test MSE] (rather than .grey[overestimate]): 70.2% <img src="002-slides_files/figure-html/plot-sim-knn-1b-1.svg" style="display: block; margin: auto;" /> --- .b[Independence:] Comparing .b[true MSE] and .b[CV-based MSE] (.pink[45° line]) <br>Less likely (53.9%) to underestimate true, test MSE <img src="002-slides_files/figure-html/plot-sim-knn-1b-ind-1.svg" style="display: block; margin: auto;" /> --- .b[Dependence:] The .hi-slate[difference] between the true and CV-estimated MSE. <img src="002-slides_files/figure-html/plot-sim-knn-2a-1.svg" style="display: block; margin: auto;" /> --- .b[Independence:] The .hi-slate[difference] between the true and CV-estimated MSE. <img src="002-slides_files/figure-html/plot-sim-knn-2a-ind-1.svg" style="display: block; margin: auto;" /> --- .b[Dependence:] .hi-pink[Percent difference] between the true and CV-estimated MSE. <img src="002-slides_files/figure-html/plot-sim-knn-2b-1.svg" style="display: block; margin: auto;" /> --- .b[Independence:] .hi-pink[Percent difference] between the true and CV-estimated MSE. <img src="002-slides_files/figure-html/plot-sim-knn-2b-in-1.svg" style="display: block; margin: auto;" /> --- .b[Dependence: KNN flexibility] (K) and percentage error <img src="002-slides_files/figure-html/plot-sim-knn-3-1.svg" style="display: block; margin: auto;" /> --- .b[Independence: KNN flexibility] (K) and percentage error <img src="002-slides_files/figure-html/plot-sim-knn-3-ind-1.svg" style="display: block; margin: auto;" /> --- layout: false # Table of contents .col-left[ .smaller[ #### Admin - [Today and upcoming](#admin) - [Check in](#check-in) #### Cross validation - [Review](#cv-review) - [Independence](#cv-independence) #### Simulation - [Monte Carlo](#sim-mc) - [Introductory example](#sim-intro) - [Writing functions](#ex-fun) ] ] .col-right[ .smaller[ #### `\(k\)`-fold CV and dependence - [Simulation setup](#sim-kcv-intro) - [Define/build population](#sim-kcv-population) - [Sample](#sim-kcv-sample) - [Tune/train the model](#sim-kcv-model) - [Iteration function](#sim-kcv-function) ] ]