library(pacman)
EC524W25: Lab 002
Validation
Lab goals:
- Wrap up 001: Go over the solutions to the problems from last week
- Setup: Setup our data and load packages
- Resampling methods: Introduction to hold out groups and a quick analysis
Getting started
Download the zip file under lab 002 on the course GitHub repository. Unzip the file in your “lab” folder. Click on the .Rproj
file to open RStudio.
Setup
First, install pacman
, a package manager that will help us install and load packages. Recall we only need to install packages once. Ex.
install.packages("pacman")
However, we always have to load them in each new R session. We so by running the following code:
Sometimes we have to load a bunch of packages at once, which can be wordy. Ex.
# Load ALL packages
library(tidyverse)
library(tidymodels)
library(broom)
library(magrittr)
library(janitor)
library(here)
The pacman
package can simplify our code with the p_load
function. This function allows us to load mulitple pacakges at once. Further, if you do not have it installed, it will install it for you. Ex.
# Load ALL packages
::p_load(tidyverse, data.table, broom, magrittr, janitor, skimr) pacman
You can use either method to load packages. The p_load
function is a bit more concise and I use it often, but not always.
New packages can be scary. To learn more, read the vignette. Ex.
browseVignettes(package = 'here')
Vignettes are a short introduction to the package and its functions. Great for a quick overview and reference. But not all packages have them.
Randomization seeds
Since we are using some randomization, let’s set a seed so we all get the same results. A random seed is a number used to initialize a pseudorandom number generator. Ex.
# Set seed
set.seed(123)
These functions are used to ensure that the results of our code are reproducible. This is important for debugging and sharing code and it is a good practice to include them at the beginning of your script whenever you are using randomization.
Data
Load both the train.csv
and test.csv
data sets. We will be using both.
# Load training data
= read_csv("data/train.csv") house_df
Double check that everything is loaded correctly. Ex.
# Print first 10 observations
head(house_df, 10)
These data have really crappy column names. Let’s clean them up using the clean_names
function from the janitor
package.
# Clean column names
= house_df %>% clean_names() house_df
For today, we only need a subset. Let’s trim down our data set to four columns:
id
: house idsale_price
: sale price of the houseage
: age of the house at the time of the sale (difference between year sold and year built)area
: the non-basement square-foot area of the house
We can do this in a few different ways but dplyr::transmute()
is a very convenient function to use here
# Subset our data and create new columns
%<>% transmute(
house_df id = id,
sale_price = sale_price / 10000,
age = yr_sold - year_built,
area = gr_liv_area
)
It’s always a good idea to look at the new dataframe and make sure it is exactly what we expect it to be. Any of the following is a great way to check:
summary()
: provides a quick overview of the dataglimpse()
: provides a more detailed overview of each column
# Helpful, exploratory functions
# summary(house_df)
# glimpse(house_df)
# skimr::skim(house_df)
# View(house_df)
::
Notation.
When we use ::
notation, we are specifying that we want to use a function from a specific package. This is useful when we have functions with the same name in different packages. Ex.
# Use the `skim()` function from the `skimr` package
::skim(house_df) skimr
Sometimes I want to use a single function from a package and don’t want to load the entire package. This is how.
Resampling
Today we are introducing resampling via holdout sets—a very straightforward method that is used to estimate the performance. The key concept here is that we are partitioning our data into two sets: a training set and a validation set. The training set is used to fit the model, while the validation set is used to evaluate the model’s performance.
Create a validation set
Start by creating a single validation set composed of 30% of our training data. We can draw the validation sample randomly using the dplyr
function sample_frac()
. The argument size
will allow us to choose the desired sample of 30%.
dplyr
’s function setdiff()
will give us the remaining, non-validation observation from the original training data.
# Draw validation set
= house_df %>% sample_frac(size = 0.3)
validation_df # Find remaining training set
= setdiff(house_df, validation_df) train_df
Tip. For more info, read the help file with ?sample_frac()
and ?setdiff()
.
Finally, let’s check our work and make sure the training_df
+ validation_df
= house_df
# Check that dimensions make sense
nrow(house_df) == nrow(validation_df) + nrow(train_df)
Model fit
With training and validation sets we can start train a learner on the training set and evaluate its performance on the validation set. We will use a linear regression model to predict sale_price
using age
and area
. We will give some flexibility to the model by including polynomial terms for age
and area
.
We proceed with the following steps (algorithm):
- Train a regression model with various degrees of flexibility
- Calculate MSE on the
training_df
- Determine which degree of flexibility minimizes validation MSE
Model specification
We will fit a model of the form:
\[\begin{align*} Price_i = &\beta_0 + \beta_1 * age_i^2 + \beta_2 * age_i + \beta_3 * area_i^2 + \\ &\beta_4 * area_i + \beta_5 * age_i^2 \cdot area_i^2 + \beta_6 * age_i^2 \cdot area_i + \\ & \beta_7 * age_i \cdot area_i^2 + \beta_8 * area_i \cdot age_i \end{align*}\]
Often in programming, we want to automate the process of fitting a model to different specifications. We can do this by defining a function
We define a function below that will fit a model to the training data and return the validation MSE. The function takes two arguments:
deg_age
deg_area
These arguments represent the degree of polynomial for age and area that we want to fit our model.
# Define function
= function(deg_age, deg_area) {
fit_model # Estimate the model with training data
= lm(
est_model ~ poly(age, deg_age, raw = T) * poly(area, deg_area, raw = T),
sale_price data = train_df)
# Make predictions on the validation data
= predict(est_model, newdata = validation_df, se.fit = F)
y_hat # Calculate our validation MSE
= mean((validation_df$sale_price - y_hat)^2)
mse # Return the output of the function
return(mse)
}
First let’s create a dataframe that is 2 by 4x6 using the expand_grid()
function. We will attach each model fit MSE to an additional column.
# Take all possible combinations of our degrees
= expand_grid(deg_age = 1:6, deg_area = 1:4) deg_df
Now let’s iterate over all possible combinations (4x6) of polynomial specifications and see which model fit produces the smallest MSE.
# Iterate over set of possibilities (returns a vector of validation-set MSEs)
= mapply(
mse_v FUN = fit_model,
deg_age = deg_df$deg_age,
deg_area = deg_df$deg_area
)
Now that we have a 1 by 24 length vector of all possible polynomial combinations, lets attach this vector as an additional column to the deg_df
dataframe we assigned a moment ago and arrange by the smalled MSE parameter.
# Add validation-set MSEs to 'deg_df'
$mse = mse_v
deg_df# Which set of parameters minimizes validation-set MSE?
arrange(deg_df, mse)
# A tibble: 24 × 3
deg_age deg_area mse
<int> <int> <dbl>
1 4 2 15.8
2 6 2 16.1
3 4 3 18.0
4 3 2 18.3
5 3 3 18.4
6 6 4 19.1
7 5 2 19.4
8 2 3 19.9
9 3 4 20.5
10 6 1 21.0
# ℹ 14 more rows
Now let’s turn this table into a more visual appealing plot. I used ggplot2
to make the following heat map. Recall, less MSE is better.
Next week
Next week, we build on these concepts, moving into LOOCV and k-fold cross validation.
General tip: Get started on projects early. Run into problems sooner rather than later so you can get help from me, Ed, or your classmates.
Have a great weekend!!
Acknowledgements
This document is built upon notes created by a previous GE for this course Stephen Reed which you can find here