class: center, middle, inverse, title-slide # Data Science for Economics ## Lecture 11: Shrinkage methods ### Dawie van Lill ### 21 February 2022 --- ## Packages I thought slides would be easier to present than notes. The code and slides are on the Github page. Here are the packages that we will be using for this session. ```r if (!require("pacman")) install.packages("pacman") pacman::p_load(dplyr, ggplot2, rsample, caret, glmnet, vip, tidyverse, pdp, AmesHousing) ``` Let us give quick rundown of some of the potentially new packages. **rsample**: **caret**: Machine learning **glmnet**: --- ## Linear regression (one variable) Quick example of linear regression using machine learning process. Want to model linear relationship between total above ground living space of a home (`Gr_Liv_Area`) and sale price (`Sale_Price`). First, we construct our training sample. ```r set.seed(123) ames <- AmesHousing::make_ames() split <- initial_split(ames, prop = 0.7, strata = "Sale_Price") ames_train <- training(split) ames_test <- testing(split) ``` Second, we model on training data. ```r model1 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_train) ``` In the next slide we show the results from this regression. --- <img src="11-regularisation_files/figure-html/regression_1-1.png" width="720" style="display: block; margin: auto;" /> --- <img src="11-regularisation_files/figure-html/regression_2-1.png" width="720" style="display: block; margin: auto;" /> --- ## Evaluate results ```r summary(model1) ``` ``` ## ## Call: ## lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames_train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -474682 -30794 -1678 23353 328183 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 15938.173 3851.853 4.138 3.65e-05 *** ## Gr_Liv_Area 109.667 2.421 45.303 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 56790 on 2047 degrees of freedom ## Multiple R-squared: 0.5007, Adjusted R-squared: 0.5004 ## F-statistic: 2052 on 1 and 2047 DF, p-value: < 2.2e-16 ``` --- ## Multiple linear regression You can include more than one predictor, such as above ground square footage and year house was built. ```r model2 <- lm(Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train) ``` One could add as many predictors as you want. ```r model3 <- lm(Sale_Price ~ ., data = ames_train) ``` Let us now test the results from the models to see which one is the most accurate. --- ## Linear regression ```r set.seed(123) (cv_model1 <- train(form = Sale_Price ~ Gr_Liv_Area, data = ames_train, method = "lm", trControl = trainControl(method = "cv", number = 10) )) ``` ``` ## Linear Regression ## ## 2049 samples ## 1 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 1843, 1844, 1844, 1844, 1844, 1844, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 56644.76 0.510273 38851.99 ## ## Tuning parameter 'intercept' was held constant at a value of TRUE ``` --- ## Multiple linear regression ```r set.seed(123) (cv_model2 <- train(form = Sale_Price ~ Gr_Liv_Area + Year_Built, data = ames_train, method = "lm", trControl = trainControl(method = "cv", number = 10) )) ``` ``` ## Linear Regression ## ## 2049 samples ## 2 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 1843, 1844, 1844, 1844, 1844, 1844, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 46865.68 0.6631008 31695.48 ## ## Tuning parameter 'intercept' was held constant at a value of TRUE ``` --- ### Out of sample performance ``` ## ## Call: ## summary.resamples(object = resamples(list(model1 = cv_model1, model2 ## = cv_model2))) ## ## Models: model1, model2 ## Number of resamples: 10 ## ## MAE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## model1 34076.73 37656.23 39785.18 38851.99 40200.92 42058.68 0 ## model2 29227.14 30885.17 32003.59 31695.48 32710.41 33942.26 0 ## ## RMSE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## model1 45604.65 55896.58 57000.74 56644.76 59544.08 66198.59 0 ## model2 37174.26 42650.00 46869.84 46865.68 51155.14 55780.47 0 ## ## Rsquared ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## model1 0.4230788 0.4621034 0.5090642 0.5102730 0.5681246 0.5996400 0 ## model2 0.5829425 0.6075293 0.6865871 0.6631008 0.6976664 0.7254572 0 ``` --- ## Regularised regression Data typically have large number of features. As number of features grow models tend to **overfit the data**. Regularisation methods provide a means to constrain / regularise the estimated coefficients. Easiest way to understand is to see how and why it is applied to OLS. Objective in OLS is to find *hyperplane* that minimises SSE between observed and predicted response values. This means identifying hyperplane that minimises grey lines in next slide. --- <img src="11-regularisation_files/figure-html/regression_3-1.png" width="720" style="display: block; margin: auto;" /> --- ## Regularised regression OLS adheres to some specific assumptions - Linear relationship - More observations (n) than features (p) - which means n > p - No or little multicollinearity However, for many datasets, such as those involved in text mining, we have more features than observations. If we have that p > n there are many potential problems - Model is less interpretable - Infinite solutions to OLS problem Make an assumption that small subset of these features exhibit strongest effect This is called the **sparsity principle** --- ## Regularised regression Objective function of regularised regression model includes penalty parameter `\(P\)`. `$$\min(\text{SSE + P})$$` Penalty parameter constrains the size of coefficients. Coefficients can only increase is if we have decrease in SSE (i.e. the loss function). Generalises to other linear models as well (such as logistic regression). Three common penalty parameters 1. Ridge 2. Lasso [**think cowboys!**] 3. Elastic net (combination of ridge and lasso)I --- ## Ridge regression Ridge regression controls coefficients by adding `\(\lambda \sum_{j=1}^{p}\beta^{2}_{j}\)` to objective function: `$$\min(\text{SSE} + \lambda \sum_{j=1}^{p}\beta^{2}_{j})$$` Size of the penalty is referred to as the `\(L^2\)` norm, or Euclidean norm. This penalty can take on different values depending on the tuning parameter `\(\lambda\)`. If `\(\lambda = 0\)`, then we have OLS regression. As `\(\lambda \rightarrow \infty\)` the penalty becomes large and forces coefficients to zero. **Important to note**: Does not force all the way to zero, only asymptotically. Does not perform feature selection. The following slide illustrates dynamics for ridge regression coefficient values as the tuning parameter approaches zero. --- <img src="11-regularisation_files/figure-html/regression_5-1.png" width="720" style="display: block; margin: auto;" /> --- ## Lasso Similar to ridge penalty. Swap out `\(L^2\)` norm for `\(L^{1}\)` norm: `\(\lambda \sum_{j=1}^{p}|\beta_{j}|\)` `$$\min(\text{SSE} + \lambda \sum_{j=1}^{p}|\beta_{j}|)$$` Ridge penalty pushes variables to approximately zero. Lasso actually pushes coefficients all the way to zero. Automated **feature selection**! Look at the figure on the following slide, can you see the difference? --- <img src="11-regularisation_files/figure-html/regression_6-1.png" width="720" style="display: block; margin: auto;" /> --- ## Elastic nets Generalisation of ridge and lasso penalties: `$$\min(\text{SSE} + \lambda_1 \sum_{j=1}^{p}\beta^{2}_{j} + \lambda_2 \sum_{j=1}^{p}|\beta_{j}|)$$` Lasso performs feature selection, **but** when we have two strongly correlated features one may be pushed to zero and the other remains. Ridge regression penalty more effective in handling correlated features. Elastic net combines bits of both these methods. If you are interested in these topics I suggest you read more on them in **ISLR**. Penalty models are easy to implement and results are surprisingly good. --- <img src="11-regularisation_files/figure-html/regression_7-1.png" width="720" style="display: block; margin: auto;" /> --- ## glmnet We will be using `glmnet` to conduct our regularised regression analysis. This package is extremely efficient and fast. It utilises Fortran code. For those who like Python, one can use `Scikit-Learn` modules. There are also other R packages available (e.g. `h20`, `elasticnet` and `penalized`). We have to do some basic transformations to use this package. ```r # Create training feature matrices # we use model.matrix(...)[, -1] to discard the intercept X <- model.matrix(Sale_Price ~ ., ames_train)[, -1] # transform y with log transformation Y <- log(ames_train$Sale_Price) ``` --- ## glmnet Let us become more familiar with the different components of this package. `alpha` tells **glmnet** which method to implement. - `alpha` = 0: ridge penalty - `alpha` = 1: lasso penalty - 0 < `alpha` < 1: elastic net model **glmnet** does two things that one needs to be aware of. 1. Standardises features 2. Fits ridge models across wide range of `\(\lambda\)` values (automatically) --- ## Parameter tuning `\(\lambda\)` is the tuning parameter and it helps prevent overfitting training data. To identify optimal `\(\lambda\)` value we can use `\(k\)`-fold cross validation. ```r # Apply CV ridge regression to Ames data ridge <- cv.glmnet( x = X, y = Y, alpha = 0) # Apply CV lasso regression to Ames data lasso <- cv.glmnet( x = X, y = Y, alpha = 1) ``` Now let's plot the result... --- <img src="11-regularisation_files/figure-html/tuning-1.png" width="720" style="display: block; margin: auto;" /> --- ```r # Ridge model min(ridge$cvm) # minimum MSE ``` ``` ## [1] 0.02158662 ``` ```r ridge$lambda.min # lambda for this min MSE ``` ``` ## [1] 0.1389619 ``` ```r # Lasso model min(lasso$cvm) # minimum MSE ``` ``` ## [1] 0.02342113 ``` ```r lasso$lambda.min # lambda for this min MSE ``` ``` ## [1] 0.003606096 ``` ```r lasso$nzero[lasso$lambda == lasso$lambda.min] # No. of coef | Min MSE ``` ``` ## s47 ## 135 ``` --- <img src="11-regularisation_files/figure-html/ridge_lasso-1.png" width="720" style="display: block; margin: auto;" /> --- ```r # model with lowest RMSE cv_glmnet$bestTune ``` ``` ## alpha lambda ## 7 0.1 0.02006835 ``` ```r # results for model with lowest RMSE cv_glmnet$results %>% filter(alpha == cv_glmnet$bestTune$alpha, lambda == cv_glmnet$bestTune$lambda) ``` ``` ## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD ## 1 0.1 0.02006835 0.1460178 0.8749723 0.086671 0.03174582 0.04965123 ## MAESD ## 1 0.007970299 ``` ```r # predict sales price on training data pred <- predict(cv_glmnet, X) # compute RMSE of transformed predicted RMSE(exp(pred), exp(Y)) ``` ``` ## [1] 23281.59 ``` ```r # RMSE of multiple linear regression was 26098.00 ``` --- <img src="11-regularisation_files/figure-html/vip-1.png" width="720" style="display: block; margin: auto;" />