Abstract
Ridge regression and the lasso are two popular regularization techniques used in linear regression. In this document, we explore the intuition behind these methods in a simplified setting where the number of observations equals the number of predictors, as in Hastie et al. (2015.) We show how ridge regression shrinks all coefficients by the same proportion, while the lasso can shrink some coefficients to exactly zero, effectively performing feature selection.
To better understand the behavior of ridge regression and the lasso, let’s consider a simplified case where the number of observations (\(n\)) is equal to the number of predictors (\(p\)), and the design matrix (\(\mathbf{X}\)) is a diagonal matrix with 1’s on the diagonal and 0’s in all off-diagonal elements. Additionally, assume that we are performing regression without an intercept.
Let’s set up this scenario. First, define the dimension of the data \(n = p = 11\):
Next, create the design matrix \(\mathbf{X}\) as a diagonal matrix:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 1 0 0 0 0 0 0 0 0 0 0
## [2,] 0 1 0 0 0 0 0 0 0 0 0
## [3,] 0 0 1 0 0 0 0 0 0 0 0
## [4,] 0 0 0 1 0 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 0 0
## [6,] 0 0 0 0 0 1 0 0 0 0 0
## [7,] 0 0 0 0 0 0 1 0 0 0 0
## [8,] 0 0 0 0 0 0 0 1 0 0 0
## [9,] 0 0 0 0 0 0 0 0 1 0 0
## [10,] 0 0 0 0 0 0 0 0 0 1 0
## [11,] 0 0 0 0 0 0 0 0 0 0 1
In this setting, the response vector (\(\mathbf{y}\)) is simply set as a sequence of integers from \(-5\) to \(5\).
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
In this simplified setting, the least squares problem reduces to finding the coefficients \(\beta_1, \ldots, \beta_p\) that minimize the sum of squared residuals:
\[\sum_{j=1}^p(y_j - \beta_j)^2\]
where \(y_j\) is the \(j\)th response value.
The least squares solution in this case is simply:
\[\hat{\beta}_j = y_j\]
meaning that each coefficient estimate is equal to its corresponding response value. Let’s calculate the OLS estimates:
ols_model <- lm(y ~ 0 + X) # fit OLS model without intercept
beta_ols <- as.vector(coef(ols_model))
beta_ols
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
Note that the fit here is perfect, i.e., \(R^2 = 1\). This is expected given the simple structure of the data and the fact that the number of observations equals the number of predictors.
Ridge regression, in this context, involves finding the coefficients \(\beta_1, \ldots, \beta_p\) that minimize:
\[\sum_{j=1}^p(y_j - \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2\]
where \(\lambda\) is the regularization parameter.
The ridge regression estimates in this case are given by:
lambda_ridge <- 0.1
ridge_model <- glmnet(X, y, alpha = 0, lambda = lambda_ridge, standardize = FALSE, intercept = FALSE)
beta_ridge <- as.vector(coef(ridge_model))[-1]
round(beta_ridge, 1)
## [1] -3.7 -3.0 -2.2 -1.5 -0.7 0.0 0.7 1.5 2.2 3.0 3.7
The ridge estimates shrink each coefficient by the same proportion, regardless of the value of the response. This is a key property of ridge regression, which helps to reduce the variance of the estimates and improve the model’s generalization performance.
The lasso involves finding the coefficients that minimize:
\[\sum_{j=1}^p(y_j - \beta_j)^2 + \lambda \sum_{j=1}^p|\beta_j|\]
where \(|\beta_j|\) denotes the absolute value of \(\beta_j\).
The lasso estimates in this case are given by:
lambda_lasso <- 0.1
lasso_model <- glmnet(X, y, alpha = 1, lambda = lambda_lasso, standardize = FALSE, intercept = FALSE)
beta_lasso <- as.vector(coef(lasso_model))[-1]
round(beta_lasso, 1)
## [1] -3.9 -2.9 -1.9 -0.9 0.0 0.0 0.0 0.9 1.9 2.9 3.9
The lasso estimates can shrink some coefficients to exactly zero, effectively performing feature selection. This property of the lasso makes it useful when dealing with high-dimensional data or when feature selection is desired.
Formally, in this setting, the ridge regression estimates take the form:
\[\hat{\beta}_j^R = \frac{y_j}{1 + \lambda}\]
meaning that each coefficient is shrunk by the same proportion.
The lasso estimates, however, take the form:
\[ \hat{\beta}_j^L = \begin{cases} y_j - \lambda/2 & \text{if } y_j > \lambda/2 \\ y_j + \lambda/2 & \text{if } y_j < -\lambda/2 \\ 0 & \text{if } |y_j| \leq \lambda/2 \end{cases} \]
This means that the lasso shrinks each coefficient towards zero by a constant amount (\(\lambda/2\)), and coefficients with absolute values less than or equal to \(\lambda/2\) are shrunk to exactly zero. This type of shrinkage is known as soft-thresholding.
Let’s visualize the behavior of ridge regression and the lasso in this simplified setting.
df_lasso <- tibble(
y = y, beta_ols = beta_ols, beta_lasso = beta_lasso
)
p_lasso <-
df_lasso |>
ggplot(aes(x = y, y = beta_ols)) +
geom_hline(yintercept = 0) +
geom_line(linetype = "dashed", color = "grey") +
geom_line(aes(y = beta_lasso), color = "red") +
theme_light() +
labs(
title = "lasso",
x = "y",
y = "beta"
)
df_ridge <- tibble(
y = y, beta_ols = beta_ols, beta_ridge = beta_ridge
)
p_ridge <-
df_ridge |>
ggplot(aes(x = y, y = beta_ols)) +
geom_hline(yintercept = 0) +
geom_line(linetype = "dashed", color = "grey") +
geom_line(aes(y = beta_ridge), color = "red") +
theme_light() +
labs(
title = "ridge",
x = "y",
y = "beta"
)
The key difference between ridge regression and the lasso is the way they shrink the coefficients. Ridge regression shrinks all coefficients by the same proportion (left panel, in red), while the lasso can shrink some coefficients to exactly zero, effectively performing feature selection (right panel, in red). This property of the lasso makes it useful when dealing with high-dimensional data or when feature selection is desired.
Another approach to feature selection is subset selection, where we
consider all possible subsets of predictors and select the best subset
based on some criterion. For comparison, let’s perform subset selection
using the regsubsets
function from the leaps
package. We limit the number of predictors to 10 to keep the computation
feasible.
## Subset selection object
## 11 Variables
## Forced in Forced out
## b FALSE FALSE
## c FALSE FALSE
## d FALSE FALSE
## e FALSE FALSE
## f FALSE FALSE
## g FALSE FALSE
## h FALSE FALSE
## i FALSE FALSE
## j FALSE FALSE
## k FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## a b c d e f g h i j k
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) "*" "*" " " " " " " " " " " " " " " "*" "*"
## 5 ( 1 ) "*" "*" "*" " " " " " " " " " " " " "*" "*"
## 6 ( 1 ) "*" "*" "*" " " " " " " " " " " "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*"
As the output shows, the best subset selection method selects the subset of predictors that minimizes the residual sum of squares (RSS) for each subset size. Note that the selection is exhaustive, i.e., it considers all possible subsets of predictors for each subset size.
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. CRC Press.