Big Data and Economics

class: center, middle, inverse, title-slide

.title[
# Big Data and Economics
]
.subtitle[
## Linear Model Selection and Regularization
]
.author[
### Kyle Coombs
]
.date[
### Bates College | <a href="https://github.com/big-data-and-economics">ECON/DCS 368</a>
]

---

name: toc

# Table of contents

- [Prologue](#prologue)

- [Linear Model Selection](#linear-model-selection)

- [Ridge Regression](#ridge)

- [Least Absolute Shrinkage and Selection Operator (LASSO)](#lasso)

---
name: prologue
class: inverse, center, middle

# Prologue

---
# Regressions

- What do we typically do when we run OLS?

- We run a regression with all the variables we think are important

- But what happens when we have more variables than observations?

---
# Too many variables

- Most of the analysis we have done in this class has focused on the case where we have a small number of variables relative to the number of observations.

- But sometimes you have WIDE data

- In this case, you have a large number of variables `$J$` relative to the number of observations `$n$`.

- If you try to use OLS  with all the variables, you will run into problems. Why?

- The number of variables is larger than the number of observations!

- Uh oh

---
# Example of wide data

```
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
```

```
## # A tibble: 6 × 1,001
##        y     P_1     P_2    P_3    P_4     P_5     P_6     P_7    P_8    P_9
##    <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
## 1   8.33 -0.560  -0.710   2.20  -0.715 -0.0736 -0.602   1.07   -0.728  0.356
## 2 -63.4  -0.230   0.257   1.31  -0.753 -1.17   -0.994  -0.0273 -1.54  -0.658
## 3  -8.21  1.56   -0.247  -0.265 -0.939 -0.635   1.03   -0.0333 -0.693  0.855
## 4  11.7   0.0705 -0.348   0.543 -1.05  -0.0288  0.751  -1.52    0.119  1.15 
## 5  35.9   0.129  -0.952  -0.414 -0.437  0.671  -1.51    0.790  -1.36   0.276
## 6 -41.4   1.72   -0.0450 -0.476  0.331 -1.65   -0.0951 -0.211   0.590  0.144
## # ℹ 991 more variables: P_10 <dbl>, P_11 <dbl>, P_12 <dbl>, P_13 <dbl>,
## #   P_14 <dbl>, P_15 <dbl>, P_16 <dbl>, P_17 <dbl>, P_18 <dbl>, P_19 <dbl>,
## #   P_20 <dbl>, P_21 <dbl>, P_22 <dbl>, P_23 <dbl>, P_24 <dbl>, P_25 <dbl>,
## #   P_26 <dbl>, P_27 <dbl>, P_28 <dbl>, P_29 <dbl>, P_30 <dbl>, P_31 <dbl>,
## #   P_32 <dbl>, P_33 <dbl>, P_34 <dbl>, P_35 <dbl>, P_36 <dbl>, P_37 <dbl>,
## #   P_38 <dbl>, P_39 <dbl>, P_40 <dbl>, P_41 <dbl>, P_42 <dbl>, P_43 <dbl>,
## #   P_44 <dbl>, P_45 <dbl>, P_46 <dbl>, P_47 <dbl>, P_48 <dbl>, P_49 <dbl>, …
```

---
# What if I run a regression?

A mess to include all variables

```r
etable(feols(y ~ ..('^P'), data = wide_df))
```

```
## The variables 'P_100', 'P_101' and 899 others have been removed because of collinearity (see $collin.var).
```

```
##                  feols(y ~ ..
## Dependent Var.:             y
##                              
## Constant          357.7 (NaN)
## P_1               246.0 (NaN)
## P_2              -58.65 (NaN)
## P_3              -21.77 (NaN)
## P_4              -78.74 (NaN)
## P_5               14.93 (NaN)
## P_6               280.5 (NaN)
## P_7              -109.2 (NaN)
## P_8              -181.9 (NaN)
## P_9              -393.5 (NaN)
## P_10              265.5 (NaN)
## P_11             -3.331 (NaN)
## P_12             -6.579 (NaN)
## P_13              202.2 (NaN)
## P_14              88.59 (NaN)
## P_15              355.9 (NaN)
## P_16             -134.5 (NaN)
## P_17              161.8 (NaN)
## P_18             -60.64 (NaN)
## P_19              79.27 (NaN)
## P_20             -161.0 (NaN)
## P_21             -271.9 (NaN)
## P_22             -221.7 (NaN)
## P_23              104.8 (NaN)
## P_24             -178.2 (NaN)
## P_25              88.22 (NaN)
## P_26             -234.2 (NaN)
## P_27              339.8 (NaN)
## P_28             -336.4 (NaN)
## P_29             -193.4 (NaN)
## P_30              19.52 (NaN)
## P_31              398.5 (NaN)
## P_32              144.3 (NaN)
## P_33              536.0 (NaN)
## P_34             -90.05 (NaN)
## P_35             -218.5 (NaN)
## P_36              47.30 (NaN)
## P_37              223.1 (NaN)
## P_38              181.7 (NaN)
## P_39             -190.5 (NaN)
## P_40              8.225 (NaN)
## P_41             -274.2 (NaN)
## P_42              449.4 (NaN)
## P_43              154.0 (NaN)
## P_44             -64.26 (NaN)
## P_45              134.4 (NaN)
## P_46              21.30 (NaN)
## P_47             -45.43 (NaN)
## P_48             -93.77 (NaN)
## P_49             -235.2 (NaN)
## P_50              349.4 (NaN)
## P_51             -189.2 (NaN)
## P_52              287.5 (NaN)
## P_53             -25.93 (NaN)
## P_54             -412.8 (NaN)
## P_55              268.9 (NaN)
## P_56             -300.3 (NaN)
## P_57             -36.62 (NaN)
## P_58             -29.98 (NaN)
## P_59             -243.6 (NaN)
## P_60             -248.7 (NaN)
## P_61              286.9 (NaN)
## P_62             -63.84 (NaN)
## P_63             -129.7 (NaN)
## P_64              184.4 (NaN)
## P_65             -120.9 (NaN)
## P_66             -204.7 (NaN)
## P_67             -59.35 (NaN)
## P_68             -106.7 (NaN)
## P_69             -10.99 (NaN)
## P_70              171.3 (NaN)
## P_71              73.69 (NaN)
## P_72             -118.8 (NaN)
## P_73             -71.04 (NaN)
## P_74             -338.3 (NaN)
## P_75              255.1 (NaN)
## P_76             -172.9 (NaN)
## P_77             -41.91 (NaN)
## P_78              88.62 (NaN)
## P_79              33.83 (NaN)
## P_80              246.0 (NaN)
## P_81              88.89 (NaN)
## P_82              4.382 (NaN)
## P_83             -288.0 (NaN)
## P_84             -409.2 (NaN)
## P_85             -136.0 (NaN)
## P_86             -225.3 (NaN)
## P_87             -91.71 (NaN)
## P_88              139.6 (NaN)
## P_89             -84.91 (NaN)
## P_90              349.6 (NaN)
## P_91             -293.1 (NaN)
## P_92             -348.0 (NaN)
## P_93             -4.049 (NaN)
## P_94             -339.2 (NaN)
## P_95              135.1 (NaN)
## P_96              100.0 (NaN)
## P_97             -316.6 (NaN)
## P_98             -165.9 (NaN)
## P_99              128.4 (NaN)
## _______________  ____________
## S.E. type       NA (not-ava.)
## Observations              100
## R2                          1
## Adj. R2              NaNe+Inf
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# How can we cut down on variables?

- How can we cut down on the number of variables?

- What would be the regression tree approach?

- Iteratively split training data using variables that minimize residual sum of squares and use test data to determine the optimal number of leaves

- This is a form of **variable selection**

- But it forces us to turn continuous data binary ( `$X>c$` vs. `$X\geq c$` )

- But what other ways are available?

---
class: inverse, center, middle
name: linear-model-selection

# Linear Model Selection

---
# Typical OLS

- Good old-fashioned regression minimizes the residual sum of squares (RSS)

$$ 
`\begin{aligned}
\min_{\beta} \sum_{i=1}^n \underbrace{(y_i - \beta_0 - \sum_{j=1}^k \beta_j x_{ij})^2}_{\text{RSS}} 
\end{aligned}`
$$

- What does that mean?

- We are trying to find the `$\beta$`s that predict a dependent variable `$y$` as a linear combination of the independent variables `$x$`.

---
# Adding dimensions with OLS

- Each additional variable `$x_{j}$` adds a new dimension to the problem
  - As in each additional variable is a new axis in `$J$`-dimensional space where `$J$` is the number of variables 
  - (You've likely never thought about it that way before, but any regression is a multi-dimensional problem)

- If you have more variables than observations, you have more dimensions than observations

- Why? Well solve this equation:

$$ x + y =5 $$

- How many solutions are there? Infinite

- Now solve this system of equations:

$$ x + y =5 $$
$$ x + 2y =10 $$

- The same logic applies to regression (though it is a bit more complicated)

---
class: inverse, center, middle
name: ridge

# Ridge Regression

---
# Shrinkage

- In OLS, we are trying to minimize the residual sum of squares (RSS)

- In machine learning, there are shrinkage methods that add a penalty term to the RSS
  - These penalize coefficients that are too large

$$ 
`\begin{aligned}
\min_{\beta} \sum_{i=1}^n \underbrace{\text{model fit}}_{\text{RSS}} + \text{penalty on size of coefficients} 
\end{aligned}`
$$

- Why penalize large coefficients?

- Large coefficients are more likely to be overfitting the data since they are more sensitive to small changes in the data
  - By penalizing large coefficients, we are reducing the variance of the model and thus complexity
  - Intuitively, a larger `$\beta$` the further your model is from a null hypothesis of `$\beta = 0$`, which is the simplest model

- What happens if we reduce bias in the data?

- We increase variance!

---
# Ridge Regression

- So what form do these penalties take?

- Well Ridge Regression is one such example

- Ridge regression adds a penalty term to the RSS that is proportional to the sum of the squared coefficients

- Essentially, it adds a constraint to the optimization problem

$$ 
`\begin{aligned}
\min \underbrace{\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^J \beta_j x_{ij})}_{\text{model fit}} + \underbrace{\lambda \sum_{j=1}^J \beta_j^2}_{\text{penalty}} = RSS + \lambda \sum_{j=1}^J \beta_j^2 
\end{aligned}`
$$

`$\lambda$` is the "tuning parameter" that controls the strength of the penalty

- In order to minimize, we need to find the `$\beta$`s that minimize the RSS and the penalty

- That means we need smaller `$\beta$`s -- and necessarily a simpler, less variable model

- Literally, we shrink the `$\beta$`s towards zero

---
# Ridge Regression

![Ridge vs. OLS](img/ols_vs_ridge.png)

Example taken from [Dr. Samuel E. Jackson online textbook](https://bookdown.org/ssjackson300/Machine-Learning-Lecture-Notes/ridge-regression.html#minimisation)

---
# Ridge Regression coefficients

![Ridge vs. LASSO](img/lasso_v_ridge_coefficient.jpg)

---
# Ridge Regression flaws

- Ridge regression keeps all the variables in the model -- it just shrinks the coefficients

- But what if some variables are just truly noise 
  - i.e. they are not correlated with the dependent variable

- Sure, we can check by hand, but shouldn't we just toss them?

---
class: inverse, center, middle
name: lasso

# LASSO

---
# LASSO

- LASSO stands for Least Absolute Shrinkage and Selection Operator

- It is another shrinkage method that adds a penalty term to the RSS

- But now the penalty term is proportional to the sum of the absolute value of the coefficients

$$ 
`\begin{aligned}
\min \underbrace{\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^J \beta_j x_{ij})}_{\text{model fit}} + \underbrace{\lambda \sum_{j=1}^J |\beta_j|}_{\text{penalty}} = RSS + \lambda \sum_{j=1}^J |\beta_j| 
\end{aligned}`
$$

- Instead of the squared penalty on coefficient size, you have absolute value

- The magic of the absolute value is that it can shrink coefficients to zero with a sufficiently large `$\lambda$`
  - This means that LASSO can select variables: `$\beta_j=0$` means that `$x_j$` is not in the model
  - **Intuition**: The absolute value has a "sharp" corner at zero, so it can "cut" coefficients to zero, Ridge is a circle, so it can only shrink coefficients to the edge of the circle

- Selection is a big advantage over Ridge Regression
  - Of course, that can also be a disadvantage if you want to keep all the variables in the model
  - It leads to more bias

---
# LASSO visualization

![LASSO vs. Ridge](img/lasso_v_ridge.jpg)

---
# Ridge Regression coefficients

![LASSO vs. Ridge](img/lasso_v_ridge_coefficient.jpg)

---
class: inverse, center, middle
name: other-regularization

# Other details on Regularization

---
# K-fold cross-validation: How to pick `$\lambda$`

- The `$\lambda$` in is a "tuning parameter," which controls the strength of the penalty

- You need to do `$K$`-fold cross-validation:

1. Choose the number of "folds" or groups, `$K$` (usually 5 or 10)
2. Randomly split the data into `$K$` folds
3. Create a grid of feasible `$\lambda$` values to check
4. For each value of `$\lambda$`:
  - Run Ridge or LASSO on the `$K-1$` folds
  - Calculate the `$MSE_k$` on the remaining `$k$`-fold
5. Calculate the average `$MSE_k$` for each `$\lambda$`
`$$MSE_{CV}(\lambda) = \frac{1}{K} \sum_{k=1}^K MSE_k(\lambda)$$`
6. Pick the `$\lambda$` with the lowest MSE

- You know what's neat? You can do this in R with the **glmnet** library!

- It will even plot the results for you, so you can see the optimal `$\lambda$`

---
# Drawbacks of LASSO and Ridge

- Regularization/coefficient shinkage are useful for reducing variance and overfitting

- But they can also lead to bias

- The more you shrink the coefficients, the more bias you introduce

- You are no longer finding the best linear unbiased estimator (BLUE) that you find with OLS

- Instead, you get the best linear biased estimator (BLBE) because you trade some bias for less variance

- Sometimes you're okay with that!

---
# Why are you okay with bias?

- Sometimes you don't mind being a little off in your predictions

- Imagine you are predicting the number of people who will show up to a party
  
- Someone tells you there's a 50% chance 0 people come and a 50% chance 100 people come
  - That's not very helpful

- But what if they predict 45-55 people will show up and then 40 people showed up
  - That's wrong, but not so wrong to cause problems

- It is even less helpful if they tell you that to make an accurate prediction they need to know:
  - The number of invites, the number of RSVPs
  - The weather
  - The day of the week, the time of day
  - The variety of chips you're serving
  - What is on TV that night
  - etc.

- I'd prefer a simple model that is a little wrong then a complicated model that swings wildly in its predictions

---
# Warning

- Regularization is a useful tool for reducing variance and overfitting

- But just cause you can run a regression techniques doesn't mean you should

- You should always think about the problem you are trying to solve and the data you have

- Is it worth trying a technique?

- Will this technique help you solve your problem?

- Will it help you understand your data?

- Or are you just trying to seem flashy?

---
# Conclusion

- Regularization is a useful tool for reducing variance and overfitting

- It recognizes that sometimes you are okay with a little bias if it means you get less variance

- It relies on a tuning parameter `$\lambda$` that controls the strength of the penalty from adding more complexity to a regression model

- LASSO can be used to select variables, while Ridge just reduces the magnitude of the coefficients

---
# What next?

- Try an activity: [ISLR lab using tidymodels](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/06-regularization.html)

- Before class: work through the lab sections on Ridge and LASSO in a .Rmd file that you create

- Write up short answers to the following questions:
  1. What are the coefficients in the Ridge and LASSO regressions when the penalty is zero? Why? 
  2. How does tidymodels pick the optimal `$\lambda$` in each method? 
  3. What is the optimal `$\lambda$` in Ridge and LASSO?

---
class: inverse, center, middle

# Next lecture: Regular expressions and word clouds
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>