library(tidymodels)
library(tidyverse)
library(magrittr)
library(DataExplorer)
library(kableExtra)
library(stargazer)
library(ROCit)
library(caret)
library(glmnet)
set.seed(100)

Linear Regression

Preface

In this exercise we will go over some regression and classification models. Since it is a class for economists you are probably familiar with linear regression. However, good grasp of the parts of linear regression can help us in understanding other models. We can detect where those models differ from the old workhorse of linear regression.

Question: Can you tell what are the parts of linear regression?

Basically, when we are estimating the \(\beta\)’s we assume several assumptions:

  1. Linearity: The effect of each x on the y is constant.
  2. Separability: The effect of each x on the y is unrelated to the effect of other xs’.
  3. \(E(u|X)\sim {\sf N}(0, \sigma^2)\): Everything that should be included in the regression is in there, and the rest is ditributed around observed values. This assumption also include homoscedasticity - no matter what the X’s are - the expected error is the same.

There are some more assumptions that we will not go into.

When we are estimating the \(\beta\)s we basically saying that changes in the X’s change the Y in those \(\beta\)s. Now imagine a set of observations of some variable of interest - test score for example, and some X’s that can explain the Y’s. What we are seeing is just some numbers, we still can’t assert any relation between the variables. Remember! The data itself can never assert causality, additional assumptions are always required. If we arguing that some variable has some effect on Y we need some explain why this effect best fits our data.

Exercises
  1. Can we use the data for prediction without assumptions? Why?

Linearity

What end does the linearity assumption serve? It basically says that the effect of the dependent variable is constant. It’s quite strong assumption; even if we will accept the fact that class size affects students’ achievements - can we really be sure that the difference between 16 and 17 students in a class is the same as the difference between 34 and 35?

Separability

The separability assumption means that the effect of each x in the X’s affects Y individually. Again, this is quite a strong assumption. Maybe the quality of the teacher doesn’t play a big role in small classes but does play a huge role in big classes? We will see some models that doesn’t assume separability. However, we can add some tweaks to the linear regression model by adding interactions between the X’s.

Exercises
  1. What is the downside in adding interactions? (Hint: think about the ratio n/k, where n is the number of observations and k is the number of covariates)

Normal Distribution of E(u|X)

This assumption is folding few assumptions in it. It means that the parts in Y that we don’t explain with x have expectancy of zero, the distribution of them is normal and that the variance of this distribution doesn’t depend on X. Each of those assumptions is strong.

Exercises
  1. Why is those assumptions strong? Can you come up with a story that wouldn’t fit?
  2. The confidence intervals of the \(\beta\)s derived from those assumptions. Explain how the confidence intervals derived from the assumptions (Intuitive explaination is enough).

Data

We will use the wine dataset in this section. The wine dataset contains 1599 reviews of red wines. Every row composes of characteristics of the wine and its score on scale of 1-8 stars. Start by exploring the data.

Exercises
  1. Download the data from the course repo. We will use the winequality_red dataset.

  2. Plot the histogram of every variable. (Hint: you can use the DataExplorer package). You should get something like the following graph.

  3. Plot boxplots of every variable against the quality variable. It should look like this:

Model

We will now fit a model to the data. We will not do any preperations except for dividing the data to train and test. It is very common and sometimes even necessary in ML to split the data to train - on which you should fit the model - and test - on which you should test your model. We expect a good model to be able to predict outcomes on data it has never seen. It is recommended for this part to use the tidymodels package. You can find some materials here.

Exercises
  1. Split the data to 70% train and 30% test. Save the train in wine_train and the test in wine_test. Use set.seed(100) before your code.

Note: If you are using the tidymodels - note that juice() is not relevant in here since we didn’t make any change in the data.

  1. Fit the model: use linear regression where the quality is the Y and the remained variables are the X’s. You should get something like the following.
## # A tibble: 12 x 5
##    term                  estimate std.error statistic  p.value
##    <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)           18.3     26.2          0.699 4.84e- 1
##  2 fixed.acidity          0.0259   0.0320       0.810 4.18e- 1
##  3 volatile.acidity      -1.17     0.152       -7.70  2.95e-14
##  4 citric.acid           -0.202    0.180       -1.12  2.63e- 1
##  5 residual.sugar         0.0221   0.0186       1.19  2.36e- 1
##  6 chlorides             -2.25     0.532       -4.23  2.56e- 5
##  7 free.sulfur.dioxide    0.00380  0.00267      1.42  1.55e- 1
##  8 total.sulfur.dioxide  -0.00316  0.000864    -3.65  2.72e- 4
##  9 density              -14.1     26.7         -0.529 5.97e- 1
## 10 pH                    -0.401    0.237       -1.69  9.07e- 2
## 11 sulphates              0.966    0.142        6.82  1.49e-11
## 12 alcohol                0.267    0.0329       8.12  1.20e-15
  1. Predict: based on the model you fitted - predict the outcomes of the test set. Don’t forget that the outcome is real numbers. Plot the first 6 predictions.
##    .pred_reg
## 1          5
## 2          5
## 4          6
## 11         5
## 12         6
## 13         5
  1. Metrics: What is the rmse, R-Squared and MAE?

Reminder: Root Mean Squared Error (rmse) is the square of each error, averaged, and rooted. Mean Absolute Error (MAE) is the average of the absolute errors. You should get something like this:

  1. Confidence intervals and t-test are essential parts of determining whether we estimated the “real” \(\beta\) or not. RMSE also helps to assert whether your model is correct or not. What is the main difference between these tests?

Logistic Regression

Thus far we thought about the dependent variable as continuous (even though it wasn’t really so). Different kind of questions can focus on discrete dependent variable, in particular binary Y.

Exercises
  1. Can we use linear regression for binary outcomes? Why?

Data

We will use “heart data” - a dataset about heart attacks. Take some time to explore it here.

Exercises
  1. Load the data from the course repo. Look at the head of the data. It should look like this:
## # A tibble: 6 x 14
##     age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1    63     1     3      145   233     1       0     150     0     2.3     0
## 2    37     1     2      130   250     0       1     187     0     3.5     0
## 3    41     0     1      130   204     0       0     172     0     1.4     2
## 4    56     1     1      120   236     0       1     178     0     0.8     2
## 5    57     0     0      120   354     0       1     163     1     0.6     2
## 6    57     1     0      140   192     0       1     148     0     0.4     1
## # ... with 3 more variables: ca <dbl>, thal <dbl>, target <dbl>
  1. Plot the histogram of every variable. (Hint: you can use the DataExplorer package). You should get something like the following graph.

Linear Regression

There is no technical problem in using linear regression for binary outcomes. The problem is that the estimated paramaters are not probabilities - they can be over 1 and below 0. That’s not a good feature for binary model.

Exercises
  1. Split the data to 70% train and 30% test. Save the train in heart_train and the test in heart_test. Use set.seed(100) before your code.

  2. Fit the model: use linear regression where the target is the Y and the remained variables are the X’s. You should get something like the following.

## # A tibble: 14 x 5
##    term         estimate std.error statistic    p.value
##    <chr>           <dbl>     <dbl>     <dbl>      <dbl>
##  1 (Intercept)  0.903     0.365       2.48   0.0141    
##  2 age         -0.00128   0.00330    -0.389  0.698     
##  3 sex         -0.241     0.0590     -4.09   0.0000636 
##  4 cp           0.128     0.0279      4.59   0.00000800
##  5 trestbps    -0.00295   0.00158    -1.86   0.0644    
##  6 chol        -0.000554  0.000518   -1.07   0.286     
##  7 fbs          0.00530   0.0740      0.0717 0.943     
##  8 restecg      0.0368    0.0495      0.744  0.458     
##  9 thalach      0.00323   0.00146     2.21   0.0285    
## 10 exang       -0.112     0.0639     -1.75   0.0812    
## 11 oldpeak     -0.0709    0.0315     -2.25   0.0255    
## 12 slope        0.0663    0.0534      1.24   0.216     
## 13 ca          -0.0738    0.0272     -2.71   0.00726   
## 14 thal        -0.0730    0.0438     -1.67   0.0968
  1. Predict: based on the model you fitted - predict the outcomes of the test set. Don’t forget that the outcome is real numbers, but before you round them plot the largest number in your prediction and the lowest one. You should get something like this:
## [1] 1.185397
## [1] -0.3348134

Is there any problem with these numbers?

  1. Metrics: Plot a ROC curve. You should get the following graph:

Logistic Regression

Since linear regression has some unwanted features we can use logistic regression instead. In logistic regression we use the logistic distribution to make sure that we get probability as our result - it can’t deviate from [0,1].

Exercises
  1. Train a logistic regression model. Use the glm() function. Show summary of the results. You should get something like this
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4487  -0.4195   0.1533   0.6777   2.5945  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.985022   3.007597   1.325  0.18518    
## age         -0.014248   0.027096  -0.526  0.59901    
## sex         -2.035554   0.541345  -3.760  0.00017 ***
## cp           0.939408   0.222372   4.224 2.39e-05 ***
## trestbps    -0.025575   0.012615  -2.027  0.04262 *  
## chol        -0.006086   0.004185  -1.454  0.14587    
## fbs         -0.023626   0.583572  -0.040  0.96771    
## restecg      0.319575   0.406256   0.787  0.43150    
## thalach      0.025961   0.013069   1.986  0.04699 *  
## exang       -0.704569   0.484908  -1.453  0.14623    
## oldpeak     -0.546538   0.262465  -2.082  0.03731 *  
## slope        0.427791   0.403262   1.061  0.28877    
## ca          -0.515384   0.211149  -2.441  0.01465 *  
## thal        -0.546001   0.340902  -1.602  0.10924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 293.58  on 212  degrees of freedom
## Residual deviance: 158.77  on 199  degrees of freedom
## AIC: 186.77
## 
## Number of Fisher Scoring iterations: 6
  1. Predict: based on the model you fitted - predict the outcomes of the test set. Don’t forget that the outcome is real numbers, but before you round them plot the largest number in your prediction and the lowest one. You should get something like this:
## [1] 0.9958782
## [1] 0.001592446

Is there a problem with these numbers?

  1. Metrics: In binary classification models we evaluate the model performance using the terms “Sensitivity”, “Specifity” and “Accuracy”. How can you calculate these measures? Calculate them for our predictions. (You can use some packages to do that).

Regularization

The regression models we used thus far take the data and try to figure what are the parameters that fit best. The problem is that these models “force” themselves on the reality; you can almost always find parameters that fits your data. In this case the model predictions will be very good in reproducing your in-sample values, but it will perform poorly on data that wasn’t used for fitting the model. In other words, the model fits both the “DGP” (Data Generating Process) and the noise (the idiosyncratic parts of the data). This problem called “overfitting”. In order to mitigate this problem one can use penalty in the estimation equation. Since regression is essentialy minimization problem adding positive term with \(\beta\)’s in the equation will force omitting or shrinking of the parameters.

Ridge

Ridge is one specification of regularization. In ridge we are minimizing the following equation:

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]

The penalty is in squared value. \(\lambda\) is a paramater - the higher it is the bigger the punishment. Notice that - unlike what you know from OLS - the \(\beta\) estimated by ridge is biased.

Exercises
  1. Another version of regularized regression is the LASSO in which the penalty is in absolute term (in other context these regularizations also called L1 and L2 regularizations after their norm term). Why it is necessary to use absolute or square term in the penalty?
  2. Advanced: One of the features of lasso is “Variable Selection” - some of the covariates end up as zeros. This does not happen in Ridge. Why?

We will now use the heart data again.

Exercises
  1. Use the glmnet package for this part. Use the glmnet function to fit a ridge model to the data. Then use the plot function to plot the “log - lambda” on the x-axis and the coefficients on the y-axis. You should get the following graph:

  2. In order to actually pick the best lambda we will use cross validation. Use the cv.glmnet function to fit the model. Use the plot function to show the “log lambda” against the MSE.

  1. Plot the coefficients estimated by the model. Are they different than those from the logit section? Are all the covariates different from zero? Use the tidy function from broom to introduce the results.

  2. What can be the problem in econometrics (as opposed to statistics) when a covariate equals zero? (We will come back to this question down the course road)

  3. Predict: Use the model you fitted and predict the target variable. Use two different \(\lambda\)’s: the minimal \(\lambda\), and the 1se \(\lambda\) (use the s argument in the predict function). Compare the results to one another and to the logistic regression.

  4. Metrics: Create confusion matrix for each prediction. It should look like this:

## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 36  6
##   1  5 43
##                                           
##                Accuracy : 0.8778          
##                  95% CI : (0.7918, 0.9374)
##     No Information Rate : 0.5444          
##     P-Value [Acc > NIR] : 1.202e-11       
##                                           
##                   Kappa : 0.7541          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.8776          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.8958          
##              Prevalence : 0.4556          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.8778          
##                                           
##        'Positive' Class : 0               
## 
## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 33  5
##   1  8 44
##                                           
##                Accuracy : 0.8556          
##                  95% CI : (0.7657, 0.9208)
##     No Information Rate : 0.5444          
##     P-Value [Acc > NIR] : 3.463e-10       
##                                           
##                   Kappa : 0.7071          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.8049          
##             Specificity : 0.8980          
##          Pos Pred Value : 0.8684          
##          Neg Pred Value : 0.8462          
##              Prevalence : 0.4556          
##          Detection Rate : 0.3667          
##    Detection Prevalence : 0.4222          
##       Balanced Accuracy : 0.8514          
##                                           
##        'Positive' Class : 0               
## 

Good Luck!