library(tidymodels)
library(tidyverse)
library(magrittr)
library(DataExplorer)
library(kableExtra)
library(stargazer)
library(ROCit)
library(caret)
library(glmnet)
set.seed(100)
In this exercise we will go over some regression and classification models. Since it is a class for economists you are probably familiar with linear regression. However, good grasp of the parts of linear regression can help us in understanding other models. We can detect where those models differ from the old workhorse of linear regression.
Question: Can you tell what are the parts of linear regression?
Basically, when we are estimating the \(\beta\)’s we assume several assumptions:
x
on the y
is constant.x
on the y
is unrelated to the effect of other x
s’.There are some more assumptions that we will not go into.
When we are estimating the \(\beta\)s we basically saying that changes in the X
’s change the Y in those \(\beta\)s. Now imagine a set of observations of some variable of interest - test score for example, and some X’s that can explain the Y
’s. What we are seeing is just some numbers, we still can’t assert any relation between the variables. Remember! The data itself can never assert causality, additional assumptions are always required. If we arguing that some variable has some effect on Y we need some explain why this effect best fits our data.
What end does the linearity assumption serve? It basically says that the effect of the dependent variable is constant. It’s quite strong assumption; even if we will accept the fact that class size affects students’ achievements - can we really be sure that the difference between 16 and 17 students in a class is the same as the difference between 34 and 35?
The separability assumption means that the effect of each x
in the X
’s affects Y
individually. Again, this is quite a strong assumption. Maybe the quality of the teacher doesn’t play a big role in small classes but does play a huge role in big classes? We will see some models that doesn’t assume separability. However, we can add some tweaks to the linear regression model by adding interactions between the X
’s.
This assumption is folding few assumptions in it. It means that the parts in Y
that we don’t explain with x have expectancy of zero, the distribution of them is normal and that the variance of this distribution doesn’t depend on X. Each of those assumptions is strong.
We will use the wine dataset
in this section. The wine dataset
contains 1599 reviews of red wines. Every row composes of characteristics of the wine and its score on scale of 1-8 stars. Start by exploring the data.
Download the data from the course repo. We will use the winequality_red
dataset.
Plot the histogram of every variable. (Hint: you can use the DataExplorer
package). You should get something like the following graph.
Plot boxplots of every variable against the quality
variable. It should look like this:
We will now fit a model to the data. We will not do any preperations except for dividing the data to train and test. It is very common and sometimes even necessary in ML to split the data to train - on which you should fit the model - and test - on which you should test your model. We expect a good model to be able to predict outcomes on data it has never seen. It is recommended for this part to use the tidymodels
package. You can find some materials here.
wine_train
and the test in wine_test
. Use set.seed(100)
before your code.Note: If you are using the tidymodels - note that juice() is not relevant in here since we didn’t make any change in the data.
Y
and the remained variables are the X
’s. You should get something like the following.## # A tibble: 12 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.3 26.2 0.699 4.84e- 1
## 2 fixed.acidity 0.0259 0.0320 0.810 4.18e- 1
## 3 volatile.acidity -1.17 0.152 -7.70 2.95e-14
## 4 citric.acid -0.202 0.180 -1.12 2.63e- 1
## 5 residual.sugar 0.0221 0.0186 1.19 2.36e- 1
## 6 chlorides -2.25 0.532 -4.23 2.56e- 5
## 7 free.sulfur.dioxide 0.00380 0.00267 1.42 1.55e- 1
## 8 total.sulfur.dioxide -0.00316 0.000864 -3.65 2.72e- 4
## 9 density -14.1 26.7 -0.529 5.97e- 1
## 10 pH -0.401 0.237 -1.69 9.07e- 2
## 11 sulphates 0.966 0.142 6.82 1.49e-11
## 12 alcohol 0.267 0.0329 8.12 1.20e-15
## .pred_reg
## 1 5
## 2 5
## 4 6
## 11 5
## 12 6
## 13 5
rmse
, R-Squared
and MAE
?Reminder: Root Mean Squared Error (rmse) is the square of each error, averaged, and rooted. Mean Absolute Error (MAE) is the average of the absolute errors. You should get something like this:
Thus far we thought about the dependent variable as continuous (even though it wasn’t really so). Different kind of questions can focus on discrete dependent variable, in particular binary Y
.
We will use “heart data” - a dataset about heart attacks. Take some time to explore it here.
## # A tibble: 6 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3 0
## 2 37 1 2 130 250 0 1 187 0 3.5 0
## 3 41 0 1 130 204 0 0 172 0 1.4 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1
## # ... with 3 more variables: ca <dbl>, thal <dbl>, target <dbl>
DataExplorer
package). You should get something like the following graph. There is no technical problem in using linear regression for binary outcomes. The problem is that the estimated paramaters are not probabilities - they can be over 1 and below 0. That’s not a good feature for binary model.
Split the data to 70% train and 30% test. Save the train in heart_train
and the test in heart_test
. Use set.seed(100)
before your code.
Fit the model: use linear regression where the target is the Y
and the remained variables are the X
’s. You should get something like the following.
## # A tibble: 14 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.903 0.365 2.48 0.0141
## 2 age -0.00128 0.00330 -0.389 0.698
## 3 sex -0.241 0.0590 -4.09 0.0000636
## 4 cp 0.128 0.0279 4.59 0.00000800
## 5 trestbps -0.00295 0.00158 -1.86 0.0644
## 6 chol -0.000554 0.000518 -1.07 0.286
## 7 fbs 0.00530 0.0740 0.0717 0.943
## 8 restecg 0.0368 0.0495 0.744 0.458
## 9 thalach 0.00323 0.00146 2.21 0.0285
## 10 exang -0.112 0.0639 -1.75 0.0812
## 11 oldpeak -0.0709 0.0315 -2.25 0.0255
## 12 slope 0.0663 0.0534 1.24 0.216
## 13 ca -0.0738 0.0272 -2.71 0.00726
## 14 thal -0.0730 0.0438 -1.67 0.0968
## [1] 1.185397
## [1] -0.3348134
Is there any problem with these numbers?
Since linear regression has some unwanted features we can use logistic regression instead. In logistic regression we use the logistic distribution to make sure that we get probability as our result - it can’t deviate from [0,1].
glm()
function. Show summary of the results. You should get something like this##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4487 -0.4195 0.1533 0.6777 2.5945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.985022 3.007597 1.325 0.18518
## age -0.014248 0.027096 -0.526 0.59901
## sex -2.035554 0.541345 -3.760 0.00017 ***
## cp 0.939408 0.222372 4.224 2.39e-05 ***
## trestbps -0.025575 0.012615 -2.027 0.04262 *
## chol -0.006086 0.004185 -1.454 0.14587
## fbs -0.023626 0.583572 -0.040 0.96771
## restecg 0.319575 0.406256 0.787 0.43150
## thalach 0.025961 0.013069 1.986 0.04699 *
## exang -0.704569 0.484908 -1.453 0.14623
## oldpeak -0.546538 0.262465 -2.082 0.03731 *
## slope 0.427791 0.403262 1.061 0.28877
## ca -0.515384 0.211149 -2.441 0.01465 *
## thal -0.546001 0.340902 -1.602 0.10924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 293.58 on 212 degrees of freedom
## Residual deviance: 158.77 on 199 degrees of freedom
## AIC: 186.77
##
## Number of Fisher Scoring iterations: 6
## [1] 0.9958782
## [1] 0.001592446
Is there a problem with these numbers?
The regression models we used thus far take the data and try to figure what are the parameters that fit best. The problem is that these models “force” themselves on the reality; you can almost always find parameters that fits your data. In this case the model predictions will be very good in reproducing your in-sample values, but it will perform poorly on data that wasn’t used for fitting the model. In other words, the model fits both the “DGP” (Data Generating Process) and the noise (the idiosyncratic parts of the data). This problem called “overfitting”. In order to mitigate this problem one can use penalty in the estimation equation. Since regression is essentialy minimization problem adding positive term with \(\beta\)’s in the equation will force omitting or shrinking of the parameters.
Ridge is one specification of regularization. In ridge we are minimizing the following equation:
\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]
The penalty is in squared value. \(\lambda\) is a paramater - the higher it is the bigger the punishment. Notice that - unlike what you know from OLS - the \(\beta\) estimated by ridge is biased.
We will now use the heart data again.
Use the glmnet
package for this part. Use the glmnet
function to fit a ridge model to the data. Then use the plot
function to plot the “log - lambda” on the x-axis and the coefficients on the y-axis. You should get the following graph:
In order to actually pick the best lambda we will use cross validation. Use the cv.glmnet
function to fit the model. Use the plot
function to show the “log lambda” against the MSE
.
Plot the coefficients estimated by the model. Are they different than those from the logit section? Are all the covariates different from zero? Use the tidy
function from broom to introduce the results.
What can be the problem in econometrics (as opposed to statistics) when a covariate equals zero? (We will come back to this question down the course road)
Predict: Use the model you fitted and predict the target
variable. Use two different \(\lambda\)’s: the minimal \(\lambda\), and the 1se
\(\lambda\) (use the s argument in the predict function). Compare the results to one another and to the logistic regression.
Metrics: Create confusion matrix for each prediction. It should look like this:
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 36 6
## 1 5 43
##
## Accuracy : 0.8778
## 95% CI : (0.7918, 0.9374)
## No Information Rate : 0.5444
## P-Value [Acc > NIR] : 1.202e-11
##
## Kappa : 0.7541
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8780
## Specificity : 0.8776
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.8958
## Prevalence : 0.4556
## Detection Rate : 0.4000
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.8778
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 33 5
## 1 8 44
##
## Accuracy : 0.8556
## 95% CI : (0.7657, 0.9208)
## No Information Rate : 0.5444
## P-Value [Acc > NIR] : 3.463e-10
##
## Kappa : 0.7071
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.8049
## Specificity : 0.8980
## Pos Pred Value : 0.8684
## Neg Pred Value : 0.8462
## Prevalence : 0.4556
## Detection Rate : 0.3667
## Detection Prevalence : 0.4222
## Balanced Accuracy : 0.8514
##
## 'Positive' Class : 0
##
Good Luck!