class: center, middle, inverse, title-slide # PRIMJENJENA STATISTIKA ## Predavanje 11: Linearna regresija ### Luka Sikic, PhD ### Fakultet hrvatskih studija |
Github PS
--- class: inverse, middle # PREGLED PREDAVANJA --- layout: true # PREGLED PREDAVANJA --- <br> <br> ## CILJEVI <br> <br> - Linearna regresija <br> - Multivarijatna linearna regresija <br> - Karakteristike procijenjenog modela <br> - Testiranje hipoteza <br> - Pretpostavke modela <br> - Provjera modela <br> - Izbor parametara modela --- layout:false class: middle, inverse # LINEARNA REGRESIJA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Osnova za razumijevanje velikog broja statističkih modela!) --- layout:true # LINEARNA REGRESIJA --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } remark-slide-content { font-size: 22px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 16px; } .huge .remark-code { /*Change made here*/ font-size: 200% !important; } .mid .remark-code { /*Change made here*/ font-size: 70% !important; } .tiny .remark-code { /*Change made here*/ font-size: 50% !important; } </style> .hi[**Pregled podataka**] ``` #> dan.sleep baby.sleep dan.grump day #> 1 7.59 10.18 56 1 #> 2 7.91 11.66 60 2 #> 3 5.14 7.92 82 3 #> 4 7.71 9.61 55 4 #> 5 6.68 9.75 67 5 #> 6 5.99 5.04 72 6 #> 7 8.19 10.45 53 7 #> 8 7.19 8.27 60 8 #> 9 7.40 6.06 60 9 #> 10 6.58 7.09 71 10 #> 11 6.49 11.68 72 11 #> 12 6.27 6.13 65 12 #> 13 5.95 7.83 74 13 #> 14 6.65 5.60 67 14 #> 15 6.41 6.03 66 15 ``` --- .hi[**Pregled podataka**] <img src="11_REG_files/figure-html/regr0-1.svg" style="display: block; margin: auto;" /> <br> <br> .footnote[[*]Dijagram rasipanja koji pokazuje neraspolozenje kao funkciju sati spavanja.] --- .hi[**Napravi regresijski pravac**] <img src="11_REG_files/figure-html/regression1a-1.svg" style="display: block; margin: auto;" /> .footnote[[*]Regresijski pravac koji prikazuje odnos neraspolozenja i sati spavanja.] --- .hi[**Napravi (loš) regresijski pravac**] <img src="11_REG_files/figure-html/regression1b-1.svg" style="display: block; margin: auto;" /> .footnote[[*]Regresijski pravac koji loše prikazuje odnos neraspolozenja i sati spavanja.] --- .hi[**Formula regresijskog pravca**] `$$\hat{Y_i} = b_1 X_i + b_0$$` .hi[**Pogreška regresijskog modela**] `$$\epsilon_i = Y_i - \hat{Y}_i$$` .hi[**Regresijski model za procjenu**] `$$Y_i = b_1 X_i + b_0 + \epsilon_i$$` .hi[**OLS model**] `$$\sum_i (Y_i - \hat{Y}_i)^2$$` `$$\sum_i {\epsilon_i}^2$$` --- .hi[**Grafički prikaz OLS modela**] <img src="11_REG_files/figure-html/regression3a-1.svg" style="display: block; margin: auto;" /> --- .hi[**Grafički prikaz OLS modela**] <img src="11_REG_files/figure-html/regression3b-1.svg" style="display: block; margin: auto;" /> .footnote[[*]Prikaz reziduala vezanih uz loš regresijski pravac.] --- .hi[**Procjeni regresijski model**] ```r # Procjeni regresijski model i spremi rezultate u objekt regression.1 <- lm( formula = dan.grump ~ dan.sleep, data = parenthood ) ``` ```r # Pogledaj rezultate print( regression.1 ) ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep, data = parenthood) #> #> Coefficients: #> (Intercept) dan.sleep #> 125.956 -8.937 ``` .hi[**Formula procijenjenog regresijskog modela**] `$$\hat{Y}_i = -8.94 \ X_i + 125.96$$` --- layout:false class: middle, inverse # VIŠESTRUKA REGRESIJA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Regresija sa više varijabli!) --- layout:true # VIŠESTRUKA REGRESIJA --- <br> <br> .hi[**Formula 1**] `$$Y_i = b_2 X_{i2} + b_1 X_{i1} + b_0 + \epsilon_i$$` <br> <br> .hi[**Sintaksa u R**] <br> ``` dan.grump ~ dan.sleep + baby.sleep ``` --- .hi[**Grafički prikaz**] <img src="../Foto/scatter3d.png" width="420" style="display: block; margin: auto;" /> --- .hi[**Procijeni regresijski model**] ```r # Provedi višestruku regresiju u R regression.2 <- lm( formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood ) # Pregledaj rezultate print(regression.2) ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood) #> #> Coefficients: #> (Intercept) dan.sleep baby.sleep #> 125.96557 -8.95025 0.01052 ``` .hi[**Formula 2**] `$$Y_i = \left( \sum_{k=1}^K b_{k} X_{ik} \right) + b_0 + \epsilon_i$$` --- layout:false class: middle, inverse # KARAKTERISTIKE PROCIJENJENOG MODELA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Koliko kvalitetena procjena...) --- layout:true # KARAKTERISTIKE PROCIJENJENOG MODELA --- .hi[**Izračun kvadrata odstupanja**] - Rezidualna odstupanja `$$\mbox{SS}_{res} = \sum_i (Y_i - \hat{Y}_i)^2$$` - Ukupna odstupanja `$$\mbox{SS}_{tot} = \sum_i (Y_i - \bar{Y})^2$$` - Izračunaj u programu ```r X <- parenthood$dan.sleep # Nezavisna varijabla Y <- parenthood$dan.grump # Zavisna varijabla ``` ```r # Procijenjene vrijednosti Y.pred <- -8.94 * X + 125.97 ``` --- .hi[**Izračun kvadrata odstupanja**] ```r # Izračunaj sumu rezidualnih odstupanja SS.resid <- sum((Y - Y.pred)^2) print(SS.resid) # Prikaži ``` ``` #> [1] 1838.722 ``` ```r # Izračunaj sumu ukupnih odstupanja SS.tot <- sum((Y - mean(Y))^2) print(SS.tot) # Prikaži ``` ``` #> [1] 9998.59 ``` --- - Formula `$$R^2 = 1 - \frac{\mbox{SS}_{res}}{\mbox{SS}_{tot}}$$` - Izračunaj vrijednost ```r # Unesi vrijednosti R.squared <- 1 - (SS.resid / SS.tot) print(R.squared) # Prikaži ``` ``` #> [1] 0.8161018 ``` - Usporedi sa korelacijom ```r r <- cor(X, Y) # Izračunaj korelaciju print( r^2 ) # Prikaži ``` ``` #> [1] 0.8161027 ``` --- .hi[**Prilagodjeni R^2 koeficijent**] <br> <br> `$$\mbox{adj. } R^2 = 1 - \left(\frac{\mbox{SS}_{res}}{\mbox{SS}_{tot}} \times \frac{N-1}{N-K-1} \right)$$` --- layout:false class: middle, inverse # TESTIRANJE HIPOTEZA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Hipoteze kod regresijskog modela!) --- layout:true # TESTIRANJE HIPOTEZA --- .hi[**Za cijeli model**] <br> <br> - Nulta hipoteza <br> `$$H_0: Y_i = b_0 + \epsilon_i$$` <br> - Alternativna hipoteza <br> `$$H_1: Y_i = \left( \sum_{k=1}^K b_{k} X_{ik} \right) + b_0 + \epsilon_i$$` --- .hi[**Za cijeli model**] - Izračun F statistike `$$\mbox{SS}_{mod} = \mbox{SS}_{tot} - \mbox{SS}_{res}$$` `$$\begin{array}{rcl} \mbox{MS}_{mod} &=& \displaystyle\frac{\mbox{SS}_{mod} }{df_{mod}} \\ \\ \mbox{MS}_{res} &=& \displaystyle\frac{\mbox{SS}_{res} }{df_{res} } \end{array}$$` `$$F = \frac{\mbox{MS}_{mod}}{\mbox{MS}_{res}}$$` --- .hi[**Za pojedinačne koeficijente**] <br> - Hipoteze <br> `$$\begin{array}{rl} H_0: & b = 0 \\ H_1: & b \neq 0 \end{array}$$` <br> - t-test <br> `$$t = \frac{\hat{b}}{\mbox{SE}({\hat{b})}}$$` --- - Rezultati modela ```r # Pogledaj rezultate modela print( regression.2 ) ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood) #> #> Coefficients: #> (Intercept) dan.sleep baby.sleep #> 125.96557 -8.95025 0.01052 ``` --- - Rezultati modela višestruke linearne regresije .tiny[ ```r # Pogkedaj rezultate summary(regression.2) ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood) #> #> Residuals: #> Min 1Q Median 3Q Max #> -11.0345 -2.2198 -0.4016 2.6775 11.7496 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 125.96557 3.04095 41.423 <2e-16 *** #> dan.sleep -8.95025 0.55346 -16.172 <2e-16 *** #> baby.sleep 0.01052 0.27106 0.039 0.969 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 4.354 on 97 degrees of freedom #> Multiple R-squared: 0.8161, Adjusted R-squared: 0.8123 #> F-statistic: 215.2 on 2 and 97 DF, p-value: < 2.2e-16 ``` ] --- layout:false class: middle, inverse # PRETPOSTAVKE REGRESIJSKOG MODELA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Uvjeti koji moraju biti zadovoljeni!) --- layout:true # PRETPOSTAVKE REGRESIJSKOG MODELA --- .hi[**Uvjeti:**] <br> <br> - Normalnost distribucije (reziduala) <br> - Linearnost <br> - Homogenost varijance <br> - Nekoreliranost(prediktora) <br> - Nezavisnost rezidualne strukture <br> - Nema značajnih outliera <br> --- layout:false class: middle, inverse # PROVJERA MODELA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Da li su pretpostavke zadovoljene!) --- layout:true # PROVJERA MODELA --- .hi[**Ekstremni podatci:**] - Outlier-i <img src="../Foto/outlier.png" width="120" style="display: block; margin: auto;" /> --- .hi[**Ekstremni podatci:**] - utjecaj poluge (leverage) <img src="../Foto/leverage.png" width="120" style="display: block; margin: auto;" /> --- .hi[**Ekstremni podatci:**] - utjecaj opservacije <img src="../Foto/influence.png" width="120" style="display: block; margin: auto;" /> --- - Formula ```r # Izračunaj mjeru Cook-ove udaljenosti cooks.distance( model = regression.2 ) ``` --- ```r # Prikaži Cook-ovu mjeru grafički plot(x = regression.2, which = 4) ``` <img src="11_REG_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> --- ```r # Provedi procjenu bez ekstremnih opservacija lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood, subset = -64) ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep + baby.sleep, data = parenthood, #> subset = -64) #> #> Coefficients: #> (Intercept) dan.sleep baby.sleep #> 126.3553 -8.8283 -0.1319 ``` --- .hi[**Provjera normalnosti reziduala**] ```r # Prikaži grafički plot(x = regression.2, which = 1 ) # Resid vs. fitted ``` <img src="11_REG_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> --- .hi[**Provjera normalnosti reziduala**] ```r plot(x = regression.2, which = 2 ) # QQ-plot ``` <img src="11_REG_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> --- .hi[**Provjera normalnosti reziduala**] ```r # Prikaži reziduale na histogramu hist( x = residuals(regression.2), xlab = "Vrijednost reziduala", main = "", breaks = 20) ``` <img src="11_REG_files/figure-html/unnamed-chunk-22-1.svg" style="display: block; margin: auto;" /> --- .hi[**Provjera linearnosti**] ```r # Spremi fit vrijednosti u objekt yhat.2 <- fitted.values(object = regression.2) # Prikaži grafički plot(x = yhat.2, y = parenthood$dan.grump, xlab = "Fit", ylab = "Observed") ``` <img src="11_REG_files/figure-html/unnamed-chunk-23-1.svg" style="display: block; margin: auto;" /> --- ```r # Prikaži reyidualne vs. procijenjene fitted vrijednosti plot(x = regression.2, which = 1) ``` <img src="11_REG_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto;" /> --- ```r # Prikaži rezidualne vs fitted vrijednosti car::residualPlots(model = regression.2) ``` <img src="11_REG_files/figure-html/unnamed-chunk-25-1.svg" style="display: block; margin: auto;" /> ``` #> Test stat Pr(>|Test stat|) #> dan.sleep 2.1604 0.03323 * #> baby.sleep -0.5445 0.58733 #> Tukey test 2.1615 0.03066 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- .hi[**Provjera homogenosti varijance**] ```r plot(x = regression.2, which = 3) ``` <img src="11_REG_files/figure-html/unnamed-chunk-26-1.svg" style="display: block; margin: auto;" /> --- ```r # Provedi test homogenosti varijance car::ncvTest(regression.2) ``` ``` #> Non-constant Variance Score Test #> Variance formula: ~ fitted.values #> Chisquare = 0.09317511, Df = 1, p = 0.76018 ``` ```r # Provedi drugi test varijance library(car) lmtest::coeftest( regression.2, vcov= hccm ) ``` ``` #> #> t test of coefficients: #> #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 125.965566 3.247285 38.7910 <2e-16 *** #> dan.sleep -8.950250 0.615820 -14.5339 <2e-16 *** #> baby.sleep 0.010524 0.291565 0.0361 0.9713 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- .hi[**Provjera kolinearnosti**] `$$\mbox{VIF}_k = \frac{1}{1-{R^2_{(-k)}}}$$` ```r # Provedi test vif( mod = regression.2 ) ``` ``` #> dan.sleep baby.sleep #> 1.651038 1.651038 ``` ```r # Pogledaj korelaciju cor( parenthood ) ``` ``` #> dan.sleep baby.sleep dan.grump day #> dan.sleep 1.00000000 0.62794934 -0.90338404 -0.09840768 #> baby.sleep 0.62794934 1.00000000 -0.56596373 -0.01043394 #> dan.grump -0.90338404 -0.56596373 1.00000000 0.07647926 #> day -0.09840768 -0.01043394 0.07647926 1.00000000 ``` --- layout:false class: middle, inverse # IZBOR PARAMETARA MODELA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Uvjeti koji moraju biti zadovoljeni!) --- layout:true # IZBOR PARAMETARA MODELA --- - Informacijski kriterij (AIC) `$$\mbox{AIC} = \displaystyle\frac{\mbox{SS}_{res}}{\hat{\sigma}}^2+ 2K$$` - Selekcija unatrag (backward selection) ```r # Specificiraj puni model full.model <- lm(formula = dan.grump ~ dan.sleep + baby.sleep + day, data = parenthood) ``` - Selekcija unaprijed (forward selection) ```r # Specificiraj osnovni model nul.model <- lm(dan.grump ~ 1, parenthood) ``` --- .tiny[ ```r # Selekcija unatrag step(object = full.model, direction = "backward") ``` ``` #> Start: AIC=299.08 #> dan.grump ~ dan.sleep + baby.sleep + day #> #> Df Sum of Sq RSS AIC #> - baby.sleep 1 0.1 1837.2 297.08 #> - day 1 1.6 1838.7 297.16 #> <none> 1837.1 299.08 #> - dan.sleep 1 4909.0 6746.1 427.15 #> #> Step: AIC=297.08 #> dan.grump ~ dan.sleep + day #> #> Df Sum of Sq RSS AIC #> - day 1 1.6 1838.7 295.17 #> <none> 1837.2 297.08 #> - dan.sleep 1 8103.0 9940.1 463.92 #> #> Step: AIC=295.17 #> dan.grump ~ dan.sleep #> #> Df Sum of Sq RSS AIC #> <none> 1838.7 295.17 #> - dan.sleep 1 8159.9 9998.6 462.50 ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep, data = parenthood) #> #> Coefficients: #> (Intercept) dan.sleep #> 125.956 -8.937 ``` ] --- - Selekcija unaprijed .tiny[ ```r # Definiraj selekcijsku funkciju (unaprijed) step(object = nul.model, direction = "forward", scope = dan.grump ~ dan.sleep + baby.sleep + day) ``` ``` #> Start: AIC=462.5 #> dan.grump ~ 1 #> #> Df Sum of Sq RSS AIC #> + dan.sleep 1 8159.9 1838.7 295.17 #> + baby.sleep 1 3202.7 6795.9 425.89 #> <none> 9998.6 462.50 #> + day 1 58.5 9940.1 463.92 #> #> Step: AIC=295.17 #> dan.grump ~ dan.sleep #> #> Df Sum of Sq RSS AIC #> <none> 1838.7 295.17 #> + day 1 1.55760 1837.2 297.08 #> + baby.sleep 1 0.02858 1838.7 297.16 ``` ``` #> #> Call: #> lm(formula = dan.grump ~ dan.sleep, data = parenthood) #> #> Coefficients: #> (Intercept) dan.sleep #> 125.956 -8.937 ``` ] --- - Usporedba dva modela ```r # Procjeni dva ugnježdena modela M0 <- lm( dan.grump ~ dan.sleep + day, parenthood ) M1 <- lm( dan.grump ~ dan.sleep + day + baby.sleep, parenthood ) # Usporedi modele AIC( M0, M1 ) ``` ``` #> df AIC #> M0 4 582.8681 #> M1 5 584.8646 ``` --- `$$F = \frac{(\mbox{SS}_{res}^{(0)} - \mbox{SS}_{res}^{(1)})/k}{(\mbox{SS}_{res}^{(1)})/(N-p-1)}$$` `$$\mbox{SS}_\Delta = \mbox{SS}_{res}^{(0)} - \mbox{SS}_{res}^{(1)}$$` `$$\mbox{SS}_\Delta = \sum_{i} \left( \hat{y}_i^{(1)} - \hat{y}_i^{(0)} \right)^2$$` ```r # Provedi hijerarhijsku regresiju anova(M0, M1) ``` ``` #> Analysis of Variance Table #> #> Model 1: dan.grump ~ dan.sleep + day #> Model 2: dan.grump ~ dan.sleep + day + baby.sleep #> Res.Df RSS Df Sum of Sq F Pr(>F) #> 1 97 1837.2 #> 2 96 1837.1 1 0.063688 0.0033 0.9541 ``` --- layout:false class: middle, inverse # Hvala na pažnji! <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> Zadnje predavanje u ovom kolegiju :-)