Probability, Statistics & Modeling II
Non-paramtric tests & discrete data
Parametric assumptions
Also called: homoscedasticity
Can be tested with:
car:leveneTest(...)
lmtest::bptest(...)
Normally met when:
Can be tested with:
shapiro.test()
Assumption | Test | Potential fix |
---|---|---|
Independence of errors | Resiudal-predictor plot, correlation | Autocorrelation-sensitive methods |
Homoscedasticity | Levene’s Test, plot | Box-Cox transformation |
Normality | K-S test, Shapiro’s Test | Transforming data |
Parametric case:
Non-parametric case
Problem: class rank not parametric (e.g. not normally distributed)
Spearman’s correlation test
Idea:
cor.test(iq, class_rank, method = 'spearman')
## Warning in cor.test.default(iq, class_rank, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: iq and class_rank
## S = 158810, p-value = 0.6421
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.04704638
Parametric case: Independent samples t-test
Parametric case: Independent samples t-test
t.test(males, females)
##
## Welch Two Sample t-test
##
## data: males and females
## t = -5.0547, df = 197.52, p-value = 9.796e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14.496541 -6.359724
## sample estimates:
## mean of x mean of y
## 98.14252 108.57065
Non-parametric case: Independent samples t-test
Problem: variable “rank” not parametric (e.g. not normally distributed)
Wilcoxon Rank Sum Test
Idea:
males | rank.males. | females | rank.females. |
---|---|---|---|
3 | 27.5 | 8 | 75.5 |
3 | 27.5 | 7 | 65.0 |
8 | 82.0 | 2 | 15.5 |
6 | 63.0 | 3 | 25.5 |
7 | 72.5 | 8 | 75.5 |
10 | 96.0 | 1 | 5.5 |
wilcox.test(males, females)
##
## Wilcoxon rank sum test with continuity correction
##
## data: males and females
## W = 4564.5, p-value = 0.2853
## alternative hypothesis: true location shift is not equal to 0
For the non-parametric “dependent t-test”, you’d have to use the ‘paired’ argument (same is in the t.test
function).
The Kruskal-Wallis Test
kruskal.test(deprivation ~ area, data = deprivation)
##
## Kruskal-Wallis rank sum test
##
## data: deprivation by area
## Kruskal-Wallis chi-squared = 40.92, df = 3, p-value = 6.8e-09
No anti-virus software | Anti-virus software | |
---|---|---|
Hacked | 300 | 250 |
Not hacked | 200 | 250 |
Association test
Chi-square test
No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|
Hacked | 300 | 250 | 550 |
Not hacked | 200 | 250 | 450 |
Sum | 500 | 500 | 1000 |
Idea of the Chi-square test:
\(E_i,_j = (total_i * total_j)/total\)
\(E_i,_j = (total_i * total_j)/total\)
No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|
Hacked | ? | ? | 550 |
Not hacked | ? | ? | 450 |
Sum | 500 | 500 | 1000 |
Example: cell [hacked, no anti-virus software] –> cell [1,1]
\(E_i,_j = (total_i * total_j)/total\)
$ = (550 * 500)/1000$
$ = 275$
No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|
Hacked | 275 | 275 | 550 |
Not hacked | 225 | 225 | 450 |
Sum | 500 | 500 | 1000 |
\(\chi^2 = \sum\frac{(O - E^2)}{E}\)
For cell[2,1]:
\(cell[2,1] = \frac{(200-225)^2}{225} = \frac{-25^2}{225} = \frac{625}{225} = 2.78\)
\(\chi^2 = 9.701\)
No anti-virus software | Anti-virus software | |
---|---|---|
Hacked | 300 (275) | 250 (275) |
Not hacked | 200 (225) | 250 (225) |
##
## Pearson's Chi-squared test
##
## data: data1
## X-squared = 10.101, df = 1, p-value = 0.001482
There is a significant association between being hacked (hacked vs not hacked) and the use of anti-virus software (no anti-virus software vs anit-virus software), X2(1) = 10.10, p = .001.
But where does this association stem from? What drives it?
No anti-virus software | Anti-virus software | |
---|---|---|
Hacked | 300 | 250 |
Not hacked | 200 | 250 |
Interpret as:
knitr::kable(chisq.test(data1, correct = F)$stdres)
No anti-virus software | Anti-virus software | |
---|---|---|
Hacked | 3.178209 | -3.178209 |
Not hacked | -3.178209 | 3.178209 |
No anti-virus software | Anti-virus software | |
---|---|---|
Hacked | 3.18 (O > E) | -3.18 (O < E) |
Not hacked | -3.18 (O < E) | 3.18 (O > E) |
Non AV | standard AV | premium AV | |
---|---|---|---|
No access | 200 | 250 | 150 |
Files stolen | 400 | 300 | 200 |
Ransomware | 350 | 150 | 150 |
knitr::kable(addmargins(data2, c(1,2)))
Non AV | standard AV | premium AV | Sum | |
---|---|---|---|---|
No access | 200 | 250 | 150 | 600 |
Files stolen | 400 | 300 | 200 | 900 |
Ransomware | 350 | 150 | 150 | 650 |
Sum | 950 | 700 | 500 | 2150 |
Same steps:
Non AV | standard AV | premium AV | |
---|---|---|---|
No access | 15.99 | 15.29 | 0.78 |
Files stolen | 0.01 | 0.17 | 0.41 |
Ransomware | 13.73 | 17.95 | 0.01 |
##
## Pearson's Chi-squared test
##
## data: data2
## X-squared = 64.344, df = 4, p-value = 3.537e-13
Follow-up tests for the interpretation.
–> What drives the sign. association?
knitr::kable(round(chisq.test(data2, correct = F)$stdres, 2))
Non AV | standard AV | premium AV | |
---|---|---|---|
No access | -6.30 | 5.61 | 1.19 |
Files stolen | 0.20 | 0.65 | -0.96 |
Ransomware | 5.94 | -6.18 | -0.13 |
Remember: interpretation like z-scores
Non AV | standard AV | premium AV | |
---|---|---|---|
No access | (O < E) | (O > E) | (O == E) |
Files stolen | (O == E) | (O == E) | (O == E) |
Ransomware | (O > E) | (O < E) | (O == E) |
The significant association between … was driven by four significant deviations between the observed and expected values.
## vandalised yes no
## area natural surveillance
## urban yes 911 44
## no 538 456
## suburb yes 3 2
## no 43 279
Simple extension of the r*c calculation?
## , , area = urban
##
## natural surveillance
## vandalised yes no
## yes 0.6287095 0.3712905
## no 0.0880000 0.9120000
##
## , , area = suburb
##
## natural surveillance
## vandalised yes no
## yes 0.065217391 0.9347826
## no 0.007117438 0.9928826
If the data were independent…
then the expected count = joint prob. * n, where
joint prob. = product of the marginal probabilities
\(\mu_i,_j = n * marginal_i * marginal_j\)
No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|
Hacked | 0.3 | 0.25 | 0.55 |
Not hacked | 0.2 | 0.25 | 0.45 |
Sum | 0.5 | 0.50 | 1.00 |
No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|
Hacked | 0.3 | 0.25 | 0.55 |
Not hacked | 0.2 | 0.25 | 0.45 |
Sum | 0.5 | 0.50 | 1.00 |
\(cell[1,2] = 0.50*0.55*1000 = 275\)
\(cell[2,2] = 0.50*0.45*1000 = 225\)
Log transformation
\(\mu_i,_j = n * marginal_i * marginal_j\)
==
\(log(\mu_i,_j) = log(n) + log(marginal_i) + log(marginal_j)\)
Hence: “loglinear” model
## vandalised natural.surveillance area Freq
## 1 yes yes urban 911
## 2 no yes urban 44
## 3 yes no urban 538
## 4 no no urban 456
## 5 yes yes suburb 3
## 6 no yes suburb 2
## 7 yes no suburb 43
## 8 no no suburb 279
indep_model = glm(formula = Freq ~ vandalised + natural.surveillance + area
, data = data3_
, family = poisson)
##
## Call:
## glm(formula = Freq ~ vandalised + natural.surveillance + area,
## family = poisson, data = data3_)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 14.522 -17.683 -7.817 3.426 -12.440 -8.832 -8.436 19.639
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.29154 0.03667 171.558 < 2e-16 ***
## vandalisedno -0.64931 0.04415 -14.707 < 2e-16 ***
## natural.surveillanceno 0.31542 0.04244 7.431 1.08e-13 ***
## areasuburb -1.78511 0.05976 -29.872 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2851.5 on 7 degrees of freedom
## Residual deviance: 1286.0 on 4 degrees of freedom
## AIC: 1343.1
##
## Number of Fisher Scoring iterations: 6
pchisq(1286, 4, lower.tail = F)
## [1] 3.610223e-277
Reject H0 that the model is a good representation.
knitr::kable(cbind(indep_model$data, round(fitted(indep_model), 2)))
vandalised | natural.surveillance | area | Freq | round(fitted(indep_model), 2) |
---|---|---|---|---|
yes | yes | urban | 911 | 539.98 |
no | yes | urban | 44 | 282.09 |
yes | no | urban | 538 | 740.23 |
no | no | urban | 456 | 386.70 |
yes | yes | suburb | 3 | 90.60 |
no | yes | suburb | 2 | 47.33 |
yes | no | suburb | 43 | 124.19 |
no | no | suburb | 279 | 64.88 |
Coefficient for “vandalised=no”
exp(-0.64931)
## [1] 0.5224061
Odds of a playground being vandalised are 0.52:1, regardless of whether there was natural surveillance and regardless of the area.
Also called: the saturated model
full_model = glm(formula = Freq ~ vandalised * natural.surveillance * area
, data = data3_
, family = poisson)
What do you expect?
knitr::kable(cbind(full_model$data, round(fitted(full_model), 2)))
vandalised | natural.surveillance | area | Freq | round(fitted(full_model), 2) |
---|---|---|---|---|
yes | yes | urban | 911 | 911 |
no | yes | urban | 44 | 44 |
yes | no | urban | 538 | 538 |
no | no | urban | 456 | 456 |
yes | yes | suburb | 3 | 3 |
no | yes | suburb | 2 | 2 |
yes | no | suburb | 43 | 43 |
no | no | suburb | 279 | 279 |
step(full_model)
## Start: AIC=65.04
## Freq ~ vandalised * natural.surveillance * area
##
## Df Deviance AIC
## - vandalised:natural.surveillance:area 1 0.37399 63.417
## <none> 0.00000 65.043
##
## Step: AIC=63.42
## Freq ~ vandalised + natural.surveillance + area + vandalised:natural.surveillance +
## vandalised:area + natural.surveillance:area
##
## Df Deviance AIC
## <none> 0.37 63.42
## - natural.surveillance:area 1 92.02 153.06
## - vandalised:area 1 187.75 248.80
## - vandalised:natural.surveillance 1 497.37 558.41
##
## Call: glm(formula = Freq ~ vandalised + natural.surveillance + area +
## vandalised:natural.surveillance + vandalised:area + natural.surveillance:area,
## family = poisson, data = data3_)
##
## Coefficients:
## (Intercept) vandalisedno
## 6.8139 -3.0158
## natural.surveillanceno areasuburb
## -0.5249 -5.5283
## vandalisedno:natural.surveillanceno vandalisedno:areasuburb
## 2.8479 2.0545
## natural.surveillanceno:areasuburb
## 2.9860
##
## Degrees of Freedom: 7 Total (i.e. Null); 1 Residual
## Null Deviance: 2851
## Residual Deviance: 0.374 AIC: 63.42
summary(best_model)
##
## Call:
## glm(formula = Freq ~ vandalised + natural.surveillance + area +
## vandalised:natural.surveillance + vandalised:area + natural.surveillance:area,
## family = poisson, data = data3_)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## 0.02044 -0.09256 -0.02658 0.02890 -0.33428 0.49134 0.09452
## 8
## -0.03690
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.81387 0.03313 205.699 < 2e-16
## vandalisedno -3.01575 0.15162 -19.891 < 2e-16
## natural.surveillanceno -0.52486 0.05428 -9.669 < 2e-16
## areasuburb -5.52827 0.45221 -12.225 < 2e-16
## vandalisedno:natural.surveillanceno 2.84789 0.16384 17.382 < 2e-16
## vandalisedno:areasuburb 2.05453 0.17406 11.803 < 2e-16
## natural.surveillanceno:areasuburb 2.98601 0.46468 6.426 1.31e-10
##
## (Intercept) ***
## vandalisedno ***
## natural.surveillanceno ***
## areasuburb ***
## vandalisedno:natural.surveillanceno ***
## vandalisedno:areasuburb ***
## natural.surveillanceno:areasuburb ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2851.46098 on 7 degrees of freedom
## Residual deviance: 0.37399 on 1 degrees of freedom
## AIC: 63.417
##
## Number of Fisher Scoring iterations: 4
Can we reject the H0 of model adequacy?
pchisq(0.37, 1, lower.tail = F)
## [1] 0.5430043
No!
vandalised | natural.surveillance | area | Freq | round(fitted(best_model), 2) |
---|---|---|---|---|
yes | yes | urban | 911 | 910.38 |
no | yes | urban | 44 | 44.62 |
yes | no | urban | 538 | 538.62 |
no | no | urban | 456 | 455.38 |
yes | yes | suburb | 3 | 3.62 |
no | yes | suburb | 2 | 1.38 |
yes | no | suburb | 43 | 42.38 |
no | no | suburb | 279 | 279.62 |
coefficients(best_model)
## (Intercept) vandalisedno
## 6.8138656 -3.0157544
## natural.surveillanceno areasuburb
## -0.5248611 -5.5282675
## vandalisedno:natural.surveillanceno vandalisedno:areasuburb
## 2.8478892 2.0545341
## natural.surveillanceno:areasuburb
## 2.9860144
vandalisedno:areasuburb
2.0545341
exp(2.055)
## [1] 7.806838
Exponentiated interaction ==> OR
Playgrounds that are in the suburb have estimated odds of not being vandalised that is 7.81 times the estimated odds for playgrounds that are in urban areas. This is independent of “natural surveillance”.
Follow the steps here (https://data.library.virginia.edu/an-introduction-to-loglinear-models/)
Next week: Reading week
Week 6: Open Science (lecture + tutorial)