Probability, Statistics & Modeling II
Module recap and Q&A
How to do that comparison?
One model is nested in another if you can always obtain the first model by constraining some of the parameters of the second model.
Nice explanation in this SO answer
Model 1: \(Y \sim x1 + x2 + x1:x2 + x3\)
Model 2: \(Y \sim x1 + x2\)
Can we constrain the parameters of Model 1 to obtain Model 2?
Model 1: \(Y \sim \beta_1x1 + \beta_2x2 + \beta_3(x1:x2) + \beta_4x3\)
Model 2: \(Y \sim \beta_1x1 + \beta_2x2\)
Can we constrain the parameters of Model 1 to obtain Model 2?
–> Yes: set \(\beta_3 = \beta_4 = 0\) so that \(Model 1 = Model 2\)
Model 1: \(Y \sim x1 + x2\)
Model 2: \(Y \sim x1\)
Nested?
Model 1: \(Y \sim x1 + x2\)
Model 2: \(Y \sim x1 + x3\)
Nested?
Model 1: \(Y \sim x1 + x2 + x3 + x4\)
Model 2: \(Y \sim x5\)
Nested?
\(income \sim age + gender\)
\(income \sim age + gender + education\)
Nested?
M1: \(income \sim age + gender\)
M2: \(income \sim age + gender + education\)
M3: \(income \sim ethnicity + familystatus\)
\(income \sim age + gender\)
\(income \sim age + gender + education\)
In essence: do we really need the additional predictor \(education\)?
Nested structure allows for formal statistical tests!
\(income \sim age + gender + education\)
vs.
\(income \sim ethnicity + familystatus\)
No formal test possible?
In essence: you have to make a judgment without formal statistical test.
No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|
Hacked | 300 | 250 | 550 |
Not hacked | 200 | 250 | 450 |
Sum | 500 | 500 | 1000 |
Idea of the Chi-square test:
\(E_{i,j} = \frac{(total_i * total_j)}{total}\)
\(\chi^2 = \sum\frac{(O - E^2)}{E}\)
Thus: if sign. –> there is a sign. association between r and c
## vandalised yes no
## area natural surveillance
## urban yes 911 44
## no 538 456
## suburb yes 3 2
## no 43 279
example = array(c(40, 70, 80, 30), dim=c(2,2))
dimnames(example) = list('gender' = c('male', 'female')
, 'UK' = c('yes', 'no')
)
ftable(example)
## UK yes no
## gender
## male 40 80
## female 70 30
(exampledata = as.data.frame(as.table(example)))
## gender UK Freq
## 1 male yes 40
## 2 female yes 70
## 3 male no 80
## 4 female no 30
indep_model = glm(formula = Freq ~ gender + UK
, data = exampledata
, family = poisson)
Next:
summary(indep_model)
##
## Call:
## glm(formula = Freq ~ gender + UK, family = poisson, data = exampledata)
##
## Deviance Residuals:
## 1 2 3 4
## -2.750 2.666 2.455 -3.058
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.094e+00 1.135e-01 36.078 <2e-16 ***
## genderfemale -1.823e-01 1.354e-01 -1.347 0.178
## UKno 7.856e-12 1.348e-01 0.000 1.000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 31.869 on 3 degrees of freedom
## Residual deviance: 30.048 on 1 degrees of freedom
## AIC: 59.135
##
## Number of Fisher Scoring iterations: 4
pchisq(30.048, 1, lower.tail = F)
## [1] 4.21483e-08
gender | UK | Freq | round(fitted(indep_model), 2) |
---|---|---|---|
male | yes | 40 | 60 |
female | yes | 70 | 50 |
male | no | 80 | 60 |
female | no | 30 | 50 |
Next step: full model
full_model = glm(formula = Freq ~ gender*UK
, data = exampledata
, family = poisson)
gender | UK | Freq | round(fitted(full_model), 2) |
---|---|---|---|
male | yes | 40 | 40 |
female | yes | 70 | 70 |
male | no | 80 | 80 |
female | no | 30 | 30 |
## UK yes no
## gender
## male 40 80
## female 70 30
There is an association between gender and “UK”.
coefficients(full_model)
## (Intercept) genderfemale UKno genderfemale:UKno
## 3.6888795 0.5596158 0.6931472 -1.5404450
exp(coefficients(full_model))
## (Intercept) genderfemale UKno genderfemale:UKno
## 40.0000000 1.7500000 2.0000000 0.2142857
Remeber: we’re modelling the log (hence log-linear model)
General strategy:
When you ran your test/model:
Your interpretation becomes very difficult if you do not know the question you want to answer.
Always start with the question!
CLASS TEST