Probability, Statistics & Modeling II

Module recap and Q&A

- Model comparison
- The loglinear model
- Interpretation

- you can model an outcome variable in many ways
- \(income \sim age + gender\)
- \(income \sim age + gender + education\)
- \(income \sim ethnicity + familystatus\)

- Which model explains the data (i.e. the outcome) better?

How to do that comparison?

One model is nested in another if you can always obtain the first model by constraining some of the parameters of the second model.

Nice explanation in this SO answer

Model 1: \(Y \sim x1 + x2 + x1:x2 + x3\)

Model 2: \(Y \sim x1 + x2\)

Can we constrain the parameters of Model 1 to obtain Model 2?

Model 1: \(Y \sim \beta_1x1 + \beta_2x2 + \beta_3(x1:x2) + \beta_4x3\)

Model 2: \(Y \sim \beta_1x1 + \beta_2x2\)

Can we constrain the parameters of Model 1 to obtain Model 2?

–> Yes: set \(\beta_3 = \beta_4 = 0\) so that \(Model 1 = Model 2\)

Model 1: \(Y \sim x1 + x2\)

Model 2: \(Y \sim x1\)

Nested?

Model 1: \(Y \sim x1 + x2\)

Model 2: \(Y \sim x1 + x3\)

Nested?

Model 1: \(Y \sim x1 + x2 + x3 + x4\)

Model 2: \(Y \sim x5\)

Nested?

\(income \sim age + gender\)

\(income \sim age + gender + education\)

Nested?

M1: \(income \sim age + gender\)

M2: \(income \sim age + gender + education\)

M3: \(income \sim ethnicity + familystatus\)

\(income \sim age + gender\)

\(income \sim age + gender + education\)

In essence: do we really need the additional predictor \(education\)?

Nested structure allows for formal statistical tests!

- if nested, we can test whether a simpler model is significantly worse than a more complex model
- if the model comparison is sign., then choose the more complex model
- if the test is not sign., choose the simpler one (Ockham’s razor principle)

\(income \sim age + gender + education\)

vs.

\(income \sim ethnicity + familystatus\)

No formal test possible?

- for non-nested models, compare goodness of fit indices
- e.g. sum of squared residuals, mean absolute error, …
- other fit indices: AIC, Log-likelihood, BIC

In essence: you have to make a judgment without formal statistical test.

No anti-virus software | Anti-virus software | Sum | |
---|---|---|---|

Hacked | 300 | 250 | 550 |

Not hacked | 200 | 250 | 450 |

Sum | 500 | 500 | 1000 |

Idea of the Chi-square test:

- Observed values \(O\)
- Expected values (if there were no association) \(E\)
- rows: \(i\)
- columns: \(j\)

\(E_{i,j} = \frac{(total_i * total_j)}{total}\)

\(\chi^2 = \sum\frac{(O - E^2)}{E}\)

- Null-hypothesis: there is no association between the two factors
- Alt. hypothesis: there is a significant association

Thus: if sign. –> there is a sign. association between r and c

```
## vandalised yes no
## area natural surveillance
## urban yes 911 44
## no 538 456
## suburb yes 3 2
## no 43 279
```

- GLM with link function for count data
- count data aptly modelled as a Poisson distrubuted variable

- we build the “independence” model
- no relationships between variables

- we assess the \(H_0\) of model adequacy
- if significant: model not adequate for the data
- if non-sign.: model is considered adequate

- we build more complex models
- e.g. with dependencies (i.e. interactions) between variables

- remember, we’re modelling counts that come about due to a combination of factors
- thus: the saturated (= full) model will explain the data perfectly

```
example = array(c(40, 70, 80, 30), dim=c(2,2))
dimnames(example) = list('gender' = c('male', 'female')
, 'UK' = c('yes', 'no')
)
ftable(example)
```

```
## UK yes no
## gender
## male 40 80
## female 70 30
```

`(exampledata = as.data.frame(as.table(example)))`

```
## gender UK Freq
## 1 male yes 40
## 2 female yes 70
## 3 male no 80
## 4 female no 30
```

```
indep_model = glm(formula = Freq ~ gender + UK
, data = exampledata
, family = poisson)
```

Next:

- look at \(H_0\) of model adequacy
- look at predicted values

`summary(indep_model)`

```
##
## Call:
## glm(formula = Freq ~ gender + UK, family = poisson, data = exampledata)
##
## Deviance Residuals:
## 1 2 3 4
## -2.750 2.666 2.455 -3.058
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.094e+00 1.135e-01 36.078 <2e-16 ***
## genderfemale -1.823e-01 1.354e-01 -1.347 0.178
## UKno 7.856e-12 1.348e-01 0.000 1.000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 31.869 on 3 degrees of freedom
## Residual deviance: 30.048 on 1 degrees of freedom
## AIC: 59.135
##
## Number of Fisher Scoring iterations: 4
```

`pchisq(30.048, 1, lower.tail = F)`

`## [1] 4.21483e-08`

gender | UK | Freq | round(fitted(indep_model), 2) |
---|---|---|---|

male | yes | 40 | 60 |

female | yes | 70 | 50 |

male | no | 80 | 60 |

female | no | 30 | 50 |

Next step: full model

```
full_model = glm(formula = Freq ~ gender*UK
, data = exampledata
, family = poisson)
```

gender | UK | Freq | round(fitted(full_model), 2) |
---|---|---|---|

male | yes | 40 | 40 |

female | yes | 70 | 70 |

male | no | 80 | 80 |

female | no | 30 | 30 |

```
## UK yes no
## gender
## male 40 80
## female 70 30
```

There is an association between gender and “UK”.

`coefficients(full_model)`

```
## (Intercept) genderfemale UKno genderfemale:UKno
## 3.6888795 0.5596158 0.6931472 -1.5404450
```

`exp(coefficients(full_model))`

```
## (Intercept) genderfemale UKno genderfemale:UKno
## 40.0000000 1.7500000 2.0000000 0.2142857
```

Remeber: we’re modelling the log (hence log-linear model)

- UK_no: 2.00
- The odds of a person being from the UK are 1:2.00, regardless of their gender.

- gender_female: 1.75
- The odds of a person being female are 1.75:1, regardless of their UK status.

- gender_female:UK_no: 0.21
- People that are female have estimated odds of not being from the UK is 0.21 times the odds for males of not being from the UK.

- extends Chisquare idea to mutliple dimensions
- brings in the modelling aspect
- aim: find a model that is simpler than the full model
- core: simplest model to explain the data

General strategy:

- there’s always a hypothesis
- make the hypothesis explicit
- every RQ must come down to one or multiple hypotheses

- difference in means 2 groups (t-test, rank sum test)
- difference in means 2+ groups (ANOVA, Kruskal-Wallis test)
- predictor combinations to explain an outcome (model comparison tests)
- linear models
- logistic regression models
- log linear models

When you ran your test/model:

- ask yourself: what did I test?
- which hypothesis was behind the test?
- what does the hyp. testing result reveal?
- how does this feed back to my RQ?

- forgetting to re-transform coefficients in logistic regression or loglinear models
- forgetting the unit of interpretation of coefficients
- forgetting the direction of effects
- attributing causality to correlational data

**Your interpretation becomes very difficult if you do not know the question you want to answer.**

Always start with the question!

CLASS TEST

- Tuesday, 19 March 2019
- 10am-12pm
- 60 min
- 10 questions (5 MC, 5 open)