# PSM 2

### 12 March 2019

Probability, Statistics & Modeling II

## Today

Module recap and Q&A

• Model comparison
• The loglinear model
• Interpretation

## When do we need it?

• you can model an outcome variable in many ways
• $$income \sim age + gender$$
• $$income \sim age + gender + education$$
• $$income \sim ethnicity + familystatus$$
• Which model explains the data (i.e. the outcome) better?

How to do that comparison?

## Nested vs unnested models

One model is nested in another if you can always obtain the first model by constraining some of the parameters of the second model.

Nice explanation in this SO answer

## Nested vs unnested models

Model 1: $$Y \sim x1 + x2 + x1:x2 + x3$$

Model 2: $$Y \sim x1 + x2$$

Can we constrain the parameters of Model 1 to obtain Model 2?

## Model parameters

Model 1: $$Y \sim \beta_1x1 + \beta_2x2 + \beta_3(x1:x2) + \beta_4x3$$

Model 2: $$Y \sim \beta_1x1 + \beta_2x2$$

Can we constrain the parameters of Model 1 to obtain Model 2?

–> Yes: set $$\beta_3 = \beta_4 = 0$$ so that $$Model 1 = Model 2$$

## Nested vs unnested models

Model 1: $$Y \sim x1 + x2$$

Model 2: $$Y \sim x1$$

Nested?

## Nested vs unnested models

Model 1: $$Y \sim x1 + x2$$

Model 2: $$Y \sim x1 + x3$$

Nested?

## Nested vs unnested models

Model 1: $$Y \sim x1 + x2 + x3 + x4$$

Model 2: $$Y \sim x5$$

Nested?

## Nested models

$$income \sim age + gender$$

$$income \sim age + gender + education$$

Nested?

## Nested models

M1: $$income \sim age + gender$$

M2: $$income \sim age + gender + education$$

M3: $$income \sim ethnicity + familystatus$$

## Nested models

$$income \sim age + gender$$

$$income \sim age + gender + education$$

In essence: do we really need the additional predictor $$education$$?

Nested structure allows for formal statistical tests!

## Formal model comparison logic

• if nested, we can test whether a simpler model is significantly worse than a more complex model
• if the model comparison is sign., then choose the more complex model
• if the test is not sign., choose the simpler one (Ockham’s razor principle)

## Non-nested models

$$income \sim age + gender + education$$

vs.

$$income \sim ethnicity + familystatus$$

No formal test possible?

## Non-nested models

• for non-nested models, compare goodness of fit indices
• e.g. sum of squared residuals, mean absolute error, …
• other fit indices: AIC, Log-likelihood, BIC

In essence: you have to make a judgment without formal statistical test.

## A step back:

No anti-virus software Anti-virus software Sum
Hacked 300 250 550
Not hacked 200 250 450
Sum 500 500 1000

## For r-by-c tables

Idea of the Chi-square test:

• Observed values $$O$$
• Expected values (if there were no association) $$E$$
• rows: $$i$$
• columns: $$j$$

$$E_{i,j} = \frac{(total_i * total_j)}{total}$$

## Chi-square test for r-by-c

$$\chi^2 = \sum\frac{(O - E^2)}{E}$$

• Null-hypothesis: there is no association between the two factors
• Alt. hypothesis: there is a significant association

Thus: if sign. –> there is a sign. association between r and c

## More dimensions?

##                             vandalised yes  no
## area   natural surveillance
## urban  yes                             911  44
##        no                              538 456
## suburb yes                               3   2
##        no                               43 279

## The Log-Linear Model

• GLM with link function for count data
• count data aptly modelled as a Poisson distrubuted variable

## Stepwise

1. we build the “independence” model
• no relationships between variables
2. we assess the $$H_0$$ of model adequacy
• if significant: model not adequate for the data
• if non-sign.: model is considered adequate
3. we build more complex models
• e.g. with dependencies (i.e. interactions) between variables

## Stepwise

• remember, we’re modelling counts that come about due to a combination of factors
• thus: the saturated (= full) model will explain the data perfectly

## Example

example = array(c(40, 70, 80, 30), dim=c(2,2))
dimnames(example) = list('gender' = c('male', 'female')
, 'UK' = c('yes', 'no')
)
ftable(example)
##        UK yes no
## gender
## male       40 80
## female     70 30

## Example

(exampledata = as.data.frame(as.table(example)))
##   gender  UK Freq
## 1   male yes   40
## 2 female yes   70
## 3   male  no   80
## 4 female  no   30

## Example

indep_model = glm(formula = Freq ~ gender + UK
, data = exampledata
, family = poisson)

Next:

• look at $$H_0$$ of model adequacy
• look at predicted values

summary(indep_model)
##
## Call:
## glm(formula = Freq ~ gender + UK, family = poisson, data = exampledata)
##
## Deviance Residuals:
##      1       2       3       4
## -2.750   2.666   2.455  -3.058
##
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)   4.094e+00  1.135e-01  36.078   <2e-16 ***
## genderfemale -1.823e-01  1.354e-01  -1.347    0.178
## UKno          7.856e-12  1.348e-01   0.000    1.000
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 31.869  on 3  degrees of freedom
## Residual deviance: 30.048  on 1  degrees of freedom
## AIC: 59.135
##
## Number of Fisher Scoring iterations: 4

pchisq(30.048, 1, lower.tail = F)
## [1] 4.21483e-08

## Predicted values

gender UK Freq round(fitted(indep_model), 2)
male yes 40 60
female yes 70 50
male no 80 60
female no 30 50

## Example

Next step: full model

full_model = glm(formula = Freq ~ gender*UK
, data = exampledata
, family = poisson)

## Example

gender UK Freq round(fitted(full_model), 2)
male yes 40 40
female yes 70 70
male no 80 80
female no 30 30

## Conclusion

##        UK yes no
## gender
## male       40 80
## female     70 30

There is an association between gender and “UK”.

## Making sense of the coefficients

coefficients(full_model)
##       (Intercept)      genderfemale              UKno genderfemale:UKno
##         3.6888795         0.5596158         0.6931472        -1.5404450
exp(coefficients(full_model))
##       (Intercept)      genderfemale              UKno genderfemale:UKno
##        40.0000000         1.7500000         2.0000000         0.2142857

Remeber: we’re modelling the log (hence log-linear model)

• UK_no: 2.00
• The odds of a person being from the UK are 1:2.00, regardless of their gender.
• gender_female: 1.75
• The odds of a person being female are 1.75:1, regardless of their UK status.
• gender_female:UK_no: 0.21
• People that are female have estimated odds of not being from the UK is 0.21 times the odds for males of not being from the UK.

## Log-linear model

• extends Chisquare idea to mutliple dimensions
• brings in the modelling aspect
• aim: find a model that is simpler than the full model
• core: simplest model to explain the data

## Interpretation of results

General strategy:

• there’s always a hypothesis
• make the hypothesis explicit
• every RQ must come down to one or multiple hypotheses

## Hypotheses

• difference in means 2 groups (t-test, rank sum test)
• difference in means 2+ groups (ANOVA, Kruskal-Wallis test)
• predictor combinations to explain an outcome (model comparison tests)
• linear models
• logistic regression models
• log linear models

## Interpretation strategy

• ask yourself: what did I test?
• which hypothesis was behind the test?
• what does the hyp. testing result reveal?
• how does this feed back to my RQ?

## Pitfalls

• forgetting to re-transform coefficients in logistic regression or loglinear models
• forgetting the unit of interpretation of coefficients
• forgetting the direction of effects
• attributing causality to correlational data

Your interpretation becomes very difficult if you do not know the question you want to answer.