Probability, Statistics & Modeling II
The GLM for group mean comparisons
income ~ age
income ~ age*gender
failed ~ hours_spent
Mean:
## Analyst Manager
## 82.92 101.02
SD:
## Analyst Manager
## 18.11735 27.84028
Is fraud by managers more damaging than fraud by analysts?
Yes, because:
## Analyst Manager
## 82.92 101.02
And: Manager > Analyst
Do the samples stem from different distributions?
mean(df$damage[df$role == 'Manager'] - df$damage[df$role == 'Analyst'])
## [1] 18.1
Wanted: a value that expresses the frequentist probability of observing the mean difference of 18.10 (or more extreme) if the null hypothesis were true.
–> called the p-value
analogous to: a threshold that we deem acceptable in making a Type I error
p < .05
p < .005
For two groups: t-test
(non-parametric tests –> next week)
t.test(df$damage ~ df$role)
##
## Welch Two Sample t-test
##
## data: df$damage by df$role
## t = -3.8531, df = 84.191, p-value = 0.0002269
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -27.441161 -8.758839
## sample estimates:
## mean in group Analyst mean in group Manager
## 82.92 101.02
t = -3.8531, df = 84.191, p-value = 0.0002269
The damage in $ lost was higher for managers (M = 101.02, SD = 27.84) than for analysts (M = 82.92, SD = 18.12), t(84.19) = -3.85, p < .001.
Note: always three decimals for the p-value, unless p < .001.
(more in week 7)
Alt. hypothesis: the means are affected by the factor Position (3 levels)
Total variance = explained variance + unexplained variance
damage | grandmean | squared_diff |
---|---|---|
60 | 84.80667 | 615.37071 |
119 | 84.80667 | 1169.18404 |
124 | 84.80667 | 1536.11738 |
58 | 84.80667 | 718.59738 |
79 | 84.80667 | 33.71738 |
90 | 84.80667 | 26.97071 |
121 | 84.80667 | 1309.95738 |
108 | 84.80667 | 537.93071 |
130 | 84.80667 | 2042.43738 |
117 | 84.80667 | 1036.41071 |
Total variance = explained variance + unexplained variance
109143.40 = explained variance + unexplained variance
109143.40 = explained variance + unexplained variance
explained variance
Shortcut:
group | groupmean | grandmean | squared_diff |
---|---|---|---|
Manager | 104.44 | 84.81 | 385.47 |
Analyst | 81.84 | 84.81 | 8.80 |
CEO | 68.14 | 84.81 | 277.78 |
SUM | - | - | 672.05 |
672.05 * 50
## [1] 33602.5
109143.40 = 33602.50 + unexplained variance
–>
unexplained variance = 75540.90
We want to know: how much more variance is explained compared to non-explained.
source | variance |
---|---|
explained (factor Position) | 33602.50 |
unexplained | 75540.90 |
total | 109143.40 |
But: different number of values used for calculation!
df = number of values that are free to vary
source | variance | df |
---|---|---|
explained (factor Position) | 33602.50 | 2 |
unexplained | 75540.90 | 147 |
total | 109143.40 | 149 |
source | variance | df | mean SSq |
---|---|---|---|
explained (factor Position) | 33602.50 | 2 | 16801 |
unexplained | 75540.90 | 147 | 514 |
total | 109143.40 | 149 | - |
How much more variance is explained compared to non-explained?
F-statistic = mean SSq explained / mean SSq unexplained
16801 / 514
## [1] 32.68677
The explained variance (due to the factor Position) is 32.69 times higher than the unexplained variance.
Is this significant?
summary(aov(df__$damage ~ df__$role))
## Df Sum Sq Mean Sq F value Pr(>F)
## df__$role 2 33602 16801 32.69 1.79e-12 ***
## Residuals 147 75541 514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The one-way ANOVA revealed that there was a significant main effect of Position (CEO, Manager, Analyst) on the damage in USD, F(2, 147) = 32.69, p < .001.
Now we know whether there is an overall effect …
Important: only now do you have statistical justification proceed with follow-up contrasts.
If ANOVA ns –> analysis stops here!!!!!!!
Step 1: The one-way ANOVA revealed that there was a significant main effect of Position (CEO, Manager, Analyst) on the damage in USD, F(2, 147) = 32.69, p < .001.
Step 2: follow-up contrasts
t.test(df__$damage[df__$role != 'Analyst'] ~ df__$role[df__$role != 'Analyst'])
t.test(df__$damage[df__$role != 'Manager'] ~ df__$role[df__$role != 'Manager'])
t.test(df__$damage[df__$role != 'CEO'] ~ df__$role[df__$role != 'CEO'])
comparison | t | df | p |
---|---|---|---|
CEO vs Manager | -7.4734 | 65.80 | < .001 |
CEO vs Analysts | 4.1717 | 87.70 | < .001 |
Manager vs Analysts | -4.33 | 80.31 | < .001 |
Step 1: The one-way ANOVA revealed that there was a significant main effect of Position (CEO, Manager, Analyst) on the damage in USD, F(2, 147) = 32.69, p < .001.
Step 2: follow-up contrasts
mean | |
---|---|
Analyst | 81.84 |
CEO | 68.14 |
Manager | 104.44 |
Follow-up contrasts revealed that the damaga (in $) was smaller when caused by CEOs (M = 68.14, SD = 13.31) than when caused by Managers (M = 104.44, SD = 31.67), t(65.80) = -7.47, p < .001. …
role*gender
–> 2 by 3 ANOVAlm(damage ~ role, data=df__)
beta | SE | t-statistic | p-value | |
---|---|---|---|---|
(Intercept) | 81.84 | 3.205884 | 25.528057 | 0.0000000 |
roleCEO | -13.70 | 4.533805 | -3.021744 | 0.0029654 |
roleManager | 22.60 | 4.533805 | 4.984775 | 0.0000017 |
Beta coefficients are the group means respective to the reference group!
beta | |
---|---|
(Intercept) | 81.84 |
roleCEO | -13.70 |
roleManager | 22.60 |
knitr::kable(tapply(df__$damage, df__$role, mean), col.names = c('mean'))
mean | |
---|---|
Analyst | 81.84 |
CEO | 68.14 |
Manager | 104.44 |
Look at the output:
Linear Model:
F-statistic: 32.69 on 2 and 147 DF, p-value: 1.793e-12
ANOVA:
Df Sum Sq Mean Sq F value Pr(>F)
df__$role 2 33602 16801 32.69 1.79e-12 ***
Residuals 147 75541 514
Same omnibus logic:
#Managers and Analysts only
t = -3.8531, df = 84.191, p-value = 0.0002269
summary(aov(df$damage ~ df$role))
## Df Sum Sq Mean Sq F value Pr(>F)
## df$role 1 8190 8190 14.85 0.000208 ***
## Residuals 98 54063 552
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(-3.8531)^2
## [1] 14.84638
sqrt(14.84)
## [1] 3.852272
If the ANOVA is a linear regression,
so is the t-test:
lm(damage ~ role, data=df)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 82.92 | 3.321626 | 24.963681 | 0.0000000 |
roleManager | 18.10 | 4.697488 | 3.853123 | 0.0002083 |
Tutorial
Next week