Lecture 5: Hypothesis Testing

Example: Life satisfaction in France

We estimated that the average life satisfaction in France is 7.0 (95% CI: 6.9 to 7.1) based on European Social Survey data.

This lecture: Instead of estimating a range of plausible values, we often want to test specific claims about the population.

Research questions

Is the average life satisfaction in France different from 7?
Do men and women have different levels of life satisfaction?
Is trust in parliament higher than trust in politicians?
Does interpersonal trust really vary by education level?

What is a hypothesis test?

A hypothesis test is a statistical method for making decisions about population parameters based on sample data.

The logic:

Start with a claim (hypothesis) about the population
Collect sample data
Calculate how compatible the data is with the claim
Make a decision: reject the claim or fail to reject it

Important

We never “prove” a hypothesis is true. We only assess whether the data provides sufficient evidence against a claimed value.

A simple example

Suppose a politician claims that 60% of French adults trust parliament.

You survey 100 randomly selected adults and find that 45% trust parliament.

Questions:

Is this evidence against the politician’s claim?
Could we observe 45% just by chance if the true proportion is 60%?
How unlikely would our result need to be before we reject the claim?

Hypothesis testing provides a formal framework to answer these questions.

Connection to confidence intervals

Hypothesis testing and confidence intervals are two sides of the same coin.

Equivalence

If a 95% confidence interval for \(\mu\) is \((6.9, 7.1)\), then we would reject any hypothesis that \(\mu\) equals a value outside this interval (e.g., \(\mu = 6.5\)) at the 5% significance level.
If a hypothesized value \(\mu_0\) falls inside the 95% CI, we would fail to reject the hypothesis that \(\mu = \mu_0\) at the 5% level.

Today: We’ll learn the formal hypothesis testing framework, which gives us more flexibility:

One-sided vs two-sided tests
Comparing groups
Testing relationships between variables

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

The hypothesis testing framework

The six steps of hypothesis testing

Every hypothesis test follows the same structure:

State the hypotheses: Define null (\(H_0\)) and alternative (\(H_A\)) hypotheses
Choose significance level: Set significance level \(\alpha\) (typically 0.05)
Calculate test statistic: Compute from sample data
Find the p-value: Probability of observing our result (or more extreme) if \(H_0\) is true
Make a decision: Reject \(H_0\) if p-value < \(\alpha\)
Interpret in context: State conclusion in plain language

Step 1: State the hypotheses

Every hypothesis test involves two competing hypotheses:

Definitions

Null hypothesis (\(H_0\)): The claim we are testing, typically representing “no effect” or “no difference”

Alternative hypothesis (\(H_A\) or \(H_1\)): What we conclude if we reject \(H_0\)

Key principle: \(H_0\) always contains an equality (=, ≤, or ≥)

Example: Life satisfaction

\(H_0: \mu = 7\) (average life satisfaction equals 7)
\(H_A: \mu \neq 7\) (average life satisfaction differs from 7)

One-sided vs two-sided tests

The alternative hypothesis can be:

Two-sided (non-directional): \[H_0: \mu = \mu_0 \quad \text{vs} \quad H_A: \mu \neq \mu_0\]

Used when we care about differences in either direction.

One-sided (directional): \[H_0: \mu \leq \mu_0 \quad \text{vs} \quad H_A: \mu > \mu_0\] or \[H_0: \mu \geq \mu_0 \quad \text{vs} \quad H_A: \mu < \mu_0\]

Used when we only care about differences in one direction.

Step 2: Choose significance level (\(\alpha\))

The significance level (\(\alpha\)) is the probability threshold for rejecting \(H_0\).

Common choices:

\(\alpha = 0.05\) (5% level) — most common
\(\alpha = 0.01\) (1% level) — more stringent
\(\alpha = 0.10\) (10% level) — more lenient

Interpretation:

\(\alpha = 0.05\) means we’re willing to reject \(H_0\) when there’s only a 5% chance of observing data this extreme if \(H_0\) were true.

Note

The significance level \(\alpha\) represents the maximum probability of a Type I error we’re willing to tolerate (more on this soon!).

Step 3: Calculate the test statistic

A test statistic measures how far our sample estimate is from the hypothesized value, in units of standard errors.

For a population mean: \[t = \frac{\bar{x} - \mu_0}{SE(\bar{x})} = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]

where:

\(\bar{x}\) = sample mean
\(\mu_0\) = hypothesized population mean
\(s\) = sample standard deviation
\(n\) = sample size

Interpretation: The test statistic tells us “how many standard errors away” our sample estimate is from the hypothesized value.

Step 4: Calculate the p-value

Definition

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) what we actually observed, assuming the null hypothesis \(H_0\) is true.

Example setup — testing whether mean life satisfaction in France equals 7:

\(H_0: \mu = 7\) vs \(H_A: \mu \neq 7\)
Significance level: \(5\%\)
Observed sample mean: \(\bar{x} = 7.4\), with \(s = 2.1\), \(n = 150\)
Test statistic: \(t = \dfrac{7.4 - 7}{2.1/\sqrt{150}} \approx 2.33\)

Step 4: The null distribution

Under \(H_0\), if \(\mu = 7\), repeated samples of size \(n = 150\) would give sample means distributed as:

This is the distribution of \(\bar{x}\) we’d expect to see if the null were true.

Step 4: Where did our sample fall?

We observed \(\bar{x} = 7.4\). How does it sit relative to the null distribution?

Our result looks unusual under \(H_0\) — but how unusual exactly?

Step 4: The p-value area

The p-value = \(P(|\bar{x} - 7| \geq 0.4 \mid H_0)\) — the shaded area in both tails:

p-value \(\approx\) 0.02 — the data are unlikely under \(H_0\). We reject \(H_0\) at the 5% level. For a two-tailed test, the p-value is computed by: 2 * (1 - pnorm(abs(obs_value - mu0) / se))

Step 4: How the p-value changes with the test statistic

The further the observed \(\bar{x}\) from \(\mu_0\), the smaller the p-value:

Interpreting p-values

What the p-value is (and isn’t)

The p-value IS:

The probability of obtaining results as extreme as observed, assuming \(H_0\) is true
A measure of compatibility between data and \(H_0\)

The p-value is NOT:

The probability that \(H_0\) is true
The probability that the results occurred by chance
The probability that you’ve made a wrong decision

Common interpretations:

Small p-value (< \(\alpha\)): Data is incompatible with \(H_0\) → reject \(H_0\)
Large p-value (≥ \(\alpha\)): Data is compatible with \(H_0\) → fail to reject \(H_0\)

Warning

“Fail to reject \(H_0\)” ≠ “Accept \(H_0\)” or “Prove \(H_0\) is true”

We simply don’t have sufficient evidence against \(H_0\).

Step 5: Make a decision

Compare the p-value to the significance level \(\alpha\):

Decision rule:

If p-value < \(\alpha\): Reject \(H_0\) (results are “statistically significant”)
If p-value ≥ \(\alpha\): Fail to reject \(H_0\) (results are “not statistically significant”)

The 0.05 threshold is arbitrary!

The conventional threshold of \(\alpha = 0.05\) is not a magical number. It’s a historical convention introduced by Fisher (1925) — chosen as a convenient round number, not for any deep statistical reason.

There’s nothing special about 5% vs 4.9% or 5.1%
Results with p = 0.049 and p = 0.051 are not fundamentally different
Don’t treat 0.05 as a bright line — report actual p-values and effect sizes!

Step 6: Interpret in context

Always translate your statistical decision back into the real-world context.

Bad interpretations:

❌ “We reject the null hypothesis”
❌ “The p-value is 0.03”
❌ “The result is significant”

Good interpretations:

✅ “The data provide strong evidence that average life satisfaction in France differs from 7 (p = 0.03)”
✅ “We found a statistically significant difference in life satisfaction between men and women (p = 0.02)”
✅ “There is insufficient evidence to conclude that trust in parliament differs from trust in politicians (p = 0.18)”

Your turn! #1

A government report states that French adults have an average interest in politics of 5 on a 0 (Not at all interested) – 10 (Very interested) scale. You survey a random sample of 50 French university students and find: \(\bar{x} = 5.8\), with standard deviation \(s = 2.3\).

Work through all six steps to test whether students differ from the general population:

State the hypotheses. Write \(H_0\) and \(H_A\).
Choose \(\alpha\). Use \(\alpha = 0.05\).
Compute the test statistic. Calculate \(t = (\bar{x} - \mu_0)/(s / \sqrt{n})\).
Compute the p-value. Use 2 * pt(-abs(t_stat), df = n - 1) in R. Express it as \(P(|\bar{X} - 5| \geq {\color{gray}{?}} \mid H_0)\).
Make a decision. Do you reject \(H_0\) at the 5% level?
Interpret in context. Write one sentence summarising what you conclude about political interest among French university students.

Type I and Type II errors

In hypothesis testing, we can make two types of errors:

	H₀ is true	H₀ is false (Hₐ is true)
Reject H₀	❌ Type I error False Positive Probability = α	✅ Correct decision Power Probability = 1 − β
Fail to reject H₀	✅ Correct decision Probability = 1 − α	❌ Type II error False Negative Probability = β

Definitions

Type I error (α): Rejecting \(H_0\) when it’s actually true (false positive)
Type II error (β): Failing to reject \(H_0\) when it’s actually false (false negative)
Power (1 - β): Probability of correctly rejecting a false \(H_0\)

Understanding Type I and Type II errors

Medical test analogy

Think of a medical test for a disease:

Type I error: Test says you have the disease when you don’t (false positive)
Type II error: Test says you don’t have the disease when you do (false negative)

In statistical testing:

We control Type I error by choosing \(\alpha\) (typically 0.05)
We cannot directly control Type II error without knowing the true parameter value
Type II error depends on: sample size, effect size, and \(\alpha\)

The trade-off:

Decreasing \(\alpha\) (e.g., from 0.05 to 0.01) reduces Type I error
But it increases Type II error (harder to detect real effects)

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

One-sample tests

One-sample test for means

Goal: Test whether a population mean \(\mu\) equals a specific hypothesized value \(\mu_0\).

Hypotheses: \[H_0: \mu = \mu_0 \quad \text{vs} \quad H_A: \mu \neq \mu_0\]

Test statistic: \[T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}}\]

Under \(H_0\): \(T \sim t_{n-1}\)

If \(\sigma\) is known

If we know the true population SD \(\sigma\), we can use the standard normal distribution \(N(0,1)\). Since in practice we estimate \(\sigma\) with the sample SD \(s\), we use the t-distribution.

P-value (two-sided): \(P(|T| \geq |t_{\text{obs}}| = |\frac{\bar{x} - \mu_0}{s/\sqrt{n}}|)\) where \(T \sim t_{n-1}\)

Example: Trust in parliament

Research question: Is the average trust in parliament among French adults equal to 5 (the midpoint of the 0-10 scale)?

# Load ESS data
ess_france <- read.csv("https://www.dropbox.com/scl/fi/25h4lezq3zuzla94ejmqq/ess_france_11ed.csv?rlkey=lf23ra0i6bvq5dzgtt2u1gy9a&dl=1")

# Clean trust in parliament variable
ess_trust_parl <- ess_france |>
  filter(trstprl %in% 0:10) |>
  select(trstprl)

# Summary statistics
n <- nrow(ess_trust_parl)
xbar <- mean(ess_trust_parl$trstprl)
s <- sd(ess_trust_parl$trstprl)

n = 1727

Sample mean = 4.15

Sample SD = 2.485

Example: Trust in parliament (cont.)

The sample mean (\(\bar{x}\) = 4.15) looks close to 5 — but is it significantly different?

Example: Trust in parliament (cont.)

Step 1: State hypotheses:

\(H_0: \mu = 5\) (average trust equals 5)
\(H_A: \mu \neq 5\) (average trust differs from 5)

Step 2: Choose \(\alpha\): \(0.05\)

Step 3: Calculate test statistic \(\dfrac{\bar{x} - \mu_0}{s/\sqrt{n}}\):

mu0 <- 5
t_stat <- (xbar - mu0) / (s / sqrt(n))

Test statistic: t = -14.218

Step 4: Calculate p-value:

# Two-sided p-value
p_value <- 2 * pt(-abs(t_stat), df = n - 1)

P-value = <2e-16

Example: Trust in parliament (cont.)

Using R’s built-in function:

# Perform one-sample t-test
t.test(ess_trust_parl$trstprl, mu = 5)


    One Sample t-test

data:  ess_trust_parl$trstprl
t = -14.218, df = 1726, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
 4.032712 4.267230
sample estimates:
mean of x 
 4.149971

Visualizing the result

Decision: Reject \(H_0\) (p < 0.05) at the 5% level. The data provide strong evidence that average trust in parliament differs from 5. Can we reject at the 1% level?

CI and hypothesis testing

The CI–hypothesis test duality

The 95% CI does not contain \(\mu_0 = 5\) \(\Rightarrow\) we reject \(H_0\) at \(\alpha = 0.05\).

More generally: a two-sided test at level \(\alpha\) rejects \(H_0\) if and only if \(\mu_0\) falls outside the \((1-\alpha)\) CI.

Quick example: Student sleep

Health guidelines recommend at least 7 hours of sleep per night. You survey a random sample of 40 students and find that they report sleeping an average of 6 hours and 18 minutes per night, with a standard deviation of 1 hour and 24 minutes. Test at \(\alpha = 0.05\) whether students sleep less than the recommended amount.

Step 1: State the hypothesis

\(H_0: \mu = 7\) vs \(H_A: \mu < 7\) (one-sided)

Step 2: Choose significance level

\(\alpha = 0.05\)

Step 3: Compute the test statistic:

\[ t = \dfrac{\bar{x} - \mu}{\sigma/\sqrt{n}} = \dfrac{6.3 - 7}{1.4/\sqrt{40}} \approx -3.16 \]

Step 4: Compute the p-value (one-sided, left tail):

pt(t_sl, df = n_sl - 1)

[1] 0.001513711

Steps 5–6: Make a decision \(\rightarrow\) reject or fail to reject \(H_0\)?

\(p \approx 0.002 < 0.05 \rightarrow\) reject \(H_0\). Strong evidence that students sleep significantly less than the recommended 7 hours.

One-sample test for proportions

Goal: Test whether a population proportion \(p\) equals a specific hypothesized value \(p_0\).

Hypotheses: \[H_0: p = p_0 \quad \text{vs} \quad H_A: p \neq p_0\]

Test statistic (large sample): \[z = \frac{\hat{p} - p_0}{\sqrt{\dfrac{p_0(1-p_0)}{n}}}\]

where \(\hat{p}\) is the sample proportion. (Denominator uses \(p_0\) because \(H_0\) assumed to be true.)

Under \(H_0\): \(z \sim N(0, 1)\) approximately (when \(np_0 \geq 10\) and \(n(1-p_0) \geq 10\))

Example: High interpersonal trust

Research question: Is the proportion of French adults with high interpersonal trust greater than 20%?

# Create high trust indicator
ess_trust <- ess_france |>
  filter(ppltrst %in% 0:10) |>
  mutate(high_trust = ppltrst >= 7) |>
  select(high_trust)

# Summary
n_trust <- nrow(ess_trust)
p_hat <- mean(ess_trust$high_trust)
num_high_trust <- sum(ess_trust$high_trust)

n = 1760

Number with high trust = 363

Sample proportion = 0.206

Example: High interpersonal trust (cont.)

Testing \(H_0: p \le 0.20\) vs \(H_A: p > 0.20\).

Manual calculation first:

p0 <- 0.20

# Step 3: Test statistic
z_stat <- (p_hat - p0) / sqrt(p0 * (1 - p0) / n_trust)

z = 0.656

# Step 4: One-sided p-value
p_value <- 1 - pnorm(z_stat)

p-value = 0.256

Confirm with prop.test():

prop.test(x = num_high_trust, n = n_trust, p = 0.20, alternative = "greater", correct = FALSE)


    1-sample proportions test without continuity correction

data:  num_high_trust out of n_trust, null probability 0.2
X-squared = 0.42969, df = 1, p-value = 0.2561
alternative hypothesis: true p is greater than 0.2
95 percent confidence interval:
 0.1908427 1.0000000
sample estimates:
      p 
0.20625

Visualizing the result

Decision: Fail to reject \(H_0\) (p > 0.05) at 5% level. No strong evidence that the proportion with high interpersonal trust exceeds 20%.

Your turn! #2

Load the ESS France data and tidyverse. Select the life satisfaction variable (stflife) keeping only values between 0 and 10. The goal is to test whether average life satisfaction equals 6.5 (use \(\alpha = 0.05\)).

library(tidyverse)

ess_france <- read.csv("https://www.dropbox.com/scl/fi/25h4lezq3zuzla94ejmqq/ess_france_11ed.csv?rlkey=lf23ra0i6bvq5dzgtt2u1gy9a&dl=1")

State the null and alternative hypotheses.
Manually compute the test statistic \(t = (\bar{x} - \mu_0) / (s / \sqrt{n})\) and the two-sided p-value using pt().
Verify your result with t.test().
Make a decision and interpret.
Compute the 95% confidence interval for overage life satisfaction. Compare it with the decision you made above.

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

Two-sample tests

Two-sample t-test: Independent groups

Goal: Test whether two population means are equal: \(\mu_1 = \mu_2\)

Hypotheses: \[H_0: \mu_1 = \mu_2 \quad \text{vs} \quad H_A: \mu_1 \neq \mu_2\]

Or equivalently: \(H_0: \mu_1 - \mu_2 = 0\) vs \(H_A: \mu_1 - \mu_2 \neq 0\)

Test statistic: \[T = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{SE(\bar{X}_1 - \bar{X}_2)} = \frac{(\bar{X}_1 - \bar{X}_2) - 0}{SE(\bar{X}_1 - \bar{X}_2)}\]

Two versions:

Equal variances (pooled t-test): Assumes \(\sigma_1^2 = \sigma_2^2\)
Unequal variances (Welch’s t-test): Does not assume equal variances (default in R)

Standard error for difference in means

Unequal variances — Welch’s t-test (recommended default in R): \[SE_{\text{Welch}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

Degrees of freedom are approximated by the Welch-Satterthwaite equation we discussed in the confidence interval lecture (R handles this automatically).

Equal variances — pooled t-test (only if variances are truly equal): \[SE_{\text{pooled}} = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \qquad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}\]

Tip

Rule of thumb: Always use Welch’s t-test. It reduces to the pooled test when variances are equal, so there is no downside to using it as the default. In R, t.test() uses Welch’s test by default.

Example: Education and trust

Research question: Does interpersonal trust differ between those with and without higher education?

# Create education groups
ess_educ_trust <- ess_france |>
  filter(ppltrst %in% 0:10) |>
  mutate(higher_educ = case_when(education == "High school diploma or less" ~ "No higher education",
                                 education != "High school diploma or less" ~ "Higher education")) |>
  filter(!is.na(higher_educ)) |>
  select(ppltrst, higher_educ)

# Summary statistics by group
ess_educ_trust |>
  group_by(higher_educ) |>
  summarise(n = n(),
            mean = mean(ppltrst),
            sd = sd(ppltrst),
            se = sd / sqrt(n))

# A tibble: 2 × 5
  higher_educ             n  mean    sd     se
  <chr>               <int> <dbl> <dbl>  <dbl>
1 Higher education      733  5.15  2.07 0.0763
2 No higher education  1027  4.13  2.16 0.0676

Example: Education and trust (cont.)

Visualize the distributions — do the groups look different?

Example: Education and trust (cont.)

Before testing — visualize the group means with 95% CIs:

The CIs don’t overlap — strong visual evidence of a difference. Let’s confirm with a formal test.

Example: Education and trust (cont.)

Manual computation — Welch’s t-test:

# Group statistics
stats <- educ_summary

n1 <- stats$n[1]
xbar1 <- stats$mean[1]
s1 <- stats$sd[1]
n2 <- stats$n[2]
xbar2 <- stats$mean[2]
s2 <- stats$sd[2]

# SE (Welch)
se <- sqrt(s1^2/n1 + s2^2/n2)

SE = 0.1019

# Test statistic
t_stat <- (xbar1 - xbar2) / se

t = 10.007

# Approximate p-value (using large-sample z approximation for simplicity)
p_value <- 2 * pnorm(-abs(t_stat))

p-value ≈ <2e-16

Confirm with t.test():

(uses exact Welch-Satterthwaite df)

t.test(ppltrst ~ higher_educ,
       data = ess_educ_trust)


    Welch Two Sample t-test

data:  ppltrst by higher_educ
t = 10.007, df = 1619.7, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Higher education and group No higher education is not equal to 0
95 percent confidence interval:
 0.8197456 1.2194366
sample estimates:
   mean in group Higher education mean in group No higher education 
                         5.150068                          4.130477

Conclusion: Those with higher education have significantly higher trust (difference ≈ 0.67, p < 0.001).

Quick example: Study hours and exam outcome

A university records the weekly study hours of students who sat a statistics exam. Among the 40 students who passed, the average was 5.3 hours per day with a standard deviation of 1.8 hours. Among the 30 students who failed, the average was 3.2 hours per day with a standard deviation of 1.5 hours. Test at \(\alpha = 0.05\) whether students who passed studied more on average.

Step 1: State the hypotheses:

\(H_0: \mu_1 = \mu_2\) vs \(H_A: \mu_1 > \mu_2\) (one-sided)

Step 2: Choose significance level:

\(\alpha = 0.05\)

Step 3: Compute the test statistic:

\[t = \dfrac{\bar{x}_1 - \bar{x}_2}{SE} = \dfrac{5.3 - 3.2}{\sqrt{\dfrac{1.8^2}{40} + \dfrac{1.5^2}{30}}} \approx 5.32\]

Step 4: Compute the p-value (one-sided, right tail; use df = 67.19 (Welch-Satterthwaite)):

1 - pt(t_ts, df = 67.19)

[1] 6.491827e-07

Steps 5–6: Make a decision \(\rightarrow\) reject or fail to reject \(H_0\)?

\(p < 0.001 < 0.05 \rightarrow\) reject \(H_0\). Strong evidence that students who passed studied significantly more hours per day than those who failed.

Two-sample test for proportions

Goal: Test whether two population proportions are equal: \(p_1 = p_2\)

Test statistic: \[z = \frac{\hat{p}_1 - \hat{p}_2}{SE}, \qquad SE = \sqrt{\hat{p}(1-\hat{p})\!\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}\]

where \(\hat{p} = \dfrac{x_1 + x_2}{n_1 + n_2}\) is the pooled proportion (estimated under \(H_0: p_1 = p_2\)).

Under \(H_0\): \(z \sim N(0, 1)\) approximately (requires \(n_i\hat{p}(1-\hat{p}) \geq 5\) for each group)

Example: High trust by education

Research question: Does the proportion with high interpersonal trust differ by education?

# Create high trust indicator by education
ess_trust_educ <- ess_educ_trust |>
  mutate(high_trust = ppltrst >= 7)

# Summary
prop_table <- ess_trust_educ |>
  group_by(higher_educ) |>
  summarise(n = n(),
            n_high_trust = sum(high_trust),
            prop_high_trust = mean(high_trust))

prop_table

# A tibble: 2 × 4
  higher_educ             n n_high_trust prop_high_trust
  <chr>               <int>        <int>           <dbl>
1 Higher education      733          225           0.307
2 No higher education  1027          138           0.134

Example: High trust by education (cont.)

Manual calculation first:

# Group proportions
 # Higher education
p1_hat <- prop_table$prop_high_trust[1]
# No higher education
p2_hat <- prop_table$prop_high_trust[2]  

n_high1 <- prop_table$n_high_trust[1]
n_high2 <- prop_table$n_high_trust[2]
n1 <- prop_table$n[1]
n2 <- prop_table$n[2]

# Pooled proportion (under H0: p1 = p2)
p_pool <- (n_high1 + n_high2) / (n1 + n2)

Pooled proportion: 0.206

# SE and z-statistic
SE_prop <- sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
z_stat_prop <- (p1_hat - p2_hat) / SE_prop

z = 8.822

p_value <- 2 * pnorm(-abs(z_stat_prop))

p-value = <2e-16

Verify with prop.test():

prop.test(x = prop_table$n_high_trust,
          n = prop_table$n,
          correct = FALSE)


    2-sample test for equality of proportions without continuity correction

data:  prop_table$n_high_trust out of prop_table$n
X-squared = 77.82, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1332162 0.2119553
sample estimates:
   prop 1    prop 2 
0.3069577 0.1343720

Conclusion: The proportion with high trust is significantly higher among those with higher education (p < 0.001).

Your turn! #3

Part 1: Test whether average life satisfaction differs between men and women:

Calculate summary statistics by gender
Compute a 95% confidence interval for both means. Do they overlap?
Perform a two-sample t-test manually (using df = 1730.8) and using t.test.
Interpret: is the difference statistically significant? At what level? Practically meaningful?

Part 2: Test whether the proportion who are “very satisfied” with democracy (stfdem ≥ 8) differs between those with and without higher education:

Compute 95% confidence intervals for both proportions. Do they overlap?
Perform a two-sample proportion test manually and using prop.test.
Interpret: is the difference statistically significant? At what level? Practically meaningful?

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

Paired tests

Paired vs independent samples

Independent samples: Two separate groups with no relationship between observations (e.g., men vs women)

Paired samples: Two measurements on the same individuals or matched pairs (e.g., before vs after, parliament vs politicians)

Why does it matter?

Paired data has less variability because we control for individual differences
Use a different test that accounts for the pairing
More statistical power to detect differences

Key principle

With paired data, we analyze the differences within pairs, not the two groups separately.

Paired t-test: Theory

Goal: Test whether the mean difference between paired observations equals a hypothesized value \(\delta_0\).

Hypotheses: \[H_0: \mu_d = \delta_0 \quad \text{vs} \quad H_A: \mu_d \neq \delta_0\]

where \(\mu_d = \mu_1 - \mu_2\) is the population mean of the pairwise differences. Typically, \(\delta_0 = 0\) (no difference).

Method:

Calculate differences: \(d_i = x_{1i} - x_{2i}\) for each pair \(i\)
Perform a one-sample t-test on the differences

Test statistic: \(t = \frac{\bar{d} - \delta_0}{s_d / \sqrt{n}}\)

When \(\delta_0 = 0\) this simplifies to \(t = \frac{\bar{d}}{s_d / \sqrt{n}}\).

Example: Trust in parliament vs politicians

Research question: Do French adults trust parliament and politicians equally?

# Prepare paired data
ess_paired <- ess_france |>
  filter(trstprl %in% 1:10 & trstplt %in% 0:10) |>
  select(trstprl, trstplt) |>
  mutate(diff = trstprl - trstplt)

# Summary statistics

n = 1490

Mean trust in parliament: 4.783

Mean trust in politicians: 4.128

Mean difference: 0.655

SD of differences: 1.892

Example: Trust comparison (cont.)

Visualize the paired data:

Example: Trust comparison (cont.)

Visualize the differences:

Example: Trust comparison (cont.)

Paired t-test = one-sample t-test on the differences \(D_i =\) Parliament \(-\) Politicians

Parliament: \(\bar{x} = 4.78\) Politicians: \(\bar{x} = 4.13\)

\[\bar{d} = 4.78 - 4.13 = 0.66, \qquad s_d = 1.89, \qquad n = 1490\]

\[t = \frac{\bar{d}}{s_d / \sqrt{n}} = \frac{0.655}{1.892 / \sqrt{1490}} \approx 13.36, \qquad p < 0.001\]

Confirm with t.test():


    Paired t-test

data:  ess_paired$trstprl and ess_paired$trstplt
t = 13.361, df = 1489, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.5588648 0.7512023
sample estimates:
mean difference 
      0.6550336

Conclusion: French adults trust parliament significantly more than politicians (mean difference = 0.66, p < 0.001).

Your turn! #4

Test whether satisfaction with the state of the economy (stfeco) differs from satisfaction with the government (stfgov).

Create a new variable equal to the difference between the two satisfaction measures. Make sure to drop values not between 0 and 10.
Perform a paired t-test manually and using t.test.
Interpret: Are people more satisfied with the economy or government?

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

Chi-square test of independence

Testing relationships between categorical variables

So far: Testing hypotheses about means and proportions

New question: Are two categorical variables independent, or is there an association between them?

Examples:

Is trust level (low/medium/high) independent of education (lower/higher)?
Is life satisfaction (dissatisfied/neutral/satisfied) independent of gender?
Is voting intention independent of age group?

Tool: Chi-square test of independence (also called chi-square test of association)

Contingency tables

A contingency table (or cross-tabulation) shows the frequency distribution of two categorical variables.

Example: Trust level by education

# Create trust categories
ess_trust_cat <- ess_educ_trust |>
  filter(ppltrst %in% 0:10) |> 
  mutate(trust_level = case_when(
    ppltrst <= 3 ~ "Low trust (0-3)",
    ppltrst <= 6 ~ "Medium trust (4-6)",
    ppltrst <= 10 ~ "High trust (7-10)"
  ))

# Create contingency table
cont_table <- table(ess_trust_cat$higher_educ, ess_trust_cat$trust_level)
cont_table

                     
                      High trust (7-10) Low trust (0-3) Medium trust (4-6)
  Higher education                  225             151                357
  No higher education               138             378                511

Observed vs expected frequencies

Under independence: The expected frequency in each cell is: \[E_{ij} = \frac{(\text{row } i \text{ total}) \times (\text{column } j \text{ total})}{\text{grand total}}\]

Intuition: Independence means \(P(\text{row } i \cap \text{col } j) = P(\text{row } i) \times P(\text{col } j)\). Estimating marginal probabilities from the data: \[E_{ij} = \underbrace{N}_{\text{\# obs.}} \times \underbrace{\frac{R_i}{N}}_{\hat{P}(\text{row } i)} \times \underbrace{\frac{C_j}{N}}_{\hat{P}(\text{col } j)} = \frac{R_i \times C_j}{N}\]

i.e. “how many observations should be in this cell if the two variables were unrelated?”

Expected frequencies under independence

# Add marginal totals
addmargins(cont_table)

                     
                      High trust (7-10) Low trust (0-3) Medium trust (4-6)  Sum
  Higher education                  225             151                357  733
  No higher education               138             378                511 1027
  Sum                               363             529                868 1760

# Calculate expected frequencies
expected <- chisq.test(cont_table)$expected
round(expected, 1)

                     
                      High trust (7-10) Low trust (0-3) Medium trust (4-6)
  Higher education                151.2           220.3              361.5
  No higher education             211.8           308.7              506.5

Compare to observed:

cont_table

                     
                      High trust (7-10) Low trust (0-3) Medium trust (4-6)
  Higher education                  225             151                357
  No higher education               138             378                511

Notice: More higher education people in “high trust” than expected!

Chi-square test statistic

Measures the discrepancy between observed and expected frequencies:

\[\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

where:

\(O_{ij}\) = observed frequency in cell \((i,j)\)
\(E_{ij}\) = expected frequency under independence

Under \(H_0\) (independence): \[\chi^2 \sim \chi^2_{(r-1)(c-1)}\]

where \(r\) = number of rows, \(c\) = number of columns.

Degrees of freedom: \((r-1)(c-1) = (2-1)(3-1) = 2\) for our example.

The chi-square distribution

Note: Chi-square distribution is right-skewed and only takes positive values.

Example: Trust by education (cont.)

Hypotheses:

\(H_0\): Trust level and education are independent
\(H_A\): Trust level and education are associated

Manual computation:

# Observed and expected
O <- as.matrix(cont_table)
E <- chisq.test(cont_table)$expected

# Chi-square statistic: sum of (O - E)^2 / E
chi_manual <- sum((O - E)^2 / E)

χ² = 99.24

df_chi <- (nrow(O) - 1) * (ncol(O) - 1)
p_value <- pchisq(chi_manual, df = df_chi, lower.tail = FALSE)

p-value = <2e-16

Confirm with chisq.test():

chi_result <- chisq.test(cont_table)
chi_result


    Pearson's Chi-squared test

data:  cont_table
X-squared = 99.24, df = 2, p-value < 2.2e-16

We reject \(H_0\): trust level and education are not independent \(\rightarrow\) people with higher education tend to report systematically different levels of trust in parliament.

Visualizing the association

Your turn! #5

Test whether life satisfaction is independent of gender:

Create a contingency table of life satisfaction categories (low: 0-4, medium: 5-7, high: 8-10) by gender.

ess_lifesat_cat <- ess_france |>
  filter(stflife %in% 0:10, gndr %in% 1:2) |>
  mutate(gender   = recode(gndr, `1` = "Male", `2` = "Female"),
         sat_level = case_when(stflife <= 4  ~ "Low (0-4)",
                               stflife <= 7  ~ "Medium (5-7)",
                               stflife <= 10 ~ "High (8-10)"))

Calculate expected frequencies by hand (\(E_{ij} = \text{row total} \times \text{col total} / n\)).
Compute the chi-square statistic manually using \(\chi^2 = \sum (O-E)^2/E\).
Find the p-value with pchisq().
Verify with chisq.test().
Interpret: Is there evidence of an association?

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

Randomization methods

Beyond theory-based tests

So far: We’ve used theory-based tests that rely on distributional assumptions (t-distribution, normal distribution, chi-square distribution)

Alternative approach: Randomization tests (also called permutation tests)

Don’t rely on distributional assumptions
Use simulation to create the null distribution
More intuitive: “What would we see if there were no effect?”

The infer package provides a tidy workflow for this approach!

The `infer` workflow

Every hypothesis test with infer follows four steps:

specify(): Specify the variables and success (for proportions)
hypothesize(): Declare the null hypothesis
generate(): Generate simulated samples under \(H_0\)
calculate(): Calculate the test statistic for each simulation

Then: visualize() and get_p_value()

Permutation test for two means

Logic: If education and trust are unrelated, we could randomly shuffle the education labels and get similar differences by chance.

Method:

Calculate observed difference in means
Randomly permute the group labels many times (e.g., 1,000)
Calculate the difference in means for each permutation
Compare observed difference to permutation distribution

P-value: Proportion of permutations with difference ≥ observed (in absolute value)

Example: Education and trust (permutation)

# Step 1: Calculate observed statistic
obs_diff <- ess_educ_trust |>
  specify(ppltrst ~ higher_educ) |>
  calculate(stat = "diff in means", order = c("Higher education", "No higher education"))

obs_diff

Response: ppltrst (numeric)
Explanatory: higher_educ (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1  1.02

# Steps 2-4: Generate null distribution via permutation
set.seed(2024)
null_dist <- ess_educ_trust |>
  specify(ppltrst ~ higher_educ) |>
  hypothesize(null = "independence") |>
  generate(reps = 10000, type = "permute") |>
  calculate(stat = "diff in means", order = c("Higher education", "No higher education"))

Visualizing the null distribution

# Visualize
null_dist |>
  visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "both") +
  labs(title = "Permutation distribution of difference in means",
       subtitle = "Red line shows observed difference",
       x = "Difference in mean trust (Higher ed - No higher ed)",
       y = "Count")

Calculate p-value

# Get p-value
p_value_perm <- null_dist |>
  get_p_value(obs_stat = obs_diff, direction = "both")

p_value_perm

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

Compare to theory-based test:

# Theory-based test (from earlier)
t.test(ppltrst ~ higher_educ, data = ess_educ_trust)$p.value

[1] 6.469825e-23

Very similar! Both show strong evidence of a difference (p < 0.001).

Theory-based vs randomization

When to use each approach?

Theory-based tests:

Faster (no simulation needed)
Well-established and widely recognized
Require distributional assumptions
Work well when assumptions are met

Randomization tests:

More flexible (fewer assumptions)
More intuitive interpretation
Computationally intensive
Ideal for small samples or non-normal data
Exact p-values (not approximate)

In practice: Theory-based tests are usually fine if assumptions are met.

Use randomization when: sample is small, data is skewed, or you want to be conservative.

Your turn! #6

Perform a permutation test for the difference in life satisfaction between men and women using infer.

Calculate observed difference.
Generate 10,000 permutations
Visualize and calculate p-value.
Compare to theory-based t-test.

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

Statistical power and sample size

Statistical power

Statistical power = \(1 - \beta\) = Probability of correctly rejecting a false null hypothesis

Factors affecting power:

Effect size: Larger true effects → easier to detect → higher power

Sample size: Larger samples → more information → higher power

Significance level: Larger \(\alpha\) → easier to reject \(H_0\) → higher power (but more Type I errors!)

Variability: Less noise → clearer signal → higher power

Rule of thumb:

Aim for power ≥ 0.80 (80% chance of detecting a true effect)

Power analysis visualization

Setup: Testing \(H_0: \mu = 7.0\) when the true value is \(\mu = 7.3\) (effect size = 0.3, SE = 0.1)

Power analysis visualization (cont.)

Power analysis visualization (cont. 2)

Power analysis visualization (cont. 3)

Power analysis visualization (cont. 4)

Power analysis: effect of larger \(n\)

Power analysis: effect of larger effect size

Computing power

Power = probability of correctly rejecting \(H_0\) when it is false

Power is the probability that \(\bar{X}\) exceeds the critical value \(c\) when the true mean is \(\mu_1\):

\[\text{Power} = P(\bar{X} > c \mid \mu = \mu_1), \qquad c = \mu_0 + t_{\alpha/2,\,n-1} \cdot \frac{\sigma}{\sqrt{n}}\]

Under the true distribution \(\bar{X} \sim \mathcal{N}(\mu_1,\, \sigma^2/n)\). Standardise:

\[\text{Power} = P\!\left(Z > \frac{c - \mu_1}{\sigma/\sqrt{n}}\right)\]

Substitute \(c = \mu_0 + t_{\alpha/2,\,n-1} \cdot \sigma/\sqrt{n}\) and simplify:

\[\text{Power} = P\!\left(Z > t_{\alpha/2,\,n-1} - \frac{\mu_1 - \mu_0}{\sigma/\sqrt{n}}\right) = 1 - \Phi\!\left(t_{\alpha/2,\,n-1} - \frac{\mu_1 - \mu_0}{\sigma/\sqrt{n}}\right)\]

\(t_{\alpha/2,\,n-1}\) replaces \(z_{\alpha/2}\) because \(\sigma\) is unknown — but \(\Phi\) (normal CDF) remains because we are standardising \(\bar{X}\), not the t-statistic itself.

Computing power

As the effect \((\mu_1 - \mu_0)\) or \(n\) grows, the argument of \(\Phi\) shrinks \(\Rightarrow\) power increases.

The three levers that increase power:

Larger effect size \(|\mu_1 - \mu_0|\) — the further the truth is from \(H_0\), the easier it is to detect
Larger \(n\) — shrinks the SE, pulling the two distributions apart
Larger \(\alpha\) — moves the critical value closer to \(\mu_0\), but increases Type I error

In R: use pwr::pwr.t.test() (one-sample) or pwr::pwr.t2n.test() (two-sample):

library(pwr)
pwr.t.test(d = effect / sigma, n = n, sig.level = 0.05, type = "one.sample")

where d = (μ₁ − μ₀) / σ is Cohen’s d (standardised effect size).

Computing power: example

Setup from the visualization: \(\mu_0 = 7.0\), true \(\mu_1 = 7.3\), \(\sigma = 1.0\), \(n = 100\), \(\alpha = 0.05\)

Analytically (manual):

mu0 <- 7.0
mu1 <- 7.3
sigma <- 1.0
n <- 100
alpha <- 0.05
se       <- sigma / sqrt(n)
t_crit <- qt(1-alpha/2, df = n -1)
crit     <- t_crit - (mu1 - mu0)/se
power_an <- 1 - pnorm(crit)

SE = 0.1 
Critical value = -1.0158 
Power = 0.845

Using pwr (Cohen’s d = 0.3 / 1.0 = 0.3):

library(pwr)
pwr.t.test(d = (mu1 - mu0) / sigma, n = n,
           sig.level = alpha, type = "one.sample")


     One-sample t test power calculation 

              n = 100
              d = 0.3
      sig.level = 0.05
          power = 0.8439471
    alternative = two.sided

# the slight difference comes from normal
# distribution approx (left) vs t-distribution (right)

Both give the same power. With \(n = 100\) and a small effect (\(d = 0.3\)), power ≈ 80% — the conventional target.

Sample size and power

How many observations do we need?

This depends on:

The effect size we want to detect
The desired power (typically 0.80)
The significance level \(\alpha\) (typically 0.05)

Sample size and power

Derivation: start from the power formula and set it equal to the desired power \(1 - \beta\):

\[1 - \Phi\!\left(t_{\alpha/2,\,n-1} - \frac{|\mu_1 - \mu_0|}{\sigma/\sqrt{n}}\right) = 1 - \beta\]

Apply \(\Phi^{-1}\) to both sides (recall \(\Phi^{-1}(p) = z_{p} = -z_{1-p}\), i.e., it’s qnorm):

\[t_{\alpha/2,\,n-1} - \frac{|\mu_1 - \mu_0|}{\sigma/\sqrt{n}} = -z_{1-\beta}\]

Approximate \(t_{\alpha/2,\,n-1} \approx z_{\alpha/2}\) (valid for large \(n\)) and rearrange to isolate \(\sqrt{n}\):

\[\frac{|\mu_1 - \mu_0|}{\sigma/\sqrt{n}} = z_{\alpha/2} + z_{1-\beta} \implies \sqrt{n} = \frac{(z_{\alpha/2} + z_{1-\beta})\,\sigma}{|\mu_1 - \mu_0|}\]

Square both sides:

\[\boxed{n = \left(\frac{z_{\alpha/2} + z_{1-\beta}}{|\mu_1 - \mu_0|/\sigma}\right)^2}\]

where \(|\mu_1 - \mu_0|/\sigma\) is Cohen’s \(d\), i.e., the standardised effect size (always positive)

Sample size and power

Example: What’s the sample size needed to detect an effect of 0.3 points in life satisfaction (\(\sigma\) = 2) with 80% power at \(\alpha\) = 0.05? (Note: \(z_{0.8} \approx 0.84\).)

Let’s use the formula from the previous slide:

\[\begin{align*} n &= \left(\frac{z_{\alpha/2} + z_{1-\beta}}{|\mu_1 - \mu_0|/\sigma}\right)^2\\ &\approx \left(\frac{1.96 + 0.84}{0.3/2}\right)^2\\ &\approx \left(\frac{2.8}{0.15}\right)^2\\ &\approx 348 \end{align*}\]

\(\rightarrow\) to estimate an effect size of at least 0.3 points (with \(\sigma\) = 2) with 80% power at the 5% level we need a sample size of roughly 350 individuals

Sample size and power

Example: What’s the sample size needed to detect an effect of 0.3 points in life satisfaction (\(\sigma\) = 2) with 80% power at \(\alpha\) = 0.05?

R computation using the formula:

delta <- 0.3   # minimum effect to detect
sigma <- 2.0   # population SD (estimated from data)

# Critical values for α = 0.05 (two-sided)
# and power = 0.80 (β = 0.20)
z_alpha2 <- qnorm(0.975)   # z_{1-α/2} ≈ 1.96
z_alpha2

[1] 1.959964

z_beta   <- qnorm(0.80)    # z_{1-β}   ≈ 0.84
z_beta

[1] 0.8416212

# Required n
n_manual <- ((z_alpha2 + z_beta) / (delta / sigma))^2
n_manual

[1] 348.8391

Verify with power.t.test():

power.t.test(delta = delta, sd = sigma,
             power = 0.80,
             sig.level = 0.05,
             type = "one.sample")


     One-sample t test power calculation 

              n = 350.7647
          delta = 0.3
             sd = 2
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

We need n = 349 observations.

Your turn! #7

You are planning a study on life satisfaction in France. Based on previous research, you assume \(\sigma = 2.0\).

Part 1: Compute power

A previous study found \(\bar{x} = 6.8\). You want to test \(H_0: \mu = 7.0\) at \(\alpha = 0.05\) with \(n = 80\) observations. What is the power of this test to detect the true effect?

Compute power manually using the formula (use qt and pnorm)
Verify with power.t.test()

Part 2: Minimum sample size

You want to design a new study to detect a difference of at least 0.4 points from \(\mu_0 = 7.0\) with 80% power at \(\alpha = 0.05\).

Compute the required \(n\) manually using the formula (use qnorm)
Verify with power.t.test() (and round up)

Today’s lecture

The big question: How do we test whether our data provides evidence for or against a claim about the population?

1. Introduction to hypothesis testing

2. The hypothesis testing framework

3. One-sample tests

4. Two-sample tests

5. Paired tests

6. Chi-square test of independence

7. Randomization methods

8. Statistical power and sample size

9. Practical guidelines

Practical guidelines

Choosing the right test

Research question	Test	Test statistic	R function
One group Mean vs. a value	One-sample t-test	\(t = \dfrac{\bar{x} - \mu_0}{s/\sqrt{n}}\)	`t.test(x, mu = …)`
One group Proportion vs. a value	One-sample proportion test	\(z = \dfrac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}\)	`prop.test(x, n, p = …)`
Two groups (indep.) Compare means	Two-sample t-test (Welch’s)	\(t = \dfrac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}}\)	`t.test(y ~ x, data = …)`
Two groups (indep.) Compare proportions	Two-sample proportion test	\(z = \dfrac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}}\)	`prop.test(x, n)`
Two groups (paired) Compare means	Paired t-test	\(t = \dfrac{\bar{d}}{s_d/\sqrt{n}}\)	`t.test(x, y, paired = TRUE)`
Categorical association Test independence	Chi-square test	\(\chi^2 = \sum_{i,j} \dfrac{(O_{ij}-E_{ij})^2}{E_{ij}}\)	`chisq.test(table(x, y))`

The ASA statement on p-values

In 2016, the American Statistical Association issued guidance on proper use of p-values:

Six principles

P-values can indicate how incompatible the data are with a specified statistical model
P-values do NOT measure the probability that the studied hypothesis is true
Scientific conclusions should NOT be based solely on whether a p-value passes a threshold
Proper inference requires full reporting and transparency
A p-value does NOT measure the size of an effect or importance of a result
By itself, a p-value does NOT provide a good measure of evidence

Bottom line: Report effect sizes, confidence intervals, and context — not just p-values!

Reporting your results

Good practice includes:

Effect size: Mean difference, odds ratio, etc.
Confidence interval: Shows precision and range of plausible values
Test statistic and p-value: Provides formal test result
Sample size: Allows assessment of power
Context and interpretation: What does it mean in plain language?

Example:

“French adults with higher education have significantly higher interpersonal trust (mean = 5.88, SD = 2.31) compared to those without higher education (mean = 5.21, SD = 2.52). The difference of 0.67 points [95% CI: 0.50, 0.85] represents a small-to-medium effect and is statistically significant (t = 7.73, p < .001).”

Lecture 5: Hypothesis Testing

Recap quiz

Today’s lecture

Introduction to hypothesis testing

From confidence intervals to hypothesis tests

What is a hypothesis test?

A simple example

Connection to confidence intervals

Today’s lecture

The hypothesis testing framework

The six steps of hypothesis testing

Step 1: State the hypotheses

One-sided vs two-sided tests

Step 2: Choose significance level (\(\alpha\))

Step 3: Calculate the test statistic

Step 4: Calculate the p-value

Step 4: The null distribution

Step 4: Where did our sample fall?

Step 4: The p-value area

Step 4: How the p-value changes with the test statistic

Interpreting p-values

Step 5: Make a decision

Step 6: Interpret in context

Your turn! #1

Type I and Type II errors

Understanding Type I and Type II errors

Today’s lecture

One-sample tests

One-sample test for means

Example: Trust in parliament

Example: Trust in parliament (cont.)

Example: Trust in parliament (cont.)

Example: Trust in parliament (cont.)

Visualizing the result

CI and hypothesis testing

Quick example: Student sleep

One-sample test for proportions

Example: High interpersonal trust

Example: High interpersonal trust (cont.)

Visualizing the result

Your turn! #2

Today’s lecture

Two-sample tests

Two-sample t-test: Independent groups

Standard error for difference in means

Example: Education and trust

Example: Education and trust (cont.)

Example: Education and trust (cont.)

Example: Education and trust (cont.)

Quick example: Study hours and exam outcome

Two-sample test for proportions

Example: High trust by education

Example: High trust by education (cont.)

Your turn! #3

Today’s lecture

Paired tests

Paired vs independent samples

Paired t-test: Theory

Example: Trust in parliament vs politicians

Example: Trust comparison (cont.)

Example: Trust comparison (cont.)

Example: Trust comparison (cont.)

Your turn! #4

Today’s lecture

Chi-square test of independence

Testing relationships between categorical variables

Contingency tables

Observed vs expected frequencies

Expected frequencies under independence

Chi-square test statistic

The chi-square distribution

Example: Trust by education (cont.)

Visualizing the association

Your turn! #5

Today’s lecture

Randomization methods

Beyond theory-based tests

The infer workflow

Permutation test for two means

Example: Education and trust (permutation)

The `infer` workflow