Slides.knit

class: center, inverse, middle

.pull-left-wide {
  float: left;
  width: 66%;
}
.pull-right-wide {
  float: right;
  width: 66%;
}
.pull-right-wide ~ p {
  clear: both;
}

.pull-left-narrow {
  float: left;
  width: 30%;
}
.pull-right-narrow {
  float: right;
  width: 30%;
}

.tiny123 {
  font-size: 0.40em;
}

.small123 {
  font-size: 0.80em;
}

.large123 {
  font-size: 2em;
}

.red {
  color: red
}

.orange {
  color: orange
}

.green {
  color: green
}
</style>

# Statistics
## Testing relationships using qualitative data
### (Chapter 16)

### Christian Vedel,<br>Department of Economics<br>University of Southern Denmark

### Email: [christian-vs@sam.sdu.dk](mailto:christian-vs@sam.sdu.dk)

### Updated 2026-04-27

---
class: middle
# Today's lecture
.pull-left-wide[
**Using chi-square tests to assess whether observed categorical distributions match theory and whether two discrete variables are related**

- **Section 1:** The `$\chi^2$`-test
- **Section 2:** Test of a distribution
- **Section 3:** Testing the relationship between two discrete random variables
- **Section 4:** Test for homogeneity
]

.pull-right-narrow[
![Trees](Figures/Trees1.jpg)
]

---
class: inverse, middle, center
# The `$\chi^2$`-test

---
# Motivation

.pull-left-wide[
**Discrete random variables** arise when we deal with qualitative data:
- highest level of education attained
- gender
- type of car purchased (gas, diesel, hybrid, electric)

We can test whether the observed distribution of such a variable matches a theoretical one.
]

---
# The `$\chi^2$` test: setup

.pull-left-wide[
Let `$X$` be the variable of interest taking `$K$` possible values `$x_1, x_2, \ldots, x_K$` (called **categories**).

The probability of each category `$k$` is `$f(x_k) = p_k$`.

Hypotheses:
`$$\begin{align*} H_0 & : p_1=\pi_1,\; p_2=\pi_2,\ldots,p_K=\pi_K \\ H_1 & : \text{at least one } p_k \not= \pi_k \end{align*}$$`

where `$\pi_1, \ldots, \pi_K$` are the probabilities from the given (theoretical) distribution.
]

.pull-right-narrow[
.small123[
**Note:** `$\pi_k$` is the hypothesised probability; `$p_k$` is the true (unknown) probability; `$Z_k$` is the observed count.
]
]

---
# The `$\chi^2$` test statistic

.pull-left-wide[
With a simple random sample of `$n$` observations, let `$Z_k$` be the number of elements with value `$x_k$`. Then `$Z_k/n$` estimates `$p_k$`.

> **Test statistic:**
`$$\chi^2 = \sum_{k=1}^K \frac{(Z_k - n\pi_k)^2}{n\pi_k}$$`

Under `$H_0$`, `$\chi^2 \sim \chi^2(K-1)$`.

The `$(K-1)$` degrees of freedom arise because the `$K$` probabilities must sum to 1 — only `$K-1$` are free.

Decision rule: reject `$H_0$` if `$\chi^2 > \chi^2_{1-\alpha}(K-1)$`.
]

---
# .red[Raise your hand 1: The `$\chi^2$` test]

.pull-left-wide[
**Q1.** A `$\chi^2$` test checks whether a six-sided die is fair (`$K=6$`). How many degrees of freedom does the test statistic have?

A: 5
B: 6
C: 1
D: Depends on the sample size `$n$`
]

.pull-left-wide[
**Q2.** The minimum `$\chi^2$` value that leads to rejection at `$\alpha=0.05$` for this test is:

A: `$\chi^2_{0.95}(5) \approx 11.07$`
B: `$\chi^2_{0.95}(6) \approx 12.59$`
C: `$\chi^2_{0.975}(5) \approx 12.83$`
D: `$z_{0.95}^2 \approx 2.71$`
]

---
# .red[Practice 1: Chi-square goodness-of-fit]

.pull-left-wide[
A die is rolled 120 times. Observed counts:

| Face | 1 | 2 | 3 | 4 | 5 | 6 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| Count | 25 | 17 | 22 | 18 | 21 | 17 |

1. State `$H_0$` and compute the expected counts under `$H_0$`.
2. Compute the `$\chi^2$` test statistic.
3. Test at `$\alpha=0.05$`. (`$\chi^2_{0.95}(5) \approx 11.07$`)
]

---
class: inverse, middle, center
# Test of a distribution

---
# Known and unknown distributions

.pull-left-wide[
**Known distribution:** the theoretical probabilities `$\pi_k$` are fully specified.

Example: test whether a die is fair (`$\pi_k = 1/6$`) or a coin is unloaded (`$\pi_1 = \pi_2 = 0.5$`). Proceed directly with the test statistic from Section 1.
]

.pull-left-wide[
**Unknown distribution:** we know the distributional family but not all parameters.

Example: `$X \sim \text{Binomial}(n,p)$` with `$n$` known but `$p$` unknown.

Estimate `$p$` from the sample, then compute `$\pi_k$` using the estimate.

> When one parameter is estimated, the test statistic follows `$\chi^2(K-2)$` — one degree of freedom is lost per estimated parameter.
]

---
class: inverse, middle, center
# Testing the relationship between two discrete random variables

---
# Independence and hypotheses

.pull-left-wide[
Recall that `$X$` and `$Y$` are independent if and only if:
`$$f(x,y) = f_X(x) \cdot f_Y(y)$$`

Hypotheses:
`$$\begin{align*} H_0 & : f(x,y) = f_X(x)\cdot f_Y(y) \text{ for all } x,y \\ H_1 & : f(x,y) \not= f_X(x)\cdot f_Y(y) \text{ for some } x,y \end{align*}$$`

We interpret `$f(x,y)$` as the unknown joint probabilities, and `$f_X(x)\cdot f_Y(y)$` as the distribution under independence.
]

---
# Setup and estimated probabilities

.pull-left-wide[
Suppose `$X$` takes `$J$` values `$x_1,\ldots,x_J$` and `$Y$` takes `$L$` values `$y_1,\ldots,y_L$`, giving `$K = J \cdot L$` categories.

With a sample of `$n$` pairs `$((X_1,Y_1),\ldots,(X_n,Y_n))$`, estimate the marginals:
`$$\begin{align*} \hat{f}_X(x_j) & = \frac{\text{# elements with } X=x_j}{n} \\ \hat{f}_Y(y_l) & = \frac{\text{# elements with } Y=y_l}{n} \end{align*}$$`

Let `$Z_{jl}$` = number of elements with `$X=x_j$` and `$Y=y_l$`.
]

---
# Test statistic for independence

.pull-left-wide[
> **Test statistic:**
`$$\chi^2 = \sum_{j=1}^J\sum_{l=1}^L \frac{\left[Z_{jl} - n\cdot\hat{f}_X(x_j)\cdot\hat{f}_Y(y_l)\right]^2}{n\cdot\hat{f}_X(x_j)\cdot\hat{f}_Y(y_l)}$$`

Degrees of freedom: `$(J-1)(L-1)$`
- `$J-1$` lost estimating `$\hat{f}_X(x)$`
- `$L-1$` lost estimating `$\hat{f}_Y(y)$`

Reject `$H_0$` if `$\chi^2 > \chi^2_{1-\alpha}\!\left((J-1)(L-1)\right)$`.
]

.pull-right-narrow[
.small123[
**Expected count** under independence: `$$n\cdot\hat{f}_X(x_j)\cdot\hat{f}_Y(y_l) = \frac{\text{row total}\times\text{col total}}{n}$$`
]
]

---
# .red[Raise your hand 2: Test of independence]

.pull-left-wide[
**Q1.** A `$\chi^2$` independence test between `$X$` (3 values) and `$Y$` (4 values) has how many degrees of freedom?

A: `$(J-1)(L-1) = 6$`
B: `$J \cdot L - 1 = 11$`
C: `$J + L - 2 = 3$`
D: `$J + L - 1 = 4$`
]

.pull-left-wide[
**Q2.** A test of homogeneity — whether `$X$` has the same distribution across groups defined by `$Y$` — is mathematically equivalent to:

A: A test of independence between `$X$` and `$Y$`
B: A two-sample `$t$`-test of the means of `$X$`
C: A test of whether `$X$` is normally distributed
D: A test of whether the variances of `$X$` are equal across groups
]

---
# .red[Practice 2: Test of independence]

.pull-left-wide[
A survey of 200 people records education level and whether they voted. Observed counts:

| | Voted | Did not vote | Total |
|---|:---:|:---:|:---:|
| Low education | 30 | 50 | 80 |
| Medium education | 45 | 35 | 80 |
| High education | 35 | 5 | 40 |
| **Total** | **110** | **90** | **200** |

1. Compute the expected counts under independence.
2. Compute the `$\chi^2$` test statistic.
3. Test at `$\alpha=0.05$`. (`$\chi^2_{0.95}(2) \approx 5.99$`)
]

---
class: inverse, middle, center
# Test for homogeneity

---
# Homogeneity

.pull-left-wide[
A **test of homogeneity** asks: is the distribution of `$X$` the same across groups defined by `$Y$`?

**Example:** Is the age distribution of men and women the same? Let `$X$` = age, `$Y$` = gender.

We compare the conditional distributions `$f_{X|Y}(x \mid Y=1)$` and `$f_{X|Y}(x \mid Y=2)$`.
]

.pull-left-wide[
If `$f_{X|Y}(x \mid Y=1) = f_{X|Y}(x \mid Y=2)$` for every `$x$`, this means `$X$` and `$Y$` are **independent**.

> A test of homogeneity of `$X$` across groups defined by `$Y$` is equivalent to a test of **independence** between `$X$` and `$Y$` — use exactly the same `$\chi^2$` procedure.
]

---
# Before next time
.pull-left[
- Read the assigned reading
- Next time: Simple linear regression `$\rightarrow$` Chapter 17
]

.pull-right[
![Trees](Figures/Trees1.jpg)
]