class: center, inverse, middle <style type="text/css"> .pull-left { float: left; width: 44%; } .pull-right { float: right; width: 44%; } .pull-right ~ p { clear: both; } .pull-left-wide { float: left; width: 66%; } .pull-right-wide { float: right; width: 66%; } .pull-right-wide ~ p { clear: both; } .pull-left-narrow { float: left; width: 30%; } .pull-right-narrow { float: right; width: 30%; } .tiny123 { font-size: 0.40em; } .small123 { font-size: 0.80em; } .large123 { font-size: 2em; } .red { color: red } .orange { color: orange } .green { color: green } </style> # Statistics ## Testing relationships using qualitative data ### (Chapter 16) ### Christian Vedel,<br>Department of Economics<br>University of Southern Denmark ### Email: [christian-vs@sam.sdu.dk](mailto:christian-vs@sam.sdu.dk) ### Updated 2026-04-27 --- class: middle # Today's lecture .pull-left-wide[ **Using chi-square tests to assess whether observed categorical distributions match theory and whether two discrete variables are related** - **Section 1:** The `\(\chi^2\)`-test - **Section 2:** Test of a distribution - **Section 3:** Testing the relationship between two discrete random variables - **Section 4:** Test for homogeneity ] .pull-right-narrow[  ] --- class: inverse, middle, center # The `\(\chi^2\)`-test --- # Motivation .pull-left-wide[ **Discrete random variables** arise when we deal with qualitative data: - highest level of education attained - gender - type of car purchased (gas, diesel, hybrid, electric) We can test whether the observed distribution of such a variable matches a theoretical one. ] --- # The `\(\chi^2\)` test: setup .pull-left-wide[ Let `\(X\)` be the variable of interest taking `\(K\)` possible values `\(x_1, x_2, \ldots, x_K\)` (called **categories**). The probability of each category `\(k\)` is `\(f(x_k) = p_k\)`. Hypotheses: `$$\begin{align*} H_0 & : p_1=\pi_1,\; p_2=\pi_2,\ldots,p_K=\pi_K \\ H_1 & : \text{at least one } p_k \not= \pi_k \end{align*}$$` where `\(\pi_1, \ldots, \pi_K\)` are the probabilities from the given (theoretical) distribution. ] .pull-right-narrow[ .small123[ **Note:** `\(\pi_k\)` is the hypothesised probability; `\(p_k\)` is the true (unknown) probability; `\(Z_k\)` is the observed count. ] ] --- # The `\(\chi^2\)` test statistic .pull-left-wide[ With a simple random sample of `\(n\)` observations, let `\(Z_k\)` be the number of elements with value `\(x_k\)`. Then `\(Z_k/n\)` estimates `\(p_k\)`. > **Test statistic:** `$$\chi^2 = \sum_{k=1}^K \frac{(Z_k - n\pi_k)^2}{n\pi_k}$$` Under `\(H_0\)`, `\(\chi^2 \sim \chi^2(K-1)\)`. The `\((K-1)\)` degrees of freedom arise because the `\(K\)` probabilities must sum to 1 — only `\(K-1\)` are free. Decision rule: reject `\(H_0\)` if `\(\chi^2 > \chi^2_{1-\alpha}(K-1)\)`. ] --- # .red[Raise your hand 1: The `\(\chi^2\)` test]
−
+
00
:
20
.pull-left-wide[ **Q1.** A `\(\chi^2\)` test checks whether a six-sided die is fair (`\(K=6\)`). How many degrees of freedom does the test statistic have? A: 5 B: 6 C: 1 D: Depends on the sample size `\(n\)` ] -- .pull-left-wide[ **Q2.** The minimum `\(\chi^2\)` value that leads to rejection at `\(\alpha=0.05\)` for this test is: A: `\(\chi^2_{0.95}(5) \approx 11.07\)` B: `\(\chi^2_{0.95}(6) \approx 12.59\)` C: `\(\chi^2_{0.975}(5) \approx 12.83\)` D: `\(z_{0.95}^2 \approx 2.71\)` ] --- # .red[Practice 1: Chi-square goodness-of-fit] .pull-left-wide[ A die is rolled 120 times. Observed counts: | Face | 1 | 2 | 3 | 4 | 5 | 6 | |---|:---:|:---:|:---:|:---:|:---:|:---:| | Count | 25 | 17 | 22 | 18 | 21 | 17 | 1. State `\(H_0\)` and compute the expected counts under `\(H_0\)`. 2. Compute the `\(\chi^2\)` test statistic. 3. Test at `\(\alpha=0.05\)`. (`\(\chi^2_{0.95}(5) \approx 11.07\)`) ] --- class: inverse, middle, center # Test of a distribution --- # Known and unknown distributions .pull-left-wide[ **Known distribution:** the theoretical probabilities `\(\pi_k\)` are fully specified. Example: test whether a die is fair (`\(\pi_k = 1/6\)`) or a coin is unloaded (`\(\pi_1 = \pi_2 = 0.5\)`). Proceed directly with the test statistic from Section 1. ] -- .pull-left-wide[ **Unknown distribution:** we know the distributional family but not all parameters. Example: `\(X \sim \text{Binomial}(n,p)\)` with `\(n\)` known but `\(p\)` unknown. Estimate `\(p\)` from the sample, then compute `\(\pi_k\)` using the estimate. > When one parameter is estimated, the test statistic follows `\(\chi^2(K-2)\)` — one degree of freedom is lost per estimated parameter. ] --- class: inverse, middle, center # Testing the relationship between two discrete random variables --- # Independence and hypotheses .pull-left-wide[ Recall that `\(X\)` and `\(Y\)` are independent if and only if: `$$f(x,y) = f_X(x) \cdot f_Y(y)$$` Hypotheses: `$$\begin{align*} H_0 & : f(x,y) = f_X(x)\cdot f_Y(y) \text{ for all } x,y \\ H_1 & : f(x,y) \not= f_X(x)\cdot f_Y(y) \text{ for some } x,y \end{align*}$$` We interpret `\(f(x,y)\)` as the unknown joint probabilities, and `\(f_X(x)\cdot f_Y(y)\)` as the distribution under independence. ] --- # Setup and estimated probabilities .pull-left-wide[ Suppose `\(X\)` takes `\(J\)` values `\(x_1,\ldots,x_J\)` and `\(Y\)` takes `\(L\)` values `\(y_1,\ldots,y_L\)`, giving `\(K = J \cdot L\)` categories. With a sample of `\(n\)` pairs `\(((X_1,Y_1),\ldots,(X_n,Y_n))\)`, estimate the marginals: `$$\begin{align*} \hat{f}_X(x_j) & = \frac{\text{# elements with } X=x_j}{n} \\ \hat{f}_Y(y_l) & = \frac{\text{# elements with } Y=y_l}{n} \end{align*}$$` Let `\(Z_{jl}\)` = number of elements with `\(X=x_j\)` and `\(Y=y_l\)`. ] --- # Test statistic for independence .pull-left-wide[ > **Test statistic:** `$$\chi^2 = \sum_{j=1}^J\sum_{l=1}^L \frac{\left[Z_{jl} - n\cdot\hat{f}_X(x_j)\cdot\hat{f}_Y(y_l)\right]^2}{n\cdot\hat{f}_X(x_j)\cdot\hat{f}_Y(y_l)}$$` Degrees of freedom: `\((J-1)(L-1)\)` - `\(J-1\)` lost estimating `\(\hat{f}_X(x)\)` - `\(L-1\)` lost estimating `\(\hat{f}_Y(y)\)` Reject `\(H_0\)` if `\(\chi^2 > \chi^2_{1-\alpha}\!\left((J-1)(L-1)\right)\)`. ] .pull-right-narrow[ .small123[ **Expected count** under independence: `$$n\cdot\hat{f}_X(x_j)\cdot\hat{f}_Y(y_l) = \frac{\text{row total}\times\text{col total}}{n}$$` ] ] --- # .red[Raise your hand 2: Test of independence]
−
+
00
:
20
.pull-left-wide[ **Q1.** A `\(\chi^2\)` independence test between `\(X\)` (3 values) and `\(Y\)` (4 values) has how many degrees of freedom? A: `\((J-1)(L-1) = 6\)` B: `\(J \cdot L - 1 = 11\)` C: `\(J + L - 2 = 3\)` D: `\(J + L - 1 = 4\)` ] -- .pull-left-wide[ **Q2.** A test of homogeneity — whether `\(X\)` has the same distribution across groups defined by `\(Y\)` — is mathematically equivalent to: A: A test of independence between `\(X\)` and `\(Y\)` B: A two-sample `\(t\)`-test of the means of `\(X\)` C: A test of whether `\(X\)` is normally distributed D: A test of whether the variances of `\(X\)` are equal across groups ] --- # .red[Practice 2: Test of independence] .pull-left-wide[ A survey of 200 people records education level and whether they voted. Observed counts: | | Voted | Did not vote | Total | |---|:---:|:---:|:---:| | Low education | 30 | 50 | 80 | | Medium education | 45 | 35 | 80 | | High education | 35 | 5 | 40 | | **Total** | **110** | **90** | **200** | 1. Compute the expected counts under independence. 2. Compute the `\(\chi^2\)` test statistic. 3. Test at `\(\alpha=0.05\)`. (`\(\chi^2_{0.95}(2) \approx 5.99\)`) ] --- class: inverse, middle, center # Test for homogeneity --- # Homogeneity .pull-left-wide[ A **test of homogeneity** asks: is the distribution of `\(X\)` the same across groups defined by `\(Y\)`? **Example:** Is the age distribution of men and women the same? Let `\(X\)` = age, `\(Y\)` = gender. We compare the conditional distributions `\(f_{X|Y}(x \mid Y=1)\)` and `\(f_{X|Y}(x \mid Y=2)\)`. ] -- .pull-left-wide[ If `\(f_{X|Y}(x \mid Y=1) = f_{X|Y}(x \mid Y=2)\)` for every `\(x\)`, this means `\(X\)` and `\(Y\)` are **independent**. > A test of homogeneity of `\(X\)` across groups defined by `\(Y\)` is equivalent to a test of **independence** between `\(X\)` and `\(Y\)` — use exactly the same `\(\chi^2\)` procedure. ] --- # Before next time .pull-left[ - Read the assigned reading - Next time: Simple linear regression `\(\rightarrow\)` Chapter 17 ] .pull-right[  ]