class: center, inverse, middle <style type="text/css"> .pull-left { float: left; width: 44%; } .pull-right { float: right; width: 44%; } .pull-right ~ p { clear: both; } .pull-left-wide { float: left; width: 66%; } .pull-right-wide { float: right; width: 66%; } .pull-right-wide ~ p { clear: both; } .pull-left-narrow { float: left; width: 30%; } .pull-right-narrow { float: right; width: 30%; } .tiny123 { font-size: 0.40em; } .small123 { font-size: 0.80em; } .large123 { font-size: 2em; } .red { color: red } .orange { color: orange } .green { color: green } </style> # Statistics ## Estimating the mean value using stratified sampling and cluster sampling ### (Chapter 11) ### Christian Vedel,<br>Department of Economics<br>University of Southern Denmark ### Email: [christian-vs@sam.sdu.dk](mailto:christian-vs@sam.sdu.dk) ### Updated 2026-04-16 --- class: middle # Today's lecture .pull-left-wide[ **Sampling designs that exploit prior knowledge of the population** - **Section 1:** Stratification and the stratified sample average - **Section 2:** Properties of the stratified sample average - **Section 3:** Choice of stratum sample size - **Section 4:** Cluster sampling ] .pull-right-narrow[  ] --- class: inverse, middle, center # Stratification --- # Stratification .pull-left-wide[ - We may often know more about a population than just the characteristic of interest - This knowledge can potentially be used to select a sample that gives us more information than a simple random sample ] -- .pull-left-wide[ - One such method is **stratification**, which relies on prior knowledge of the distribution of some characteristic in the population ] --- # Strata .pull-left-wide[ - Suppose each element in the population has two characteristics: - the characteristic of interest `\(X\)` (e.g., income) - a second characteristic `\(Y\)`, with a known distribution (e.g., gender) ] -- .pull-left-wide[ - A **stratum** (plural: **strata**) consists of all elements in the population with the same value of `\(Y\)` - If `\(Y\)` can take `\(m\)` different values, there will be `\(m\)` strata ] -- .pull-left-wide[ - **Stratification** divides the population into strata based on `\(Y\)`, then draws a simple random sample within each stratum ] --- # Law of iterated expectations .pull-left-wide[ - When stratifying, we want to recover the population mean of `\(X\)` from the stratum means ] -- .pull-left-wide[ - The **law of iterated expectations** gives us: `$$E_X(X) = E_Y \left[ E_{X|Y}(X|Y) \right] = \sum_{j = 1}^m E_{X|Y}(X | Y = y_j) \cdot f_Y(y_j)$$` ] --- # Stratified population mean .pull-left-wide[ - The population mean of `\(X\)` is a weighted average of stratum means, with weights `\(w_j = f_Y(y_j) = N_j / N\)`: `$$\mu = E(X) = \sum_{j=1}^m \mu_j \cdot w_j$$` where `\(N_j\)` is the number of elements in stratum `\(j\)` and `\(N\)` is the total population size ] -- .pull-left-wide[ - In the gender example, the weights are the population fractions of men and women ] --- # Stratum sample average .pull-left-wide[ - By the analogy principle: replace unknown population quantities with sample counterparts - The weights `\(w_j\)` are known from the population; the only unknowns are `\(\mu_j\)` ] -- .pull-left-wide[ - Within each stratum we have a simple random sample, so: `$$\bar{X}_j = \frac{1}{n_j} \cdot \sum_{i = 1}^{n_j} X_{ij}$$` where `\(n_j\)` is the stratum sample size and `\(X_{ij}\)` is the `\(i\)`-th element in stratum `\(j\)` ] --- # Stratified sample average .pull-left-wide[ > The **stratified sample average** is the weighted sum of stratum sample averages: `$$\bar{X}_{st} = \sum_{j = 1}^m \bar{X}_j \cdot w_j$$` ] -- .pull-left-wide[ - `\(\bar{X}_{st}\)` is a weighted average of the `\(m\)` stratum sample averages - The weights `\(w_j = N_j / N\)` are the known population stratum proportions ] --- # Recap: Stratification .pull-left-wide[ **Stratification** - *Why?* To increase precision by exploiting known population structure - *When?* When we have prior knowledge of the distribution of some characteristic `\(Y\)` **Stratified sample average** `$$\bar{X}_{st} = \sum_{j = 1}^m \bar{X}_j \cdot w_j$$` i.e., the weighted sum of stratum sample averages ] --- # .red[Raise your hand 1: Stratification]
−
+
00
:
20
.pull-left-wide[ **Q1.** A population is split into two strata: 60% women (`\(\bar{X}_W = 300\)`) and 40% men (`\(\bar{X}_M = 400\)`). What is `\(\bar{X}_{st}\)`? - **A)** 340 — weighted by population stratum proportions - **B)** 350 — simple average of the two stratum means - **C)** 360 — weighted by sample rather than population shares ] -- .pull-left-wide[ **Q2.** Why is the simple sample average biased when sampling is non-proportional? - **A)** Because `\(\bar{X}_j\)` is biased within each stratum - **B)** Because over-sampled strata get too much weight relative to their population share - **C)** Because stratification changes the population mean `\(\mu\)` ] --- # .red[Practice 1: Stratified sample average] .pull-left-wide[ A company of 1,000 employees is divided into three departments: | Department | `\(N_j\)` | `\(n_j\)` | `\(\bar{X}_j\)` (avg. salary, €k) | |---|---|---|---| | Sales | 500 | 10 | 40 | | IT | 300 | 6 | 50 | | Finance | 200 | 4 | 60 | 1. Calculate `\(w_j\)` for each department 2. Calculate `\(\bar{X}\)` (simple average of all observations) 3. Calculate `\(\bar{X}_{st}\)` ] --- class: inverse, middle, center # Properties of the stratified sample average --- # Properties: overview .pull-left-wide[ We evaluate `\(\bar{X}_{st}\)` against the three desired estimator properties: - **Unbiasedness** - **Efficiency** - **Consistency** ] -- .pull-left-wide[ - Throughout, we compare `\(\bar{X}_{st}\)` with the simple sample average `\(\bar{X}\)` - `\(\bar{X}\)` is the simple average of all observations - `\(\bar{X}_{st}\)` is the weighted average of stratum averages: "We take the weighted average of each of the within-group averages." ] --- # Unbiasedness .pull-left-wide[ - A stratified sample is generally non-representative (if `\(n_j/n \neq N_j/N\)` for some `\(j\)`) - As a result, the **simple** sample average is a **biased** estimator of `\(\mu\)` in a stratified sample ] -- .pull-left-wide[ - The **stratified** sample average is unbiased: `$$E \left( \bar{X}_{st} \right) = \sum_{j = 1}^m \left[ E \left( \bar{X}_j \right) \cdot w_j \right] = \sum_{j = 1}^m \mu_j \cdot w_j = \mu$$` ] -- .pull-left-wide[ - Therefore: when sampling is stratified, always use `\(\bar{X}_{st}\)`, not `\(\bar{X}\)` ] --- # Efficiency: Variance .pull-left-wide[ - The variance of the stratum `\(j\)` sample average is: $$ Var \left( \bar{X}_j \right) = \frac{\sigma_j^2}{n_j} $$ where `\(\sigma_j^2\)` is the population variance of `\(X\)` within stratum `\(j\)` ] -- .pull-left-wide[ > The variance of the stratified sample average is: `$$Var \left( \bar{X}_{st} \right) = \sum_{j=1}^m \frac{\sigma_j^2}{n_j} \cdot w_j^2$$` ] -- .pull-left-wide[ - This depends on stratum sample sizes, within-stratum variances, and stratum weights ] --- # Efficiency: comparison .pull-left-wide[ - **Result 1:** If `\(X\)` and `\(Y\)` are **independent**, stratifying makes no difference — `\(Var(\bar{X}_{st}) = Var(\bar{X})\)` ] .pull-right-narrow[ .small123[ *Why result 1:* `\(\sigma_k^2 = \sigma^2\)` under independence; with `\(n_k = n W_k\)`: `$$\sum_k W_k^2 \frac{\sigma^2}{n W_k} = \frac{\sigma^2}{n}\sum_k W_k = \frac{\sigma^2}{n}$$` ] ] -- .pull-left-wide[ - **Result 2:** If `\(X\)` and `\(Y\)` are **correlated**, within-stratum variance is smaller than the overall variance - In this case stratification reduces variance: `\(\bar{X}_{st}\)` is more efficient than `\(\bar{X}\)` ] .pull-right-narrow[ .small123[ *Why result 2:* correlation `\(\Rightarrow\)` `\(\sigma_k^2 < \sigma^2\)`, so: `$$\sum_k W_k^2 \frac{\sigma_k^2}{n W_k} = \frac{1}{n}\sum_k W_k \sigma_k^2 < \frac{\sigma^2}{n}$$` ] ] -- .pull-left-wide[ **Intuition:** - Correlation means people *within* a stratum resemble each other more than the population at large - Stratifying forces coverage of all groups — bad-luck samples that over-represent one group are ruled out - Low within-stratum variance `\(\Rightarrow\)` precise weighted average ] --- # Consistency .pull-left-wide[ - The variance of `\(\bar{X}_{st}\)` decreases as stratum sample sizes increase - The estimator is unbiased: `\(E(\bar{X}_{st}) = \mu\)` ] .pull-right-narrow[ .small123[ *Why:* We showed `\(Var(\bar{X}_{st}) \leq \frac{\sigma^2}{n}\)`, so: `$$0 \leq Var(\bar{X}_{st}) \leq \frac{\sigma^2}{n} \xrightarrow{n\to\infty} 0$$` Variance is squeezed to zero. ] ] -- .pull-left-wide[ - Therefore, `\(\bar{X}_{st}\)` is a **consistent** estimator of `\(\mu\)` ] --- # Recap: Properties .pull-left-wide[ **Properties of `\(\bar{X}_{st}\)`:** - **Unbiased:** `\(E \left( \bar{X}_{st} \right) = \mu\)` - **Efficient:** `\(Var \left( \bar{X}_{st} \right) = \sum_{j=1}^m \dfrac{\sigma_j^2}{n_j} \cdot w_j^2\)` — beats `\(\bar{X}\)` when `\(X\)` and `\(Y\)` are correlated - **Consistent:** unbiased and variance falls with sample size ] --- # .red[Raise your hand 2: Properties]
−
+
00
:
20
.pull-left-wide[ **Q1.** You stratify by gender and estimate mean income. It turns out that within each gender, incomes vary just as much as in the full population. What follows? - **A)** Gender and income are uncorrelated in your data — stratifying by gender gives no efficiency gain over simple random sampling - **B)** You should use a different stratifying variable, but `\(\bar{X}_{st}\)` is still more efficient than `\(\bar{X}\)` - **C)** The sample is too small — within-stratum variance should shrink as the sample grows ] -- .pull-left-wide[ **Q2.** A researcher stratifies by region but reports `\(\bar{X}\)` (plain average) instead of `\(\bar{X}_{st}\)`. Under what condition is this a problem? - **A)** Always — `\(\bar{X}\)` is biased whenever the data come from a stratified design - **B)** Only if strata were sampled disproportionately; with proportional allocation `\(\bar{X}\)` is unbiased - **C)** Only if the strata have different variances ] --- # .red[Practice 2: Properties of] `\(\bar{X}_{st}\)` .pull-left-wide[ Using the same company data from Practice 1. Within-stratum variances: Sales `\(\sigma^2 = 100\)`, IT `\(\sigma^2 = 64\)`, Finance `\(\sigma^2 = 36\)`. Overall population variance: `\(\sigma^2 = 90\)`. 1. Calculate `\(Var(\bar{X}_{st})\)` 2. Calculate `\(Var(\bar{X})\)` for a simple random sample of `\(n = 20\)` 3. Is `\(\bar{X}_{st}\)` more efficient here? Interpret why ] --- class: inverse, middle, center # Choice of stratum sample size --- # Choice of stratum sample size .pull-left-wide[ - The variance of `\(\bar{X}_{st}\)` depends on the stratum sample sizes `\(n_j\)` - To gain something from stratification, we need to choose `\(n_j\)` wisely. - We ought to choose `\(n_j\)` to minimise `\(Var(\bar{X}_{st})\)` ] -- .pull-left-wide[ Two common strategies: 1. **Optimal allocation** — minimise variance given total `\(n\)` 2. **Proportional allocation** — match population shares ] --- # Optimal allocation .pull-left-wide[ > With known stratum variances `\(\sigma_j^2\)`, the variance-minimising stratum sample size is: `$$n_j^* = n \cdot \frac{\sigma_j \cdot w_j}{\sum_{j=1}^m \sigma_j \cdot w_j}$$` ] -- .pull-left-wide[ - Allocate *more* observations to strata that are: - **larger** in the population (`\(w_j\)` is high) - **more variable** in `\(X\)` (`\(\sigma_j\)` is high) ] --- # Optimal allocation vs simple sample average .pull-left-wide[ - With optimal allocation: `$$Var \left( \bar{X}_{st} \right) = \frac{\left( \sum_{j=1}^m \sigma_j \cdot w_j \right)^2}{n}$$` ] -- .pull-left-wide[ - `\(Var(\bar{X}_{st}) < Var(\bar{X})\)` when the weighted average of stratum standard deviations is smaller than the overall standard deviation - This holds whenever within-stratum variance is smaller than overall variance ] -- .pull-right-narrow[ .small123[ *Why `\(Var(\bar{X}_{st}) < Var(\bar{X})\)`?:* Need to show `\(S = \sum_j w_j \sigma_j \leq \sigma\)`. By Jensen's inequality (`\(\sqrt{\cdot}\)` is concave): `$$S = \sum_j w_j \sqrt{\sigma_j^2} \leq \sqrt{\sum_j w_j \sigma_j^2} \leq \sqrt{\sigma^2} = \sigma$$` The last step uses the law of total variance: `\(\sigma^2 = \underbrace{\sum_j w_j \sigma_j^2}_{\text{within}} + \underbrace{\sum_j w_j(\mu_j - \mu)^2}_{\geq\, 0}\)` ] ] --- # Optimal allocation in practice .pull-left-wide[ - Optimal allocation requires knowing `\(\sigma_j^2\)` — which is usually unknown ] -- .pull-left-wide[ - Practical solution: run a **pilot study** 1. Collect a small preliminary sample 2. Estimate `\(\sigma_j^2\)` from the pilot 3. Plug into the formula to determine `\(n_j^*\)` for the main study ] -- .red[ - Pilot studies are common in practice, but they add cost and complexity to the research process - If the pilot is too small, the `\(\sigma_j^2\)` estimates may be noisy, leading to suboptimal allocation - Often researchers use proportional allocation is a more attractive alternative. ] --- # Proportional allocation .pull-left-wide[ > **Proportional allocation** sets stratum sizes proportional to population shares: `$$\frac{n_j}{n} = \frac{N_j}{N}$$` ] -- .pull-left-wide[ - With proportional allocation, weighting is automatic: `$$\bar{X}_{st, prop} = \frac{1}{n} \cdot \sum_{j=1}^m \sum_{i=1}^{n_j} X_{ij} = \bar{X}$$` - .red[This is just the simple average — no weights needed] ] --- # Proportional allocation: efficiency .pull-left-wide[ - Variance under proportional allocation: `$$Var \left( \bar{X}_{st, prop} \right) = \frac{1}{n} \sum_{j=1}^m \sigma_j^2 \cdot w_j$$` ] -- .pull-left-wide[ - Since `\(\sum_j w_j \sigma_j^2 \leq \sigma^2\)` (law of total variance), proportional allocation is **more efficient** than simple random sampling - It is **less efficient** than optimal allocation — but requires no knowledge of `\(\sigma_j^2\)` ] --- # Predetermined stratification .pull-left-wide[ - In practice, we often receive data from a statistical agency, not design the sample ourselves - If they used stratified sampling, the stratum structure is given to us ] -- .pull-left-wide[ - To correctly use such a sample, we need for each stratum: - the sample size `\(n_j\)` - the population proportion `\(w_j\)` ] -- .pull-left-wide[ - **Bias:** `\(\bar{X}_{st}\)` remains unbiased as long as we use the correct weights `\(w_j = N_j/N\)` — it does not matter who designed the sample - **Variance:** the same formula applies; we just need to know `\(n_j\)` and `\(w_j\)` ] --- # Stratification without replacement .pull-left-wide[ - We have assumed simple random sampling (with replacement) within each stratum - In practice, draws are often **without replacement** — observations are not independent ] -- .pull-left-wide[ - **Bias:** unaffected — `\(\bar{X}_{st}\)` remains unbiased regardless of replacement - **Variance:** the formula changes — sampling without replacement introduces a finite-population correction factor that *reduces* variance ] --- # Stratification without replacement: variances .pull-left-wide[ - Without replacement from the full population: `$$Var \left( \bar{X}_{norep} \right) = \frac{\sigma^2}{n} \cdot \frac{N}{N - 1} \cdot \left( 1 - \frac{n}{N} \right)$$` ] -- .pull-left-wide[ - For the stratified average without replacement: `$$Var \left( \bar{X}_{st, norep} \right) = \sum_{j=1}^m \left[ w_j^2 \cdot \frac{\sigma_j^2}{n_j} \cdot \frac{N_j}{N_j - 1} \cdot \left( 1 - \frac{n_j}{N_j} \right) \right]$$` - The optimal stratum size formula also changes accordingly ] --- # Recap: Choice of stratum sample size .pull-left-wide[ **Strategies for choosing `\(n_j\)`:** - **Optimal allocation:** minimise `\(Var(\bar{X}_{st})\)` — requires `\(\sigma_j^2\)` (use pilot study if unknown) - **Proportional allocation:** `\(n_j/n = N_j/N\)` — simple, avoids explicit weighting - **Predetermined stratification:** `\(n_j\)` given by the data provider **Sampling without replacement:** unbiasedness and consistency unchanged; variance formula differs ] -- .pull-left-wide[ **Why not always stratify?** - Requires a complete frame with stratum membership known *before* sampling - Efficiency gain is negligible if `\(X\)` and `\(Y\)` are weakly correlated - Optimal allocation requires `\(\sigma_j^2\)` — often unknown - Often data just happens to exist and then we use it. ] --- # .red[Raise your hand 3: Stratum sample size]
−
+
00
:
20
.pull-left-wide[ **Q1.** Under optimal allocation, what determines how large `\(n_j^*\)` should be? - **A)** Only the population share `\(w_j\)` — larger strata always get more observations - **B)** Only the within-stratum variance `\(\sigma_j^2\)` — more variable strata need more data - **C)** Both `\(w_j\)` and `\(\sigma_j\)` — the formula is `\(n_j^* \propto \sigma_j \cdot w_j\)` ] -- .pull-left-wide[ **Q2.** With proportional allocation, why does the stratified average simplify to a plain unweighted average? - **A)** Because all stratum variances are assumed equal under proportional allocation - **B)** Because `\(n_j/n = N_j/N\)`, so counting observations equally already gives the right weights - **C)** Because proportional allocation makes stratified sampling equivalent to SRS ] --- # .red[Practice 3: Optimal allocation] .pull-left-wide[ A population of 800 households is divided into three income strata: | Stratum | `\(N_j\)` | `\(\sigma_j\)` (income s.d., €k) | |---|---|---| | Low | 400 | 5 | | Middle | 300 | 8 | | High | 100 | 15 | Total sample size `\(n = 40\)`. 1. Calculate `\(w_j\)` for each stratum 2. Calculate `\(n_j^*\)` under optimal allocation 3. Compare to proportional allocation. Which stratum changes most, and why? ] --- class: inverse, middle, center # Cluster sampling --- # Cluster sampling .pull-left-wide[ - The cost of sampling can be substantial when elements are hard to reach: - geographical distance - costs of running separate experiments for different individuals ] -- .pull-left-wide[ - It may be cheaper to group elements into **clusters** (e.g., cities or schools) - We use simple random sampling to select some of the clusters ] -- .pull-left-wide[ - Within selected clusters, we either: - include all elements (*one-step method*), or - draw a simple random sample within each cluster (*two-step method*) ] --- # Strata vs. clusters .pull-left[ **Strata** - Groups are **homogeneous** internally - Purpose: reduce variance of `\(\bar{X}_{st}\)` - Sample from **all** strata - Gain: statistical efficiency ] .pull-right[ **Clusters** - Groups are **heterogeneous** — each is a mini-population - Purpose: reduce cost - Sample **some** clusters only - Gain: logistical feasibility ] -- .pull-left-wide[ - Cluster sampling is motivated by **cost**, not efficiency — but lower cost may allow a larger `\(n\)` ] --- # Cluster sample average .pull-left-wide[ - Suppose the population is divided into `\(K\)` clusters of equal size `\(\bar{N} = N/K\)` - We select `\(k\)` of the `\(K\)` clusters ] -- .pull-left-wide[ > The **cluster sample average** is: `$$\bar{X}_{cluster} = \frac{1}{k} \cdot \sum_{j=1}^k \bar{X}_j = \frac{1}{k \cdot \bar{N}} \cdot \sum_{j=1}^k \sum_{i=1}^{\bar{N}} X_{ij}$$` ] --- # .red[Raise your hand 4: Cluster sampling]
−
+
00
:
20
.pull-left-wide[ **Q1.** What is the key structural difference between a stratum and a cluster? - **A)** Strata are homogeneous (low within-group variance); clusters are heterogeneous (representative of the population) - **B)** Clusters are defined by a statistical criterion; strata are defined by convenience - **C)** Strata are non-overlapping subgroups; clusters can share elements ] -- .pull-left-wide[ **Q2.** Why does cluster sampling reduce costs compared to simple random sampling? - **A)** Because fewer variables are measured within each cluster - **B)** Because surveying whole clusters concentrates effort geographically or organisationally, reducing travel and contact costs - **C)** Because clusters always have lower variance than the full population, so fewer total observations are needed ] --- # .red[Practice 4: Cluster sample average] .pull-left-wide[ A country has `\(K = 50\)` schools of equal size (`\(\bar{N} = 30\)` students each). A researcher selects `\(k = 5\)` schools at random and records the test score of every student. Sample means per school: 72, 68, 75, 70, 65. 1. Calculate `\(\bar{X}_{cluster}\)` 2. Is `\(\bar{X}_{cluster}\)` an unbiased estimator of the population mean `\(\mu\)`? Why or why not? 3. How would you expect `\(Var(\bar{X}_{cluster})\)` to compare to `\(Var(\bar{X})\)` for a simple random sample of the same total `\(n\)`? ] --- # Key takeaways .pull-left-wide[ 1. **Stratification** — exploit known population structure to increase precision - Stratified sample average: `\(\bar{X}_{st} = \sum_{j=1}^m \bar{X}_j \cdot w_j\)` - Properties: unbiased, efficient (when `\(X \perp\!\!\!\not\perp Y\)`), consistent ] -- .pull-left-wide[ 2. **Stratum sample size** - Optimal allocation: `\(n_j^* \propto \sigma_j \cdot w_j\)` - Proportional allocation: `\(n_j/n = N_j/N\)` - Sampling without replacement changes the variance formula but not unbiasedness or consistency ] -- .pull-left-wide[ 3. **Cluster sampling** — group elements to reduce cost - Clusters are heterogeneous (unlike homogeneous strata) - Cluster average: `\(\bar{X}_{cluster} = \frac{1}{k} \sum_{j=1}^k \bar{X}_j\)` - Cost advantage comes with a variance penalty (intra-cluster correlation) ] --- # Before next time .pull-left[ - Read the assigned reading - Next time: Regression analysis `\(\rightarrow\)` Chapter 12 ] .pull-right[  ]