Slides.knit

class: center, inverse, middle

.pull-left-wide {
  float: left;
  width: 66%;
}
.pull-right-wide {
  float: right;
  width: 66%;
}
.pull-right-wide ~ p {
  clear: both;
}

.pull-left-narrow {
  float: left;
  width: 30%;
}
.pull-right-narrow {
  float: right;
  width: 30%;
}

.tiny123 {
  font-size: 0.40em;
}

.small123 {
  font-size: 0.80em;
}

.large123 {
  font-size: 2em;
}

.red {
  color: red
}

.orange {
  color: orange
}

.green {
  color: green
}
</style>

# Statistics
## Estimating the mean value using stratified sampling and cluster sampling
### (Chapter 11)

### Christian Vedel,<br>Department of Economics<br>University of Southern Denmark

### Email: [christian-vs@sam.sdu.dk](mailto:christian-vs@sam.sdu.dk)

### Updated 2026-04-16

---
class: middle
# Today's lecture
.pull-left-wide[
**Sampling designs that exploit prior knowledge of the population**

- **Section 1:** Stratification and the stratified sample average
- **Section 2:** Properties of the stratified sample average
- **Section 3:** Choice of stratum sample size
- **Section 4:** Cluster sampling
]

.pull-right-narrow[
![Trees](Figures/Trees1.jpg)
]

---
class: inverse, middle, center
# Stratification

---
# Stratification

.pull-left-wide[
- We may often know more about a population than just the characteristic of interest
- This knowledge can potentially be used to select a sample that gives us more information than a simple random sample
]

.pull-left-wide[
- One such method is **stratification**, which relies on prior knowledge of the distribution of some characteristic in the population
]

---
# Strata

.pull-left-wide[
- Suppose each element in the population has two characteristics:
  - the characteristic of interest `$X$` (e.g., income)
  - a second characteristic `$Y$`, with a known distribution (e.g., gender)
]

.pull-left-wide[
- A **stratum** (plural: **strata**) consists of all elements in the population with the same value of `$Y$`
- If `$Y$` can take `$m$` different values, there will be `$m$` strata
]

.pull-left-wide[
- **Stratification** divides the population into strata based on `$Y$`, then draws a simple random sample within each stratum
]

---
# Law of iterated expectations

.pull-left-wide[
- When stratifying, we want to recover the population mean of `$X$` from the stratum means
]

.pull-left-wide[
- The **law of iterated expectations** gives us:

`$$E_X(X) = E_Y \left[ E_{X|Y}(X|Y) \right] = \sum_{j = 1}^m E_{X|Y}(X | Y = y_j) \cdot f_Y(y_j)$$`
]

---
# Stratified population mean

.pull-left-wide[
- The population mean of `$X$` is a weighted average of stratum means, with weights `$w_j = f_Y(y_j) = N_j / N$`:

`$$\mu = E(X) = \sum_{j=1}^m \mu_j \cdot w_j$$`

where `$N_j$` is the number of elements in stratum `$j$` and `$N$` is the total population size
]

.pull-left-wide[
- In the gender example, the weights are the population fractions of men and women
]

---
# Stratum sample average

.pull-left-wide[
- By the analogy principle: replace unknown population quantities with sample counterparts
- The weights `$w_j$` are known from the population; the only unknowns are `$\mu_j$`
]

.pull-left-wide[
- Within each stratum we have a simple random sample, so:
`$$\bar{X}_j = \frac{1}{n_j} \cdot \sum_{i = 1}^{n_j} X_{ij}$$`
where `$n_j$` is the stratum sample size and `$X_{ij}$` is the `$i$`-th element in stratum `$j$`
]

---
# Stratified sample average

.pull-left-wide[
> The **stratified sample average** is the weighted sum of stratum sample averages:
`$$\bar{X}_{st} = \sum_{j = 1}^m \bar{X}_j \cdot w_j$$`
]

.pull-left-wide[
- `$\bar{X}_{st}$` is a weighted average of the `$m$` stratum sample averages
- The weights `$w_j = N_j / N$` are the known population stratum proportions
]

---
# Recap: Stratification

.pull-left-wide[
**Stratification**
- *Why?* To increase precision by exploiting known population structure
- *When?* When we have prior knowledge of the distribution of some characteristic `$Y$`

**Stratified sample average**
`$$\bar{X}_{st} = \sum_{j = 1}^m \bar{X}_j \cdot w_j$$`
i.e., the weighted sum of stratum sample averages
]

---
# .red[Raise your hand 1: Stratification]

.pull-left-wide[
**Q1.** A population is split into two strata: 60% women (`$\bar{X}_W = 300$`) and 40% men (`$\bar{X}_M = 400$`). What is `$\bar{X}_{st}$`?

- **A)** 340 — weighted by population stratum proportions
- **B)** 350 — simple average of the two stratum means
- **C)** 360 — weighted by sample rather than population shares
]

.pull-left-wide[
**Q2.** Why is the simple sample average biased when sampling is non-proportional?

- **A)** Because `$\bar{X}_j$` is biased within each stratum
- **B)** Because over-sampled strata get too much weight relative to their population share
- **C)** Because stratification changes the population mean `$\mu$`
]

---
# .red[Practice 1: Stratified sample average]

.pull-left-wide[
A company of 1,000 employees is divided into three departments:

| Department | `$N_j$` | `$n_j$` | `$\bar{X}_j$` (avg. salary, €k) |
|---|---|---|---|
| Sales | 500 | 10 | 40 |
| IT | 300 | 6 | 50 |
| Finance | 200 | 4 | 60 |

1. Calculate `$w_j$` for each department
2. Calculate `$\bar{X}$` (simple average of all observations)
3. Calculate `$\bar{X}_{st}$`
]

---
class: inverse, middle, center
# Properties of the stratified sample average

---
# Properties: overview

.pull-left-wide[
We evaluate `$\bar{X}_{st}$` against the three desired estimator properties:

- **Unbiasedness**
- **Efficiency**
- **Consistency**
]

.pull-left-wide[
- Throughout, we compare `$\bar{X}_{st}$` with the simple sample average `$\bar{X}$`

- `$\bar{X}$` is the simple average of all observations
  - `$\bar{X}_{st}$` is the weighted average of stratum averages: "We take the weighted average of each of the within-group averages."
]

---
# Unbiasedness

.pull-left-wide[
- A stratified sample is generally non-representative (if `$n_j/n \neq N_j/N$` for some `$j$`)
- As a result, the **simple** sample average is a **biased** estimator of `$\mu$` in a stratified sample
]

.pull-left-wide[
- The **stratified** sample average is unbiased:

`$$E \left( \bar{X}_{st} \right) = \sum_{j = 1}^m \left[ E \left( \bar{X}_j \right) \cdot w_j \right] = \sum_{j = 1}^m \mu_j \cdot w_j = \mu$$`
]

.pull-left-wide[
- Therefore: when sampling is stratified, always use `$\bar{X}_{st}$`, not `$\bar{X}$`

]

---
# Efficiency: Variance

.pull-left-wide[
- The variance of the stratum `$j$` sample average is:
$$
Var \left( \bar{X}_j \right) = \frac{\sigma_j^2}{n_j}
$$
where `$\sigma_j^2$` is the population variance of `$X$` within stratum `$j$`
]

.pull-left-wide[
> The variance of the stratified sample average is:
`$$Var \left( \bar{X}_{st} \right) = \sum_{j=1}^m \frac{\sigma_j^2}{n_j} \cdot w_j^2$$`
]

.pull-left-wide[
- This depends on stratum sample sizes, within-stratum variances, and stratum weights
]

---
# Efficiency: comparison

.pull-left-wide[
- **Result 1:** If `$X$` and `$Y$` are **independent**, stratifying makes no difference — `$Var(\bar{X}_{st}) = Var(\bar{X})$`
]

.pull-right-narrow[
.small123[
*Why result 1:* `$\sigma_k^2 = \sigma^2$` under independence; with `$n_k = n W_k$`:
`$$\sum_k W_k^2 \frac{\sigma^2}{n W_k} = \frac{\sigma^2}{n}\sum_k W_k = \frac{\sigma^2}{n}$$`
]
]

.pull-left-wide[
- **Result 2:** If `$X$` and `$Y$` are **correlated**, within-stratum variance is smaller than the overall variance
- In this case stratification reduces variance: `$\bar{X}_{st}$` is more efficient than `$\bar{X}$`
]

.pull-right-narrow[
.small123[
*Why result 2:* correlation `$\Rightarrow$` `$\sigma_k^2 < \sigma^2$`, so:
`$$\sum_k W_k^2 \frac{\sigma_k^2}{n W_k} = \frac{1}{n}\sum_k W_k \sigma_k^2 < \frac{\sigma^2}{n}$$`
]
]

.pull-left-wide[
**Intuition:**
- Correlation means people *within* a stratum resemble each other more than the population at large
- Stratifying forces coverage of all groups — bad-luck samples that over-represent one group are ruled out
- Low within-stratum variance `$\Rightarrow$` precise weighted average
]

---
# Consistency

.pull-left-wide[
- The variance of `$\bar{X}_{st}$` decreases as stratum sample sizes increase
- The estimator is unbiased: `$E(\bar{X}_{st}) = \mu$`
]

.pull-right-narrow[
.small123[
*Why:* We showed `$Var(\bar{X}_{st}) \leq \frac{\sigma^2}{n}$`, so:
`$$0 \leq Var(\bar{X}_{st}) \leq \frac{\sigma^2}{n} \xrightarrow{n\to\infty} 0$$`
Variance is squeezed to zero.
]
]

.pull-left-wide[
- Therefore, `$\bar{X}_{st}$` is a **consistent** estimator of `$\mu$`
]

---
# Recap: Properties

.pull-left-wide[
**Properties of `$\bar{X}_{st}$`:**

- **Unbiased:** `$E \left( \bar{X}_{st} \right) = \mu$`

- **Efficient:** `$Var \left( \bar{X}_{st} \right) = \sum_{j=1}^m \dfrac{\sigma_j^2}{n_j} \cdot w_j^2$` — beats `$\bar{X}$` when `$X$` and `$Y$` are correlated

- **Consistent:** unbiased and variance falls with sample size
]

---
# .red[Raise your hand 2: Properties]

.pull-left-wide[
**Q1.** You stratify by gender and estimate mean income. It turns out that within each gender, incomes vary just as much as in the full population. What follows?

- **A)** Gender and income are uncorrelated in your data — stratifying by gender gives no efficiency gain over simple random sampling
- **B)** You should use a different stratifying variable, but `$\bar{X}_{st}$` is still more efficient than `$\bar{X}$`
- **C)** The sample is too small — within-stratum variance should shrink as the sample grows
]

.pull-left-wide[
**Q2.** A researcher stratifies by region but reports `$\bar{X}$` (plain average) instead of `$\bar{X}_{st}$`. Under what condition is this a problem?

- **A)** Always — `$\bar{X}$` is biased whenever the data come from a stratified design
- **B)** Only if strata were sampled disproportionately; with proportional allocation `$\bar{X}$` is unbiased
- **C)** Only if the strata have different variances
]

---
# .red[Practice 2: Properties of] `$\bar{X}_{st}$`

.pull-left-wide[
Using the same company data from Practice 1.
Within-stratum variances: Sales `$\sigma^2 = 100$`, IT `$\sigma^2 = 64$`, Finance `$\sigma^2 = 36$`.
Overall population variance: `$\sigma^2 = 90$`.

1. Calculate `$Var(\bar{X}_{st})$`
2. Calculate `$Var(\bar{X})$` for a simple random sample of `$n = 20$`
3. Is `$\bar{X}_{st}$` more efficient here? Interpret why
]

---
class: inverse, middle, center
# Choice of stratum sample size

---
# Choice of stratum sample size

.pull-left-wide[
- The variance of `$\bar{X}_{st}$` depends on the stratum sample sizes `$n_j$`
- To gain something from stratification, we need to choose `$n_j$` wisely.
- We ought to choose `$n_j$` to minimise `$Var(\bar{X}_{st})$`
]

.pull-left-wide[
Two common strategies:
1. **Optimal allocation** — minimise variance given total `$n$`
2. **Proportional allocation** — match population shares
]

---
# Optimal allocation

.pull-left-wide[
> With known stratum variances `$\sigma_j^2$`, the variance-minimising stratum sample size is:
`$$n_j^* = n \cdot \frac{\sigma_j \cdot w_j}{\sum_{j=1}^m \sigma_j \cdot w_j}$$`
]

.pull-left-wide[
- Allocate *more* observations to strata that are:
  - **larger** in the population (`$w_j$` is high)
  - **more variable** in `$X$` (`$\sigma_j$` is high)
]

---
# Optimal allocation vs simple sample average

.pull-left-wide[
- With optimal allocation:
`$$Var \left( \bar{X}_{st} \right) = \frac{\left( \sum_{j=1}^m \sigma_j \cdot w_j \right)^2}{n}$$`
]

.pull-left-wide[
- `$Var(\bar{X}_{st}) < Var(\bar{X})$` when the weighted average of stratum standard deviations is smaller than the overall standard deviation
- This holds whenever within-stratum variance is smaller than overall variance
]

.pull-right-narrow[
.small123[
*Why `$Var(\bar{X}_{st}) < Var(\bar{X})$`?:*

Need to show `$S = \sum_j w_j \sigma_j \leq \sigma$`.

By Jensen's inequality (`$\sqrt{\cdot}$` is concave):
`$$S = \sum_j w_j \sqrt{\sigma_j^2} \leq \sqrt{\sum_j w_j \sigma_j^2} \leq \sqrt{\sigma^2} = \sigma$$`

The last step uses the law of total variance: `$\sigma^2 = \underbrace{\sum_j w_j \sigma_j^2}_{\text{within}} + \underbrace{\sum_j w_j(\mu_j - \mu)^2}_{\geq\, 0}$`
]
]

---
# Optimal allocation in practice

.pull-left-wide[
- Optimal allocation requires knowing `$\sigma_j^2$` — which is usually unknown
]

.pull-left-wide[
- Practical solution: run a **pilot study**
  1. Collect a small preliminary sample
  2. Estimate `$\sigma_j^2$` from the pilot
  3. Plug into the formula to determine `$n_j^*$` for the main study
]

.red[
- Pilot studies are common in practice, but they add cost and complexity to the research process
- If the pilot is too small, the `$\sigma_j^2$` estimates may be noisy, leading to suboptimal allocation
- Often researchers use proportional allocation is a more attractive alternative.
]

---
# Proportional allocation

.pull-left-wide[
> **Proportional allocation** sets stratum sizes proportional to population shares:
`$$\frac{n_j}{n} = \frac{N_j}{N}$$`
]

.pull-left-wide[
- With proportional allocation, weighting is automatic:
`$$\bar{X}_{st, prop} = \frac{1}{n} \cdot \sum_{j=1}^m \sum_{i=1}^{n_j} X_{ij} = \bar{X}$$`
- .red[This is just the simple average — no weights needed]
]

---
# Proportional allocation: efficiency

.pull-left-wide[
- Variance under proportional allocation:
`$$Var \left( \bar{X}_{st, prop} \right) = \frac{1}{n} \sum_{j=1}^m \sigma_j^2 \cdot w_j$$`
]

.pull-left-wide[
- Since `$\sum_j w_j \sigma_j^2 \leq \sigma^2$` (law of total variance), proportional allocation is **more efficient** than simple random sampling
- It is **less efficient** than optimal allocation — but requires no knowledge of `$\sigma_j^2$`
]

---
# Predetermined stratification

.pull-left-wide[
- In practice, we often receive data from a statistical agency, not design the sample ourselves
- If they used stratified sampling, the stratum structure is given to us
]

.pull-left-wide[
- To correctly use such a sample, we need for each stratum:
  - the sample size `$n_j$`
  - the population proportion `$w_j$`
]

.pull-left-wide[
- **Bias:** `$\bar{X}_{st}$` remains unbiased as long as we use the correct weights `$w_j = N_j/N$` — it does not matter who designed the sample
- **Variance:** the same formula applies; we just need to know `$n_j$` and `$w_j$`
]

---
# Stratification without replacement

.pull-left-wide[
- We have assumed simple random sampling (with replacement) within each stratum
- In practice, draws are often **without replacement** — observations are not independent
]

.pull-left-wide[
- **Bias:** unaffected — `$\bar{X}_{st}$` remains unbiased regardless of replacement
- **Variance:** the formula changes — sampling without replacement introduces a finite-population correction factor that *reduces* variance
]

---
# Stratification without replacement: variances

.pull-left-wide[
- Without replacement from the full population:
`$$Var \left( \bar{X}_{norep} \right) = \frac{\sigma^2}{n} \cdot \frac{N}{N - 1} \cdot \left( 1 - \frac{n}{N} \right)$$`
]

.pull-left-wide[
- For the stratified average without replacement:
`$$Var \left( \bar{X}_{st, norep} \right) = \sum_{j=1}^m \left[ w_j^2 \cdot \frac{\sigma_j^2}{n_j} \cdot \frac{N_j}{N_j - 1} \cdot \left( 1 - \frac{n_j}{N_j} \right) \right]$$`

- The optimal stratum size formula also changes accordingly
]

---
# Recap: Choice of stratum sample size

.pull-left-wide[
**Strategies for choosing `$n_j$`:**

- **Optimal allocation:** minimise `$Var(\bar{X}_{st})$` — requires `$\sigma_j^2$` (use pilot study if unknown)
- **Proportional allocation:** `$n_j/n = N_j/N$` — simple, avoids explicit weighting
- **Predetermined stratification:** `$n_j$` given by the data provider

**Sampling without replacement:** unbiasedness and consistency unchanged; variance formula differs
]

.pull-left-wide[
**Why not always stratify?**
- Requires a complete frame with stratum membership known *before* sampling
- Efficiency gain is negligible if `$X$` and `$Y$` are weakly correlated
- Optimal allocation requires `$\sigma_j^2$` — often unknown
- Often data just happens to exist and then we use it. 
]

---
# .red[Raise your hand 3: Stratum sample size]

.pull-left-wide[
**Q1.** Under optimal allocation, what determines how large `$n_j^*$` should be?

- **A)** Only the population share `$w_j$` — larger strata always get more observations
- **B)** Only the within-stratum variance `$\sigma_j^2$` — more variable strata need more data
- **C)** Both `$w_j$` and `$\sigma_j$` — the formula is `$n_j^* \propto \sigma_j \cdot w_j$`
]

.pull-left-wide[
**Q2.** With proportional allocation, why does the stratified average simplify to a plain unweighted average?

- **A)** Because all stratum variances are assumed equal under proportional allocation
- **B)** Because `$n_j/n = N_j/N$`, so counting observations equally already gives the right weights
- **C)** Because proportional allocation makes stratified sampling equivalent to SRS
]

---
# .red[Practice 3: Optimal allocation]

.pull-left-wide[
A population of 800 households is divided into three income strata:

| Stratum | `$N_j$` | `$\sigma_j$` (income s.d., €k) |
|---|---|---|
| Low | 400 | 5 |
| Middle | 300 | 8 |
| High | 100 | 15 |

Total sample size `$n = 40$`.

1. Calculate `$w_j$` for each stratum
2. Calculate `$n_j^*$` under optimal allocation
3. Compare to proportional allocation. Which stratum changes most, and why?
]

---
class: inverse, middle, center
# Cluster sampling

---
# Cluster sampling

.pull-left-wide[
- The cost of sampling can be substantial when elements are hard to reach:
  - geographical distance
  - costs of running separate experiments for different individuals
]

.pull-left-wide[
- It may be cheaper to group elements into **clusters** (e.g., cities or schools)
- We use simple random sampling to select some of the clusters
]

.pull-left-wide[
- Within selected clusters, we either:
  - include all elements (*one-step method*), or
  - draw a simple random sample within each cluster (*two-step method*)
]

---
# Strata vs. clusters

.pull-left[
**Strata**
- Groups are **homogeneous** internally
- Purpose: reduce variance of `$\bar{X}_{st}$`
- Sample from **all** strata
- Gain: statistical efficiency
]

.pull-right[
**Clusters**
- Groups are **heterogeneous** — each is a mini-population
- Purpose: reduce cost
- Sample **some** clusters only
- Gain: logistical feasibility
]

.pull-left-wide[
- Cluster sampling is motivated by **cost**, not efficiency — but lower cost may allow a larger `$n$`
]

---
# Cluster sample average

.pull-left-wide[
- Suppose the population is divided into `$K$` clusters of equal size `$\bar{N} = N/K$`
- We select `$k$` of the `$K$` clusters
]

.pull-left-wide[
> The **cluster sample average** is:
`$$\bar{X}_{cluster} = \frac{1}{k} \cdot \sum_{j=1}^k \bar{X}_j = \frac{1}{k \cdot \bar{N}} \cdot \sum_{j=1}^k \sum_{i=1}^{\bar{N}} X_{ij}$$`
]

---
# .red[Raise your hand 4: Cluster sampling]

.pull-left-wide[
**Q1.** What is the key structural difference between a stratum and a cluster?

- **A)** Strata are homogeneous (low within-group variance); clusters are heterogeneous (representative of the population)
- **B)** Clusters are defined by a statistical criterion; strata are defined by convenience
- **C)** Strata are non-overlapping subgroups; clusters can share elements
]

.pull-left-wide[
**Q2.** Why does cluster sampling reduce costs compared to simple random sampling?

- **A)** Because fewer variables are measured within each cluster
- **B)** Because surveying whole clusters concentrates effort geographically or organisationally, reducing travel and contact costs
- **C)** Because clusters always have lower variance than the full population, so fewer total observations are needed
]

---
# .red[Practice 4: Cluster sample average]

.pull-left-wide[
A country has `$K = 50$` schools of equal size (`$\bar{N} = 30$` students each). A researcher selects `$k = 5$` schools at random and records the test score of every student.

Sample means per school: 72, 68, 75, 70, 65.

1. Calculate `$\bar{X}_{cluster}$`
2. Is `$\bar{X}_{cluster}$` an unbiased estimator of the population mean `$\mu$`? Why or why not?
3. How would you expect `$Var(\bar{X}_{cluster})$` to compare to `$Var(\bar{X})$` for a simple random sample of the same total `$n$`?
]

---
# Key takeaways

.pull-left-wide[
1. **Stratification** — exploit known population structure to increase precision
   - Stratified sample average: `$\bar{X}_{st} = \sum_{j=1}^m \bar{X}_j \cdot w_j$`
   - Properties: unbiased, efficient (when `$X \perp\!\!\!\not\perp Y$`), consistent
]

.pull-left-wide[
2. **Stratum sample size**
   - Optimal allocation: `$n_j^* \propto \sigma_j \cdot w_j$`
   - Proportional allocation: `$n_j/n = N_j/N$`
   - Sampling without replacement changes the variance formula but not unbiasedness or consistency
]

.pull-left-wide[
3. **Cluster sampling** — group elements to reduce cost
   - Clusters are heterogeneous (unlike homogeneous strata)
   - Cluster average: `$\bar{X}_{cluster} = \frac{1}{k} \sum_{j=1}^k \bar{X}_j$`
   - Cost advantage comes with a variance penalty (intra-cluster correlation)
]

---
# Before next time
.pull-left[
- Read the assigned reading
- Next time: Regression analysis `$\rightarrow$` Chapter 12
]

.pull-right[
![Trees](Figures/Trees1.jpg)
]