Slides.knit

class: center, inverse, middle

.pull-left-wide {
  float: left;
  width: 66%;
}
.pull-right-wide {
  float: right;
  width: 66%;
}
.pull-right-wide ~ p {
  clear: both;
}

.pull-left-narrow {
  float: left;
  width: 30%;
}
.pull-right-narrow {
  float: right;
  width: 30%;
}

.tiny123 {
  font-size: 0.40em;
}

.small123 {
  font-size: 0.80em;
}

.large123 {
  font-size: 2em;
}

.red {
  color: red
}

.orange {
  color: orange
}

.green {
  color: green
}
</style>

# Statistics
## The sampling process
### (Chapter 9)

### Christian Vedel,<br>Department of Economics<br>University of Southern Denmark

### Email: [christian-vs@sam.sdu.dk](mailto:christian-vs@sam.sdu.dk)

### Updated 2026-04-09

---
class: middle
# Today's lecture
.pull-left-wide[
**How samples are drawn from populations — and what can go wrong**

- **Section 1:** Deductive vs. inductive reasoning — and why sampling is the foundation of all inference
- **Section 2:** Designing the sampling process
- **Section 3:** Formal model for the sample
- **Section 4:** Representative samples
- **Section 5:** Sources of error and observation errors
]

.pull-right-narrow[
![Trees](Figures/Trees1.jpg)
]

???
10:16

---
# What this lecture unlocks

.pull-left-wide[
Everything from Chapter 10 onwards rests on one question: **how was the sample drawn?**

- **Ch. 10 — Estimating the mean:** The sample average `$\bar{X}$` is an unbiased estimator of `$\mu$` and has variance `$\sigma^2/n$` *only if* the sample is a representative draw where `$X_1, \ldots, X_n$` are i.i.d.
- **Ch. 11 — Stratified estimation:** Uses the selection mechanism `$f_{Z_i}(z)$` explicitly to correct for known non-representativeness. *Also the basis for IPW estimation in Causal Inference.*
- **Ch. 13 — Confidence intervals:** The stated coverage (e.g. 95%) is only correct when the sampling assumptions hold.
- **Ch. 14 — Hypothesis tests:** A test at size `$\alpha$` is only valid for the population actually sampled — not for the target population if the two differ.
]

.pull-right-narrow[
> A non-representative sample means your formulas are correct but your answers are wrong — silently.
]

???
10:18

---
# The minimal requirement for all inference

.pull-left-wide[
- Say we want to estimate the population mean `$\mu$`
- The natural estimator is the sample average: `$\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$`
- But when does `$\bar{X}$` give us a good estimate of `$\mu$`?
]

.pull-left-wide[
- We would want `$\bar{X}$` to be **unbiased**: `$E(\bar{X}) = \mu$`
- One solution: Assume `$E(X_i) = \mu$` for every unit `$i$`
]

.pull-left-wide[
- That holds if and only if `$X_i$` has the same distribution as `$X$` in the population:
`$$f_{X_i}(x) = f_X(x) \quad \text{for all } i \quad \Leftarrow \textbf{ representative sample}$$`
- Then:
`$$E(\bar{X}) = \frac{1}{n}\sum_{i=1}^n E(X_i) = \frac{1}{n} \cdot n \cdot \mu = \mu \checkmark$$`
]

???
10:22

---
class: inverse, middle, center
# Deductive and inductive reasoning

---
# Deductive vs. inductive reasoning

.pull-left[
### Deductive reasoning
- Start from a **known population**
- Derive what samples look like

`$$\text{Population} \rightarrow \text{Sample}$$`

.red[*This is the theory we have now learned.*]
]

.pull-right[
### Inductive reasoning
- Start from an **observed sample**
- Learn about the population

`$$\text{Sample} \rightarrow \text{Population}$$`

.red[*This is the task we are faced with in practice.*]
]

???
10:24

---
# Deductive reasoning

.pull-left-wide[
- Up until now, we assumed that random variables represent populations with **known distributions**
- The distribution of the random variable was the same as the distribution of the population
- This allowed us to calculate moments (mean, variance) and probabilities
]

.pull-left-wide[
- This type of analysis — moving from general (population) to specific (sample) — is called **deductive reasoning**
]

???
10:25

---
# Inductive reasoning

.pull-left-wide[
- We will now reverse the process
- We try to learn something about the population from a sample
- This type of analysis — moving from specific (sample) to general (population) — is called **inductive reasoning**
]

.pull-left-wide[
- The knowledge we acquire about the population allows us to make **predictions**: statements about what future samples would look like
- Note that predictions themselves are based on deductive reasoning
]

???
10:26

---
# Statistical inference

.pull-left-wide[
- The main goal of **statistical inference** is to obtain knowledge about an unknown distribution (e.g., the distribution of income)
]

.pull-left-wide[
- We can never observe the full population — we have **limited information**
- Less data `$\Rightarrow$` less information `$\Rightarrow$` more uncertainty
- Goal: use available information **efficiently** (low variance) and **accurately** (low bias)
- A bad sample wastes or corrupts the information we do have
]

???
10:28

---
# .red[Raise your hand 1: Reasoning and inference]

.pull-left-wide[
**Q1.** You know that Danish mean income is 450,000 kr. You pick one Dane at random and expect their income to be around 450,000 kr. This is:

- **A)** Deductive — applying a known population fact to an individual case
- **B)** Inductive — any claim about an unknown individual generalises from data
- **C)** Neither — "expecting" something is not reasoning, it's just a guess
]

.pull-left-wide[
**Q2.** We want to estimate mean income in Denmark. We select *one* person completely at random. Is this a good estimate of the mean?

- **A)** Yes
- **B)** No
]

???
10:32

---
class: inverse, middle, center
# Designing the sampling process

---
# Designing the sampling process

.pull-left-wide[
- There are four considerations when designing a sampling process — we will go through each in turn:

#### Consideration 1: 
  `$\rightarrow$` Choice of **population** and characteristics of interest  
  #### Consideration 2: 
  `$\rightarrow$` Specification of **sampling units** and **sampling frame**  
  #### Consideration 3: 
  `$\rightarrow$` Specification of the **sampling method**  
  #### Consideration 4: 
  `$\rightarrow$` **Measurement** of characteristics of interest
]

???
10:33

---
class: inverse, middle, center
# Consideration 1: Population and characteristics of interest

---
# Choice of population

.pull-left-wide[
- What you want to know determines the population — but this is harder in practice
]

.pull-left-wide[
- **Ethics:** harmful substances can't be tested on humans `$\Rightarrow$` use lab substitutes
- **Cost:** can't sample everyone `$\Rightarrow$` use a subpopulation and generalise
]

???
10:35

---
# Population vs target population

.pull-left-wide[
- **Target population:** the population you care about
- **Study population:** what you actually sample from — may differ
]

.pull-left-wide[
- Key question: how well does the study population mimic the target?
  - Lab cells `$\approx$` human cells?
  - One city `$\approx$` the whole country?
]

???
10:37

---
# Characteristics of interest

.pull-left-wide[
- Measuring the right thing is non-trivial
- School quality `$\rightarrow$` test scores 
— But scores reflect prior skills too
]

.pull-left-wide[
- What even are "skills"? IQ? Test-taking ability? Concentration?
]

.pull-left-wide[
- Statistics can't answer this — requires **domain knowledge**
]

???
10:39

---
class: inverse, middle, center
# Consideration 2: Sampling units and sampling frame

---
# Sampling units

.pull-left-wide[
- The **sampling units** are the subset of elements from the population that we extract
]

.pull-left-wide[
- Usually individual people — but not always
- E.g., studying student performance: the sampling unit is a **school**, but the characteristic of interest is measured at the **student** level
- Grouping units (schools, households) is often cheaper
]

???
10:41

---
# Sampling frame

.pull-left-wide[
- A **sampling frame** lists all sampling units and how to reach them
- Can be precise or vague
]

.pull-left-wide[
- Precise: the Danish CPR register — every resident has a unique number, so each unit can be contacted directly
- Vague: intercepting people at a shopping mall — the "frame" is just whoever happens to be present
]

???
10:43

---
# Sampling characteristic

.pull-left-wide[
- The contact detail used to reach a unit is its **sampling characteristic** (e.g., CPR number, address, presence at a time/place)
]

???
10:44

---
# Putting it together: an example dataset

.pull-left-wide[
| .orange[CPR] | .green[Income (kr)] | .red[Age] | .red[Gender] |
|---|---|---|---|
| .orange[010190-1234] | .green[320,000] | .red[34] | .red[F] |
| .orange[150385-5679] | .green[510,000] | .red[40] | .red[M] |
| .orange[220772-9012] | .green[280,000] | .red[52] | .red[F] |
| .orange[030601-3457] | .green[450,000] | .red[23] | .red[M] |
| .orange[091268-7890] | .green[390,000] | .red[56] | .red[F] |
]

.pull-left-wide[
- .orange[**Sampling characteristic**] — how we find and contact the unit
- .green[**Characteristic of interest**] — what we actually want to measure
- .red[**Background characteristics**] — known attributes that may inform the sampling design (e.g., for stratification)
]

???
10:46

---
class: inverse, middle, center
# Consideration 3: Sampling method

---
# Sampling method

.pull-left-wide[
- The method used to select sampling units from the sampling frame is called the **sampling method**
- The type of method used is very important: it determines the quality of the sample and the type of statistical methods that can be used
]

.pull-left-wide[
- We can divide methods into two broad categories:
  - **Probability sampling**
  - **Non-probability sampling**
]

???
10:47

---
# Probability sampling

.pull-left-wide[
- **Probability sampling** means that the probability of selecting each sampling unit is known
- For example, compile a list of all addresses in Denmark and randomly choose 5,000 of them — the probability of choosing each address is equal
- More complicated methods assign *different* probabilities to different sampling units
]

.pull-left-wide[
- Probability sampling has two major advantages:
  - It makes it possible to **assess the quality** of the study
  - It allows researchers to obtain a picture of the **true population**
]

???
10:49

---
# Non-probability sampling

.pull-left-wide[
- **Non-probability sampling** means that the probability of selecting each sampling unit is unknown
- This makes it impossible to assess the quality of the study
- Used in practice because it is cheap and/or because probability sampling is infeasible
]

.pull-left-wide[
- Treating a non-probability sample as a probability sample results in errors
- We will discuss four examples:
  1. Convenience sampling
  2. Purposive sampling
  3. Quota sampling
  4. Snowball sampling
]

???
10:51

---
# Convenience sampling

.pull-left-wide[
- **Convenience sampling** = the sample is chosen based on how easy it is to select the sampling units (e.g., interviewing people at a shopping mall)
- Advantage: cheap
- Disadvantage: the probability of a unit being selected is unknown
]

.pull-left-wide[
- Mall survey: people *at* the mall have atypical shopping attitudes
- `$\Rightarrow$` sample does not represent the population
]

???
10:53

---
# Purposive sampling

.pull-left-wide[
- **Purposive sampling** = the researcher identifies in advance who will be studied (e.g., experts)
- Advantage: need fewer units, so cheaper
- Disadvantage: we do not obtain a true picture of the entire population (not everyone is an expert!)
]

???
10:54

---
# Quota sampling

.pull-left-wide[
- **Quota sampling** = it is decided in advance that the sample must include a certain proportion of units with certain **background characteristics** (e.g., immigrants, minorities)
- Advantage: more information about groups that would otherwise be underrepresented
- Disadvantage: may not give a true picture of the population
]

.pull-left-wide[
- However, if we know the distribution of the background characteristics in the population (e.g., what fraction are immigrants), we can use this to improve quality
- This is the main idea behind *stratification*
]

???
10:56

---
# Snowball sampling

.pull-left-wide[
- **Snowball sampling** = selected sampling units are used to identify new sampling units
- Used when it is difficult to find people with certain characteristics (e.g., illegal gamblers, illegal drug users)
]

???
10:57

---
# .red[Raise your hand 2: Sampling methods]

.pull-left-wide[
**Q1.** A researcher surveys students who happen to be in the university cafeteria at noon. Which method is this?

- **A)** Probability sampling — the cafeteria is a defined location with a countable population
- **B)** Convenience sampling — selection probability is unknown
- **C)** Quota sampling — students are a well-defined demographic group
]

.pull-left-wide[
**Q2.** Snowball sampling is most likely to produce a non-representative sample because:

- **A)** Any bias in the initial seed group propagates through every subsequent referral
- **B)** It relies on social networks, so the sample clusters around connected subgroups
- **C)** Isolated individuals — with no ties to the seed group — have zero probability of selection
]

???
11:00

---
class: inverse, middle, center
# Consideration 4: Measurement of characteristics

---
# Measurement of characteristics

.pull-left-wide[
- Once the sampling units are selected, we need to measure their characteristics
- This can be done in two ways:
  - **Measurement by observation**
  - **Self-reporting**
]

???
11:16

---
# Measurement by observation

.pull-left-wide[
- **Measurement by observation** = the researcher observes and records the characteristic (e.g., clinical studies)
- It is important not to influence behaviour or characteristics when observing them
]

.pull-left-wide[
- Individuals may try to eat healthier to impress the researcher
- Some objects may be destroyed in the process of measuring their characteristics (e.g., resistance testing of drugs)
- You might misinterpret the observed characteristics (e.g., is a smile genuine or fake?)
- What does "boy" mean in old census data? (Sometimes farm servant)
]

???
11:18

---
# Self-reporting

.pull-left-wide[
- **Self-reporting** = the element itself measures its own characteristics
- Key issue: is the element willing to provide accurate measurements?
  - Individuals may under-report income if they fear higher taxes
  - Individuals may not know how to assess certain characteristics (e.g., their IQ)
]

.pull-left-wide[
- One advantage of self-reporting: we can obtain two types of information:
  - Characteristics based on past behaviour (*revealed preference*)
  - Characteristics based on planned future behaviour (*stated preference*)
]

???
11:20

---
# Self-reporting: questions

.pull-left-wide[
- Usually, self-reported measures are obtained by asking questions
- Several issues need to be considered:
  - *How* we ask (phone, email, printed survey) — affects information quality and accuracy
  - How the *question itself* is presented — the order of questions can strongly influence answers
]

???
11:22

---
# Reliability and validity

.pull-left-wide[
- In order to have confidence in the results of a study, we need to ensure that the study is reliable and valid
]

.pull-left-wide[
- **Reliability** = ability to repeat the sample selection and measurement and obtain similar results
- **Validity** = the results of the study represent the stated object of the study (i.e., we measure what we intend to measure)
]

???
11:24

---
# Key takeaways: designing the sampling process

.pull-left-wide[
- **Deductive vs. inductive reasoning** — moving from population to sample vs. from sample to population
- **Choice of population** — the population studied may differ from the target population
- **Characteristics of interest** — deciding what to measure and how to measure it
]

.pull-left-wide[
- **Sampling units** — can be cheaper to select several elements together
- **Sampling frame** — identifies all potential sampling units and how to reach them
- **Sampling methods** — probability vs. non-probability sampling
- **Measurement** — observation vs. self-reports
- **Reliability and validity**
]

???
11:26

---
class: inverse, middle, center
# Formal model for the sample

---
# Sample

.pull-left-wide[
- From now on, unless otherwise mentioned, we assume **probability sampling**
- With probability sampling, it is uncertain which units will be selected `$\Rightarrow$` we can represent the process using random variables
]

.pull-left-wide[
- Let `$X_i$` be the random variable representing the value of the studied characteristic of element `$i$`
- A **sample** containing `$n$` elements is represented by `$n$` random variables:
$$
(X_1, X_2, \ldots, X_n)
$$
]

???
11:28

---
# Realized sample

.pull-left-wide[
- A sample is an "abstract" notion: we have `$n$` units, but we do not know who/what each unit is until selection is performed
]

.pull-left-wide[
- Once we perform the actual selection, we can observe who/what is unit `$i$` and the value of its characteristic `$X_i$`
- A **realized sample** with `$n$` elements consists of `$n$` observed values:
$$
(x_1, x_2, \ldots, x_n)
$$
]

???
11:30

---
# Sample distribution

.pull-left-wide[
- A sample is a collection of `$n$` random variables
- Each random variable has a distribution, but we are interested in how the distributions of all units are intertwined
]

.pull-left-wide[
- The **sample distribution** is the joint distribution of `$(X_1, X_2, \ldots, X_n)$`, given by `$f(x_1, x_2, \ldots, x_n)$`:
  - the probability function, if `$X_1, \ldots, X_n$` are discrete
  - the probability density function, if `$X_1, \ldots, X_n$` are continuous
]

???
11:32

---
class: inverse, middle, center
# A model for the sample

---
# Selection characteristic

.pull-left-wide[
- In practice, a sample is selected based on a characteristic *other than* the one being studied (e.g., CPR number, phone number)
- This is called the **selection characteristic**, denoted `$Z$`
- Let `$Z_i$` be the random variable for the value of `$Z$` for unit `$i$`; then `$f_{Z_i}(z_i)$` represents the selection mechanism
]

.pull-left-wide[
- Example: select 5,000 people based on CPR number
  - Get a list of all CPR numbers in Denmark
  - For any CPR number `$z$`: `$f_{Z_i}(z) = \dfrac{1}{5{,}000} = 0.0002$`
  - If you only want a sample of women: `$f_{Z_i}(z) = 0$` for all men's CPR numbers
]

???
11:35

---
# Identified and non-identified elements

.pull-left-wide[
- When we select a unit via `$Z$`, we need to know: **does `$Z$` pin down a unique individual?**
- An element is **identified** if no other element shares the same value of `$Z$` — otherwise *non-identified*
]

.pull-left-wide[
- CPR number `$\Rightarrow$` **identified**: each number belongs to exactly one person
- Address `$\Rightarrow$` **non-identified**: selecting "Nyborgvej 55" could yield any of the people living there
- This distinction determines how well we know *who* we are actually sampling
]

???
11:38

---
# Selection characteristic and characteristic of interest

.pull-left-wide[
- Once we select a unit via `$Z$`, we observe its characteristic of interest `$X$`
- If the unit is non-identified, we don't know in advance *which* `$X$` value we will get
- The link between `$Z$` and `$X$` is the conditional distribution:
`$$f_{X_i \mid Z_i}(x \mid z)$$`
]

.pull-left-wide[
- Example: we select address "Nyborgvej 55" — what is the probability the person there earns 200,000 kr?
- If three people live there with different incomes, this probability is less than 1
]

???
11:41

---
# Identified vs. non-identified: formal result

.pull-left-wide[
- **Identified** (`$Z = z_0$` maps to exactly one person with `$X = x_0$`):
`$$f_{X_i \mid Z_i}(x_0 \mid z_0) = 1$$`
Selecting `$z_0$` guarantees we observe `$x_0$`
]

.pull-left-wide[
- **Non-identified** (`$Z = z^*$` shared by multiple people with different `$X$`):
`$$f_{X_i \mid Z_i}(x^* \mid z^*) < 1$$`
Selecting `$z^*$` gives an uncertain draw — we don't know whose income we will observe
]

???
11:43

---
# Distribution of the characteristic of interest

.pull-left-wide[
- The joint distribution of the characteristic of interest and the selection characteristic is `$f_{X_i, Z_i}(x, z)$`
]

.pull-left-wide[
- From the definition of conditional and joint probability, the **marginal distribution** of the characteristic of interest is:

`$$f_{X_i}(x) = \sum_z f_{X_i, Z_i}(x, z) = \sum_z \left[ f_{X_i \mid Z_i}(x \mid z) \cdot f_{Z_i}(z) \right]$$`
]

.pull-left-wide[
- **Why do we want this?** — `$f_{X_i}(x)$` tells us the distribution of `$X$` *as seen through our sampling mechanism*
- If `$f_{X_i}(x) = f_X(x)$`: the sample is **representative** — we recover the true population distribution
- If `$f_{X_i}(x) \neq f_X(x)$`: the sampling mechanism has distorted what we observe
- The formula shows exactly how: both *who we contact* (`$f_{Z_i}$`) and *who responds* (`$f_{X_i \mid Z_i}$`) shape the final distribution
]

???
11:45

---
# .red[Raise your hand 3: Formal model]

.pull-left-wide[
**Q1.** We draw a sample of `$n = 1{,}000$` CPR numbers from the Danish population of `$N$` people, each with equal probability. The selection mechanism `$f_{Z_i}(z)$` is:

- **A)** `$f_{Z_i}(z) = \frac{1}{N}$` for each valid CPR number `$z$`
- **B)** `$f_{Z_i}(z) = \frac{1}{1{,}000}$` for each valid CPR number `$z$`
- **C)** `$f_{Z_i}(z) = \frac{1{,}000}{N}$` for each valid CPR number `$z$`
]

.pull-left-wide[
**Q2.** Income `$X$` and CPR number `$Z$` are independent. This means:

- **A)** The sample is representative — `$f_{X_i}(x) = f_X(x)$` regardless of how `$Z$` is distributed
- **B)** The sample is non-representative — independence means the selection ignores income entirely
- **C)** Nothing about representativeness — independence is a property of the variables, not the sample
]

???
11:48

---
class: inverse, middle, center
# Representative sample

---
# Representative sample

.pull-left-wide[
- In most cases, we would like to obtain a "true" picture of the population without having to study the whole population
- In the population, the relative frequency of a characteristic `$x_i$` tells us how often that characteristic occurs
]

.pull-left-wide[
> A sample is **representative** for `$X$` if the marginal distribution of `$X_i$` equals the relative frequency function of `$X$` in the population:
> `$$f_{X_i}(x) = f_X(x) \quad \text{for all } i = 1, 2, \ldots, n$$`

- A sample that does not satisfy this condition is a **non-representative sample**
]

???
11:50

---
# Conditions for a representative sample

.pull-left-wide[
- Recall the marginal distribution:
`$$f_{X_i}(x) = \sum_z \left[ f_{X_i \mid Z_i}(x \mid z) \cdot f_{Z_i}(z) \right]$$`
- The sample is representative if both of the following hold:
]

.pull-left-wide[
1. The distribution of the selection characteristic matches the population: `$f_{Z_i}(z) = f_Z(z)$`
2. For a given `$z$`, the probability of selecting a unit with `$X_i = x$` equals the population proportion: `$f_{X_i \mid Z_i}(x \mid z) = f_{X \mid Z}(x \mid z)$`
]

.pull-left-wide[
- We typically assume condition 2 holds and focus on how the selection mechanism `$f_{Z_i}(z)$` can lead to non-representative samples
- Exception: if `$X$` and `$Z$` are **independent**, the sample is representative even if `$f_{Z_i}(z) \neq f_Z(z)$`
]

???
11:53

---
class: inverse, middle, center
# Consequences of a non-representative sample

---
# Consequences of a non-representative sample

.pull-left-wide[
- In general, a non-representative sample produces misleading results
- Consider the mean of a characteristic in the population (same as the expected value):
`$$E(X) = \sum_x x \cdot f_X(x)$$`
]

.pull-left-wide[
- The mean for the `$i$`-th element in a non-representative sample:
`$$E(X_i) = \sum_x x \cdot f_{X_i}(x) \neq \sum_x x \cdot f_X(x)$$`
]

.pull-left-wide[
- If we *know* the distribution of the selection characteristic in the population (e.g., stratified sampling), we can correct the calculation of `$E(X_i)$` to recover `$E(X)$`
- In that case, a representative sample is not strictly needed — but we must know *in which way* the sample is non-representative
]

.pull-right-narrow[
.orange[*Emerging method: **Debiasing**: Deals with this in very large datasets. All the hype in ML/Econometrics now.*]
]

???
11:59

---
# .red[Practice: Non-representative Bernoulli sample]

*You will derive a proof that oversampling one group biases the sample mean — even when sampling within the frame is done correctly.*

.pull-left-wide[
.small123[
We want to estimate `$p$` = fraction of Danes who exercise regularly. We recruit by standing outside a gym. Let `$X_i = 1$` (exercises regularly), `$X_i = 0$` (does not). Among gym visitors, a fraction `$\alpha > p$` exercise regularly — gym-goers are not a random cross-section of the population.

**a)** Write `$f_X(x)$` for `$x \in \{0,1\}$` and show `$E(X) = p$`.

**b)** Our sample is drawn uniformly from gym visitors. Write `$f_{X_i \mid Z_i}(1 \mid z)$` for any gym visitor `$z$`. Are non-exercisers excluded from the sample?

**c)** Use the marginal distribution formula:
`$$f_{X_i}(x) = \sum_z f_{X_i \mid Z_i}(x \mid z) \cdot f_{Z_i}(z)$$`
to show `$f_{X_i}(1) = \alpha$`. Compute `$E(X_i)$`, state the direction of bias, and state the condition on `$\alpha$` under which the sample is representative.
]
]

---
class: inverse, middle, center
# Sources of error in the sampling process

---
# Sources of error

.pull-left-wide[
- In practice, researchers sometimes wrongly believe they are sampling based on the population distribution of `$Z$`
- There are several situations in which this may occur:
  1. **Errors in the sampling frame**
  2. **Non-respondents**
  3. **Self-selection**
]

---
# Errors in the sampling frame

.pull-left-wide[
- Sometimes, the sampling frame is not precise enough, and parts of the population are omitted
]

.pull-left-wide[
- Car show survey: visitors are more interested in "show cars" than hybrids
- `$X$` (hybrid preference) and `$Z$` (attending the show) are not independent
- `$\Rightarrow$` hybrid interest is *under-represented*
]

---
# Non-respondents

.pull-left-wide[
- Not everybody can or is willing to take part in a study — these individuals are called **non-responders**
- If the probability of being a non-respondent is correlated with the characteristic of interest, the sample is non-representative
]

.pull-left-wide[
- Reasons for non-response:
  - Sensitive questions (e.g., income, sexual orientation)
  - Time required to answer
  - General attitude toward the survey
]

---
# Modelling non-response

.pull-left-wide[
- Let `$Z_i$` indicate if person `$i$` is a respondent (`$Z_i = 1$`) or not (`$Z_i = 0$`)
- The selection mechanism is:
$$
f_{Z_i}(z) =
`\begin{cases}
0, & z = 0 \\
1, & z = 1
\end{cases}`
$$
]

.pull-left-wide[
- This gives the marginal distribution of the characteristic of interest as:
`$$f_{X_i}(x) = f_{X \mid Z}(x \mid 1)$$`
- In other words, the distribution of `$X_i$` equals the distribution *among responders*
- This equals the population distribution of `$X$` only if `$X$` and `$Z$` are independent
- Note: non-response and sampling frame errors both imply that some elements have **zero probability** of being selected
]

---
# Self-selection

.pull-left-wide[
- **Self-selection** occurs when someone's behaviour affects the probability that the person will be selected for the sample
]

.pull-left-wide[
- Wage study: only employed people can be sampled
- Low-wage workers may have chosen *not* to work `$\Rightarrow$` selection probability = 0
- The observed wage distribution is truncated from below
]

.pull-left-wide[
- In less extreme cases, the probability of being selected is non-zero but still affected by individual behaviour (e.g., online reviews)
- Note: non-respondents can be seen as an extreme case of self-selection
]

---
# .red[Raise your hand 5: Sources of error]

.pull-left-wide[
**Q1.** An online survey about social media use is shared only on Facebook. Which error type is *most* relevant?

- **A)** Self-selection — only people who feel strongly about social media will bother to respond
- **B)** Sampling frame error — people without a Facebook account are excluded from the outset
- **C)** Non-response — a large share of Facebook users will see the survey but not complete it
]

.pull-left-wide[
**Q2.** We ask random employees about their wages. Our estimate of the mean wage is too high. What might be the cause?:

- **A)** Observation error — respondents round wages up when self-reporting
- **B)** Self-selection — low-wage workers are less likely to be employed and thus not sampled
- **C)** Non-response — low-wage workers are less likely to respond
]

---
class: inverse, middle, center
# Observation errors

---
# Observation errors

.pull-left-wide[
- **Observation errors** occur when the actual value of the characteristic of interest differs from the measured (observed) value
]

.pull-left-wide[
- We can divide observation errors into two types:
  - **Response errors** = the responder misunderstands the question or the answer is reported incorrectly
      - *Question error* = the answer is incompatible with the question
      - *Interview error* = the interviewer influenced the respondent to answer incorrectly
      - *Recording or coding error* = the answer was recorded or coded incorrectly
  - **Measurement error** = errors due to difficulty in measuring the characteristic accurately
]

---
# Modelling observation errors

.pull-left-wide[
- We assume that the observed value of the characteristic is a function of the "true" value and the observation error
- We denote the correct value with a `$*$` superscript: `$X^*$`
]

.pull-left-wide[
- The **observation error** `$\epsilon$` is the difference between the observed and correct value:
`$$\epsilon = X - X^*$$`
- The observed value equals the correct value plus the error:
`$$X = X^* + \epsilon$$`
]

---
# Consequences of observation errors

.pull-left-wide[
- The consequences depend on what we are studying
- We typically assume: `$E(\epsilon) = 0$` (mean-zero errors)
- Then the mean of the characteristic of interest is unaffected:
`$$E(X) = E(X^* + \epsilon) = E(X^*) + E(\epsilon) = E(X^*)$$`
]

.pull-left-wide[
- If `$E(\epsilon) \neq 0$` (e.g., an incorrectly calibrated scale), then `$E(X) \neq E(X^*)$`
]

.pull-left-wide[
- The variance *is* affected even with mean-zero errors
- Assuming `$\text{Cov}(X^*, \epsilon) = 0$`:
`$$\text{Var}(X) = \text{Var}(X^* + \epsilon) = \text{Var}(X^*) + \text{Var}(\epsilon) \geq \text{Var}(X^*)$$`
- In other words, observation errors add noise to the observed characteristic
]

---
# .red[Raise your hand 6: Observation errors]

.pull-left-wide[
**Q1.** A study finds `$\text{Var}(X) > \text{Var}(X^*)$` even though `$E(\epsilon) = 0$`. The best explanation is:

- **A)** The sample is non-representative — units with extreme values of `$X$` are over-sampled, stretching the distribution
- **B)** Observation errors add noise to `$X$`, increasing its variance even when they are mean-zero
- **C)** The variance formula is `$\text{Var}(X) = \text{Var}(X^*) \cdot \text{Var}(\epsilon)$`, so any error variance multiplies the true variance
]

.pull-left-wide[
**Q2.** Income is under-reported by high earners. The effect on the estimated population mean is:

- **A)** No effect — even systematic errors cancel out once you average over enough observations
- **B)** Downward bias — the mean of `$X$` is lower than the mean of `$X^*$`
- **C)** Upward bias — since only high earners under-report, their observed incomes are still the highest in the sample and dominate the mean
]

---
# Next time
.pull-left[

- Next time: **Estimating the mean**
]

.pull-right[
![Trees](Figures/Trees1.jpg)
]