class: center, inverse, middle <style type="text/css"> .pull-left { float: left; width: 44%; } .pull-right { float: right; width: 44%; } .pull-right ~ p { clear: both; } .pull-left-wide { float: left; width: 66%; } .pull-right-wide { float: right; width: 66%; } .pull-right-wide ~ p { clear: both; } .pull-left-narrow { float: left; width: 30%; } .pull-right-narrow { float: right; width: 30%; } .tiny123 { font-size: 0.40em; } .small123 { font-size: 0.80em; } .large123 { font-size: 2em; } .red { color: red } .orange { color: orange } .green { color: green } </style> # Statistics ## The sampling process ### (Chapter 9) ### Christian Vedel,<br>Department of Economics<br>University of Southern Denmark ### Email: [christian-vs@sam.sdu.dk](mailto:christian-vs@sam.sdu.dk) ### Updated 2026-04-09 --- class: middle # Today's lecture .pull-left-wide[ **How samples are drawn from populations — and what can go wrong** - **Section 1:** Deductive vs. inductive reasoning — and why sampling is the foundation of all inference - **Section 2:** Designing the sampling process - **Section 3:** Formal model for the sample - **Section 4:** Representative samples - **Section 5:** Sources of error and observation errors ] .pull-right-narrow[  ] ??? 10:16 --- # What this lecture unlocks .pull-left-wide[ Everything from Chapter 10 onwards rests on one question: **how was the sample drawn?** - **Ch. 10 — Estimating the mean:** The sample average `\(\bar{X}\)` is an unbiased estimator of `\(\mu\)` and has variance `\(\sigma^2/n\)` *only if* the sample is a representative draw where `\(X_1, \ldots, X_n\)` are i.i.d. - **Ch. 11 — Stratified estimation:** Uses the selection mechanism `\(f_{Z_i}(z)\)` explicitly to correct for known non-representativeness. *Also the basis for IPW estimation in Causal Inference.* - **Ch. 13 — Confidence intervals:** The stated coverage (e.g. 95%) is only correct when the sampling assumptions hold. - **Ch. 14 — Hypothesis tests:** A test at size `\(\alpha\)` is only valid for the population actually sampled — not for the target population if the two differ. ] .pull-right-narrow[ > A non-representative sample means your formulas are correct but your answers are wrong — silently. ] ??? 10:18 --- # The minimal requirement for all inference .pull-left-wide[ - Say we want to estimate the population mean `\(\mu\)` - The natural estimator is the sample average: `\(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\)` - But when does `\(\bar{X}\)` give us a good estimate of `\(\mu\)`? ] -- .pull-left-wide[ - We would want `\(\bar{X}\)` to be **unbiased**: `\(E(\bar{X}) = \mu\)` - One solution: Assume `\(E(X_i) = \mu\)` for every unit `\(i\)` ] -- .pull-left-wide[ - That holds if and only if `\(X_i\)` has the same distribution as `\(X\)` in the population: `$$f_{X_i}(x) = f_X(x) \quad \text{for all } i \quad \Leftarrow \textbf{ representative sample}$$` - Then: `$$E(\bar{X}) = \frac{1}{n}\sum_{i=1}^n E(X_i) = \frac{1}{n} \cdot n \cdot \mu = \mu \checkmark$$` ] ??? 10:22 --- class: inverse, middle, center # Deductive and inductive reasoning --- # Deductive vs. inductive reasoning .pull-left[ ### Deductive reasoning - Start from a **known population** - Derive what samples look like `$$\text{Population} \rightarrow \text{Sample}$$` .red[*This is the theory we have now learned.*] ] -- .pull-right[ ### Inductive reasoning - Start from an **observed sample** - Learn about the population `$$\text{Sample} \rightarrow \text{Population}$$` .red[*This is the task we are faced with in practice.*] ] ??? 10:24 --- # Deductive reasoning .pull-left-wide[ - Up until now, we assumed that random variables represent populations with **known distributions** - The distribution of the random variable was the same as the distribution of the population - This allowed us to calculate moments (mean, variance) and probabilities ] -- .pull-left-wide[ - This type of analysis — moving from general (population) to specific (sample) — is called **deductive reasoning** ] ??? 10:25 --- # Inductive reasoning .pull-left-wide[ - We will now reverse the process - We try to learn something about the population from a sample - This type of analysis — moving from specific (sample) to general (population) — is called **inductive reasoning** ] -- .pull-left-wide[ - The knowledge we acquire about the population allows us to make **predictions**: statements about what future samples would look like - Note that predictions themselves are based on deductive reasoning ] ??? 10:26 --- # Statistical inference .pull-left-wide[ - The main goal of **statistical inference** is to obtain knowledge about an unknown distribution (e.g., the distribution of income) ] -- .pull-left-wide[ - We can never observe the full population — we have **limited information** - Less data `\(\Rightarrow\)` less information `\(\Rightarrow\)` more uncertainty - Goal: use available information **efficiently** (low variance) and **accurately** (low bias) - A bad sample wastes or corrupts the information we do have ] ??? 10:28 --- # .red[Raise your hand 1: Reasoning and inference]
−
+
00
:
30
.pull-left-wide[ **Q1.** You know that Danish mean income is 450,000 kr. You pick one Dane at random and expect their income to be around 450,000 kr. This is: - **A)** Deductive — applying a known population fact to an individual case - **B)** Inductive — any claim about an unknown individual generalises from data - **C)** Neither — "expecting" something is not reasoning, it's just a guess ] -- .pull-left-wide[ **Q2.** We want to estimate mean income in Denmark. We select *one* person completely at random. Is this a good estimate of the mean? - **A)** Yes - **B)** No ] ??? 10:32 --- class: inverse, middle, center # Designing the sampling process --- # Designing the sampling process .pull-left-wide[ - There are four considerations when designing a sampling process — we will go through each in turn: #### Consideration 1: `\(\rightarrow\)` Choice of **population** and characteristics of interest #### Consideration 2: `\(\rightarrow\)` Specification of **sampling units** and **sampling frame** #### Consideration 3: `\(\rightarrow\)` Specification of the **sampling method** #### Consideration 4: `\(\rightarrow\)` **Measurement** of characteristics of interest ] ??? 10:33 --- class: inverse, middle, center # Consideration 1: Population and characteristics of interest --- # Choice of population .pull-left-wide[ - What you want to know determines the population — but this is harder in practice ] -- .pull-left-wide[ - **Ethics:** harmful substances can't be tested on humans `\(\Rightarrow\)` use lab substitutes - **Cost:** can't sample everyone `\(\Rightarrow\)` use a subpopulation and generalise ] ??? 10:35 --- # Population vs target population .pull-left-wide[ - **Target population:** the population you care about - **Study population:** what you actually sample from — may differ ] -- .pull-left-wide[ - Key question: how well does the study population mimic the target? - Lab cells `\(\approx\)` human cells? - One city `\(\approx\)` the whole country? ] ??? 10:37 --- # Characteristics of interest .pull-left-wide[ - Measuring the right thing is non-trivial - School quality `\(\rightarrow\)` test scores — But scores reflect prior skills too ] -- .pull-left-wide[ - What even are "skills"? IQ? Test-taking ability? Concentration? ] -- .pull-left-wide[ - Statistics can't answer this — requires **domain knowledge** ] ??? 10:39 --- class: inverse, middle, center # Consideration 2: Sampling units and sampling frame --- # Sampling units .pull-left-wide[ - The **sampling units** are the subset of elements from the population that we extract ] -- .pull-left-wide[ - Usually individual people — but not always - E.g., studying student performance: the sampling unit is a **school**, but the characteristic of interest is measured at the **student** level - Grouping units (schools, households) is often cheaper ] ??? 10:41 --- # Sampling frame .pull-left-wide[ - A **sampling frame** lists all sampling units and how to reach them - Can be precise or vague ] -- .pull-left-wide[ - Precise: the Danish CPR register — every resident has a unique number, so each unit can be contacted directly - Vague: intercepting people at a shopping mall — the "frame" is just whoever happens to be present ] ??? 10:43 --- # Sampling characteristic .pull-left-wide[ - The contact detail used to reach a unit is its **sampling characteristic** (e.g., CPR number, address, presence at a time/place) ] ??? 10:44 --- # Putting it together: an example dataset .pull-left-wide[ | .orange[CPR] | .green[Income (kr)] | .red[Age] | .red[Gender] | |---|---|---|---| | .orange[010190-1234] | .green[320,000] | .red[34] | .red[F] | | .orange[150385-5679] | .green[510,000] | .red[40] | .red[M] | | .orange[220772-9012] | .green[280,000] | .red[52] | .red[F] | | .orange[030601-3457] | .green[450,000] | .red[23] | .red[M] | | .orange[091268-7890] | .green[390,000] | .red[56] | .red[F] | ] -- .pull-left-wide[ - .orange[**Sampling characteristic**] — how we find and contact the unit - .green[**Characteristic of interest**] — what we actually want to measure - .red[**Background characteristics**] — known attributes that may inform the sampling design (e.g., for stratification) ] ??? 10:46 --- class: inverse, middle, center # Consideration 3: Sampling method --- # Sampling method .pull-left-wide[ - The method used to select sampling units from the sampling frame is called the **sampling method** - The type of method used is very important: it determines the quality of the sample and the type of statistical methods that can be used ] -- .pull-left-wide[ - We can divide methods into two broad categories: - **Probability sampling** - **Non-probability sampling** ] ??? 10:47 --- # Probability sampling .pull-left-wide[ - **Probability sampling** means that the probability of selecting each sampling unit is known - For example, compile a list of all addresses in Denmark and randomly choose 5,000 of them — the probability of choosing each address is equal - More complicated methods assign *different* probabilities to different sampling units ] -- .pull-left-wide[ - Probability sampling has two major advantages: - It makes it possible to **assess the quality** of the study - It allows researchers to obtain a picture of the **true population** ] ??? 10:49 --- # Non-probability sampling .pull-left-wide[ - **Non-probability sampling** means that the probability of selecting each sampling unit is unknown - This makes it impossible to assess the quality of the study - Used in practice because it is cheap and/or because probability sampling is infeasible ] -- .pull-left-wide[ - Treating a non-probability sample as a probability sample results in errors - We will discuss four examples: 1. Convenience sampling 2. Purposive sampling 3. Quota sampling 4. Snowball sampling ] ??? 10:51 --- # Convenience sampling .pull-left-wide[ - **Convenience sampling** = the sample is chosen based on how easy it is to select the sampling units (e.g., interviewing people at a shopping mall) - Advantage: cheap - Disadvantage: the probability of a unit being selected is unknown ] -- .pull-left-wide[ - Mall survey: people *at* the mall have atypical shopping attitudes - `\(\Rightarrow\)` sample does not represent the population ] ??? 10:53 --- # Purposive sampling .pull-left-wide[ - **Purposive sampling** = the researcher identifies in advance who will be studied (e.g., experts) - Advantage: need fewer units, so cheaper - Disadvantage: we do not obtain a true picture of the entire population (not everyone is an expert!) ] ??? 10:54 --- # Quota sampling .pull-left-wide[ - **Quota sampling** = it is decided in advance that the sample must include a certain proportion of units with certain **background characteristics** (e.g., immigrants, minorities) - Advantage: more information about groups that would otherwise be underrepresented - Disadvantage: may not give a true picture of the population ] -- .pull-left-wide[ - However, if we know the distribution of the background characteristics in the population (e.g., what fraction are immigrants), we can use this to improve quality - This is the main idea behind *stratification* ] ??? 10:56 --- # Snowball sampling .pull-left-wide[ - **Snowball sampling** = selected sampling units are used to identify new sampling units - Used when it is difficult to find people with certain characteristics (e.g., illegal gamblers, illegal drug users) ] ??? 10:57 --- # .red[Raise your hand 2: Sampling methods]
−
+
00
:
20
.pull-left-wide[ **Q1.** A researcher surveys students who happen to be in the university cafeteria at noon. Which method is this? - **A)** Probability sampling — the cafeteria is a defined location with a countable population - **B)** Convenience sampling — selection probability is unknown - **C)** Quota sampling — students are a well-defined demographic group ] -- .pull-left-wide[ **Q2.** Snowball sampling is most likely to produce a non-representative sample because: - **A)** Any bias in the initial seed group propagates through every subsequent referral - **B)** It relies on social networks, so the sample clusters around connected subgroups - **C)** Isolated individuals — with no ties to the seed group — have zero probability of selection ] ??? 11:00 --- class: inverse, middle, center # Consideration 4: Measurement of characteristics --- # Measurement of characteristics .pull-left-wide[ - Once the sampling units are selected, we need to measure their characteristics - This can be done in two ways: - **Measurement by observation** - **Self-reporting** ] ??? 11:16 --- # Measurement by observation .pull-left-wide[ - **Measurement by observation** = the researcher observes and records the characteristic (e.g., clinical studies) - It is important not to influence behaviour or characteristics when observing them ] -- .pull-left-wide[ - Individuals may try to eat healthier to impress the researcher - Some objects may be destroyed in the process of measuring their characteristics (e.g., resistance testing of drugs) - You might misinterpret the observed characteristics (e.g., is a smile genuine or fake?) - What does "boy" mean in old census data? (Sometimes farm servant) ] ??? 11:18 --- # Self-reporting .pull-left-wide[ - **Self-reporting** = the element itself measures its own characteristics - Key issue: is the element willing to provide accurate measurements? - Individuals may under-report income if they fear higher taxes - Individuals may not know how to assess certain characteristics (e.g., their IQ) ] -- .pull-left-wide[ - One advantage of self-reporting: we can obtain two types of information: - Characteristics based on past behaviour (*revealed preference*) - Characteristics based on planned future behaviour (*stated preference*) ] ??? 11:20 --- # Self-reporting: questions .pull-left-wide[ - Usually, self-reported measures are obtained by asking questions - Several issues need to be considered: - *How* we ask (phone, email, printed survey) — affects information quality and accuracy - How the *question itself* is presented — the order of questions can strongly influence answers ] ??? 11:22 --- # Reliability and validity .pull-left-wide[ - In order to have confidence in the results of a study, we need to ensure that the study is reliable and valid ] -- .pull-left-wide[ - **Reliability** = ability to repeat the sample selection and measurement and obtain similar results - **Validity** = the results of the study represent the stated object of the study (i.e., we measure what we intend to measure) ] ??? 11:24 --- # Key takeaways: designing the sampling process .pull-left-wide[ - **Deductive vs. inductive reasoning** — moving from population to sample vs. from sample to population - **Choice of population** — the population studied may differ from the target population - **Characteristics of interest** — deciding what to measure and how to measure it ] -- .pull-left-wide[ - **Sampling units** — can be cheaper to select several elements together - **Sampling frame** — identifies all potential sampling units and how to reach them - **Sampling methods** — probability vs. non-probability sampling - **Measurement** — observation vs. self-reports - **Reliability and validity** ] ??? 11:26 --- class: inverse, middle, center # Formal model for the sample --- # Sample .pull-left-wide[ - From now on, unless otherwise mentioned, we assume **probability sampling** - With probability sampling, it is uncertain which units will be selected `\(\Rightarrow\)` we can represent the process using random variables ] -- .pull-left-wide[ - Let `\(X_i\)` be the random variable representing the value of the studied characteristic of element `\(i\)` - A **sample** containing `\(n\)` elements is represented by `\(n\)` random variables: $$ (X_1, X_2, \ldots, X_n) $$ ] ??? 11:28 --- # Realized sample .pull-left-wide[ - A sample is an "abstract" notion: we have `\(n\)` units, but we do not know who/what each unit is until selection is performed ] -- .pull-left-wide[ - Once we perform the actual selection, we can observe who/what is unit `\(i\)` and the value of its characteristic `\(X_i\)` - A **realized sample** with `\(n\)` elements consists of `\(n\)` observed values: $$ (x_1, x_2, \ldots, x_n) $$ ] ??? 11:30 --- # Sample distribution .pull-left-wide[ - A sample is a collection of `\(n\)` random variables - Each random variable has a distribution, but we are interested in how the distributions of all units are intertwined ] -- .pull-left-wide[ - The **sample distribution** is the joint distribution of `\((X_1, X_2, \ldots, X_n)\)`, given by `\(f(x_1, x_2, \ldots, x_n)\)`: - the probability function, if `\(X_1, \ldots, X_n\)` are discrete - the probability density function, if `\(X_1, \ldots, X_n\)` are continuous ] ??? 11:32 --- class: inverse, middle, center # A model for the sample --- # Selection characteristic .pull-left-wide[ - In practice, a sample is selected based on a characteristic *other than* the one being studied (e.g., CPR number, phone number) - This is called the **selection characteristic**, denoted `\(Z\)` - Let `\(Z_i\)` be the random variable for the value of `\(Z\)` for unit `\(i\)`; then `\(f_{Z_i}(z_i)\)` represents the selection mechanism ] -- .pull-left-wide[ - Example: select 5,000 people based on CPR number - Get a list of all CPR numbers in Denmark - For any CPR number `\(z\)`: `\(f_{Z_i}(z) = \dfrac{1}{5{,}000} = 0.0002\)` - If you only want a sample of women: `\(f_{Z_i}(z) = 0\)` for all men's CPR numbers ] ??? 11:35 --- # Identified and non-identified elements .pull-left-wide[ - When we select a unit via `\(Z\)`, we need to know: **does `\(Z\)` pin down a unique individual?** - An element is **identified** if no other element shares the same value of `\(Z\)` — otherwise *non-identified* ] -- .pull-left-wide[ - CPR number `\(\Rightarrow\)` **identified**: each number belongs to exactly one person - Address `\(\Rightarrow\)` **non-identified**: selecting "Nyborgvej 55" could yield any of the people living there - This distinction determines how well we know *who* we are actually sampling ] ??? 11:38 --- # Selection characteristic and characteristic of interest .pull-left-wide[ - Once we select a unit via `\(Z\)`, we observe its characteristic of interest `\(X\)` - If the unit is non-identified, we don't know in advance *which* `\(X\)` value we will get - The link between `\(Z\)` and `\(X\)` is the conditional distribution: `$$f_{X_i \mid Z_i}(x \mid z)$$` ] -- .pull-left-wide[ - Example: we select address "Nyborgvej 55" — what is the probability the person there earns 200,000 kr? - If three people live there with different incomes, this probability is less than 1 ] ??? 11:41 --- # Identified vs. non-identified: formal result .pull-left-wide[ - **Identified** (`\(Z = z_0\)` maps to exactly one person with `\(X = x_0\)`): `$$f_{X_i \mid Z_i}(x_0 \mid z_0) = 1$$` Selecting `\(z_0\)` guarantees we observe `\(x_0\)` ] -- .pull-left-wide[ - **Non-identified** (`\(Z = z^*\)` shared by multiple people with different `\(X\)`): `$$f_{X_i \mid Z_i}(x^* \mid z^*) < 1$$` Selecting `\(z^*\)` gives an uncertain draw — we don't know whose income we will observe ] ??? 11:43 --- # Distribution of the characteristic of interest .pull-left-wide[ - The joint distribution of the characteristic of interest and the selection characteristic is `\(f_{X_i, Z_i}(x, z)\)` ] -- .pull-left-wide[ - From the definition of conditional and joint probability, the **marginal distribution** of the characteristic of interest is: `$$f_{X_i}(x) = \sum_z f_{X_i, Z_i}(x, z) = \sum_z \left[ f_{X_i \mid Z_i}(x \mid z) \cdot f_{Z_i}(z) \right]$$` ] -- .pull-left-wide[ - **Why do we want this?** — `\(f_{X_i}(x)\)` tells us the distribution of `\(X\)` *as seen through our sampling mechanism* - If `\(f_{X_i}(x) = f_X(x)\)`: the sample is **representative** — we recover the true population distribution - If `\(f_{X_i}(x) \neq f_X(x)\)`: the sampling mechanism has distorted what we observe - The formula shows exactly how: both *who we contact* (`\(f_{Z_i}\)`) and *who responds* (`\(f_{X_i \mid Z_i}\)`) shape the final distribution ] ??? 11:45 --- # .red[Raise your hand 3: Formal model]
−
+
00
:
20
.pull-left-wide[ **Q1.** We draw a sample of `\(n = 1{,}000\)` CPR numbers from the Danish population of `\(N\)` people, each with equal probability. The selection mechanism `\(f_{Z_i}(z)\)` is: - **A)** `\(f_{Z_i}(z) = \frac{1}{N}\)` for each valid CPR number `\(z\)` - **B)** `\(f_{Z_i}(z) = \frac{1}{1{,}000}\)` for each valid CPR number `\(z\)` - **C)** `\(f_{Z_i}(z) = \frac{1{,}000}{N}\)` for each valid CPR number `\(z\)` ] -- .pull-left-wide[ **Q2.** Income `\(X\)` and CPR number `\(Z\)` are independent. This means: - **A)** The sample is representative — `\(f_{X_i}(x) = f_X(x)\)` regardless of how `\(Z\)` is distributed - **B)** The sample is non-representative — independence means the selection ignores income entirely - **C)** Nothing about representativeness — independence is a property of the variables, not the sample ] ??? 11:48 --- class: inverse, middle, center # Representative sample --- # Representative sample .pull-left-wide[ - In most cases, we would like to obtain a "true" picture of the population without having to study the whole population - In the population, the relative frequency of a characteristic `\(x_i\)` tells us how often that characteristic occurs ] -- .pull-left-wide[ > A sample is **representative** for `\(X\)` if the marginal distribution of `\(X_i\)` equals the relative frequency function of `\(X\)` in the population: > `$$f_{X_i}(x) = f_X(x) \quad \text{for all } i = 1, 2, \ldots, n$$` - A sample that does not satisfy this condition is a **non-representative sample** ] ??? 11:50 --- # Conditions for a representative sample .pull-left-wide[ - Recall the marginal distribution: `$$f_{X_i}(x) = \sum_z \left[ f_{X_i \mid Z_i}(x \mid z) \cdot f_{Z_i}(z) \right]$$` - The sample is representative if both of the following hold: ] -- .pull-left-wide[ 1. The distribution of the selection characteristic matches the population: `\(f_{Z_i}(z) = f_Z(z)\)` 2. For a given `\(z\)`, the probability of selecting a unit with `\(X_i = x\)` equals the population proportion: `\(f_{X_i \mid Z_i}(x \mid z) = f_{X \mid Z}(x \mid z)\)` ] -- .pull-left-wide[ - We typically assume condition 2 holds and focus on how the selection mechanism `\(f_{Z_i}(z)\)` can lead to non-representative samples - Exception: if `\(X\)` and `\(Z\)` are **independent**, the sample is representative even if `\(f_{Z_i}(z) \neq f_Z(z)\)` ] ??? 11:53 --- class: inverse, middle, center # Consequences of a non-representative sample --- # Consequences of a non-representative sample .pull-left-wide[ - In general, a non-representative sample produces misleading results - Consider the mean of a characteristic in the population (same as the expected value): `$$E(X) = \sum_x x \cdot f_X(x)$$` ] -- .pull-left-wide[ - The mean for the `\(i\)`-th element in a non-representative sample: `$$E(X_i) = \sum_x x \cdot f_{X_i}(x) \neq \sum_x x \cdot f_X(x)$$` ] -- .pull-left-wide[ - If we *know* the distribution of the selection characteristic in the population (e.g., stratified sampling), we can correct the calculation of `\(E(X_i)\)` to recover `\(E(X)\)` - In that case, a representative sample is not strictly needed — but we must know *in which way* the sample is non-representative ] .pull-right-narrow[ .orange[*Emerging method: **Debiasing**: Deals with this in very large datasets. All the hype in ML/Econometrics now.*] ] ??? 11:59 --- # .red[Practice: Non-representative Bernoulli sample] *You will derive a proof that oversampling one group biases the sample mean — even when sampling within the frame is done correctly.* .pull-left-wide[ .small123[ We want to estimate `\(p\)` = fraction of Danes who exercise regularly. We recruit by standing outside a gym. Let `\(X_i = 1\)` (exercises regularly), `\(X_i = 0\)` (does not). Among gym visitors, a fraction `\(\alpha > p\)` exercise regularly — gym-goers are not a random cross-section of the population. **a)** Write `\(f_X(x)\)` for `\(x \in \{0,1\}\)` and show `\(E(X) = p\)`. **b)** Our sample is drawn uniformly from gym visitors. Write `\(f_{X_i \mid Z_i}(1 \mid z)\)` for any gym visitor `\(z\)`. Are non-exercisers excluded from the sample? **c)** Use the marginal distribution formula: `$$f_{X_i}(x) = \sum_z f_{X_i \mid Z_i}(x \mid z) \cdot f_{Z_i}(z)$$` to show `\(f_{X_i}(1) = \alpha\)`. Compute `\(E(X_i)\)`, state the direction of bias, and state the condition on `\(\alpha\)` under which the sample is representative. ] ] --- class: inverse, middle, center # Sources of error in the sampling process --- # Sources of error .pull-left-wide[ - In practice, researchers sometimes wrongly believe they are sampling based on the population distribution of `\(Z\)` - There are several situations in which this may occur: 1. **Errors in the sampling frame** 2. **Non-respondents** 3. **Self-selection** ] --- # Errors in the sampling frame .pull-left-wide[ - Sometimes, the sampling frame is not precise enough, and parts of the population are omitted ] -- .pull-left-wide[ - Car show survey: visitors are more interested in "show cars" than hybrids - `\(X\)` (hybrid preference) and `\(Z\)` (attending the show) are not independent - `\(\Rightarrow\)` hybrid interest is *under-represented* ] --- # Non-respondents .pull-left-wide[ - Not everybody can or is willing to take part in a study — these individuals are called **non-responders** - If the probability of being a non-respondent is correlated with the characteristic of interest, the sample is non-representative ] -- .pull-left-wide[ - Reasons for non-response: - Sensitive questions (e.g., income, sexual orientation) - Time required to answer - General attitude toward the survey ] --- # Modelling non-response .pull-left-wide[ - Let `\(Z_i\)` indicate if person `\(i\)` is a respondent (`\(Z_i = 1\)`) or not (`\(Z_i = 0\)`) - The selection mechanism is: $$ f_{Z_i}(z) = `\begin{cases} 0, & z = 0 \\ 1, & z = 1 \end{cases}` $$ ] -- .pull-left-wide[ - This gives the marginal distribution of the characteristic of interest as: `$$f_{X_i}(x) = f_{X \mid Z}(x \mid 1)$$` - In other words, the distribution of `\(X_i\)` equals the distribution *among responders* - This equals the population distribution of `\(X\)` only if `\(X\)` and `\(Z\)` are independent - Note: non-response and sampling frame errors both imply that some elements have **zero probability** of being selected ] --- # Self-selection .pull-left-wide[ - **Self-selection** occurs when someone's behaviour affects the probability that the person will be selected for the sample ] -- .pull-left-wide[ - Wage study: only employed people can be sampled - Low-wage workers may have chosen *not* to work `\(\Rightarrow\)` selection probability = 0 - The observed wage distribution is truncated from below ] -- .pull-left-wide[ - In less extreme cases, the probability of being selected is non-zero but still affected by individual behaviour (e.g., online reviews) - Note: non-respondents can be seen as an extreme case of self-selection ] --- # .red[Raise your hand 5: Sources of error]
−
+
00
:
20
.pull-left-wide[ **Q1.** An online survey about social media use is shared only on Facebook. Which error type is *most* relevant? - **A)** Self-selection — only people who feel strongly about social media will bother to respond - **B)** Sampling frame error — people without a Facebook account are excluded from the outset - **C)** Non-response — a large share of Facebook users will see the survey but not complete it ] -- .pull-left-wide[ **Q2.** We ask random employees about their wages. Our estimate of the mean wage is too high. What might be the cause?: - **A)** Observation error — respondents round wages up when self-reporting - **B)** Self-selection — low-wage workers are less likely to be employed and thus not sampled - **C)** Non-response — low-wage workers are less likely to respond ] --- class: inverse, middle, center # Observation errors --- # Observation errors .pull-left-wide[ - **Observation errors** occur when the actual value of the characteristic of interest differs from the measured (observed) value ] -- .pull-left-wide[ - We can divide observation errors into two types: - **Response errors** = the responder misunderstands the question or the answer is reported incorrectly - *Question error* = the answer is incompatible with the question - *Interview error* = the interviewer influenced the respondent to answer incorrectly - *Recording or coding error* = the answer was recorded or coded incorrectly - **Measurement error** = errors due to difficulty in measuring the characteristic accurately ] --- # Modelling observation errors .pull-left-wide[ - We assume that the observed value of the characteristic is a function of the "true" value and the observation error - We denote the correct value with a `\(*\)` superscript: `\(X^*\)` ] -- .pull-left-wide[ - The **observation error** `\(\epsilon\)` is the difference between the observed and correct value: `$$\epsilon = X - X^*$$` - The observed value equals the correct value plus the error: `$$X = X^* + \epsilon$$` ] --- # Consequences of observation errors .pull-left-wide[ - The consequences depend on what we are studying - We typically assume: `\(E(\epsilon) = 0\)` (mean-zero errors) - Then the mean of the characteristic of interest is unaffected: `$$E(X) = E(X^* + \epsilon) = E(X^*) + E(\epsilon) = E(X^*)$$` ] -- .pull-left-wide[ - If `\(E(\epsilon) \neq 0\)` (e.g., an incorrectly calibrated scale), then `\(E(X) \neq E(X^*)\)` ] -- .pull-left-wide[ - The variance *is* affected even with mean-zero errors - Assuming `\(\text{Cov}(X^*, \epsilon) = 0\)`: `$$\text{Var}(X) = \text{Var}(X^* + \epsilon) = \text{Var}(X^*) + \text{Var}(\epsilon) \geq \text{Var}(X^*)$$` - In other words, observation errors add noise to the observed characteristic ] --- # .red[Raise your hand 6: Observation errors]
−
+
00
:
20
.pull-left-wide[ **Q1.** A study finds `\(\text{Var}(X) > \text{Var}(X^*)\)` even though `\(E(\epsilon) = 0\)`. The best explanation is: - **A)** The sample is non-representative — units with extreme values of `\(X\)` are over-sampled, stretching the distribution - **B)** Observation errors add noise to `\(X\)`, increasing its variance even when they are mean-zero - **C)** The variance formula is `\(\text{Var}(X) = \text{Var}(X^*) \cdot \text{Var}(\epsilon)\)`, so any error variance multiplies the true variance ] -- .pull-left-wide[ **Q2.** Income is under-reported by high earners. The effect on the estimated population mean is: - **A)** No effect — even systematic errors cancel out once you average over enough observations - **B)** Downward bias — the mean of `\(X\)` is lower than the mean of `\(X^*\)` - **C)** Upward bias — since only high earners under-report, their observed incomes are still the highest in the sample and dominate the mean ] --- # Next time .pull-left[ - Next time: **Estimating the mean** ] .pull-right[  ]