L2 - Statistics
2026-02-06
Or click here: link to Wooclap
Imagine you are the campaign manager of a new political candidate: Pumpernickel. She wants to know the share of the population who will vote for her. How would you proceed?
\(\rightarrow\) Too costly: only survey a fraction or sample of the population.
\(\rightarrow\) How many people do you survey? How would you select them?
\(\rightarrow\) Probably not: you obtain an estimate of this share.
Definitions
Population: the entire group about which we want information, denoted \(N\).
Sample: the fraction of the population for which we collect information, denoted \(n\).
Sampling rate: ratio between the sample size and the population size, denoted \(t = n/N\).
Sampling error: the error incurred when the statistical characteristics of a population are estimated from a sample of that population.
Goal: use information from the sample to draw conclusions about the population as a whole. In order to ensure accurate inferences about the population, the sample needs to be representative of the population.
When it is costly to collect information on the whole population, and you want to know some characteristics of a population, you only survey a sample of individuals.
Two main challenges:
Selecting the sample
Measuring the level of uncertainty on the statistics you get
How to ensure your sample is representative ?
\(\rightarrow\) Does it sound like a good idea? Why?
The best way of making a representative sample of the population of interest is to randomly select the individuals who are included in the sample
– You can do so if you have access to the exhaustive list of individuals in the population of interest: the sample frame
– Having access to the exhaustive list of observations of a population is most of the time difficult
Definition
The random variables \(X_1, ..., X_n\) are called a random sample of size n from a population if \(X_1, ..., X_n\) are mutually independent random variables and each \(X_i\) has the same probability distribution. Alternatively, \(X_1, ..., X_n\) are called independent and identically distributed random variables. This is commonly abbreviated to i.i.d. random variables.
Let’s imagine the population of interest is 1,000 individuals.
We make some simplifying assumptions:
individuals truthfully report their voting intentions
their current preferences correspond to their preferences on election day
Let’s imagine the population of interest is 1,000 individuals.
We make some simplifying assumptions:
individuals truthfully report their voting intentions
their current preferences correspond to their preferences on election day
we know that Pumpernickel’s vote share is 23%
– 0.23 is the true population proportion (denoted \(p\))
Let’s imagine the population of interest is 1,000 individuals.
In pratice, you do not know the true population parameter.
Let’s survey 20 individuals
We ensure they are selected at random
Let’s imagine the population of interest is 1,000 individuals.
In pratice, you do not know the true population parameter.
Let’s survey 20 individuals
We ensure they are selected at random
Let’s imagine the population of interest is 1,000 individuals.
In pratice, you do not know the true population parameter.
Let’s survey 20 individuals
We ensure they are selected at random
How many report they will vote for Pumpernickel?
Let’s imagine the population of interest is 1,000 individuals.
In pratice, you do not know the true population parameter.
Let’s survey 20 individuals
We ensure they are selected at random
How many report they will vote for Pumpernickel?
5 will vote for Pumpernickel out of 20 = 25%
0.25 can be thought of as our guess of the vote share for Pumpernickel
The goal for the class is for the estimates to be good guesses for the population parameters:
\[ \text{Data} \rightarrow \text{Calculation} \rightarrow \hat{\text{Estimates}} \xrightarrow{\text{hopefully!}} \text{Truth} \]
For example,
\[ X \rightarrow \frac{1}{n}\sum_{i=1}^{n}X_i \rightarrow \bar{x} \xrightarrow{\text{hopefully!}} \mu \]
What would happen if we took a new sample? Would we also get 5 Pumpernickel voters as before?
What if we repeated this survey multiple times?
Probably not. The samples will vary from draw to draw.
Key to this observation: these are randomly drawn samples.
Create a data.frame containing the proportions of Pumpernickel voters from the previous slide. Name it vote_share and name the variable containing the proportions prop_votes. (Hint: to create a data.frame you need to use the data.frame() function.) The proportions are as follows: (0.25, 0.25, 0.35, 0.20, 0.20, 0.35, 0.1, 0.35, 0.35, 0.2, 0.25, 0.25, 0.2, 0.15, 0.25, 0.2, 0.15, 0.15, 0.2, 0.1).
Create a histogram of these proportions using ggplot2.
What do you observe?
Demonstrated the statistical concept of sampling
Objective: know the proportion of Pumpernickel voters, \(p\)
Methods:
Census: time-consuming (and in many cases very costly);
Random Sampling: extract a sample of 20 voters from the population to obtain an estimate, \(\hat{p}\). Our first estimate of the proportion of Pumpernickel voters was 0.25, and it turned out to be around the mean of our other estimates
Important: each sample was drawn randomly \(\rightarrow\) samples are different from each other!
\(\rightarrow\) different proportions 👉 sampling variation
This graph applies to all estimates:
# voter_population <- read.csv("https://www.dropbox.com/scl/fi/bizhr0jjvrzm4kyhggvos/population_clean.csv?rlkey=ryeswyb5mgbvvdqaptwtleybj&dl=1")
voter_population <- read.csv("data/population_clean.csv")
head(voter_population) voter_id candidate
1 1 Other candidate
2 2 Other candidate
3 3 Other candidate
4 4 Other candidate
5 5 Pumpernickel
6 6 Other candidate
2 variables:
voter_id: voter identifier
candidate: candidate of voter
moderndive function rep_sample_n.# A tibble: 6 × 3
# Groups: replicate [1]
replicate voter_id candidate
<int> <int> <chr>
1 1 287 Other candidate
2 1 466 Other candidate
3 1 530 Other candidate
4 1 995 Other candidate
5 1 488 Other candidate
6 1 814 Pumpernickel
replicate tells us the ID of the sample. Here: 1.sample_1 <- virtual_shovel |>
summarise(
# number of observations in sample
n_sample = n(),
# number of Pumpernickel voters in sample
n_pumpernickel = sum(candidate == "Pumpernickel")) |>
mutate(
# proportion of Pumpernickel voters in sample
prop_pumpernickel = n_pumpernickel / n_sample)
sample_1# A tibble: 1 × 4
replicate n_sample n_pumpernickel prop_pumpernickel
<int> <int> <int> <dbl>
1 1 50 13 0.26
👉 0.26 are Pumpernickel voters! This is an estimate of the proportion of Pumpernickel voters in the population What if we try again?
What if we try many times, like, 100 times?
100 samples (replicates) of size 50.
virtual_samples <- voter_population |>
# get 100 samples of size 50
rep_sample_n(size = 50, reps = 100)
virtual_samples# A tibble: 5,000 × 3
# Groups: replicate [100]
replicate voter_id candidate
<int> <int> <chr>
1 1 912 Other candidate
2 1 551 Other candidate
3 1 352 Other candidate
4 1 833 Other candidate
5 1 862 Pumpernickel
6 1 826 Other candidate
7 1 217 Pumpernickel
8 1 696 Pumpernickel
9 1 683 Other candidate
10 1 451 Other candidate
# ℹ 4,990 more rows
Compute the proportion of Pumpernickel voters in each sample.
virtual_prop_pumpernickel <- virtual_samples |>
group_by(replicate) |>
summarise(prop_pumpernickel = mean(candidate == "Pumpernickel"))
virtual_prop_pumpernickel# A tibble: 100 × 2
replicate prop_pumpernickel
<int> <dbl>
1 1 0.26
2 2 0.2
3 3 0.2
4 4 0.28
5 5 0.28
6 6 0.3
7 7 0.3
8 8 0.26
9 9 0.3
10 10 0.22
# ℹ 90 more rows
Just as before, the virtual sampler also creates random samples.
The prop_pumpernickel column in the virtual_prop_pumpernickel data.frame differs across samples.
And again, we can visualize the sampling distribution:
Instead of taking only 100 samples, let’s take 1000!
Load the data into an object voter_population
Obtain 1,000 samples of size 50 using the rep_sample_n() function from the moderndive package.
Calculate the proportion of Pumpernickel voters in each sample.
Plot a histogram of the obtained voter shares in each sample.
What do you observe? Which voter shares occur most frequently? How does the shape of the histogram compare to when we took only 50 samples?
How likely is it that we sample 50 voters of which less than 15% are Pumpernickel voters?
Looks remarkably close to a normal distribution \(\rightarrow\) the more samples we take, the more their sampling distribution will resemble a normal distribution
Imagine you could change the size of your samples and had the option of the following sizes: 25, 50 and 100.
If your goal is still to estimate the share of Pumpernickel voters in the population, which sample size would you choose?
Let’s repeat what we did previously but for different sample sizes.
Let’s take 1000 samples each for \(n = 25\), \(n = 50\), \(n = 100\).
We will use rep_sample_n()again.
Generate all samples of different sizes:
Compute proportion of Pumpernickel voters
Comparison of sampling distribution under different sample sizes
| Sample Size | Mean | Standard Deviation |
|---|---|---|
| 25 | 0.2287 | 0.0826 |
| 50 | 0.2299 | 0.0578 |
| 100 | 0.2306 | 0.0401 |
We have discussed using sample data to make inference about the population.
A parameter is a number that describes the population. In practice, parameters are unknown because we cannot examine the entire population.
A statistic is a number that can be calculated from sample data without using any unknown parameters. In practice, we use statistic to estimate parameters.
Definition
Let \(X_1, ..., X_n\) be a random sample of size \(n\) from a population and let \(g(x_1, ..., x_n)\) be a function of the realizations of \((X_1, ..., X_n)\). Then the random variable (or random vector) \(T_n = g(X_1, ..., X_n)\) is called a (sample) statistic. The probability distribution of a statistic \(T_n\) is called the sampling distribution of \(T_n\).
We will denote \(t_n = g(x_1, ..., x_n\)) the realization of this random variable for a realization of the sample \((x_1, ..., x_n)\).
Let \(X_1, ..., X_n\) be a random sample from a population with mean \(\mu\) and variance \(\sigma^2 < \infty\). The sample mean is the arithmetic average of the values in a random sample:
\[ \bar{X_n} = \frac{X_1 + X_2 + ... + X_n}{n} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
We denote \(\bar x\) the realization of the random variable \(\bar{X_n}\):
\[ \bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
Key insight: the sample mean is a random variable! It changes from sample to sample, and therefore has a probability distribution (as we saw earlier).
Question: On average, does \(\bar{X_n}\) equal the population mean \(\mu\)?
Answer: Yes! Let’s prove it:
\[\mathbb{E}[\bar{X_n}] = \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^n X_i\right]\]
Using linearity of expectation:
\[= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n} \sum_{i=1}^n \mu = \frac{1}{n} \cdot n\mu = \mu\]
Conclusion: \(\boxed{\mathbb{E}[\bar{X_n}] = \mu}\) ✓
We say that the sample mean is an unbiased estimator of \(\mu\)!
Question: How spread out is the distribution of \(\bar{X_n}\)?
Answer: We can calculate the variance of the sample mean, the sampling variance of the sample mean.
\[\mathbb{V}(\bar{X_n}) = \mathbb{V}\left(\frac{1}{n} \sum_{i=1}^n X_i\right) = \frac{1}{n^2} \mathbb{V}\left( \sum_{i=1}^n X_i\right)\]
Since the \(X_i\) are independent and each have variance \(\sigma^2\):
\[= \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\]
Conclusion: \(\boxed{\mathbb{V}(\bar{X_n}) = \frac{\sigma^2}{n}}\) ✓
This last result is important: it states that the larger the sample, the more precise our estimate of the mean (as we saw earlier).
| Sample Size | Mean | Standard Deviation | \(\sigma/\sqrt(n)\) |
|---|---|---|---|
| N = 1000 | \(\mu\) = 0.23 | \(\sigma\) = 0.421 | |
| 25 | 0.2287 | 0.0826 | 0.0842 |
| 50 | 0.2299 | 0.0578 | 0.0595 |
| 100 | 0.2306 | 0.0401 | 0.0421 |
The standard deviation of a statistic has a very important name: the standard error.
This last result is important: it states that the larger the sample, the more precise our estimate of the mean (as we saw earlier).
| Sample Size | Mean | Standard Error | \(\sigma/\sqrt(n)\) |
|---|---|---|---|
| N = 1000 | \(\mu\) = 0.23 | \(\sigma\) = 0.421 | |
| 25 | 0.2287 | 0.0826 | 0.0842 |
| 50 | 0.2299 | 0.0578 | 0.0595 |
| 100 | 0.2306 | 0.0401 | 0.0421 |
The standard deviation of a (sample) statistic has a very important name: the standard error.
It plays a central role in how we make statistical inferences.
Definition
Let \(X_1, ..., X_n\) be i.i.d. random variables with a finite expected value \(\mathbb{E}[X_i] = \mu\) and \(\mathbb{V}(X_i) = \sigma^2 < \infty\). Then, for every \(\epsilon > 0\),
\[ \lim_{n \rightarrow \infty} P(|\bar{X_n} - \mu| < \epsilon) = 1. \]
We say that the sample mean, \(\bar{X_n}\), converges in probability to \(\mu\). In other words, as sample size increases, the sample mean will equal the population mean.
This is a profound result: if you have a large sample, you can be almost certain that the sample mean equals the population mean!
Definition
A (point) estimator, \(\hat{\theta}\), is any function \(g(X_1, ..., X_n)\) of a sample; that is, any (sample) statistic is a (point) estimator.
Example: suppose you are interested in estimating the average income in France. Denoting \(X\) the random variable of income in France, you are effectively interested in \(\mathbb{E}[X]\). What are the desirable properties of estimators of \(\mathbb{E}[X]\)?
\(\rightarrow\) we saw earlier that the sample mean \(\bar{X_n}\) is an unbiased estimator of the population mean.
While unbiasedness is desirable (on average, the estimator equals the true value), it does not guarantee our estimator will usually be close to the true value.
Definition
A (point) estimator, \(\hat{\theta}\), is any function \(g(X_1, ..., X_n)\) of a sample; that is, any (sample) statistic is a (point) estimator.
Example: suppose you are interested in estimating the average income in France. Denoting \(X\) the random variable of income in France, you are effectively interested in \(\mathbb{E}[X]\). What are the desirable properties of estimators of \(\mathbb{E}[X]\)?
In words: if we have a large sample, our estimate \(\hat{\theta}\) is very likely close to the true \(\theta\).
\(\rightarrow\) we saw earlier the WLLN states that the sample mean \(\bar{X_n}\) is a consistent estimator of the population mean.
The sample variance is:
\[ S_n^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]
The sample standard deviation is: \(S_n = \sqrt{S^2}\)
Question: Is the sample variance an unbiased estimator of the population variance? (Recall: an estimator \(\hat{\theta}\) is unbiased for \(\theta\) if \(\mathbb{E}[\hat{\theta}] = \theta\).)
Answer: No! Let’s prove it together.
Let \(X_1, ..., X_n\) be i.i.d. random variables with a finite expected value \(E[X_i] = \mu\) and \(\mathbb{V}(X_i) = \sigma^2 < \infty\).
Show that \(\mathbb{E}[S_n^2] = \mathbb{E}[X^2] - \mathbb{E}[\bar{X_n}^2]\).
Show that \(\mathbb{E}[X^2] = \sigma^2 + \mu^2\). Hint: recall the formula for the variance using expectations.
Show that \(\mathbb{E}[\bar{X_n}^2] = \frac{\sigma^2}{n} + \mu^2\). Hint: the formula for \(\mathbb{V}(\bar{X_n})\) will be useful.
Combine the above results to show that \(\mathbb{E}[S_n^2] \neq \sigma^2\).
Load this voter data. Use R to compute the variance of Pumpernickel voters: (1) manually (i.e., taking the sum of deviations from the mean), and (2) using the var command. What do you notice?
Read the help sd (it is simpler than var) command’s help file to see if you can understand what’s going on.
The bias problem is solved with Bessel’s correction. Instead of dividing by \(n\), we divide by \((n-1)\):
\[ S_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X_n})^2 \]
Now the estimator is unbiased for the population variance:
\[ \boxed{\mathbb{E}[S_n^2] = \sigma^2 \quad} ✓ \]
Note: this unbiased estimator of the sample variance is typically referred to as the sample variance. This is what the R command var (or sd) computes!
The red line consistently underestimates the true variance. This is bias!
The blue line stays centered on the true value. Bessel’s correction works!
We now know:
But here’s the challenge: we usually have only ONE sample.
How do we know how far our single estimate is from the true population parameter?
Answer: The central limit theorem!
The miraculous result: For large samples, the sampling distribution of the sample mean is approximately normal, regardless of the underlying distribution!
Formally: Let \(X_1, ..., X_n\) be i.i.d. random variables with \(\mathbb{E}[X_i] = \mu\) and \(\mathbb{V}(X_i) = \sigma^2 > 0\). Then:
\[ \bar{X_n} \sim \mathcal{N}\left(\mu_{\bar{X_n}}, \sigma_{\bar{X_n}}^2 \right) = \mathcal{N} \left(\mu, \frac{\sigma^2}{n} \right) \quad \text{for large } n \text{ } (\geq 30) \]
We say that the [sampling distribution of the sample mean]{,hi_purple} follows a normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\) if \(n\) is large.
Or equivalently (standardized mean):
\[ Z_n = \sqrt{n}\frac{\bar{X_n} - \mu}{\sigma} \sim \mathcal{N}(0, 1) \]
The CLT holds regardless of how the original \(X_i\)’s are distributed (uniform, skewed, anything!)
It only requires \(n\) to be large enough (rule of thumb: \(n > 30\))
This lets us use the normal distribution for inference, even without knowing the population distribution
It’s the foundation of most statistical inference methods
See it? Regardless of the original population distribution (uniform, exponential, binomial), as \(n\) increases, the distribution of the sample mean becomes more and more normal! That’s the CLT!
The CLT can just as easily be applied to a sample proportion, \(\hat{p}\). Substitute \(p = \mu\) and \(p \cdot (1-p) = \sigma^2\) and you obtain:
\[ Z_n = \sqrt{n}\frac{\hat{p} - p}{\sqrt{p \cdot (1-p)}} \sim \mathcal{N}(0, 1) \]
R and normal distributionsWe can use the normal distribution to find percentiles or probabilities. Without computers, we use a normal probability table, which lists Z scores and associated percentiles.
R and normal distributionsWe can use the normal distribution to find percentiles or probabilities. With R, the pnorm function gives us the probability associated with any cutoff on a normal curve.
R and normal distributionsWe can use the normal distribution to find percentiles or probabilities. With R, the pnorm function gives us the probability associated with any cutoff on a normal curve.
R and normal distributionsWe can use the normal distribution to find percentiles or probabilities. In the same way, we can get the z score associated with a given probability using the qnorm function.
Use pnorm to confirm the figure on slide 65, i.e., within +/- 1 (2) standard deviation of the mean, the standard normal distribution contains 68% (95%) of observations.
A region has household incomes with mean \(\mu\) = 58,000 euros and standard deviation \(\sigma\) = 15,000 euros. An economist surveys 100 randomly selected households. What is the probability that the sample mean income is less than 55,000 euros?
If \(n\) is small and/or \(\sigma\) is unknown, the standardized sample mean follows a Student \(t\) distribution instead of normal.
\[ T = \frac{\bar{X_n} - \mu}{S/\sqrt{n}} \sim t_{n-1} \] where \(n-1\) are the degrees of freedom.
The \(t\) distribution has heavier tails than normal (more uncertainty), which makes sense for small samples.
When \(n \to \infty\), the \(t\) distribution converges to the normal distribution.
Key insight: For small samples (small \(df\)), the \(t\)-distribution has heavier tails than normal, meaning more extreme values are more likely. This gives wider confidence intervals and more conservative tests—appropriate when we have less information!
The Student \(t\) distribution was developed by W. S. Gosset in 1908.
Gosset worked for the Guinness brewery, which would not permit him to publish research in his own name. So he used the pseudonym “Student.”
Thus, the distribution is known as the Student \(t\) distribution—named after a pseudonym!
The \(t\) distribution is crucial for small-sample inference and remains one of the most widely used distributions in statistics.
Lecture 3: Sampling and Inferential Statistics