Lecture 3: Sampling and Inferential Statistics

L2 - Statistics

Gustave Kenedi

2026-02-06

Recap quiz

Or click here: link to Wooclap

Today’s lecture

1. Sampling: an example

2. Statistical inference

3. Quantifying uncertainty

Sampling: an example

Political polling

Imagine you are the campaign manager of a new political candidate: Pumpernickel. She wants to know the share of the population who will vote for her. How would you proceed?

  1. Ask every single voter in the country who they intend to vote for?

\(\rightarrow\) Too costly: only survey a fraction or sample of the population.

  1. What characteristics of the sample?

\(\rightarrow\) How many people do you survey? How would you select them?

  1. Once you’ve conducted your survey, would you get the exact share of voters?

\(\rightarrow\) Probably not: you obtain an estimate of this share.

Population vs. sample

Definitions

  • Population: the entire group about which we want information, denoted \(N\).

  • Sample: the fraction of the population for which we collect information, denoted \(n\).

  • Sampling rate: ratio between the sample size and the population size, denoted \(t = n/N\).

  • Sampling error: the error incurred when the statistical characteristics of a population are estimated from a sample of that population.

Goal: use information from the sample to draw conclusions about the population as a whole. In order to ensure accurate inferences about the population, the sample needs to be representative of the population.

Sampling

When it is costly to collect information on the whole population, and you want to know some characteristics of a population, you only survey a sample of individuals.

Two main challenges:

  1. Selecting the sample

  2. Measuring the level of uncertainty on the statistics you get

Selection and sampling

How to ensure your sample is representative ?

  • For you, the easiest way of surveying people could be to do so at the Cergy RER A station and ask people as they exit the station.

\(\rightarrow\) Does it sound like a good idea? Why?

  • Very likely that this is a biased sample, in the sense that it is not representative of the population of interest (French voters)
  • The best way of making a representative sample of the population of interest is to randomly select the individuals who are included in the sample

    – You can do so if you have access to the exhaustive list of individuals in the population of interest: the sample frame

    – Having access to the exhaustive list of observations of a population is most of the time difficult

Random sample

Definition

The random variables \(X_1, ..., X_n\) are called a random sample of size n from a population if \(X_1, ..., X_n\) are mutually independent random variables and each \(X_i\) has the same probability distribution. Alternatively, \(X_1, ..., X_n\) are called independent and identically distributed random variables. This is commonly abbreviated to i.i.d. random variables.

Random sampling

Let’s imagine the population of interest is 1,000 individuals.

We make some simplifying assumptions:

  • individuals truthfully report their voting intentions

  • their current preferences correspond to their preferences on election day

Random sampling

Let’s imagine the population of interest is 1,000 individuals.

We make some simplifying assumptions:

  • individuals truthfully report their voting intentions

  • their current preferences correspond to their preferences on election day

  • we know that Pumpernickel’s vote share is 23%

    – 0.23 is the true population proportion (denoted \(p\))

Random sampling

Let’s imagine the population of interest is 1,000 individuals.

  • In pratice, you do not know the true population parameter.

  • Let’s survey 20 individuals

  • We ensure they are selected at random

Random sampling

Let’s imagine the population of interest is 1,000 individuals.

  • In pratice, you do not know the true population parameter.

  • Let’s survey 20 individuals

  • We ensure they are selected at random

Random sampling

Let’s imagine the population of interest is 1,000 individuals.

  • In pratice, you do not know the true population parameter.

  • Let’s survey 20 individuals

  • We ensure they are selected at random

  • How many report they will vote for Pumpernickel?

Random sampling

Let’s imagine the population of interest is 1,000 individuals.

  • In pratice, you do not know the true population parameter.

  • Let’s survey 20 individuals

  • We ensure they are selected at random

  • How many report they will vote for Pumpernickel?

5 will vote for Pumpernickel out of 20 = 25%

0.25 can be thought of as our guess of the vote share for Pumpernickel

Parameters and statistics

  • We denote our guess of the vote share for Pumpernickel \(\hat{p}\).
  • The \(\hat{\text{ }}\) denotes that this statistic has been estimated.
  • It corresponds to the estimate of the population parameter \(p\), which is the true vote share for Pumpernickel.

The goal for the class is for the estimates to be good guesses for the population parameters:

\[ \text{Data} \rightarrow \text{Calculation} \rightarrow \hat{\text{Estimates}} \xrightarrow{\text{hopefully!}} \text{Truth} \]

For example,

\[ X \rightarrow \frac{1}{n}\sum_{i=1}^{n}X_i \rightarrow \bar{x} \xrightarrow{\text{hopefully!}} \mu \]

Sampling variation

  • What would happen if we took a new sample? Would we also get 5 Pumpernickel voters as before?

  • What if we repeated this survey multiple times?

  • Probably not. The samples will vary from draw to draw.

  • Key to this observation: these are randomly drawn samples.

Sampling variation

Sampling variation

Sampling variation

Sampling variation

Sampling variation

Sampling variation

Sampling variation

Your turn! #1

  1. Create a data.frame containing the proportions of Pumpernickel voters from the previous slide. Name it vote_share and name the variable containing the proportions prop_votes. (Hint: to create a data.frame you need to use the data.frame() function.) The proportions are as follows: (0.25, 0.25, 0.35, 0.20, 0.20, 0.35, 0.1, 0.35, 0.35, 0.2, 0.25, 0.25, 0.2, 0.15, 0.25, 0.2, 0.15, 0.15, 0.2, 0.1).

  2. Create a histogram of these proportions using ggplot2.

  3. What do you observe?

Sampling distribution: histogram

What did we just do?

  • Demonstrated the statistical concept of sampling

  • Objective: know the proportion of Pumpernickel voters, \(p\)

  • Methods:

    • Census: time-consuming (and in many cases very costly);

    • Random Sampling: extract a sample of 20 voters from the population to obtain an estimate, \(\hat{p}\). Our first estimate of the proportion of Pumpernickel voters was 0.25, and it turned out to be around the mean of our other estimates

  • Important: each sample was drawn randomly \(\rightarrow\) samples are different from each other!
    \(\rightarrow\) different proportions 👉 sampling variation

Lesson 2 of statistics: sampling variation is everywhere!

This graph applies to all estimates:

Taking random samples

  • All the voters in the population are stored in a csv file here
# voter_population <- read.csv("https://www.dropbox.com/scl/fi/bizhr0jjvrzm4kyhggvos/population_clean.csv?rlkey=ryeswyb5mgbvvdqaptwtleybj&dl=1")

voter_population <- read.csv("data/population_clean.csv")

head(voter_population)
  voter_id       candidate
1        1 Other candidate
2        2 Other candidate
3        3 Other candidate
4        4 Other candidate
5        5    Pumpernickel
6        6 Other candidate

2 variables:

  • voter_id: voter identifier

  • candidate: candidate of voter

nrow(voter_population)
[1] 1000
  • Let’s take virtual draws from our voter population.

  • We’ll use the virtual shovel to take a sample of 50 voters from our voting population.

Using a virtual shovel once

  • We will take a first sample of size 50 this time, using the moderndive function rep_sample_n.
library(moderndive) # load moderndive package, remember to install before if not already done

virtual_shovel <- voter_population |>  # notice that moderndive functions can be "pipped"
  rep_sample_n(size = 50) # take a sample of 50 voters
# display the sample's first 6 rows
head(virtual_shovel)
# A tibble: 6 × 3
# Groups:   replicate [1]
  replicate voter_id candidate      
      <int>    <int> <chr>          
1         1      287 Other candidate
2         1      466 Other candidate
3         1      530 Other candidate
4         1      995 Other candidate
5         1      488 Other candidate
6         1      814 Pumpernickel   
  • Column replicate tells us the ID of the sample. Here: 1.
# number of observations in sample
nrow(virtual_shovel)
[1] 50

Proportion of Pumpernickel voters

sample_1 <- virtual_shovel |> 
  summarise(
    # number of observations in sample
    n_sample = n(),
    # number of Pumpernickel voters in sample
    n_pumpernickel = sum(candidate == "Pumpernickel")) |> 
  mutate(
    # proportion of Pumpernickel voters in sample
    prop_pumpernickel = n_pumpernickel / n_sample)
sample_1
# A tibble: 1 × 4
  replicate n_sample n_pumpernickel prop_pumpernickel
      <int>    <int>          <int>             <dbl>
1         1       50             13              0.26
  1. Compute:
  • number of observations in sample (i.e. 50 in this case)
  • number of Pumpernickel voters in sample,
  1. Compute proportion of Pumpernickel voters

👉 0.26 are Pumpernickel voters! This is an estimate of the proportion of Pumpernickel voters in the population What if we try again?

What if we try many times, like, 100 times?

Using the virtual shovel 100 times

100 samples (replicates) of size 50.

virtual_samples <- voter_population |> 
  # get 100 samples of size 50
  rep_sample_n(size = 50, reps = 100)
virtual_samples
# A tibble: 5,000 × 3
# Groups:   replicate [100]
   replicate voter_id candidate      
       <int>    <int> <chr>          
 1         1      912 Other candidate
 2         1      551 Other candidate
 3         1      352 Other candidate
 4         1      833 Other candidate
 5         1      862 Pumpernickel   
 6         1      826 Other candidate
 7         1      217 Pumpernickel   
 8         1      696 Pumpernickel   
 9         1      683 Other candidate
10         1      451 Other candidate
# ℹ 4,990 more rows

Compute the proportion of Pumpernickel voters in each sample.

virtual_prop_pumpernickel <- virtual_samples |> 
  group_by(replicate) |> 
  summarise(prop_pumpernickel = mean(candidate == "Pumpernickel"))
virtual_prop_pumpernickel
# A tibble: 100 × 2
   replicate prop_pumpernickel
       <int>             <dbl>
 1         1              0.26
 2         2              0.2 
 3         3              0.2 
 4         4              0.28
 5         5              0.28
 6         6              0.3 
 7         7              0.3 
 8         8              0.26
 9         9              0.3 
10        10              0.22
# ℹ 90 more rows

Sampling variation

  • Just as before, the virtual sampler also creates random samples.

  • The prop_pumpernickel column in the virtual_prop_pumpernickel data.frame differs across samples.

  • And again, we can visualize the sampling distribution:

virtual_prop_pumpernickel |> 
  ggplot(aes(x = prop_pumpernickel)) +
  geom_histogram(fill = "#DC267F") +
  labs(x = "Proportion of Pumpernickel voters",
       y = "Frequency",
       title = "Distribution of 100 samples of size 50") +
  theme_minimal(base_size = 16)

Your turn! #2

Instead of taking only 100 samples, let’s take 1000!

  1. Load the data into an object voter_population

  2. Obtain 1,000 samples of size 50 using the rep_sample_n() function from the moderndive package.

  3. Calculate the proportion of Pumpernickel voters in each sample.

  4. Plot a histogram of the obtained voter shares in each sample.

  5. What do you observe? Which voter shares occur most frequently? How does the shape of the histogram compare to when we took only 50 samples?

  6. How likely is it that we sample 50 voters of which less than 15% are Pumpernickel voters?

Sampling distribution of 1,000 samples

Looks remarkably close to a normal distribution \(\rightarrow\) the more samples we take, the more their sampling distribution will resemble a normal distribution

Role of sample size

Imagine you could change the size of your samples and had the option of the following sizes: 25, 50 and 100.

If your goal is still to estimate the share of Pumpernickel voters in the population, which sample size would you choose?

Role of sample size

Let’s repeat what we did previously but for different sample sizes.

Let’s take 1000 samples each for \(n = 25\), \(n = 50\), \(n = 100\).

We will use rep_sample_n()again.

Generate all samples of different sizes:

# Sample size: 25
virtual_samples_25 <- voter_population |>  
  rep_sample_n(size = 25, reps = 1000)

# Sample size: 50
virtual_samples_50 <- voter_population |>  
  rep_sample_n(size = 50, reps = 1000)

# Sample size: 100
virtual_samples_100 <- voter_population |>  
  rep_sample_n(size = 100, reps = 1000)

Compute proportion of Pumpernickel voters

# Sample size: 25
virtual_prop_voters_25 <- virtual_samples_25 |> 
  group_by(replicate) |> 
  summarise(prop_pumpernickel = mean(candidate == "Pumpernickel"),
            sample_n = n())

Role of sample size

Comparison of sampling distribution under different sample sizes

Sample size and sampling distributions

  • The larger the sample size, the narrower the resulting sampling distribution.
  • In other words, there are fewer differences due to sampling variation.
  • Holding constant the number of replicates (i.e. 1000 in our case), bigger samples will yield normal distributions with smaller standard deviations.
Sample Size Mean Standard Deviation
25 0.2287 0.0826
50 0.2299 0.0578
100 0.2306 0.0401
  • Remember that the standard deviation measures the spread of a variable around its mean.
  • So as the sample size increases, our estimates of the true proportion of Pumpernickel voters gets more precise.

Today’s lecture

1. Sampling: an example

2. Statistical inference

3. Quantifying uncertainty

Statistical inference

Parameter and statistics

We have discussed using sample data to make inference about the population.

A parameter is a number that describes the population. In practice, parameters are unknown because we cannot examine the entire population.

A statistic is a number that can be calculated from sample data without using any unknown parameters. In practice, we use statistic to estimate parameters.

Definition

Let \(X_1, ..., X_n\) be a random sample of size \(n\) from a population and let \(g(x_1, ..., x_n)\) be a function of the realizations of \((X_1, ..., X_n)\). Then the random variable (or random vector) \(T_n = g(X_1, ..., X_n)\) is called a (sample) statistic. The probability distribution of a statistic \(T_n\) is called the sampling distribution of \(T_n\).

We will denote \(t_n = g(x_1, ..., x_n\)) the realization of this random variable for a realization of the sample \((x_1, ..., x_n)\).

Sample mean: definition

Let \(X_1, ..., X_n\) be a random sample from a population with mean \(\mu\) and variance \(\sigma^2 < \infty\). The sample mean is the arithmetic average of the values in a random sample:

\[ \bar{X_n} = \frac{X_1 + X_2 + ... + X_n}{n} = \frac{1}{n} \sum_{i=1}^{n} X_i \]

We denote \(\bar x\) the realization of the random variable \(\bar{X_n}\):

\[ \bar{x} = \frac{x_1 + x_2 + ... + x_n}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

Key insight: the sample mean is a random variable! It changes from sample to sample, and therefore has a probability distribution (as we saw earlier).

Sample mean: properties

Question: On average, does \(\bar{X_n}\) equal the population mean \(\mu\)?

Answer: Yes! Let’s prove it:

\[\mathbb{E}[\bar{X_n}] = \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^n X_i\right]\]

Using linearity of expectation:

\[= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] = \frac{1}{n} \sum_{i=1}^n \mu = \frac{1}{n} \cdot n\mu = \mu\]

Conclusion: \(\boxed{\mathbb{E}[\bar{X_n}] = \mu}\)

We say that the sample mean is an unbiased estimator of \(\mu\)!

Sample mean: precision

Question: How spread out is the distribution of \(\bar{X_n}\)?

Answer: We can calculate the variance of the sample mean, the sampling variance of the sample mean.

\[\mathbb{V}(\bar{X_n}) = \mathbb{V}\left(\frac{1}{n} \sum_{i=1}^n X_i\right) = \frac{1}{n^2} \mathbb{V}\left( \sum_{i=1}^n X_i\right)\]

Since the \(X_i\) are independent and each have variance \(\sigma^2\):

\[= \frac{1}{n^2} \sum_{i=1}^n \mathbb{V}(X_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\]

Conclusion: \(\boxed{\mathbb{V}(\bar{X_n}) = \frac{\sigma^2}{n}}\)

Variance of sample mean

This last result is important: it states that the larger the sample, the more precise our estimate of the mean (as we saw earlier).

Sample Size Mean Standard Deviation \(\sigma/\sqrt(n)\)
N = 1000 \(\mu\) = 0.23 \(\sigma\) = 0.421
25 0.2287 0.0826 0.0842
50 0.2299 0.0578 0.0595
100 0.2306 0.0401 0.0421

The standard deviation of a statistic has a very important name: the standard error.

Variance of sample mean

This last result is important: it states that the larger the sample, the more precise our estimate of the mean (as we saw earlier).

Sample Size Mean Standard Error \(\sigma/\sqrt(n)\)
N = 1000 \(\mu\) = 0.23 \(\sigma\) = 0.421
25 0.2287 0.0826 0.0842
50 0.2299 0.0578 0.0595
100 0.2306 0.0401 0.0421

The standard deviation of a (sample) statistic has a very important name: the standard error.

It plays a central role in how we make statistical inferences.

The weak law of large numbers (WLLN)

Definition

Let \(X_1, ..., X_n\) be i.i.d. random variables with a finite expected value \(\mathbb{E}[X_i] = \mu\) and \(\mathbb{V}(X_i) = \sigma^2 < \infty\). Then, for every \(\epsilon > 0\),

\[ \lim_{n \rightarrow \infty} P(|\bar{X_n} - \mu| < \epsilon) = 1. \]

We say that the sample mean, \(\bar{X_n}\), converges in probability to \(\mu\). In other words, as sample size increases, the sample mean will equal the population mean.

This is a profound result: if you have a large sample, you can be almost certain that the sample mean equals the population mean!

Illustration: WLLN in action

Properties of estimators

Definition

A (point) estimator, \(\hat{\theta}\), is any function \(g(X_1, ..., X_n)\) of a sample; that is, any (sample) statistic is a (point) estimator.

Example: suppose you are interested in estimating the average income in France. Denoting \(X\) the random variable of income in France, you are effectively interested in \(\mathbb{E}[X]\). What are the desirable properties of estimators of \(\mathbb{E}[X]\)?

  1. Unbiasedness: an estimator, \(\hat{\theta}\), is unbiased for \(\theta\) if \(\mathbb{E}[\hat{\theta}] = \theta\). The bias of \(\hat{\theta}\) in estimating \(\theta\) is equal to \(B = \mathbb{E}[\hat{\theta}] - \theta\).

\(\rightarrow\) we saw earlier that the sample mean \(\bar{X_n}\) is an unbiased estimator of the population mean.

While unbiasedness is desirable (on average, the estimator equals the true value), it does not guarantee our estimator will usually be close to the true value.

Properties of estimators

Definition

A (point) estimator, \(\hat{\theta}\), is any function \(g(X_1, ..., X_n)\) of a sample; that is, any (sample) statistic is a (point) estimator.

Example: suppose you are interested in estimating the average income in France. Denoting \(X\) the random variable of income in France, you are effectively interested in \(\mathbb{E}[X]\). What are the desirable properties of estimators of \(\mathbb{E}[X]\)?

  1. Consistency: an estimator \(\hat{\theta}\) is consistent for \(\theta\) if \(\hat{\theta}\) converges in probability to \(\theta\).

In words: if we have a large sample, our estimate \(\hat{\theta}\) is very likely close to the true \(\theta\).

\(\rightarrow\) we saw earlier the WLLN states that the sample mean \(\bar{X_n}\) is a consistent estimator of the population mean.

Sample variance and standard deviation: definitions

The sample variance is:

\[ S_n^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2 \]

The sample standard deviation is: \(S_n = \sqrt{S^2}\)

Sample variance properties

Question: Is the sample variance an unbiased estimator of the population variance? (Recall: an estimator \(\hat{\theta}\) is unbiased for \(\theta\) if \(\mathbb{E}[\hat{\theta}] = \theta\).)

Answer: No! Let’s prove it together.

Your turn! #3

Let \(X_1, ..., X_n\) be i.i.d. random variables with a finite expected value \(E[X_i] = \mu\) and \(\mathbb{V}(X_i) = \sigma^2 < \infty\).

  1. Show that \(\mathbb{E}[S_n^2] = \mathbb{E}[X^2] - \mathbb{E}[\bar{X_n}^2]\).

  2. Show that \(\mathbb{E}[X^2] = \sigma^2 + \mu^2\). Hint: recall the formula for the variance using expectations.

  3. Show that \(\mathbb{E}[\bar{X_n}^2] = \frac{\sigma^2}{n} + \mu^2\). Hint: the formula for \(\mathbb{V}(\bar{X_n})\) will be useful.

  4. Combine the above results to show that \(\mathbb{E}[S_n^2] \neq \sigma^2\).

  5. Load this voter data. Use R to compute the variance of Pumpernickel voters: (1) manually (i.e., taking the sum of deviations from the mean), and (2) using the var command. What do you notice?

  6. Read the help sd (it is simpler than var) command’s help file to see if you can understand what’s going on.

Bessel’s correction

The bias problem is solved with Bessel’s correction. Instead of dividing by \(n\), we divide by \((n-1)\):

\[ S_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X_n})^2 \]

Now the estimator is unbiased for the population variance:

\[ \boxed{\mathbb{E}[S_n^2] = \sigma^2 \quad} ✓ \]

Note: this unbiased estimator of the sample variance is typically referred to as the sample variance. This is what the R command var (or sd) computes!

Illustration: the biased estimator alone

The red line consistently underestimates the true variance. This is bias!

Illustration: Bessel’s correction fixes the bias

The blue line stays centered on the true value. Bessel’s correction works!

Today’s lecture

1. Sampling: an example

2. Statistical inference

3. Quantifying uncertainty

Quantifying uncertainty

The big question

We now know:

  1. ✓ The importance of random sampling
  2. ✓ That samples vary: sampling variation
  3. ✓ That larger samples are more precise
  4. ✓ What makes a ``good’’ estimator (unbiasedness, consistency)

But here’s the challenge: we usually have only ONE sample.

How do we know how far our single estimate is from the true population parameter?

Answer: The central limit theorem!

Central limit theorem

The miraculous result: For large samples, the sampling distribution of the sample mean is approximately normal, regardless of the underlying distribution!

Formally: Let \(X_1, ..., X_n\) be i.i.d. random variables with \(\mathbb{E}[X_i] = \mu\) and \(\mathbb{V}(X_i) = \sigma^2 > 0\). Then:

\[ \bar{X_n} \sim \mathcal{N}\left(\mu_{\bar{X_n}}, \sigma_{\bar{X_n}}^2 \right) = \mathcal{N} \left(\mu, \frac{\sigma^2}{n} \right) \quad \text{for large } n \text{ } (\geq 30) \]

We say that the [sampling distribution of the sample mean]{,hi_purple} follows a normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\) if \(n\) is large.

Or equivalently (standardized mean):

\[ Z_n = \sqrt{n}\frac{\bar{X_n} - \mu}{\sigma} \sim \mathcal{N}(0, 1) \]

A quick refresher on the (standard) normal distribution

Why the CLT is magical

  • The CLT holds regardless of how the original \(X_i\)’s are distributed (uniform, skewed, anything!)

  • It only requires \(n\) to be large enough (rule of thumb: \(n > 30\))

  • This lets us use the normal distribution for inference, even without knowing the population distribution

  • It’s the foundation of most statistical inference methods

CLT in action: non-normal becomes normal!

See it? Regardless of the original population distribution (uniform, exponential, binomial), as \(n\) increases, the distribution of the sample mean becomes more and more normal! That’s the CLT!

CLT for proportions

The CLT can just as easily be applied to a sample proportion, \(\hat{p}\). Substitute \(p = \mu\) and \(p \cdot (1-p) = \sigma^2\) and you obtain:

\[ Z_n = \sqrt{n}\frac{\hat{p} - p}{\sqrt{p \cdot (1-p)}} \sim \mathcal{N}(0, 1) \]

R and normal distributions

We can use the normal distribution to find percentiles or probabilities. Without computers, we use a normal probability table, which lists Z scores and associated percentiles.

R and normal distributions

We can use the normal distribution to find percentiles or probabilities. With R, the pnorm function gives us the probability associated with any cutoff on a normal curve.

pnorm(q = 1, mean = 0, sd = 1)
[1] 0.8413447
openintro::normTail(m = 0, s= 1, L = 1)

R and normal distributions

We can use the normal distribution to find percentiles or probabilities. With R, the pnorm function gives us the probability associated with any cutoff on a normal curve.

pnorm(q = 0, mean = 0, sd = 1)
[1] 0.5
openintro::normTail(m = 0, s= 1, L = 0)

R and normal distributions

We can use the normal distribution to find percentiles or probabilities. In the same way, we can get the z score associated with a given probability using the qnorm function.

qnorm(p = 0.8, mean = 0, sd = 1)
[1] 0.8416212
openintro::normTail(m = 0, s= 1, L = 0.8416)

Your turn! #4

  1. Use pnorm to confirm the figure on slide 65, i.e., within +/- 1 (2) standard deviation of the mean, the standard normal distribution contains 68% (95%) of observations.

  2. A region has household incomes with mean \(\mu\) = 58,000 euros and standard deviation \(\sigma\) = 15,000 euros. An economist surveys 100 randomly selected households. What is the probability that the sample mean income is less than 55,000 euros?

  1. A country’s true unemployment rate is 6.2%. A labor economist draws a random sample of 1,000 working-age adults. What is the probability that the sample unemployment rate falls between 5% and 7%?
  1. Monthly consumer spending at retail stores has mean \(\mu\) = 320 euros and standard deviation \(\sigma\) = 85 euros. A retail analyst collects data from 50 randomly selected consumers. Below what value does the lowest 10% of sample mean spending fall?

What about (small) samples?

If \(n\) is small and/or \(\sigma\) is unknown, the standardized sample mean follows a Student \(t\) distribution instead of normal.

\[ T = \frac{\bar{X_n} - \mu}{S/\sqrt{n}} \sim t_{n-1} \] where \(n-1\) are the degrees of freedom.

The \(t\) distribution has heavier tails than normal (more uncertainty), which makes sense for small samples.

When \(n \to \infty\), the \(t\) distribution converges to the normal distribution.

Illustration: t vs. Normal for different sample sizes

Key insight: For small samples (small \(df\)), the \(t\)-distribution has heavier tails than normal, meaning more extreme values are more likely. This gives wider confidence intervals and more conservative tests—appropriate when we have less information!

Fun fact: History of the t distribution

The Student \(t\) distribution was developed by W. S. Gosset in 1908.

Gosset worked for the Guinness brewery, which would not permit him to publish research in his own name. So he used the pseudonym “Student.”

Thus, the distribution is known as the Student \(t\) distribution—named after a pseudonym!

The \(t\) distribution is crucial for small-sample inference and remains one of the most widely used distributions in statistics.

See you next week!