Applied Data Analysis for Public Policy Studies

# Applied Data Analysis for Public Policy Studies
## Sampling
### Michele Fioretti
### SciencesPo Paris </br> 2022-08-29

---

---

# Recap from last Week

* *Multiple Linear Regression Model*: `$y_i = b_0 + b_1 x_{1,i} + \dots + b_k x_{k,i} + e_i$`

* Interpretation: effect holding all other independent variables constant

* Important extensions: *standardized regression*, *log models*, *interaction terms*

## Today<sup>1</sup>

.footnote[
[1]: This lecture is very heavily based on the wonderful [sampling chapter](https://moderndive.com/7-sampling.html) of [ModernDive](https://moderndive.com/)]

* Fun activity to discover *sampling*, sampling variation and sampling distributions.

* Sampling terminology: population, sample, population parameter, point estimate or sample statistic, etc.

* Definition of an ***unbiased estimator***.

* Fundamental statistical theorem for inference: ***Central Limit Theorem***.

---

# What's the Proportion of Green Pasta?

.center[
<img src="../img/photos/pasta/pasta_bowl.JPG" width="600px" style="display: block; margin: auto;" />
]

We could count every green pasta but that would be tedious! 😩 What else could we do?

---

# Sampling

* We made sure to select them at **random**.

* Here is what we found.

Color | Count | Proportion
:------:|:------:|:--------:
Green   |  14        |  0.70
Red   |  5     |   0.25
Yellow   |  1    |     0.05

* 0.70 can be thought of as our guess of the proportion of green pasta in the entire bowl.
]

---

# Sampling Variation

* What would happen if we took a *new* sample (putting the 20 previous pasta back in the bowl)? Would we also get 14 *greens* as before?

* What if we repeated this activity multiple times?

* Probably not. The samples will vary from draw to draw.

* Why?

* Key to this observation: these are ***randomly*** drawn samples.

---

# Taking 18 Samples (One per Student)

* Because we can't do this activity in class, we drew 18 samples of 20 pasta *(with replacement)*.

* This is what each looked like:

![:scale 100%](../img/photos/pasta/sample1.JPG) | ![:scale 100%](../img/photos/pasta/sample2.JPG) | ![:scale 100%](../img/photos/pasta/sample3.JPG) | ![:scale 100%](../img/photos/pasta/sample4.JPG) | ![:scale 100%](../img/photos/pasta/sample5.JPG) | ![:scale 100%](../img/photos/pasta/sample6.JPG)
:------:|:------:|:--------:|:--------:|:--------:|:--------:
![:scale 100%](../img/photos/pasta/sample7.JPG) | ![:scale 100%](../img/photos/pasta/sample8.JPG) | ![:scale 100%](../img/photos/pasta/sample9.JPG) | ![:scale 100%](../img/photos/pasta/sample10.JPG) | ![:scale 100%](../img/photos/pasta/sample11.JPG) | ![:scale 100%](../img/photos/pasta/sample12.JPG)
![:scale 100%](../img/photos/pasta/sample13.JPG) | ![:scale 100%](../img/photos/pasta/sample14.JPG) | ![:scale 100%](../img/photos/pasta/sample15.JPG) | ![:scale 100%](../img/photos/pasta/sample16.JPG) | ![:scale 100%](../img/photos/pasta/sample17.JPG) | ![:scale 100%](../img/photos/pasta/sample18.JPG)

---

# Taking 18 Samples (One per Student)

* Because we can't do this activity in class, we drew 18 samples of 20 pasta (with replacement) at home.

* For each sample, we computed the share of green pasta.

.pull-left[
Sample # | Count | Proportion
:------:|:------:|:--------:
1 | 14 | 0.70
2 | 14 | 0.70
3 | 10 | 0.50
4 | 10 | 0.50
5 | 6 | 0.30
6 | 10 | 0.50
7 | 8 | 0.40
8 | 9 | 0.45
9 | 11 | 0.55
]

.pull-right[
Sample # | Count | Proportion
:------:|:------:|:--------:
10 | 8 | 0.40
11 | 7 | 0.35
12 | 9 | 0.45
13 | 9 | 0.45
14 | 14 | 0.70
15 | 11 | 0.55
16 | 10 | 0.50
17 | 7 | 0.35
18 | 13 | 0.65
]

---

# Task 1 (10 minutes)

1. Create a data.frame containing the proportions of green pasta from the previous slide. Name it `pasta` and name the variable containing the proportions `prop_green`. (Hint: to create a data.frame you need to use the `data.frame()` function.)

1. Create a histogram of these proportions using `ggplot2`. Use these parameters in `geom_histogram()`: `boundary = 0.325, binwidth = 0.05`.

1. What do you observe?

---

# Sample Distribution: Histogram

.pull-right[
<img src="chapter6_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" />
]

---

# What Did We Just Do?

* Demonstrated the statistical concept of ***sampling***.

* *Objective*: know the proportion of green pasta

* *Methods*:

1. **Census**: time-consuming (and in many cases very costly);
  
--
  
  1. **Sampling**: extract a *sample* of 20 pasta from the bowl to obtain an ***estimate***.  
  Our first ***estimate*** of the proportion of green pasta was 0.70, but it was actually larger than most other ***estimates***.
  
--

* *Important*: each *sample* was drawn ***randomly*** `$\rightarrow$` samples are different from each other! `$\rightarrow$` different proportions 👉 ***sampling variation***

---

# Taking Virtual (not Real) Samples

* All the pasta in the bowl are stored in a csv file [here](https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1).

```r
bowl <- read.csv("https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1")

head(bowl)
```

```
##   pasta_ID  color
## 1        1 yellow
## 2        2    red
## 3        3  green
## 4        4 yellow
## 5        5    red
## 6        6  green
```
]

* `color`: ball color

```r
nrow(bowl)
```

```
## [1] 713
```

* Instead of selecting pasta with our hands, we'll take *virtual* draws from the bowl.

* We'll use the *virtual shovel* to take a sample of 50 pasta from our virtual bowl.
]

---

# Using A Virtual Shovel Once

* We will take a first sample of size 50, using the `moderndive` function `rep_sample_n`.

--
.pull-left[

```r
#load moderndive package
library(moderndive)
set.seed(123) # we will all have the same draw!
#
virtual_shovel <- bowl %>% # notice that moderndive functions can be "pipped"
  rep_sample_n(size = 50) # take a sample of 50 balls
#
# display the sample's first 6 rows
head(virtual_shovel)
```

```
## # A tibble: 6 x 3
## # Groups:   replicate [1]
##   replicate pasta_ID color 
##       <int>    <int> <fct> 
## 1         1      415 yellow
## 2         1      463 yellow
## 3         1      179 green 
## 4         1      526 red   
## 5         1      195 green 
## 6         1      118 yellow
```

]

```r
# number of observations in sample
nrow(virtual_shovel)
```

```
## [1] 50
```

* Column `replicate` tells us the ID of the sample. Here: `1` since we create only one sample (see option `reps`).

```r
# display the sample's first 6 rows
virtual_shovel_4 <- bowl %>% 
  rep_sample_n(size = 50, reps = 4)

unique(virtual_shovel_4$replicate)
```

```
## [1] 1 2 3 4
```

]

---

# Proportion of Green Pasta

.pull-left[
1. Compute:
  * sum of green pasta in sample,
  * number of observations in sample (i.e. 50 in this case)

1. Compute proportion of green pasta

]

```r
sample_1 <- virtual_shovel %>% 
  summarize(
    # number of green pasta in sample
    num_green = sum(color == "green"),
    # number of observations in sample
    sample_n = n()) %>% 
  mutate(
    # proportion of green pasta in sample
    prop_green = num_green / sample_n)
sample_1
```

```
## # A tibble: 1 x 4
##   replicate num_green sample_n prop_green
##       <int>     <int>    <int>      <dbl>
## 1         1        22       50       0.44
```

]

* 👉 0.44 are green! This is an ***estimate*** of the proportion of green pasta in the bowl. What if we try again?

* What if we try many times, like, 33 times?

---

# Using The Virtual Shovel 33 Times

33 samples (*replicates*) of size 50.

```r
set.seed(123) # we will all have the same draw!
virtual_samples <- bowl %>%
  # get 33 samples of size 50
  rep_sample_n(size = 50, reps = 33)
virtual_samples
```

```
## # A tibble: 1,650 x 3
## # Groups:   replicate [33]
##    replicate pasta_ID color 
##        <int>    <int> <fct> 
##  1         1      415 yellow
##  2         1      463 yellow
##  3         1      179 green 
##  4         1      526 red   
##  5         1      195 green 
##  6         1      118 yellow
##  7         1      299 green 
##  8         1      229 green 
##  9         1      244 green 
## 10         1       14 green 
## # … with 1,640 more rows
```
]

Compute the proportion of green pasta in each sample.

```r
virtual_prop_green <- virtual_samples %>% 
  group_by(replicate) %>% # calculate stat by sample
  summarize(
    num_green = sum(color == "green"),
    sample_n = n()) %>% 
  mutate(prop_green = num_green / sample_n)
virtual_prop_green
```

```
## # A tibble: 33 x 4
##    replicate num_green sample_n prop_green
##        <int>     <int>    <int>      <dbl>
##  1         1        22       50       0.44
##  2         2        17       50       0.34
##  3         3        25       50       0.5 
##  4         4        22       50       0.44
##  5         5        22       50       0.44
##  6         6        20       50       0.4 
##  7         7        24       50       0.48
##  8         8        26       50       0.52
##  9         9        32       50       0.64
## 10        10        32       50       0.64
## # … with 23 more rows
```
]

---

# (Virtual!) Sampling Variation

* The `prop_green` column in the `virtual_prop_green` data.frame differs across samples.

* And again, we can visualize the ***sampling distribution***:

```r
ggplot(virtual_prop_green, aes(x = prop_green)) +
  geom_histogram(binwidth = 0.02, 
                 boundary = 0.51,
                 color = "white",
                 fill = "darkgreen") +
  scale_y_continuous(breaks = seq(0, 12, by = 2)) +
  labs(x = "Proportion of 50 pasta that were green",
       y = "Frequency",
       title = "Distribution of 33 samples of size 50") +
  theme_bw(base_size = 20)
```
]

.pull-right[
<img src="chapter6_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" />
]

---

# Task 2 (10 minutes)

Instead of taking only 33 samples, let's take ***1000***!

1. Why do we not take 1000 samples "by hand"?

1. Load the [data](https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1) into an object `pasta`.

1. Obtain 1000 samples of size 50 using the `rep_sample_n()` function from the `moderndive` package. Note: use `set.seed(1234)` to allow replication (`?set.seed()`)

1. Calculate the proportion of green pasta in each sample.

1. Plot a histogram of the obtained proportion of green pasta in each sample.

1. What do you observe? Which proportions occur most frequently? How does the shape of the histogram compare to when we took only 33 samples?

1. How likely is it that we sample 50 pasta of which less than 20% are green?

---

# Sampling Distribution of 1000 Samples

Looks remarkably close to a ***normal distribution*** `$\rightarrow$` the more samples we take, the more their ***sampling distribution*** will resemble a ***normal distribution***.

---

# Role of Sample Size

Imagine you could change the size of your samples and had the option of the following sizes: 25, 50 and 100.

If your goal is still to estimate the proportion of the bowl’s pasta that are green, which shovel would you choose?

---

# Role of Sample Size

* Let's repeat what we did previously but for different sample sizes.

* Let's take 1000 samples each for `$n=25,n=50,n=100$`.

* We will use `rep_sample_n()` again.

Generate all samples of different sizes:

```r
set.seed(1234)

# Sample size: 25
virtual_samples_25 <- bowl %>% 
  rep_sample_n(size = 25, reps = 1000)

# Sample size: 50
virtual_samples_50 <- bowl %>% 
  rep_sample_n(size = 50, reps = 1000)

# Sample size: 100
virtual_samples_100 <- bowl %>% 
  rep_sample_n(size = 100, reps = 1000)
```
]

Compute proportion of green pasta:

```r
# Sample size: 25
# The same code is used for the other sample sizes
virtual_prop_green_25 <- virtual_samples_25 %>% 
  group_by(replicate) %>% 
  summarize(
    num_green = sum(color == "green"),
    sample_n = n()) %>% 
  mutate(prop_green = num_green / sample_n)
```
]

---

# Role of Sample Size

---

# Sample Size and Sampling Distributions

* The larger the sample size, the *narrower* the resulting ***sampling distribution***.

* In other words, there are fewer differences due to ***sampling variation***.

* Holding constant the number of replicates (i.e. 1000 in our case), ***bigger samples*** will yield *normal distributions* with ***smaller standard deviations***.

Sample Size | Standard Deviation
:---------:|:--------------:
25          | 0.10
50          | 0.07
100         |0.05

* Remember that the ***standard deviation*** measures the *spread* of a variable around its mean.

* So as the sample size increases, our ***estimates*** of the true proportion of the bowl's green pasta get more *precise*.

---

# Sampling Framework

* We used sampling for the purpose of ***estimation***.

* We extracted samples in order to ***estimate*** the proportion of the bowl's pasta that are green.

* 2 key concepts relating to sampling for estimation:

1. The effect of *sampling variation* on our estimates: different samples give different estimates. 
  
  1. The effect of sample size on *sampling variation*: the bigger the size of our sample the closer our estimate should be from the true value.

---

# Sampling Glossary 📖

.pull-left[
***Population:*** collection of individuals or observations we are interested in.  
`$N = 713$` pasta.

***Population parameter:*** numerical summary quantity about the population that is unknown but that we want to know.  
*Examples:* population mean `$(\mu)$`, proportion of green pasta `$(p)$`.

***Census:*** exhaustive enumeration or counting of all `$N$` individuals or observations in the population in order to compute the population parameter’s value *exactly*.

***Sampling:*** collecting sample(s) of size `$n$` from the population of size `$N$`.
]

.pull-right[
* ***Point estimate*** or ***Sample statistic:*** summary statistic computed from a sample that estimates an unknown population parameter.  
*Example:* *sample proportion* of green pasta `$(\hat{p})$`. The "hat" on top of the `$p$` indicates that it is an *estimate* of the population proportion `$p$`.

* ***Representative sampling:*** does the sample *look like* the population?

* ***Biased sampling:*** did all pasta have an equal chance of being included in a sample?

* ***Random sampling:*** randomly sampling in an unbiased fashion.

]

---

# Statistical Definitions

* We have been estimating `$\hat{p}$` all along.

* We plotted the *sampling distribution* to display the *sampling variation* of the *sample proportion* `$\hat{p}$`.

* We computed the *standard deviation* of the *sampling distribution* of `$\hat{p}$`. This standard deviation has a special name: ***standard error*** of the *point estimate* `$\hat{p}$`.

* Let's reproduce the summary table and labelling properly:

Sample Size `$(n)$` | Standard Error of `$\hat{p}$`
:---------:|:--------------:
25          | 0.10
50          | 0.07
100         |0.05

* Key takeaway: as the *sample size* `$n$` goes up, the “typical” error of your *point estimate* will go down, as quantified by the *standard error*.

---

# Putting It All Together

* ***Point estimates*** from ***random samples*** provide a *good guess* of the true unknown ***population parameter***.

* How good? Sometimes `$\hat{p}$` will be far from `$p$`, sometimes close. There's ***sampling variation***.

* ***On average***, our estimates will be correct. This is because of random sampling. We say that: 
> ### `$\hat{p}$` is an ***unbiased estimator*** of `$p$`, i.e. `$\mathop{\mathbb{E}}[\hat{p}] = p$`

* What is the true population proportion `$p$` of green pasta in the population of `$N=713$` pasta?

```r
sum(bowl$color == "green")/nrow(bowl)
```

```
## [1] 0.4936886
```

* Let's insert the ***true population proportion*** `$p=0.49$` into our previous plots!

---

# Visualizing Unbiasedness and Sampling Variation

---
# Some Sampling Scenarios

Scenario | Population parameter | Notation | Point estimate | Symbol(s)
:--: | :--: | :--: |:--: | :--:
1 | Population proportion | `$p$` | Sample proportion | `$\widehat{p}$`
2 | Population mean | `$\mu$` | Sample mean | `$\overline{x}$` or `$\widehat{\mu}$`
3 | Difference in population proportions | `$p_1 - p_2$` | Difference in sample proportions | `$\widehat{p}_1 - \widehat{p}_2$`
4 | Difference in population means | `$\mu_1 - \mu_2$` | Difference in sample means | `$\overline{x}_1 - \overline{x}_2$`
5 | Population regression slope | `$\beta_1$` | Fitted regression slope | `$b_1$` or `$\widehat{\beta}_1$`
6 | Population regression intercept | `$\beta_0$` | Fitted regression intercept | `$b_0$` or `$\widehat{\beta}_0$`

---

# The Central Limit Theorem (CLT)

* The fact that our sample statistics ***converge*** to a *central limit* is well known in statistics.

* It's due to a famous result known as the ***central limit theorem***.

> ### *Central Limit Theorem:* regardless of how the underlying population distribution looks like, **when sample *means* are based on larger and larger sample sizes, the sampling distribution of these sample *means* becomes both more and more normally shaped and more and more narrow**.

* In other words, their sampling distribution increasingly follows a ***normal distribution*** and the *variation of these sampling distributions gets smaller*, as quantified by their ***standard errors***.

---

# Central Limit Theorem - NYTimes video

.center[
<iframe width="684" height="464" src="https://www.youtube.com/embed/jvoxEYmQHNM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
]

---

# THANKS

To the amazing [moderndive](https://moderndive.com/) team!

---

class: title-slide-final, middle
background-image: url(../img/logo/ScPo-econ.png)
background-size: 250px
background-position: 9% 19%

# SEE YOU NEXT WEEK!

|                                                                                                            |                                   |
| :--------------------------------------------------------------------------------------------------------- | :-------------------------------- |
| <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>]               | michele.fioretti@sciencespo.fr       |
| <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides |
| <a href="https://michelefioretti.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book |
| <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>]                          | @ScPoEcon                         |
| <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>]                          | @ScPoEcon                       |