ECON 4050: Introduction to Econometrics

class: center, middle, inverse, title-slide

.title[
# ECON 4050: Introduction to Econometrics
]
.subtitle[
## Sampling
]
.author[
### Adam Soliman, PhD
]
.date[
### Clemson University
]

---

# Today - Sampling

.footnote[This lecture is very heavily based on the wonderful [sampling chapter](https://moderndive.com/7-sampling.html) of [ModernDive](https://moderndive.com/)]

* Fun activity to discover sampling, sampling variation and sampling distributions.

* Sampling terminology: population, sample, population parameter, point estimate or sample statistic, etc.

* Definition of an ***unbiased estimator***.

* Fundamental statistical theorem for inference: ***Central Limit Theorem***.

---

# What's the proportion of green pasta?

.center[
<img src="../img/photos/pasta/pasta_bowl.JPG" width="600px" style="display: block; margin: auto;" />
]

We could count every green pasta but that would be tedious! 😩 What else could we do?

---

# Sampling

.pull-left[
* Let's take a sample of 20 pasta.

* My friend made sure to select them at **random**.

* Here is what we found.

Color | Count | Proportion
:------:|:------:|:--------:
Green   |  14        |  0.70
Red   |  5     |   0.25
Yellow   |  1    |     0.05

* 0.70 can be thought of as our guess of the proportion of green pasta in the entire bowl.
]

.pull-right[
<div><img src="../img/photos/pasta/sample1.JPG"?></div>
]

---

# Sampling Variation

* What would happen if we took a *new* sample (putting the 20 previous pasta back in the bowl)? Would we also get 14 *greens* as before?

* What if we repeated this activity multiple times?

* Probably not. The samples will vary from draw to draw.

* Key to this observation: these are *randomly* drawn samples.

---

# Taking 18 Samples

* Because we don't have pasta with us in class, he drew 18 samples of 20 pasta (with replacement) at home.

* This is what each looked like:

![:scale 100%](../img/photos/pasta/sample1.JPG) | ![:scale 100%](../img/photos/pasta/sample2.JPG) | ![:scale 100%](../img/photos/pasta/sample3.JPG) | ![:scale 100%](../img/photos/pasta/sample4.JPG) | ![:scale 100%](../img/photos/pasta/sample5.JPG) | ![:scale 100%](../img/photos/pasta/sample6.JPG)
:------:|:------:|:--------:|:--------:|:--------:|:--------:
![:scale 100%](../img/photos/pasta/sample7.JPG) | ![:scale 100%](../img/photos/pasta/sample8.JPG) | ![:scale 100%](../img/photos/pasta/sample9.JPG) | ![:scale 100%](../img/photos/pasta/sample10.JPG) | ![:scale 100%](../img/photos/pasta/sample11.JPG) | ![:scale 100%](../img/photos/pasta/sample12.JPG)
![:scale 100%](../img/photos/pasta/sample13.JPG) | ![:scale 100%](../img/photos/pasta/sample14.JPG) | ![:scale 100%](../img/photos/pasta/sample15.JPG) | ![:scale 100%](../img/photos/pasta/sample16.JPG) | ![:scale 100%](../img/photos/pasta/sample17.JPG) | ![:scale 100%](../img/photos/pasta/sample18.JPG)

---

# Taking 18 Samples

* Because we don't have pasta with us in class, he drew 18 samples of 20 pasta (with replacement) at home.

* For each sample, we computed the share of green pasta.

.pull-left[
Sample # | Count | Proportion
:------:|:------:|:--------:
1 | 14 | 0.70
2 | 14 | 0.70
3 | 10 | 0.50
4 | 10 | 0.50
5 | 6 | 0.30
6 | 10 | 0.50
7 | 8 | 0.40
8 | 9 | 0.45
9 | 11 | 0.55
]

.pull-right[
Sample # | Count | Proportion
:------:|:------:|:--------:
10 | 8 | 0.40
11 | 7 | 0.35
12 | 9 | 0.45
13 | 9 | 0.45
14 | 14 | 0.70
15 | 11 | 0.55
16 | 10 | 0.50
17 | 7 | 0.35
18 | 13 | 0.65
]

---

# Sample Distribution: Histogram

.pull-left[
<img src="../img/photos/hist_building.gif" style="display: block; margin: auto;" />
]

.pull-right[
<img src="chapter_sampling_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" />
]

---

# What Did We Just Do?

* Demonstrated the statistical concept of ***sampling***.

* *Objective*: know the proportion of green pasta

* *Methods*:

1. **Census**: time-consuming (and in many cases very costly);
  
  1. **Sampling**: extract a *sample* of 20 pasta from the bowl to obtain an ***estimate***.  
  Out first ***estimate*** of the proportion of green pasta was 0.70, but it was actually larger than most other ***estimates***.
  
--

* *Important*: each *sample* was drawn ***randomly*** `$\rightarrow$` samples are different from each other! `$\rightarrow$` different proportions 👉 ***sampling variation***

---

# Taking Virtual (not Real) Samples

.pull-left[
* He counted the exact number of green, red and yellow pasta in the bowl 🧐 `#confinement`

* All the pasta in the bowl are stored in a csv file [here](https://github.com/adamsoliman/IntroEconometrics/blob/master/data%20for%20tasks/pasta.csv).

``` r
bowl <- read.csv("~/Library/CloudStorage/Dropbox/Clemson/Econometrics Course/data for tasks/pasta.csv")

head(bowl, 6)
```

```
##   pasta_ID  color
## 1        1 yellow
## 2        2    red
## 3        3  green
## 4        4 yellow
## 5        5    red
## 6        6  green
```
]

.pull-right[
* `pasta_ID`: pasta identifier

* `color`: pasta color

``` r
nrow(bowl)
```

```
## [1] 713
```

* Instead of selecting pasta with our hands, we'll take *virtual* draws from the bowl.

* We'll use the *virtual shovel* to take a sample of 50 pasta from our virtual bowl.
]

---

# Using A Virtual Shovel Once

* We will take a first sample of size 50, using the `moderndive` function `rep_sample_n`.

``` r
#load moderndive package
library(moderndive)
virtual_shovel <- bowl %>% # notice that moderndive functions can be "pipped"
  rep_sample_n(size = 50) # take a sample of 50 pasta
```

.pull-left[

``` r
# display the sample's first 6 rows
head(virtual_shovel)
```

```
## # A tibble: 6 × 3
## # Groups:   replicate [1]
##   replicate pasta_ID color 
##       <int>    <int> <chr> 
## 1         1      284 green 
## 2         1      101 green 
## 3         1      623 yellow
## 4         1      645 green 
## 5         1      400 red   
## 6         1       98 yellow
```

* Column `replicate` tells us the ID of the sample. Here: `1`.
]

.pull-right[

``` r
# number of observations in sample
nrow(virtual_shovel)
```

```
## [1] 50
```
]

---

# Proportion of Green Pasta

.pull-left[

``` r
sample_1 <- virtual_shovel %>% 
  summarize(
    # number of green pasta in sample
    num_green = sum(color == "green"),
    # number of observations in sample
    sample_n = n()) %>% 
  mutate(
    # proportion of green pasta in sample
    prop_green = num_green / sample_n)
sample_1
```

```
## # A tibble: 1 × 4
##   replicate num_green sample_n prop_green
##       <int>     <int>    <int>      <dbl>
## 1         1        23       50       0.46
```
]

.pull-right[
1. Compute:
  * sum of green pasta in sample,
  * number of observations in sample (i.e. 50 in this case)

1. Compute proportion of green pasta

👉 0.46 are green! This is an ***estimate*** of the proportion of green pasta in the bowl. What if we try again?

What if we try many times, like, 33 times?
]

---

# Using The Virtual Shovel 33 Times

.pull-left[

33 samples (*replicates*) of size 50.

``` r
virtual_samples <- bowl %>%
  # get 33 samples of size 50
  rep_sample_n(size = 50, reps = 33)
virtual_samples
```

```
## # A tibble: 1,650 × 3
## # Groups:   replicate [33]
##    replicate pasta_ID color 
##        <int>    <int> <chr> 
##  1         1      495 yellow
##  2         1      534 green 
##  3         1      297 yellow
##  4         1      208 green 
##  5         1      131 green 
##  6         1      569 red   
##  7         1      522 yellow
##  8         1      248 green 
##  9         1      365 red   
## 10         1      665 yellow
## # ℹ 1,640 more rows
```
]

.pull-right[

Compute the proportion of green pasta in each sample.

``` r
virtual_prop_green <- virtual_samples %>% 
  group_by(replicate) %>% # calculate stat by sample
  summarize(num_green = sum(color == "green"),
    sample_n = n()) %>% 
  mutate(prop_green = num_green / sample_n)
virtual_prop_green
```

```
## # A tibble: 33 × 4
##    replicate num_green sample_n prop_green
##        <int>     <int>    <int>      <dbl>
##  1         1        24       50       0.48
##  2         2        25       50       0.5 
##  3         3        27       50       0.54
##  4         4        23       50       0.46
##  5         5        25       50       0.5 
##  6         6        22       50       0.44
##  7         7        18       50       0.36
##  8         8        30       50       0.6 
##  9         9        29       50       0.58
## 10        10        18       50       0.36
## # ℹ 23 more rows
```
]

---

# (Virtual!) Sampling Variation

.pull-left[
* Just as when we did it, the virtual sampler *also* creates random samples.

* The `prop_green` column in the `virtual_prop_green` data.frame differs across samples.

* And again, we can visualize the ***sampling distribution***:

``` r
ggplot(virtual_prop_green, aes(x = prop_green)) +
  geom_histogram(binwidth = 0.02, 
                 boundary = 0.51,
                 color = "white",
                 fill = "darkgreen") +
  scale_y_continuous(breaks = seq(0, 12, by = 2)) +
  labs(x = "Proportion of 50 pasta that were green",
       y = "Frequency",
       title = "Distribution of 33 samples of size 50") +
  theme_bw(base_size = 20)
```
]

.pull-right[
<img src="chapter_sampling_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />
]

---

# Sampling Distribution of 1000 Samples

Looks remarkably close to a ***normal distribution*** `$\rightarrow$` the more samples we take, the more their ***sampling distribution*** will resemble a ***normal distribution***.

---

# Role of Sample Size

Imagine you could change the size of your samples and had the option of the following sizes: 25, 50 and 100.

If your goal is still to estimate the proportion of the bowl’s pasta that are green, which shovel would you choose?

---

# Role of Sample Size

* Let's repeat what we did previously but for different sample sizes.

* Let's take 1000 samples each for `$n=25,n=50,n=100$`.

* We will use `rep_sample_n()` again.

.pull-left[

Generate all samples of different sizes:

``` r
# Sample size: 25
virtual_samples_25 <- bowl %>% 
  rep_sample_n(size = 25, reps = 1000)

# Sample size: 50
virtual_samples_50 <- bowl %>% 
  rep_sample_n(size = 50, reps = 1000)

# Sample size: 100
virtual_samples_100 <- bowl %>% 
  rep_sample_n(size = 100, reps = 1000)
```
]

.pull-right[

Compute proportion of green pasta:

``` r
# Sample size: 25
# The same code is used for the other sample sizes
virtual_prop_green_25 <- virtual_samples_25 %>% 
  group_by(replicate) %>% 
  summarize(
    num_green = sum(color == "green"),
    sample_n = n()) %>% 
  mutate(prop_green = num_green / sample_n)
```
]

---

# Role of Sample Size

---

# Sample Size and Sampling Distributions

* The larger the sample size, the *narrower* the resulting ***sampling distribution***.

* In other words, there are fewer differences due to ***sampling variation***.

* Holding constant the number of replicates (i.e. 1000 in our case), ***bigger samples*** will yield *normal distributions* with ***smaller standard deviations***.

Sample Size | Standard Deviation
:---------:|:--------------:
25          | 0.10
50          | 0.07
100         |0.05

* Remember that the ***standard deviation*** measures the *spread* of a variable around its mean.

* So as the sample size increases, our ***estimates*** of the true proportion of the bowl's green pasta get more *precise*.

---

# Sampling Framework

* We used sampling for the purpose of ***estimation***.

* We extracted samples in order to ***estimate*** the proportion of the bowl's pasta that are green.

* 2 key concepts relating to sampling for estimation:

1. The effect of ***sampling variation*** on our estimates: different samples give different estimates. 
  
  1. The effect of sample size on ***sampling variation***: the bigger the size of our sample the closer our estimate should be from the true value.

---

# Sampling Glossary 📖

.pull-left[
***Population:*** collection of individuals or observations we are interested in.  
`$N = 713$` pasta.

***Population parameter:*** numerical summary quantity about the population that is unknown but that we want to know.  
*Examples:* population mean `$(\mu)$`, proportion of green pasta `$(p)$`.

***Census:*** exhaustive enumeration or counting of all `$N$` individuals or observations in the population in order to compute the population parameter’s value *exactly*.

***Sampling:*** collecting sample(s) of size `$n$` from the population of size `$N$`.
]

.pull-right[
* ***Point estimate*** or ***Sample statistic:*** summary statistic computed from a sample that estimates an unknown population parameter.  
*Example:* *sample proportion* of green pasta `$(\hat{p})$`. The "hat" on top of the `$p$` indicates that it is an *estimate* of the population proportion `$p$`.

* ***Representative sampling:*** does the sample *look like* the population?

* ***Biased sampling:*** did all pasta have an equal chance of being included in a sample?

* ***Random sampling:*** randomly sampling in an unbiased fashion.

]

---

# Statistical Definitions

* We have been estimating `$\hat{p}$` all along.

* We plotted the *sampling distribution* to display the *sampling variation* of the *sample proportion* `$\hat{p}$`.

* We computed the *standard deviation* of the *sampling distribution* of `$\hat{p}$`. This standard deviation has a special name: ***standard error*** of the *point estimate* `$\hat{p}$`.

* Let's reproduce the summary table and labeling properly:

Sample Size `$(n)$` | Standard Error of `$\hat{p}$`
:---------:|:--------------:
25          | 0.10
50          | 0.07
100         |0.05

* Key takeaway: as the *sample size* `$n$` goes up, the “typical” error of your *point estimate* will go down, as quantified by the *standard error*.

---

# Putting It All Together

* ***Point estimates*** from ***random samples*** provide a *good guess* of the true unknown ***population parameter***.

* How good? Sometimes `$\hat{p}$` will be far from `$p$`, sometimes close. There's ***sampling variation***.

* ***On average***, our estimates will be correct. This is because of random sampling. We say that: 
> ### `$\hat{p}$` is an ***unbiased estimator*** of `$p$`, i.e. `$\mathop{\mathbb{E}}[\hat{p}] = p$`

* What is the true population proportion `$p$` of green pasta in the population of `$N=713$` pasta?

``` r
sum(bowl$color == "green")/nrow(bowl)
```

```
## [1] 0.4936886
```

* Let's insert the ***true population proportion*** `$p=0.49$` into our previous plots!

---

# Visualizing Unbiasedness and Sampling Variation

---
# Some Sampling Scenarios

Scenario | Population parameter | Notation | Point estimate | Symbol(s)
:--: | :--: | :--: |:--: | :--:
1 | Population proportion | `$p$` | Sample proportion | `$\widehat{p}$`
2 | Population mean | `$\mu$` | Sample mean | `$\overline{x}$` or `$\widehat{\mu}$`
3 | Difference in population proportions | `$p_1 - p_2$` | Difference in sample proportions | `$\widehat{p}_1 - \widehat{p}_2$`
4 | Difference in population means | `$\mu_1 - \mu_2$` | Difference in sample means | `$\overline{x}_1 - \overline{x}_2$`
5 | Population regression slope | `$\beta_1$` | Fitted regression slope | `$b_1$` or `$\widehat{\beta}_1$`
6 | Population regression intercept | `$\beta_0$` | Fitted regression intercept | `$b_0$` or `$\widehat{\beta}_0$`

---

# The Central Limit Theorem (CLT)

* The fact that our sample statistics ***converge*** to a *central limit* is well known in statistics.

* It's due to a famous result known as the ***central limit theorem***.

> ### *Central Limit Theorem:* regardless of how the underlying population distribution looks like, **when sample *means* are based on larger and larger sample sizes, the sampling distribution of these sample *means* becomes both more and more normally shaped and more and more narrow**.

* In other words, their sampling distribution increasingly follows a ***normal distribution*** and the *variation of these sampling distributions gets smaller*, as quantified by their ***standard errors***.

---

# Task

1. Why do we not take 1000 samples "by hand"?

1. Install the `moderndive` package. To install a new package, i.e., for the first time, run `install.packages("moderndive")` in the console. Then, in your script, right below `library(tidyverse)`, add `library(moderndive)`. You will never need to install the package again moving forward.

1. Load the [data](https://github.com/adamsoliman/IntroEconometrics/blob/master/data%20for%20tasks/pasta.csv) into R, and call the object `pastabowl`.

1. Obtain 100 samples of size 10 using the `rep_sample_n()` function.

1. Calculate the proportion of green pasta in each sample.

1. Plot a histogram of the obtained proportion of green pasta in each sample.

1. Redo the procedure above with 100 samples of size 30 using the `rep_sample_n()` function. Then calculate the proportion of green pasta in each sample and plot a histogram.

1. Redo the procedure above with 100 samples of size 1000 using the `rep_sample_n()` function. Then calculate the proportion of green pasta in each sample and plot a histogram.

1. What are the two key concepts relating to sampling for estimation that we just highlighted?

---

# On the way to causality [chapter sampling]

✅ How to manage data? Read it, tidy it, visualise it!

✅  How to summarise relationships between variables? Simple and multiple linear regression, non-linear regressions, interactions...

✅ What is causality?

✅ **What if we don't observe an entire population?** Sampling!

❌  Are our findings just due to randomness?

❌ How to find exogeneity in practice?

---
class: title-slide-final, middle

# THANKS

To the amazing [moderndive](https://moderndive.com/) team!