class: center, middle, inverse, title-slide .title[ # ECON 4050: Introduction to Econometrics ] .subtitle[ ## Sampling ] .author[ ### Adam Soliman, PhD ] .date[ ### Clemson University ] --- # Today - Sampling .footnote[This lecture is very heavily based on the wonderful [sampling chapter](https://moderndive.com/7-sampling.html) of [ModernDive](https://moderndive.com/)] * Fun activity to discover sampling, sampling variation and sampling distributions. * Sampling terminology: population, sample, population parameter, point estimate or sample statistic, etc. * Definition of an ***unbiased estimator***. * Fundamental statistical theorem for inference: ***Central Limit Theorem***. --- # What's the proportion of green pasta? .center[ <img src="../img/photos/pasta/pasta_bowl.JPG" width="600px" style="display: block; margin: auto;" /> ] We could count every green pasta but that would be tedious! 😩 What else could we do? --- # Sampling .pull-left[ * Let's take a sample of 20 pasta. * My friend made sure to select them at **random**. * Here is what we found. Color | Count | Proportion :------:|:------:|:--------: Green | 14 | 0.70 Red | 5 | 0.25 Yellow | 1 | 0.05 * 0.70 can be thought of as our guess of the proportion of green pasta in the entire bowl. ] .pull-right[ <div><img src="../img/photos/pasta/sample1.JPG"?></div> ] --- # Sampling Variation * What would happen if we took a *new* sample (putting the 20 previous pasta back in the bowl)? Would we also get 14 *greens* as before? -- * What if we repeated this activity multiple times? * Probably not. The samples will vary from draw to draw. -- * Key to this observation: these are *randomly* drawn samples. --- # Taking 18 Samples * Because we don't have pasta with us in class, he drew 18 samples of 20 pasta (with replacement) at home. -- * This is what each looked like:  |  |  |  |  |  :------:|:------:|:--------:|:--------:|:--------:|:--------:  |  |  |  |  |   |  |  |  |  |  --- # Taking 18 Samples * Because we don't have pasta with us in class, he drew 18 samples of 20 pasta (with replacement) at home. * For each sample, we computed the share of green pasta. .pull-left[ Sample # | Count | Proportion :------:|:------:|:--------: 1 | 14 | 0.70 2 | 14 | 0.70 3 | 10 | 0.50 4 | 10 | 0.50 5 | 6 | 0.30 6 | 10 | 0.50 7 | 8 | 0.40 8 | 9 | 0.45 9 | 11 | 0.55 ] .pull-right[ Sample # | Count | Proportion :------:|:------:|:--------: 10 | 8 | 0.40 11 | 7 | 0.35 12 | 9 | 0.45 13 | 9 | 0.45 14 | 14 | 0.70 15 | 11 | 0.55 16 | 10 | 0.50 17 | 7 | 0.35 18 | 13 | 0.65 ] --- # Sample Distribution: Histogram .pull-left[ <img src="../img/photos/hist_building.gif" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_sampling_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] --- # What Did We Just Do? * Demonstrated the statistical concept of ***sampling***. -- * *Objective*: know the proportion of green pasta * *Methods*: 1. **Census**: time-consuming (and in many cases very costly); 1. **Sampling**: extract a *sample* of 20 pasta from the bowl to obtain an ***estimate***. Out first ***estimate*** of the proportion of green pasta was 0.70, but it was actually larger than most other ***estimates***. -- * *Important*: each *sample* was drawn ***randomly*** `\(\rightarrow\)` samples are different from each other! `\(\rightarrow\)` different proportions 👉 ***sampling variation*** --- # Taking Virtual (not Real) Samples .pull-left[ * He counted the exact number of green, red and yellow pasta in the bowl 🧐 `#confinement` * All the pasta in the bowl are stored in a csv file [here](https://github.com/adamsoliman/IntroEconometrics/blob/master/data%20for%20tasks/pasta.csv). ``` r bowl <- read.csv("~/Library/CloudStorage/Dropbox/Clemson/Econometrics Course/data for tasks/pasta.csv") head(bowl, 6) ``` ``` ## pasta_ID color ## 1 1 yellow ## 2 2 red ## 3 3 green ## 4 4 yellow ## 5 5 red ## 6 6 green ``` ] -- .pull-right[ * `pasta_ID`: pasta identifier * `color`: pasta color ``` r nrow(bowl) ``` ``` ## [1] 713 ``` * Instead of selecting pasta with our hands, we'll take *virtual* draws from the bowl. * We'll use the *virtual shovel* to take a sample of 50 pasta from our virtual bowl. ] --- # Using A Virtual Shovel Once * We will take a first sample of size 50, using the `moderndive` function `rep_sample_n`. -- ``` r #load moderndive package library(moderndive) virtual_shovel <- bowl %>% # notice that moderndive functions can be "pipped" rep_sample_n(size = 50) # take a sample of 50 pasta ``` .pull-left[ ``` r # display the sample's first 6 rows head(virtual_shovel) ``` ``` ## # A tibble: 6 × 3 ## # Groups: replicate [1] ## replicate pasta_ID color ## <int> <int> <chr> ## 1 1 284 green ## 2 1 101 green ## 3 1 623 yellow ## 4 1 645 green ## 5 1 400 red ## 6 1 98 yellow ``` * Column `replicate` tells us the ID of the sample. Here: `1`. ] .pull-right[ ``` r # number of observations in sample nrow(virtual_shovel) ``` ``` ## [1] 50 ``` ] --- # Proportion of Green Pasta .pull-left[ ``` r sample_1 <- virtual_shovel %>% summarize( # number of green pasta in sample num_green = sum(color == "green"), # number of observations in sample sample_n = n()) %>% mutate( # proportion of green pasta in sample prop_green = num_green / sample_n) sample_1 ``` ``` ## # A tibble: 1 × 4 ## replicate num_green sample_n prop_green ## <int> <int> <int> <dbl> ## 1 1 23 50 0.46 ``` ] .pull-right[ 1. Compute: * sum of green pasta in sample, * number of observations in sample (i.e. 50 in this case) 1. Compute proportion of green pasta 👉 0.46 are green! This is an ***estimate*** of the proportion of green pasta in the bowl. What if we try again? What if we try many times, like, 33 times? ] --- # Using The Virtual Shovel 33 Times .pull-left[ 33 samples (*replicates*) of size 50. ``` r virtual_samples <- bowl %>% # get 33 samples of size 50 rep_sample_n(size = 50, reps = 33) virtual_samples ``` ``` ## # A tibble: 1,650 × 3 ## # Groups: replicate [33] ## replicate pasta_ID color ## <int> <int> <chr> ## 1 1 495 yellow ## 2 1 534 green ## 3 1 297 yellow ## 4 1 208 green ## 5 1 131 green ## 6 1 569 red ## 7 1 522 yellow ## 8 1 248 green ## 9 1 365 red ## 10 1 665 yellow ## # ℹ 1,640 more rows ``` ] -- .pull-right[ Compute the proportion of green pasta in each sample. ``` r virtual_prop_green <- virtual_samples %>% group_by(replicate) %>% # calculate stat by sample summarize(num_green = sum(color == "green"), sample_n = n()) %>% mutate(prop_green = num_green / sample_n) virtual_prop_green ``` ``` ## # A tibble: 33 × 4 ## replicate num_green sample_n prop_green ## <int> <int> <int> <dbl> ## 1 1 24 50 0.48 ## 2 2 25 50 0.5 ## 3 3 27 50 0.54 ## 4 4 23 50 0.46 ## 5 5 25 50 0.5 ## 6 6 22 50 0.44 ## 7 7 18 50 0.36 ## 8 8 30 50 0.6 ## 9 9 29 50 0.58 ## 10 10 18 50 0.36 ## # ℹ 23 more rows ``` ] --- # (Virtual!) Sampling Variation .pull-left[ * Just as when we did it, the virtual sampler *also* creates random samples. * The `prop_green` column in the `virtual_prop_green` data.frame differs across samples. * And again, we can visualize the ***sampling distribution***: ``` r ggplot(virtual_prop_green, aes(x = prop_green)) + geom_histogram(binwidth = 0.02, boundary = 0.51, color = "white", fill = "darkgreen") + scale_y_continuous(breaks = seq(0, 12, by = 2)) + labs(x = "Proportion of 50 pasta that were green", y = "Frequency", title = "Distribution of 33 samples of size 50") + theme_bw(base_size = 20) ``` ] -- .pull-right[ <img src="chapter_sampling_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> ] --- # Sampling Distribution of 1000 Samples <img src="chapter_sampling_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> -- Looks remarkably close to a ***normal distribution*** `\(\rightarrow\)` the more samples we take, the more their ***sampling distribution*** will resemble a ***normal distribution***. --- # Role of Sample Size Imagine you could change the size of your samples and had the option of the following sizes: 25, 50 and 100. If your goal is still to estimate the proportion of the bowl’s pasta that are green, which shovel would you choose? --- # Role of Sample Size * Let's repeat what we did previously but for different sample sizes. * Let's take 1000 samples each for `\(n=25,n=50,n=100\)`. -- * We will use `rep_sample_n()` again. -- .pull-left[ Generate all samples of different sizes: ``` r # Sample size: 25 virtual_samples_25 <- bowl %>% rep_sample_n(size = 25, reps = 1000) # Sample size: 50 virtual_samples_50 <- bowl %>% rep_sample_n(size = 50, reps = 1000) # Sample size: 100 virtual_samples_100 <- bowl %>% rep_sample_n(size = 100, reps = 1000) ``` ] -- .pull-right[ Compute proportion of green pasta: ``` r # Sample size: 25 # The same code is used for the other sample sizes virtual_prop_green_25 <- virtual_samples_25 %>% group_by(replicate) %>% summarize( num_green = sum(color == "green"), sample_n = n()) %>% mutate(prop_green = num_green / sample_n) ``` ] --- # Role of Sample Size <img src="chapter_sampling_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> --- # Sample Size and Sampling Distributions * The larger the sample size, the *narrower* the resulting ***sampling distribution***. * In other words, there are fewer differences due to ***sampling variation***. -- * Holding constant the number of replicates (i.e. 1000 in our case), ***bigger samples*** will yield *normal distributions* with ***smaller standard deviations***. Sample Size | Standard Deviation :---------:|:--------------: 25 | 0.10 50 | 0.07 100 |0.05 -- * Remember that the ***standard deviation*** measures the *spread* of a variable around its mean. * So as the sample size increases, our ***estimates*** of the true proportion of the bowl's green pasta get more *precise*. --- # Sampling Framework * We used sampling for the purpose of ***estimation***. * We extracted samples in order to ***estimate*** the proportion of the bowl's pasta that are green. -- * 2 key concepts relating to sampling for estimation: 1. The effect of ***sampling variation*** on our estimates: different samples give different estimates. 1. The effect of sample size on ***sampling variation***: the bigger the size of our sample the closer our estimate should be from the true value. --- # Sampling Glossary 📖 .pull-left[ ***Population:*** collection of individuals or observations we are interested in. `\(N = 713\)` pasta. ***Population parameter:*** numerical summary quantity about the population that is unknown but that we want to know. *Examples:* population mean `\((\mu)\)`, proportion of green pasta `\((p)\)`. ***Census:*** exhaustive enumeration or counting of all `\(N\)` individuals or observations in the population in order to compute the population parameter’s value *exactly*. ***Sampling:*** collecting sample(s) of size `\(n\)` from the population of size `\(N\)`. ] .pull-right[ * ***Point estimate*** or ***Sample statistic:*** summary statistic computed from a sample that estimates an unknown population parameter. *Example:* *sample proportion* of green pasta `\((\hat{p})\)`. The "hat" on top of the `\(p\)` indicates that it is an *estimate* of the population proportion `\(p\)`. * ***Representative sampling:*** does the sample *look like* the population? * ***Biased sampling:*** did all pasta have an equal chance of being included in a sample? * ***Random sampling:*** randomly sampling in an unbiased fashion. ] --- # Statistical Definitions * We have been estimating `\(\hat{p}\)` all along. * We plotted the *sampling distribution* to display the *sampling variation* of the *sample proportion* `\(\hat{p}\)`. * We computed the *standard deviation* of the *sampling distribution* of `\(\hat{p}\)`. This standard deviation has a special name: ***standard error*** of the *point estimate* `\(\hat{p}\)`. -- * Let's reproduce the summary table and labeling properly: Sample Size `\((n)\)` | Standard Error of `\(\hat{p}\)` :---------:|:--------------: 25 | 0.10 50 | 0.07 100 |0.05 * Key takeaway: as the *sample size* `\(n\)` goes up, the “typical” error of your *point estimate* will go down, as quantified by the *standard error*. --- # Putting It All Together * ***Point estimates*** from ***random samples*** provide a *good guess* of the true unknown ***population parameter***. -- * How good? Sometimes `\(\hat{p}\)` will be far from `\(p\)`, sometimes close. There's ***sampling variation***. -- * ***On average***, our estimates will be correct. This is because of random sampling. We say that: > ### `\(\hat{p}\)` is an ***unbiased estimator*** of `\(p\)`, i.e. `\(\mathop{\mathbb{E}}[\hat{p}] = p\)` -- * What is the true population proportion `\(p\)` of green pasta in the population of `\(N=713\)` pasta? -- ``` r sum(bowl$color == "green")/nrow(bowl) ``` ``` ## [1] 0.4936886 ``` -- * Let's insert the ***true population proportion*** `\(p=0.49\)` into our previous plots! --- # Visualizing Unbiasedness and Sampling Variation <img src="chapter_sampling_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" /> --- # Some Sampling Scenarios Scenario | Population parameter | Notation | Point estimate | Symbol(s) :--: | :--: | :--: |:--: | :--: 1 | Population proportion | `\(p\)` | Sample proportion | `\(\widehat{p}\)` 2 | Population mean | `\(\mu\)` | Sample mean | `\(\overline{x}\)` or `\(\widehat{\mu}\)` 3 | Difference in population proportions | `\(p_1 - p_2\)` | Difference in sample proportions | `\(\widehat{p}_1 - \widehat{p}_2\)` 4 | Difference in population means | `\(\mu_1 - \mu_2\)` | Difference in sample means | `\(\overline{x}_1 - \overline{x}_2\)` 5 | Population regression slope | `\(\beta_1\)` | Fitted regression slope | `\(b_1\)` or `\(\widehat{\beta}_1\)` 6 | Population regression intercept | `\(\beta_0\)` | Fitted regression intercept | `\(b_0\)` or `\(\widehat{\beta}_0\)` --- # The Central Limit Theorem (CLT) * The fact that our sample statistics ***converge*** to a *central limit* is well known in statistics. -- * It's due to a famous result known as the ***central limit theorem***. -- > ### *Central Limit Theorem:* regardless of how the underlying population distribution looks like, **when sample *means* are based on larger and larger sample sizes, the sampling distribution of these sample *means* becomes both more and more normally shaped and more and more narrow**. -- * In other words, their sampling distribution increasingly follows a ***normal distribution*** and the *variation of these sampling distributions gets smaller*, as quantified by their ***standard errors***. --- # Task 1. Why do we not take 1000 samples "by hand"? 1. Install the `moderndive` package. To install a new package, i.e., for the first time, run `install.packages("moderndive")` in the console. Then, in your script, right below `library(tidyverse)`, add `library(moderndive)`. You will never need to install the package again moving forward. 1. Load the [data](https://github.com/adamsoliman/IntroEconometrics/blob/master/data%20for%20tasks/pasta.csv) into R, and call the object `pastabowl`. 1. Obtain 100 samples of size 10 using the `rep_sample_n()` function. 1. Calculate the proportion of green pasta in each sample. 1. Plot a histogram of the obtained proportion of green pasta in each sample. 1. Redo the procedure above with 100 samples of size 30 using the `rep_sample_n()` function. Then calculate the proportion of green pasta in each sample and plot a histogram. 1. Redo the procedure above with 100 samples of size 1000 using the `rep_sample_n()` function. Then calculate the proportion of green pasta in each sample and plot a histogram. 1. What are the two key concepts relating to sampling for estimation that we just highlighted? --- # On the way to causality [chapter sampling] ✅ How to manage data? Read it, tidy it, visualise it! ✅ How to summarise relationships between variables? Simple and multiple linear regression, non-linear regressions, interactions... ✅ What is causality? ✅ **What if we don't observe an entire population?** Sampling! ❌ Are our findings just due to randomness? ❌ How to find exogeneity in practice? --- class: title-slide-final, middle # THANKS To the amazing [moderndive](https://moderndive.com/) team!