Sampling - Tasks

Task 1

1/. Create a data.frame containing the proportions of green pasta from the previous slide. Name it pasta and name the variable containing the proportions prop_green. (Hint: to create a data.frame you need to use the data.frame() function.)

pasta <- data.frame(prop_green = c(0.7,0.7,0.5,0.5,0.3,0.5,0.4,0.45,0.55,0.4,0.35,0.45,0.45,0.7,0.55,0.5,0.35,0.65))

2/. Create a histogram of these proportions using ggplot2. Use these parameters in geom_histogram(): boundary = 0.325, binwidth = 0.05.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

pasta %>%
    ggplot(aes(x = prop_green)) +
    geom_histogram(boundary = 0.325, binwidth = 0.05, color = "white", fill = "darkgreen")

3/. What do you observe?

It kind of starts looking a bit like a normal distribution.

Task 2

Instead of taking only 33 samples, let’s take 1000!

1. Why do we not take 1000 samples “by hand”?

It would be too time-consuming.

2. Load the data into an object pasta.

pasta <- read.csv("https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1")

3. Obtain 1000 samples of size 50 using the rep_sample_n() function from the moderndive package.

library(moderndive)
virtual_samples <- pasta %>% 
    rep_sample_n(size = 50, reps = 1000)

4. Calculate the proportion of green pasta in each sample.

virtual_prop_green <- virtual_samples %>% 
    group_by(replicate) %>% 
    summarize(
        num_green = sum(color == "green"),
        sample_n = n()) %>% 
    mutate(prop_green = num_green / sample_n)

## `summarise()` ungrouping output (override with `.groups` argument)

5. Plot a histogram of the obtained proportion of green pasta in each sample.

virtual_prop_green %>% ggplot(
    aes(x = prop_green)) +
    geom_histogram(
        binwidth = 0.02,
        boundary = 0.41,
        color = "white",
        fill = "darkgreen") +
    labs(x = "Proportion of green pasta in sample",
         y = "Frequency",
         title = "Distribution of 1000 samples of size 50") +
    theme_bw(base_size = 14)

6. What do you observe? Which proportions occur most frequently? How does the shape of the histogram compare to when we took only 33 samples?

The distribution looks very close to a normal distribution. The proportions that occur most frequently are around 0.5. Compared to taking only 33 samples, the distribution looks significantly closer to a normal distribution.

7. How likely is it that we sample 50 pasta of which less than 20% are green?

It is extremely unlikely.

Sampling - Tasks

Florian Oswald, Gustave Kenedi and Pierre Villedieu

2021-03-31

Task 1

Task 2