1/. Create a data.frame containing the proportions of green pasta from the previous slide. Name it pasta
and name the variable containing the proportions prop_green
. (Hint: to create a data.frame you need to use the data.frame()
function.)
pasta <- data.frame(prop_green = c(0.7,0.7,0.5,0.5,0.3,0.5,0.4,0.45,0.55,0.4,0.35,0.45,0.45,0.7,0.55,0.5,0.35,0.65))
2/. Create a histogram of these proportions using ggplot2
. Use these parameters in geom_histogram()
: boundary = 0.325, binwidth = 0.05
.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
pasta %>%
ggplot(aes(x = prop_green)) +
geom_histogram(boundary = 0.325, binwidth = 0.05, color = "white", fill = "darkgreen")
3/. What do you observe?
It kind of starts looking a bit like a normal distribution.
Instead of taking only 33 samples, let’s take 1000!
1. Why do we not take 1000 samples “by hand”?
It would be too time-consuming.
2. Load the data into an object pasta
.
pasta <- read.csv("https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1")
3. Obtain 1000 samples of size 50 using the rep_sample_n()
function from the moderndive
package.
library(moderndive)
virtual_samples <- pasta %>%
rep_sample_n(size = 50, reps = 1000)
4. Calculate the proportion of green pasta in each sample.
virtual_prop_green <- virtual_samples %>%
group_by(replicate) %>%
summarize(
num_green = sum(color == "green"),
sample_n = n()) %>%
mutate(prop_green = num_green / sample_n)
## `summarise()` ungrouping output (override with `.groups` argument)
5. Plot a histogram of the obtained proportion of green pasta in each sample.
virtual_prop_green %>% ggplot(
aes(x = prop_green)) +
geom_histogram(
binwidth = 0.02,
boundary = 0.41,
color = "white",
fill = "darkgreen") +
labs(x = "Proportion of green pasta in sample",
y = "Frequency",
title = "Distribution of 1000 samples of size 50") +
theme_bw(base_size = 14)
6. What do you observe? Which proportions occur most frequently? How does the shape of the histogram compare to when we took only 33 samples?
The distribution looks very close to a normal distribution. The proportions that occur most frequently are around 0.5. Compared to taking only 33 samples, the distribution looks significantly closer to a normal distribution.
7. How likely is it that we sample 50 pasta of which less than 20% are green?
It is extremely unlikely.