---
title: "Sampling - Tasks"
author: "Florian Oswald, Gustave Kenedi and Pierre Villedieu"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Task 1

1/. Create a data.frame containing the proportions of green pasta from the previous slide. Name it `pasta` and name the variable containing the proportions `prop_green`. (Hint: to create a data.frame you need to use the `data.frame()` function.)

```{r}
pasta <- data.frame(prop_green = c(0.7,0.7,0.5,0.5,0.3,0.5,0.4,0.45,0.55,0.4,0.35,0.45,0.45,0.7,0.55,0.5,0.35,0.65))
```

2/. Create a histogram of these proportions using `ggplot2`. Use these parameters in `geom_histogram()`: `boundary = 0.325, binwidth = 0.05`.

```{r}
library(tidyverse)
pasta %>%
    ggplot(aes(x = prop_green)) +
    geom_histogram(boundary = 0.325, binwidth = 0.05, color = "white", fill = "darkgreen")
```

3/. What do you observe?

**It kind of starts looking a bit like a normal distribution.**


## Task 2 

Instead of taking only 33 samples, let's take ***1000***!

1\. Why do we not take 1000 samples "by hand"?

**It would be too time-consuming.**

2\. Load the [data](https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1) into an object `pasta`.

```{r}
pasta <- read.csv("https://www.dropbox.com/s/qpjsk0rfgc0gx80/pasta.csv?dl=1")
```

3\. Obtain 1000 samples of size 50 using the `rep_sample_n()` function from the `moderndive` package.

```{r}
library(moderndive)
virtual_samples <- pasta %>% 
    rep_sample_n(size = 50, reps = 1000)
```

4\. Calculate the proportion of green pasta in each sample.

```{r}
virtual_prop_green <- virtual_samples %>% 
    group_by(replicate) %>% 
    summarize(
        num_green = sum(color == "green"),
        sample_n = n()) %>% 
    mutate(prop_green = num_green / sample_n)
```

5\. Plot a histogram of the obtained proportion of green pasta in each sample.

```{r}
virtual_prop_green %>% ggplot(
    aes(x = prop_green)) +
    geom_histogram(
        binwidth = 0.02,
        boundary = 0.41,
        color = "white",
        fill = "darkgreen") +
    labs(x = "Proportion of green pasta in sample",
         y = "Frequency",
         title = "Distribution of 1000 samples of size 50") +
    theme_bw(base_size = 14)
```

6\. What do you observe? Which proportions occur most frequently? How does the shape of the histogram compare to when we took only 33 samples?

**The distribution looks very close to a normal distribution. The proportions that occur most frequently are around 0.5. Compared to taking only 33 samples, the distribution looks significantly closer to a normal distribution.**

7\. How likely is it that we sample 50 pasta of which less than 20% are green?

**It is extremely unlikely.**