ScPoProgramming

.title[
# ScPoProgramming
]
.subtitle[
## <code>R</code> Tidyverse
]
.author[
### Florian Oswald
]
.date[
### SciencesPo Paris </br> 2024-10-14
]

---

---

# Due Credit

1. The first part of those slides were developed jointly with 
    - [Gustave Kenedi](https://gustavekenedi.github.io/)
    - [Mylène Feuillade](https://github.com/mylenefeuillade)
    - [Pierre Vielledieu](https://github.com/pvielledieu)
    
2. The second part is copied from [Grant McDermott's](https://grantmcdermott.com/) amazing data science lectures.

---
  
# Working With Data

* Economists work with `data`.

* According a to [2014 NYTimes article](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html), "data scientists [...] spend from ***50 percent to 80 percent of their time*** mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

* In the next two lectures you will learn the basics of ***tidying***, ***visualising*** and ***summarising*** data

---

# Tidying Data

---
layout: true

---

# Intro to `dplyr`

* [`dplyr`](https://dplyr.tidyverse.org) is part of the [`tidyverse`](https://www.tidyverse.org) package family.

* [`data.table`](https://github.com/Rdatatable/data.table/wiki) is an alternative. Very fast and highly recommended for **big** data. We will dedicate an entire  [lecture](https://raw.githack.com/floswald/lectures/master/05-datatable/05-datatable.html#1)  to it! 💪

* Both have pros and cons. We'll start you off with `dplyr`.

---

# `dplyr` Overview

* You are ***highly encouraged*** to read through [Hadley Wickham's chapter on data transforms](https://r4ds.had.co.nz/transform.html). It's clear and concise.

* Also check out this great "cheatsheet" [here](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf).

* The package is organized around a set of **verbs**, i.e. *actions* to be taken.

* We operate on `data.frames` or `tibbles` (*nicer looking* data.frames.)

* All *verbs* work as follows:

`$$\text{verb}(\underbrace{\text{data.frame}}_{\text{1st argument}}, \underbrace{\text{what to do}}_\text{2nd argument})$$`

* Alternatively you can (should) use the `pipe` operator `%>%` (part of `magrittr` package):

`$$\underbrace{\text{data.frame}}_{\text{1st argument}} \underbrace{\text{ %>% }}_{\text{"pipe" operator}} \text{verb}(\underbrace{\text{what to do}}_\text{2nd argument})$$`

---
background-image: url("https://media.giphy.com/media/VzeoEQydDR2c93O0HG/giphy.gif")
background-position: 60% 50%

<br>
<br>
<br>
## THE PIPE in `R`?! 
## `%>%` or `|>`
<br>
<br>
## YES!! THE PIPE!!! 
## LIKE OUR UNIX PIPE!! `|`

---

# Main `dplyr` Verbs

1. `filter()`: Choose observations based on a certain condition (i.e. subset)

1. `arrange()`: Reorder rows

1. `select()`: Select variables by name

1. `mutate()`: Create new variables out of existing ones

1. `summarise()`: Collapse data to a single summary

1. `group_by()`: All the above can be used in conjunction with `group_by()` to use functions on groups rather than entire data

---

# Data: 2016 US election polls from the `dslabs` package

* This dataset contains __real__ data on polls made during the 2016 US Presidential elections and compiled by [fivethirtyeight](fivethirtyeight.com)

```r
library(dslabs)
library(tidyverse)
data(polls_us_election_2016, package = "dslabs") # load data from package
polls_us_election_2016 <- as_tibble(polls_us_election_2016) # convert to a tibble
head(polls_us_election_2016[,1:6]) # show first 6 lines of first 6 variables
```

```
## # A tibble: 6 × 6
##   state startdate  enddate    pollster                             grade sampl…¹
##   <fct> <date>     <date>     <fct>                                <fct>   <int>
## 1 U.S.  2016-11-03 2016-11-06 ABC News/Washington Post             A+       2220
## 2 U.S.  2016-11-01 2016-11-07 Google Consumer Surveys              B       26574
## 3 U.S.  2016-11-02 2016-11-06 Ipsos                                A-       2195
## 4 U.S.  2016-11-04 2016-11-07 YouGov                               B        3677
## 5 U.S.  2016-11-03 2016-11-06 Gravis Marketing                     B-      16639
## 6 U.S.  2016-11-03 2016-11-06 Fox News/Anderson Robbins Research/… A        1295
## # … with abbreviated variable name ¹samplesize
```

🚨 This is a `tibble` (more informative than `data.frame`)

---

# Data: 2016 US election polls from the `dslabs` package

What variables does this dataset contain?

```r
str(polls_us_election_2016)
```

```
## tibble [4,208 × 15] (S3: tbl_df/tbl/data.frame)
##  $ state           : Factor w/ 57 levels "Alabama","Alaska",..: 50 50 50 50 50 50 50 50 37 50 ...
##  $ startdate       : Date[1:4208], format: "2016-11-03" "2016-11-01" ...
##  $ enddate         : Date[1:4208], format: "2016-11-06" "2016-11-07" ...
##  $ pollster        : Factor w/ 196 levels "ABC News/Washington Post",..: 1 63 81 194 65 55 18 113 195 76 ...
##  $ grade           : Factor w/ 10 levels "D","C-","C","C+",..: 10 6 8 6 5 9 8 8 NA 8 ...
##  $ samplesize      : int [1:4208] 2220 26574 2195 3677 16639 1295 1426 1282 8439 1107 ...
##  $ population      : chr [1:4208] "lv" "lv" "lv" "lv" ...
##  $ rawpoll_clinton : num [1:4208] 47 38 42 45 47 ...
##  $ rawpoll_trump   : num [1:4208] 43 35.7 39 41 43 ...
##  $ rawpoll_johnson : num [1:4208] 4 5.46 6 5 3 3 5 6 6 7.1 ...
##  $ rawpoll_mcmullin: num [1:4208] NA NA NA NA NA NA NA NA NA NA ...
##  $ adjpoll_clinton : num [1:4208] 45.2 43.3 42 45.7 46.8 ...
##  $ adjpoll_trump   : num [1:4208] 41.7 41.2 38.8 40.9 42.3 ...
##  $ adjpoll_johnson : num [1:4208] 4.63 5.18 6.84 6.07 3.73 ...
##  $ adjpoll_mcmullin: num [1:4208] NA NA NA NA NA NA NA NA NA NA ...
```

---

# `dplyr` Verbs

### Filter observations

```r
filter()
```

]

---

---

```r
*polls_us_election_2016
```

```
## # A tibble: 4,208 × 15
##    state     startdate  enddate    polls…¹ grade sampl…² popul…³ rawpo…⁴ rawpo…⁵
##    <fct>     <date>     <date>     <fct>   <fct>   <int> <chr>     <dbl>   <dbl>
##  1 U.S.      2016-11-03 2016-11-06 ABC Ne… A+       2220 lv         47      43  
##  2 U.S.      2016-11-01 2016-11-07 Google… B       26574 lv         38.0    35.7
##  3 U.S.      2016-11-02 2016-11-06 Ipsos   A-       2195 lv         42      39  
##  4 U.S.      2016-11-04 2016-11-07 YouGov  B        3677 lv         45      41  
##  5 U.S.      2016-11-03 2016-11-06 Gravis… B-      16639 rv         47      43  
##  6 U.S.      2016-11-03 2016-11-06 Fox Ne… A        1295 lv         48      44  
##  7 U.S.      2016-11-02 2016-11-06 CBS Ne… A-       1426 lv         45      41  
##  8 U.S.      2016-11-03 2016-11-05 NBC Ne… A-       1282 lv         44      40  
##  9 New Mexi… 2016-11-06 2016-11-06 Zia Po… <NA>     8439 lv         46      44  
## 10 U.S.      2016-11-04 2016-11-07 IBD/TI… A-       1107 lv         41.2    42.7
## # … with 4,198 more rows, 6 more variables: rawpoll_johnson <dbl>,
## #   rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, adjpoll_trump <dbl>,
## #   adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>, and abbreviated variable
## #   names ¹pollster, ²samplesize, ³population, ⁴rawpoll_clinton, ⁵rawpoll_trump
```
]

---

```r
polls_us_election_2016 %>%
* filter(samplesize > 2000)
```

```
## # A tibble: 403 × 15
##    state     startdate  enddate    polls…¹ grade sampl…² popul…³ rawpo…⁴ rawpo…⁵
##    <fct>     <date>     <date>     <fct>   <fct>   <int> <chr>     <dbl>   <dbl>
##  1 U.S.      2016-11-03 2016-11-06 ABC Ne… A+       2220 lv         47      43  
##  2 U.S.      2016-11-01 2016-11-07 Google… B       26574 lv         38.0    35.7
##  3 U.S.      2016-11-02 2016-11-06 Ipsos   A-       2195 lv         42      39  
##  4 U.S.      2016-11-04 2016-11-07 YouGov  B        3677 lv         45      41  
##  5 U.S.      2016-11-03 2016-11-06 Gravis… B-      16639 rv         47      43  
##  6 New Mexi… 2016-11-06 2016-11-06 Zia Po… <NA>     8439 lv         46      44  
##  7 U.S.      2016-11-05 2016-11-07 The Ti… <NA>     2521 lv         45      40  
##  8 U.S.      2016-11-01 2016-11-07 USC Do… <NA>     2972 lv         43.6    46.8
##  9 Georgia   2016-11-03 2016-11-06 Gravis… B-       2002 rv         44      48  
## 10 Virginia  2016-11-01 2016-11-02 Reming… <NA>     3076 lv         46      44  
## # … with 393 more rows, 6 more variables: rawpoll_johnson <dbl>,
## #   rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, adjpoll_trump <dbl>,
## #   adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>, and abbreviated variable
## #   names ¹pollster, ²samplesize, ³population, ⁴rawpoll_clinton, ⁵rawpoll_trump
```
]

---

.right-wide[
Standard suite of comparison operators:
- `>`: greater than,
- `<`: smaller than,
- `>=`: greater than or equal to,
- `<=`: smaller than or equal to,
- `!=`: not equal to,
- `==`: equal to.

Logical operators:
1. `x & y`: `x` **and** `y`
1. `x | y`: `x` **or** `y`
1. `!y`: **not** `y`
]

---

.right-wide[
`R` has the convenient `x %in% y` operator (conversely `!(x %in% y)`), `TRUE` if `x` is *a member of* `y`.

```r
3 %in% 1:3
```

```
## [1] TRUE
```

```r
c(2,5) %in% 2:10  # also vectorized
```

```
## [1] TRUE TRUE
```

```r
c("S","Po") %in% c("Sciences","Po")  # also strings
```

```
## [1] FALSE  TRUE
```
]

---

.right-wide[
*Example:* Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote?
]

---

.right-wide[
*Example:* Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote?
]

```r
*polls_us_election_2016
```

---

.right-wide[
*Example:* Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote?
]

```r
polls_us_election_2016 %>%
* filter(grade == "A" & samplesize > 2000 & rawpoll_trump > 45)
```

```
## # A tibble: 1 × 15
##   state   startdate  enddate    pollster   grade sampl…¹ popul…² rawpo…³ rawpo…⁴
##   <fct>   <date>     <date>     <fct>      <fct>   <int> <chr>     <dbl>   <dbl>
## 1 Indiana 2016-04-26 2016-04-28 Marist Co… A        2149 rv           41      48
## # … with 6 more variables: rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>,
## #   adjpoll_clinton <dbl>, adjpoll_trump <dbl>, adjpoll_johnson <dbl>,
## #   adjpoll_mcmullin <dbl>, and abbreviated variable names ¹samplesize,
## #   ²population, ³rawpoll_clinton, ⁴rawpoll_trump
```
]

---

# `dplyr` Verbs

### Filter

### Create new variable(s)

```r
mutate()
```

]

---

.right-wide[
*Example:* What was
1. the combined vote share of Trump and Clinton for each poll?
2. the difference between Trump's raw poll vote share and 538's adjusted vote share?
]

---

.right-wide[
*Example:* What was
1. the combined vote share of Trump and Clinton for each poll?
2. the difference between Trump's raw poll vote share and 538's adjusted vote share?
]

```r
polls_us_election_2016
```

---

.right-wide[
*Example:* What was
1. the combined vote share of Trump and Clinton for each poll?
2. the difference between Trump's raw poll vote share and 538's adjusted vote share?
]

```r
polls_us_election_2016 %>%
* mutate(trump_clinton_tot = rawpoll_trump + rawpoll_clinton,
*        trump_raw_adj_diff = rawpoll_trump - adjpoll_trump)
```

```
## # A tibble: 4,208 × 17
##    state     startdate  enddate    polls…¹ grade sampl…² popul…³ rawpo…⁴ rawpo…⁵
##    <fct>     <date>     <date>     <fct>   <fct>   <int> <chr>     <dbl>   <dbl>
##  1 U.S.      2016-11-03 2016-11-06 ABC Ne… A+       2220 lv         47      43  
##  2 U.S.      2016-11-01 2016-11-07 Google… B       26574 lv         38.0    35.7
##  3 U.S.      2016-11-02 2016-11-06 Ipsos   A-       2195 lv         42      39  
##  4 U.S.      2016-11-04 2016-11-07 YouGov  B        3677 lv         45      41  
##  5 U.S.      2016-11-03 2016-11-06 Gravis… B-      16639 rv         47      43  
##  6 U.S.      2016-11-03 2016-11-06 Fox Ne… A        1295 lv         48      44  
##  7 U.S.      2016-11-02 2016-11-06 CBS Ne… A-       1426 lv         45      41  
##  8 U.S.      2016-11-03 2016-11-05 NBC Ne… A-       1282 lv         44      40  
##  9 New Mexi… 2016-11-06 2016-11-06 Zia Po… <NA>     8439 lv         46      44  
## 10 U.S.      2016-11-04 2016-11-07 IBD/TI… A-       1107 lv         41.2    42.7
## # … with 4,198 more rows, 8 more variables: rawpoll_johnson <dbl>,
## #   rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, adjpoll_trump <dbl>,
## #   adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>, trump_clinton_tot <dbl>,
## #   trump_raw_adj_diff <dbl>, and abbreviated variable names ¹pollster,
## #   ²samplesize, ³population, ⁴rawpoll_clinton, ⁵rawpoll_trump
```
]

---

.right-wide[
*Example:* What was
1. the combined vote share of Trump and Clinton for each poll?
2. the difference between Trump's raw poll vote share and 538's adjusted vote share?
]

```r
polls_us_election_2016 %>%
  mutate(trump_clinton_tot = rawpoll_trump + rawpoll_clinton,
         trump_raw_adj_diff = rawpoll_trump - adjpoll_trump) %>%
* names()
```

```
##  [1] "state"              "startdate"          "enddate"           
##  [4] "pollster"           "grade"              "samplesize"        
##  [7] "population"         "rawpoll_clinton"    "rawpoll_trump"     
## [10] "rawpoll_johnson"    "rawpoll_mcmullin"   "adjpoll_clinton"   
## [13] "adjpoll_trump"      "adjpoll_johnson"    "adjpoll_mcmullin"  
## [16] "trump_clinton_tot"  "trump_raw_adj_diff"
```
]

---

# `dplyr` Verbs

### Filter

### Mutate

### Keep some variable(s)

```r
select()
```

]

---

.right-wide[
*Example:* Only keep the variables `state, startdate, enddate, pollster, rawpoll_clinton, rawpoll_trump`
]

---

.right-wide[
*Example:* Only keep the variables `state, startdate, enddate, pollster, rawpoll_clinton, rawpoll_trump`
]

```r
polls_us_election_2016
```

---

.right-wide[
*Example:* Only keep the variables `state, startdate, enddate, pollster, rawpoll_clinton, rawpoll_trump`
]

```r
polls_us_election_2016 %>%
* select(state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump)
```

```
## # A tibble: 4,208 × 6
##    state      startdate  enddate    pollster                     rawpo…¹ rawpo…²
##    <fct>      <date>     <date>     <fct>                          <dbl>   <dbl>
##  1 U.S.       2016-11-03 2016-11-06 ABC News/Washington Post        47      43  
##  2 U.S.       2016-11-01 2016-11-07 Google Consumer Surveys         38.0    35.7
##  3 U.S.       2016-11-02 2016-11-06 Ipsos                           42      39  
##  4 U.S.       2016-11-04 2016-11-07 YouGov                          45      41  
##  5 U.S.       2016-11-03 2016-11-06 Gravis Marketing                47      43  
##  6 U.S.       2016-11-03 2016-11-06 Fox News/Anderson Robbins R…    48      44  
##  7 U.S.       2016-11-02 2016-11-06 CBS News/New York Times         45      41  
##  8 U.S.       2016-11-03 2016-11-05 NBC News/Wall Street Journal    44      40  
##  9 New Mexico 2016-11-06 2016-11-06 Zia Poll                        46      44  
## 10 U.S.       2016-11-04 2016-11-07 IBD/TIPP                        41.2    42.7
## # … with 4,198 more rows, and abbreviated variable names ¹rawpoll_clinton,
## #   ²rawpoll_trump
```
]

---

.right-wide[
*Example:* Only keep the variables `state, startdate, enddate, pollster, rawpoll_clinton, rawpoll_trump`
]

```r
polls_us_election_2016 %>%
  select(state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump) %>%
* names()
```

```
## [1] "state"           "startdate"       "enddate"         "pollster"       
## [5] "rawpoll_clinton" "rawpoll_trump"
```
]

---

# `dplyr` Verbs

### Filter

### Mutate

### Select

### Compute statistic

```r
summarise()
```

]

---

---

```r
polls_us_election_2016
```

---

```r
polls_us_election_2016 %>%
* summarise(max_trump = max(rawpoll_trump))
```

```
## # A tibble: 1 × 1
##   max_trump
##       <dbl>
## 1        68
```
]

---

# `dplyr` Verbs

### Filter

### Mutate

### Select

### Summarise

### Apply function by group

```r
group_by()
```

]

---

---

```r
polls_us_election_2016
```

---

```r
polls_us_election_2016 %>%
* group_by(grade)
```

```
## # A tibble: 4,208 × 15
## # Groups:   grade [11]
##    state     startdate  enddate    polls…¹ grade sampl…² popul…³ rawpo…⁴ rawpo…⁵
##    <fct>     <date>     <date>     <fct>   <fct>   <int> <chr>     <dbl>   <dbl>
##  1 U.S.      2016-11-03 2016-11-06 ABC Ne… A+       2220 lv         47      43  
##  2 U.S.      2016-11-01 2016-11-07 Google… B       26574 lv         38.0    35.7
##  3 U.S.      2016-11-02 2016-11-06 Ipsos   A-       2195 lv         42      39  
##  4 U.S.      2016-11-04 2016-11-07 YouGov  B        3677 lv         45      41  
##  5 U.S.      2016-11-03 2016-11-06 Gravis… B-      16639 rv         47      43  
##  6 U.S.      2016-11-03 2016-11-06 Fox Ne… A        1295 lv         48      44  
##  7 U.S.      2016-11-02 2016-11-06 CBS Ne… A-       1426 lv         45      41  
##  8 U.S.      2016-11-03 2016-11-05 NBC Ne… A-       1282 lv         44      40  
##  9 New Mexi… 2016-11-06 2016-11-06 Zia Po… <NA>     8439 lv         46      44  
## 10 U.S.      2016-11-04 2016-11-07 IBD/TI… A-       1107 lv         41.2    42.7
## # … with 4,198 more rows, 6 more variables: rawpoll_johnson <dbl>,
## #   rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, adjpoll_trump <dbl>,
## #   adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>, and abbreviated variable
## #   names ¹pollster, ²samplesize, ³population, ⁴rawpoll_clinton, ⁵rawpoll_trump
```
]

---

```r
polls_us_election_2016 %>%
  group_by(grade) %>%
* class()
```

```
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
```
]

---

```r
polls_us_election_2016 %>%
* group_by(grade)
```

---

```r
polls_us_election_2016 %>%
  group_by(grade) %>%
* summarise(mean_vote_clinton = mean(rawpoll_clinton))
```

```
## # A tibble: 11 × 2
##    grade mean_vote_clinton
##    <fct>             <dbl>
##  1 D                  46.7
##  2 C-                 43.2
##  3 C                  41.8
##  4 C+                 44.2
##  5 B-                 43.9
##  6 B                  37.3
##  7 B+                 44.1
##  8 A-                 43.0
##  9 A                  45.3
## 10 A+                 45.8
## 11 <NA>               43.2
```
]

---
layout: true

# Chaining ⛓ Commands Together

---

Works for all `dplyr` verbs:

```r
polls_us_election_2016
```
]

---

Works for all `dplyr` verbs:

```r
polls_us_election_2016 %>%
*   mutate(trump_clinton_diff =
*            rawpoll_trump -
*            rawpoll_clinton)
```
]

```
## # A tibble: 4,208 × 16
##    state     startdate  enddate    polls…¹ grade sampl…² popul…³ rawpo…⁴ rawpo…⁵
##    <fct>     <date>     <date>     <fct>   <fct>   <int> <chr>     <dbl>   <dbl>
##  1 U.S.      2016-11-03 2016-11-06 ABC Ne… A+       2220 lv         47      43  
##  2 U.S.      2016-11-01 2016-11-07 Google… B       26574 lv         38.0    35.7
##  3 U.S.      2016-11-02 2016-11-06 Ipsos   A-       2195 lv         42      39  
##  4 U.S.      2016-11-04 2016-11-07 YouGov  B        3677 lv         45      41  
##  5 U.S.      2016-11-03 2016-11-06 Gravis… B-      16639 rv         47      43  
##  6 U.S.      2016-11-03 2016-11-06 Fox Ne… A        1295 lv         48      44  
##  7 U.S.      2016-11-02 2016-11-06 CBS Ne… A-       1426 lv         45      41  
##  8 U.S.      2016-11-03 2016-11-05 NBC Ne… A-       1282 lv         44      40  
##  9 New Mexi… 2016-11-06 2016-11-06 Zia Po… <NA>     8439 lv         46      44  
## 10 U.S.      2016-11-04 2016-11-07 IBD/TI… A-       1107 lv         41.2    42.7
## # … with 4,198 more rows, 7 more variables: rawpoll_johnson <dbl>,
## #   rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, adjpoll_trump <dbl>,
## #   adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>, trump_clinton_diff <dbl>,
## #   and abbreviated variable names ¹pollster, ²samplesize, ³population,
## #   ⁴rawpoll_clinton, ⁵rawpoll_trump
```
]

---

Works for all `dplyr` verbs:

```r
polls_us_election_2016 %>%
    mutate(trump_clinton_diff = 
             rawpoll_trump - 
             rawpoll_clinton) %>%
*   filter(trump_clinton_diff>5 &
*          state == "Iowa" &
*          is.na(rawpoll_johnson))
```
]

```
## # A tibble: 3 × 16
##   state startdate  enddate    pollster grade samplesize popula…¹ rawpo…² rawpo…³
##   <fct> <date>     <date>     <fct>    <fct>      <int> <chr>      <dbl>   <dbl>
## 1 Iowa  2016-09-09 2016-09-29 Ipsos    A-           343 lv          42.2    48.8
## 2 Iowa  2016-09-02 2016-09-22 Ipsos    A-           344 lv          40.5    50.7
## 3 Iowa  2016-08-26 2016-09-15 Ipsos    A-           347 lv          40.6    49.5
## # … with 7 more variables: rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>,
## #   adjpoll_clinton <dbl>, adjpoll_trump <dbl>, adjpoll_johnson <dbl>,
## #   adjpoll_mcmullin <dbl>, trump_clinton_diff <dbl>, and abbreviated variable
## #   names ¹population, ²rawpoll_clinton, ³rawpoll_trump
```
]

---

Works for all `dplyr` verbs:

```r
polls_us_election_2016 %>%
    mutate(trump_clinton_diff = 
             rawpoll_trump - 
             rawpoll_clinton) %>%
    filter(trump_clinton_diff>5 &
           state == "Iowa" &
           is.na(rawpoll_johnson)) %>%
*   select(pollster)
```
]

```
## # A tibble: 3 × 1
##   pollster
##   <fct>   
## 1 Ipsos   
## 2 Ipsos   
## 3 Ipsos
```
]

---

Works for all `dplyr` verbs:

```
## [1] Ipsos Ipsos Ipsos
## 196 Levels: ABC News/Washington Post ... Zogby Interactive/JZ Analytics
```
]

---

But also with other `R` commands:

```r
polls_us_election_2016$samplesize %>% mean(na.rm = T)
```

```
## [1] 1148.216
```
]

```r
polls_us_election_2016 %>% count()
```

```
## # A tibble: 1 × 1
##       n
##   <int>
## 1  4208
```
]

#### (only for 🤓  nerds:)

* The `%>%` pipe is part of the `magrittr` package. `R v4.1.0` adds a *native pipe* via `|>`. you could use it like

```r
polls_us_election_2016$samplesize |> mean(na.rm = T)
```

```
## [1] 1148.216
```

---

---

# Missing Values: `NA`

.pull-left[
* Whenever a value is *missing*, we code it as `NA`.
    
    ```r
    x <- NA
    ```
* `R` propagates `NA` through operations:
    
    ```r
    NA > 5
    ```
    
    ```
    ## [1] NA
    ```
    
    ```r
    NA + 10
    ```
    
    ```
    ## [1] NA
    ```
* `is.na(x)` function returns `TRUE` if `x` is an `NA`.
    
    ```r
    is.na(x)
    ```
    
    ```
    ## [1] TRUE
    ```
]

.pull-right[
* What is confusing is that 
    
    ```r
    NA == NA
    ```
    
    ```
    ## [1] NA
    ```

* It's easy to illustrate like that:
    
    ```r
    # Let x be Mary's age. We don't know how old she is.
    x <- NA
    
    # Let y be John's age. We don't know how old he is.
    y <- NA
    
    # Are John and Mary the same age?
    x == y
    ```
    
    ```
    ## [1] NA
    ```
    
    ```r
    #> [1] NA
    # We don't know!
    ```
]

---

# Task 1: Data wrangling

Load the data by running the following code:

```r
library(dslabs)  # need to install.packages("dslabs") first? 
data(polls_us_election_2016)
```

1. Which polls had a missing `grade`?

1. Which polls were (i) polled by American Strategies, GfK Group or Merrill Poll, (ii) had a sample size greater than 1,000, _and_ (iii) started on October 20th, 2016? (*Hint: for (i) `%in%` might come in handy. Recall that vectors are created using the `c()` function. For (iii) make sure to check the format of the variable containing the poll's start date.*)

1. Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% _and_ (iii) were done in the state of Ohio?  (*Hint: it might be practical to first create a variable containing the combined raw poll vote share for Trump and Clinton and then filter.*)

1. Which state had the highest average Trump vote share for polls which had at least a sample size of 2,000? (*Hint: you'll have to use `filter`, `group_by`, `summarise` and `arrange`. To obtain ranking in descending order check `arrange`'s help page.*)

---

# Visualising Data

---
layout: true

---

background-image: url("../img/logo/ggplot2.svg")
background-position: 90% 5%
background-size: 150px

# Base `R` and `ggplot2`

* Base `R` plotting is fairly good.

* There is an extremely powerful alternative: `ggplot2` (part of the `tidyverse` suite) `$\rightarrow$` what we'll be using

* Let's go back to the `gapminder` dataset to run the examples.

---

# The `gapminder` dataset: Overview

* Let's first load the `gapminder` dataset with these commands:
    
    ```r
    library(dslabs)
    data(gapminder, package = "dslabs")
    ```

* Here are the first 3 rows and last 2 rows.
    
    ```r
    head(gapminder, n = 3)
    ```
    
    ```
    ##   country year infant_mortality life_expectancy fertility population
    ## 1 Albania 1960            115.4           62.87      6.19    1636054
    ## 2 Algeria 1960            148.2           47.50      7.65   11124892
    ## 3  Angola 1960            208.0           35.98      7.32    5270844
    ##           gdp continent          region
    ## 1          NA    Europe Southern Europe
    ## 2 13828152297    Africa Northern Africa
    ## 3          NA    Africa   Middle Africa
    ```
    
    ```r
    tail(gapminder, n = 2)
    ```
    
    ```
    ##        country year infant_mortality life_expectancy fertility population gdp
    ## 10544   Zambia 2016               NA           57.10        NA         NA  NA
    ## 10545 Zimbabwe 2016               NA           61.69        NA         NA  NA
    ##       continent         region
    ## 10544    Africa Eastern Africa
    ## 10545    Africa Eastern Africa
    ```

---

# Task 2: Understanding the data

Load the data by running the following code:

```r
library(dslabs)
data(gapminder, package = "dslabs")
```

1. Compute the average population per continent per year, `mean_pop`, and assign the output to a new object `gapminder_mean`. (*Hint: you should have one observation (row) per continent for each year. You'll have to use `group_by` and `summarise`.*)

---

# gg is for Grammar of Graphics<sup>1</sup>

.footnote[
[1]: The following slides are taken from [Garrick Aden-Buie](https://www.garrickadenbuie.com/)'s wonderful [Gentle Guide to the Grammar of Graphics with `ggplot2`](https://pkg.garrickadenbuie.com/gentle-ggplot2/#1)
]

---

# gg is for Grammar of Graphics

```r
data %>%
  ggplot()
```

```r
ggplot(data)
```
]

#### Tidy Data

1. Each variable forms a ***column***

2. Each observation forms a ***row***

3. Each observational unit forms a table
]

1. What information do I want to use in my visualization?

1. Is that data contained in ***one column/row*** for a given data point?
]

---
layout: true

# gg is for Grammar of Graphics

```r
+ aes()
```

]
---

- year

- population

- country
]

---

- year → **x**

- population → **y**

- country → *shape*, *color*, etc.
]

---

```r
aes(
  x = year,
  y = population,
  color = country
)
```
]

---
layout: true

# gg is for Grammar of Graphics

```r
+ geom_*()
```
]

---

<img src="chapter_tidy_files/figure-html/geom_demo-1.svg" width="650px" style="display: block; margin: auto;" />
]

---

.right-wide[
Here are the [some of the most widely used geoms](https://eric.netlify.com/2017/08/10/most-popular-ggplot2-geoms/)

.small[
| Type | Function |
|:----:|:--------:|
| Point | `geom_point()` |
| Line | `geom_line()` |
| Bar | `geom_bar()`, `geom_col()` |
| Histogram | `geom_histogram()` |
| Regression | `geom_smooth()` |
| Boxplot | `geom_boxplot()` |
| Text | `geom_text()` |
| Vert./Horiz. Line | `geom_{vh}line()` |
| Count | `geom_count()` |
| Density | `geom_density()` |

<https://eric.netlify.com/2017/08/10/most-popular-ggplot2-geoms/>
]
]

---

]

---

# (Y)Our first plot!
---

```r
gapminder_mean
```
]

```
## # A tibble: 285 × 3
## # Groups:   continent [5]
##    continent  year mean_pop
##    <fct>     <int>    <dbl>
##  1 Africa     1960 5464985.
##  2 Africa     1961 5598112.
##  3 Africa     1962 5736073.
##  4 Africa     1963 5878867.
##  5 Africa     1964 6026474.
##  6 Africa     1965 6178906.
##  7 Africa     1966 6336258.
##  8 Africa     1967 6498656.
##  9 Africa     1968 6666202.
## 10 Africa     1969 6839011.
## # … with 275 more rows
```
]

---

```r
gapminder_mean %>%
* ggplot()
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/first-plot1a-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

```r
gapminder_mean %>%
  ggplot() +
* aes(x = year,
*     y = mean_pop)
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/first-plot1b-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

```r
gapminder_mean %>%
  ggplot() +
  aes(x = year,
      y = mean_pop) +
* geom_point()
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/first-plot1c-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

```r
gapminder_mean %>%
  ggplot() +
  aes(x = year,
      y = mean_pop,
*     color = continent) +
  geom_point()
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/first-plot1-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

```r
gapminder_mean %>%
  ggplot() +
  aes(x = year,
      y = mean_pop,
      color = continent) +
  geom_point() +
* geom_line()
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/first-plot2-fake-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

```r
gapminder_mean %>%
  ggplot() +
  aes(x = year,
      y = mean_pop,
      color = continent) +
* # geom_point() +
  geom_line()
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/first-plot2-line-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---

```r
g = gapminder_mean %>%
  ggplot() +
  aes(x = year,
      y = mean_pop,
      color = continent) +
* # geom_point() +
  geom_line()
g
# graphs can be saved as
# objects!
```
]

.right-wide[
<img src="chapter_tidy_files/figure-html/save-plot-out-1.svg" width="100%" style="display: block; margin: auto;" />
]

---
layout: true

# gg is for Grammar of Graphics

```r
+ facet_wrap() 
+ facet_grid()
```
]
---

```r
g + facet_wrap(~ continent)
```

<img src="chapter_tidy_files/figure-html/geom_facet-1.svg" width="90%" style="display: block; margin: auto;" />
]

---

```r
g + facet_grid(~ continent)
```

<img src="chapter_tidy_files/figure-html/geom_grid-1.svg" width="90%" style="display: block; margin: auto;" />
]

---
layout: true

# gg is for Grammar of Graphics

```r
+ labs()
```
]
---

```r
g + labs(x = "Year", y = "Average Population", color = "Continent")
```

<img src="chapter_tidy_files/figure-html/labs-ex-1.svg" width="90%" style="display: block; margin: auto;" />
]

---
layout: true

# gg is for Grammar of Graphics

```r
+ scale_*_*()
```
]
---

What parameter do you want to adjust? → `<aes>` <br>
What type is the parameter? → `<type>`

- I want to change my discrete x-axis<br>`scale_x_discrete()`
- I want to change range of point sizes from continuous variable<br>`scale_size_continuous()`
- I want to rescale y-axis as log10<br>`scale_y_log10()`
- I want to use a different color palette<br>`scale_fill_discrete()`<br>`scale_color_manual()`
]

---

```r
g + scale_color_viridis_d()
```

<img src="chapter_tidy_files/figure-html/scale_ex1-1.svg" width="90%" style="display: block; margin: auto;" />
]

---

```r
g + scale_y_log10()
```

<img src="chapter_tidy_files/figure-html/scale_ex2-1.svg" width="90%" style="display: block; margin: auto;" />
]

---

```r
g + scale_x_continuous(breaks = seq(1950, 2020, 10))
```

<img src="chapter_tidy_files/figure-html/scale_ex4-1.svg" width="90%" style="display: block; margin: auto;" />
]

---
layout: true

# gg is for Grammar of Graphics

---

layout:true

---

# Delving Deeper into ggplot

* Each graph is different and `ggplot2` provides a zillion options to customize your graph to perfection.

* Excellent cheatsheet on [project website](https://ggplot2.tidyverse.org).

* [Garrick Aden-Buie](https://www.garrickadenbuie.com/)'s wonderful [Gentle Guide to the Grammar of Graphics with `ggplot2`](https://pkg.garrickadenbuie.com/gentle-ggplot2/#1) from which the previous slides were taken.

---

# Types of Plots

***Histograms:*** counts how many obserations fall within a certain bin.

***Boxplots:*** displays the distribution of a variable.

---

# Types of Plots

***Histograms:*** counts how many obserations fall within a certain bin.

***Boxplots:*** displays the distribution of a variable.

***Scatter plots:*** shows the association between two variables.

---

# Task 3: Visualising data

Using the `gapminder` data, create the following plots using `ggplot2`.

1. A histogram of life expectancy in 2015. (*Hint: do you need to specify a `y` in `aes()` for a histogram?*) Once you've created the histogram, within the appropriate `geom_*` set: `binwidth` to 5, `boundary` to 45, `colour` to "white" and `fill` to "#d90502". What does each of these options do? <br> *Optional:* Using the previous graph, facet it by continent such that each continent's plot is a new row. (*Hint: check the help for `facet_grid`.*)

1. A boxplot of average life expectancy per year by continent. Within the appropriate `geom_*` set: `colour` to "black" and `fill` to "#d90502". (*Hint: you need to group by both `continent` and `year`.*)

1. A scatter plot of fertility rate (y-axis) with respect to infant mortality (x-axis) in 2015. Once you've created the scatter plot, within the appropriate `geom_*` set: `size` to 3, `alpha` to 0.5, `colour` to "#d90502". Add labels (`labs`) to the plot so that it is cleaner.

---

# Summarising

---

---

# Summarising Data

* In general, we can learn from the data by visualising it and/or computing summary statistics

* Let's now turn to summary statistics!

* In particular, let's look at two features: *central tendency* and *spread*.

---

# Central Tendency

```r
x <- c(1,2,2,2,2,100)
mean(x)
```

```
## [1] 18.16667
```

```r
mean(x) == sum(x) / length(x)
```

```
## [1] FALSE
```
]

.pull-right[
`median`: the value `$x_j$` below and above which 50% of the values in `x` lie. `$m$` is the median if
    `$$\Pr(X \leq m) \geq 0.5 \text{ and } \Pr(X \geq m) \geq 0.5$$`
    
The median is robust against *outliers*.

```r
median(x)
```

```
## [1] 2
```
]

---

# Spread

.pull-left[
Another interesting feature is how much a variable is *spread out* about it's center (the mean in this case).

The *variance* is such a measure.
    `$$Var(X) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})^2$$`
    
Consider two `normal distributions` with equal mean at `0`:
]

.pull-right[
<img src="chapter_tidy_files/figure-html/unnamed-chunk-65-1.svg" style="display: block; margin: auto;" />

Compute with:

```r
var(x)
```
]
---

# Tabulating Data

`table(x)` is a useful function that counts the occurence of each unique value in `x`:

```r
table(gapminder$continent)
```

```
## 
##   Africa Americas     Asia   Europe  Oceania 
##     2907     2052     2679     2223      684
```

The same can be achieved using the `count` function (from `dplyr`)

```r
gapminder %>% count(continent)
```

```
##   continent    n
## 1    Africa 2907
## 2  Americas 2052
## 3      Asia 2679
## 4    Europe 2223
## 5   Oceania  684
```

---

# Tabulating Data

Given two variables, `table` produces a contingency table:

```r
gapminder_new <- gapminder %>%
  filter(year == 2015) %>%
  mutate(fertility_above_2 = (fertility > 2.1)) # dummy variable for fertility rate above replacement rate
```

```r
table(gapminder_new$fertility_above_2)
```

```
## 
## FALSE  TRUE 
##    80   104
```
]

```r
table(gapminder_new$fertility_above_2,gapminder_new$continent)
```

```
##        
##         Africa Americas Asia Europe Oceania
##   FALSE      2       15   20     39       4
##   TRUE      49       20   27      0       8
```
]

With `prop.table`, we can get proportions:

```r
# proportions by row
prop.table(table(gapminder_new$fertility_above_2,gapminder_new$continent), margin = 1)
# proportions by column
prop.table(table(gapminder_new$fertility_above_2,gapminder_new$continent), margin = 2) 
```

* ⚠️ To obtain `table`s or `crosstable`s with `NA`s, use the `useNA = "always"` or `useNA = "ifany"`

---

# Tabulating Data

Again the `count` function can get you there as well:

```r
gapminder_new %>%
  count(continent, fertility_above_2)
```

```
##    continent fertility_above_2  n
## 1     Africa             FALSE  2
## 2     Africa              TRUE 49
## 3   Americas             FALSE 15
## 4   Americas              TRUE 20
## 5   Americas                NA  1
## 6       Asia             FALSE 20
## 7       Asia              TRUE 27
## 8     Europe             FALSE 39
## 9    Oceania             FALSE  4
## 10   Oceania              TRUE  8
```

Note that `count` will display `NA`s only if there are some.

---

# How are x and y related? Covariance and Correlation

Two main statistics to characterise the relationship between `$x$` and `$y$`:
1. Covariance
2. Correlation

---

# Covariance

* The covariance is a measure of __joint variability__ of two variables.
    `$$Cov(x,y) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})$$`

* The `cov` function computes the covariance:

```r
cov(gapminder_new$fertility,gapminder_new$infant_mortality, use = "complete.obs")
```

```
## [1] 24.21146
```

* Difficult to interpret because sensitive to the variables' dispersions from the mean

---

# Correlation

* The correlation is a measure of the strength of the __linear association__ between two variables.
    `$$Cor(x,y) = \frac{Cov(x,y)}{\sqrt{Var(x)}\sqrt{Var(y)}}$$`

* The `cor` function computes the correlation:

```r
cor(gapminder_new$fertility,gapminder_new$infant_mortality, use = "complete.obs")
```

```
## [1] 0.8286402
```

---

# Correlation

* **Correlation is always between -1 and 1!**

---

# Correlation

* [App](https://gustavek.shinyapps.io/corr_continuous/)

---

# Task 4: Summarising data

1. Compute the mean of GDP in 2011 and assign to object `mean_GDP`. You should exclude missing values. (*Hint: read the help for `mean` to remove `NA`s*).

1. Compute the median of GDP in 2011 and assign to object `median_GDP`. Again, you should exclude missing values. Is it greater or smaller than the average?

1. Create a density plot of GDP in 2011 using `geom_density`. A density plot is a way of representing the distribution of a numeric variable. Add the following code to your plot to show the median and mean as vertical lines. What do you observe?
`geom_vline(xintercept = mean_GDP, colour = "red") +` <br>
    `geom_vline(xintercept = median_GDP, colour = "orange")`

1. Compute the correlation between fertility and infant mortality in 2015. To drop `NA`s in either variable set the argument `use` to "pairwise.complete.obs" in your `cor()` function. Is this correlation consistent with the graph you produced in Task 3?

In your free time, you can do this tutorial:

```r
library(ScPoApps) # this may take a while to install
runTutorial('chapter2')
```

---

# `joins`: Merging Datasets

### The remainder of those slides come from [Grant McDermott's](https://github.com/uo-ec607/lectures/) great course

---

---
name: joins

# Joins

One of the mainstays of the dplyr package is merging data with the family [join operations](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html).
- `inner_join(df1, df2)`
- `left_join(df1, df2)`
- `right_join(df1, df2)`
- `full_join(df1, df2)`
- `semi_join(df1, df2)`
- `anti_join(df1, df2)`

You will find it helpful to to see visual depictions of the different join operations [here](https://r4ds.hadley.nz/joins.html). You should read as much of that chapter as possible.

For the simple examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package. 
- Load it now and then inspect these data frames in your own console.

```r
library(nycflights13)
flights 
planes
```

---

# Joins (cont.)

Let's perform a [left join](https://stat545.com/bit001_dplyr-cheatsheet.html#left_joinsuperheroes-publishers) on the flights and planes datasets. 
- *Note*: I'm going subset columns after the join, but only to keep text on the slide.

```r
left_join(flights, planes) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model)
```

```
## # A tibble: 336,776 × 10
##     year month   day dep_time arr_time carrier flight tailnum type  model
##    <int> <int> <int>    <int>    <int> <chr>    <int> <chr>   <chr> <chr>
##  1  2013     1     1      517      830 UA        1545 N14228  <NA>  <NA> 
##  2  2013     1     1      533      850 UA        1714 N24211  <NA>  <NA> 
##  3  2013     1     1      542      923 AA        1141 N619AA  <NA>  <NA> 
##  4  2013     1     1      544     1004 B6         725 N804JB  <NA>  <NA> 
##  5  2013     1     1      554      812 DL         461 N668DN  <NA>  <NA> 
##  6  2013     1     1      554      740 UA        1696 N39463  <NA>  <NA> 
##  7  2013     1     1      555      913 B6         507 N516JB  <NA>  <NA> 
##  8  2013     1     1      557      709 EV        5708 N829AS  <NA>  <NA> 
##  9  2013     1     1      557      838 B6          79 N593JB  <NA>  <NA> 
## 10  2013     1     1      558      753 AA         301 N3ALAA  <NA>  <NA> 
## # … with 336,766 more rows
```

---

# Joins (cont.)

(*continued from previous slide*)

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

```
*## Joining, by = c("year", "tailnum")
```

However, there's an obvious problem here: the variable "year" does not have a consistent meaning across our joining datasets!
- In one it refers to the *year of flight*, in the other it refers to *year of construction*.

Luckily, there's an easy way to avoid this problem. 
- See if you can figure it out before turning to the next slide.
- Try `?dplyr::join`.

---

# Joins (cont.)

(*continued from previous slide*)

You just need to be more explicit in your join call by using the `by = ` argument.
- You can also rename any ambiguous columns to avoid confusion.

```r
left_join(
  flights,
  planes %>% rename(year_built = year), ## Not necessary w/ below line, but helpful
  by = "tailnum" ## Be specific about the joining column
  ) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) %>%
  head(3) ## Just to save vertical space on the slide
```

```
## # A tibble: 3 × 11
##    year month   day dep_time arr_time carrier flight tailnum year_…¹ type  model
##   <int> <int> <int>    <int>    <int> <chr>    <int> <chr>     <int> <chr> <chr>
## 1  2013     1     1      517      830 UA        1545 N14228     1999 Fixe… 737-…
## 2  2013     1     1      533      850 UA        1714 N24211     1998 Fixe… 737-…
## 3  2013     1     1      542      923 AA        1141 N619AA     1990 Fixe… 757-…
## # … with abbreviated variable name ¹year_built
```

---

# Joins (cont.)

(*continued from previous slide*)

Last thing I'll mention for now; note what happens if we again specify the join column... but don't rename the ambiguous "year" column in at least one of the given data frames.

```r
left_join(
  flights,
  planes, ## Not renaming "year" to "year_built" this time
  by = "tailnum"
  ) %>%
  select(contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>%
  head(3)
```

```
## # A tibble: 3 × 11
##   year.x year.y month   day dep_time arr_time carrier flight tailnum type  model
##    <int>  <int> <int> <int>    <int>    <int> <chr>    <int> <chr>   <chr> <chr>
## 1   2013   1999     1     1      517      830 UA        1545 N14228  Fixe… 737-…
## 2   2013   1998     1     1      533      850 UA        1714 N24211  Fixe… 737-…
## 3   2013   1990     1     1      542      923 AA        1141 N619AA  Fixe… 757-…
```

Make sure you know what "year.x" and "year.y" are. Again, it pays to be specific.

---

# `tidyr`: Reshaping Datasets

### By [Grant McDermott](https://github.com/uo-ec607/lectures/).

---

---
# Key tidyr verbs

1. `pivot_longer`: Pivot wide data into long format (i.e. "melt").<sup>1</sup>

2. `pivot_wider`: Pivot long data into wide format (i.e. "cast").<sup>2</sup>

3. `separate`: Separate (i.e. split) one column into multiple columns.

4. `unite`: Unite (i.e. combine) multiple columns into one.

<sup>2</sup> Updated version of `tidyr::spread`.
]

</br>

Let's practice these verbs together in class.
- Side question: Which of `pivot_longer` vs `pivot_wider` produces "tidy" data?
  
---

# 1) tidyr::pivot_longer

```r
stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer
  time = as.Date('2009-01-01') + 0:1,
  X = rnorm(2, 0, 1),
  Y = rnorm(2, 0, 2),
  Z = rnorm(2, 0, 4)
  )
stocks
```

```
##         time         X         Y         Z
## 1 2009-01-01 -1.366401 0.9747837  2.553785
## 2 2009-01-02 -1.023099 2.4229061 -2.239642
```

```r
stocks %>% pivot_longer(-time, names_to="stock", values_to="price")
```

```
## # A tibble: 6 × 3
##   time       stock  price
##   <date>     <chr>  <dbl>
## 1 2009-01-01 X     -1.37 
## 2 2009-01-01 Y      0.975
## 3 2009-01-01 Z      2.55 
## 4 2009-01-02 X     -1.02 
## 5 2009-01-02 Y      2.42 
## 6 2009-01-02 Z     -2.24
```

---

# 1) tidyr::pivot_longer *cont.*

Let's quickly save the "tidy" (i.e. long) stocks data frame for use on the next slide.

```r
## Write out the argument names this time: i.e. "names_to=" and "values_to="
tidy_stocks = 
  stocks %>% 
  pivot_longer(-time, names_to="stock", values_to="price")
```

---
name: pivot_wider

# 2) tidyr::pivot_wider

```r
tidy_stocks %>% pivot_wider(names_from=stock, values_from=price)
```

```
## # A tibble: 2 × 4
##   time           X     Y     Z
##   <date>     <dbl> <dbl> <dbl>
## 1 2009-01-01 -1.37 0.975  2.55
## 2 2009-01-02 -1.02 2.42  -2.24
```

```r
tidy_stocks %>% pivot_wider(names_from=time, values_from=price)
```

```
## # A tibble: 3 × 3
##   stock `2009-01-01` `2009-01-02`
##   <chr>        <dbl>        <dbl>
## 1 X           -1.37         -1.02
## 2 Y            0.975         2.42
## 3 Z            2.55         -2.24
```

</br>
Note that the second example, which has combined different pivoting arguments , has effectively transposed the data.

---

# Aside: Remembering the pivot_* syntax

There's a long-running joke about no-one being able to remember Stata's "reshape" command. ([Exhibit A](https://twitter.com/helleringer143/status/1117234887902285836).)

It's easy to see this happening with the `pivot_*` functions too. However, I find that I never forget the commands as long as I remember the argument order is *"names"* then *"values"*.

---
name: separate

# 3) tidyr::separate

```r
economists = data.frame(name = c("Adam.Smith", "Paul.Samuelson", "Milton.Friedman"))
economists
```

```
##              name
## 1      Adam.Smith
## 2  Paul.Samuelson
## 3 Milton.Friedman
```

```r
economists %>% separate(name, c("first_name", "last_name")) 
```

```
##   first_name last_name
## 1       Adam     Smith
## 2       Paul Samuelson
## 3     Milton  Friedman
```

</br>

This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with `separate(..., sep=".")`.

---

# 3) tidyr::separate *cont.*

A related function is `separate_rows`, for splitting up cells that contain multiple fields or observations (a frustratingly common occurence with survey data).

```r
jobs = data.frame(
  name = c("Jack", "Jill"),
  occupation = c("Homemaker", "Philosopher, Philanthropist, Troublemaker") 
  ) 
jobs
```

```
##   name                                occupation
## 1 Jack                                 Homemaker
## 2 Jill Philosopher, Philanthropist, Troublemaker
```

```r
## Now split out Jill's various occupations into different rows
jobs %>% separate_rows(occupation)
```

```
## # A tibble: 4 × 2
##   name  occupation    
##   <chr> <chr>         
## 1 Jack  Homemaker     
## 2 Jill  Philosopher   
## 3 Jill  Philanthropist
## 4 Jill  Troublemaker
```

---
name: unite

# 4) tidyr::unite

```r
gdp = data.frame(
  yr = rep(2016, times = 4),
  mnth = rep(1, times = 4),
  dy = 1:4,
  gdp = rnorm(4, mean = 100, sd = 2)
  )
gdp 
```

```
##     yr mnth dy       gdp
## 1 2016    1  1 101.47968
## 2 2016    1  2  96.36929
## 3 2016    1  3 103.08410
## 4 2016    1  4  99.88610
```

```r
## Combine "yr", "mnth", and "dy" into one "date" column
gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-")
```

```
##       date       gdp
## 1 2016-1-1 101.47968
## 2 2016-1-2  96.36929
## 3 2016-1-3 103.08410
## 4 2016-1-4  99.88610
```

---

# 4) tidyr::unite *cont.*

Note that `unite` will automatically create a character variable. You can see this better if we convert it to a tibble.

```r
gdp_u = gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-") %>% as_tibble()
gdp_u
```

```
## # A tibble: 4 × 2
##   date       gdp
##   <chr>    <dbl>
## 1 2016-1-1 101. 
## 2 2016-1-2  96.4
## 3 2016-1-3 103. 
## 4 2016-1-4  99.9
```

If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using `mutate`. See the next slide for an example, using the [lubridate](https://lubridate.tidyverse.org/) package's super helpful date conversion functions.

---

# 4) tidyr::unite *cont.*

*(continued from previous slide)*

```r
library(lubridate)
gdp_u %>% mutate(date = ymd(date))
```

```
## # A tibble: 4 × 2
##   date         gdp
##   <date>     <dbl>
## 1 2016-01-01 101. 
## 2 2016-01-02  96.4
## 3 2016-01-03 103. 
## 4 2016-01-04  99.9
```

---

# Other tidyr goodies

Use `crossing` to get the full combination of a group of variables.<sup>1</sup>

```r
crossing(side=c("left", "right"), height=c("top", "bottom"))
```

```
## # A tibble: 4 × 2
##   side  height
##   <chr> <chr> 
## 1 left  bottom
## 2 left  top   
## 3 right bottom
## 4 right top
```

See `?expand` and `?complete` for more specialised functions that allow you to fill in (implicit) missing data or variable combinations in existing data frames.
- You'll encounter this during your next assignment.

---
class: inverse, center, middle
name: summary

# Summary
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---

# Key verbs

### dplyr
1. `filter`
2. `arrange`
3. `select`
4. `mutate`
5. `summarise`

### tidyr
1. `pivot_longer`
2. `pivot_wider`
3. `separate`
4. `unite`

Other useful items include: pipes (`%>%`), grouping (`group_by`), joining functions (`left_join`, `inner_join`, etc.).
---

class: title-slide-final, middle
background-image: url(../img/logo/ScPo-econ.png)
background-size: 250px
background-position: 9% 19%

# Next Week: `data.table`

|                                                                                                            |                                   |
| :--------------------------------------------------------------------------------------------------------- | :-------------------------------- |
| <a href="https://floswald.github.io/ScPoProgramming">.ScPored[<i class="fa fa-link fa-fw"></i>] | Course Website |
| <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>]                          | @ScPoEcon                         |
| <a href="http://github.com/floswald">.ScPored[<i class="fa fa-github fa-fw"></i>]                          | @floswald                       |