Topic 4: Data Wrangling in the Tidyverse

class: center, middle, inverse, title-slide

.title[
# Topic 4: Data Wrangling in the Tidyverse
]
.author[
### Nick Hagerty* ECNS 460/560 Montana State University
]
.date[
### .smaller[ *Parts of these slides are adapted from <a href="https://raw.githack.com/uo-ec607/lectures/master/05-tidyverse/05-tidyverse.html">“Data Science for Economists”</a> by Grant R. McDermott, used under the <a href="https://github.com/uo-ec607/lectures/blob/master/LICENSE">MIT License</a>. Images are from <a href="https://moderndive.com/3-wrangling.html">ModernDive</a> by Chester Ismay and Albert Y. Kim, used under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0</a>.]
]

---

name: toc

</style>

# Agenda for the next 2-3 weeks

**Where we've been:**

1. R Basics

1. Programming

1. Productivity tools

**Where we're going:**

1. Data wrangling in the tidyverse

1. Data cleaning

1. Introduce the term project

1. Data acquisition & webscraping

---

# Table of contents

1. [Tidyverse basics](#basics)

1. [Data wrangling with dplyr](#dplyr)

1. [Joining data with dplyr](#joins)

1. [Data tidying with tidyr](#tidyr)

1. [Importing data with readr](#readr)

1. [Summary](#summary)

---
class: inverse, middle
name: basics

# Tidyverse basics

---

# The tidyverse

**A whole "universe" of functions within R**
- The most powerful, intuitive, and popular approach to data cleaning, wrangling, and visualization in R

**Advantages:**
- Consistent philosophy and syntax
- "Verb" based approach makes it more familiar to users of Stata/SAS/SPSS
- Serves as the front-end for many other big data and ML tools

**Why did I make you learn base R to start with?**
- The tidyverse can't do everything
- The best solution to many problems involves a combination of tidyverse and base R
- Ultimately you want to learn about many tools and choose the best one for the job

.footnote[
Note: Another approach with many fans is [**data.table**](http://r-datatable.com/), which you can learn about on your own!
]

---

# Tidyverse vs. base R

There is often a direct correspondence between a tidyverse command and its base R equivalent.

| tidyverse  |  base |
|---|---|
| `?readr::read_csv`  | `?utils::read.csv` |
|  `?dplyr::if_else` |  `?base::ifelse` |
|  `?tibble::tibble` |  `?base::data.frame` |

Notice these generally follow a `tidyverse::snake_case` vs `base::period.case` rule.

The tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart.
- Remember: There are always multiple ways to achieve a single goal in R.

---

# Tidying Data

The two most important properties of tidy data are:

1. Each column is a unique variable.
2. Each row is a single observation.

![](img/tidy-1.png)

Tidy data is easier to work with, because you have a consistent way of referring to variables and observations. It then becomes easy to manipulate, visualize, and model.

.smaller[Image is from ["R for Data Science"](https://r4ds.had.co.nz/tidy-data.html) by Hadley Wickham & Garrett Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and is not included under this project's overall CC license.]

---

# Wide vs. Long Formats

Both of these data sets display information on heart rate observed in individuals across 3 different time periods:

```
##      name time1 time2 time3
## 1  Wilbur    67    56    70
## 2 Petunia    80    90    67
## 3 Gregory    64    50   101
```

```
##      name time heartrate
## 1  Wilbur    1        67
## 2 Petunia    1        80
## 3 Gregory    1        64
## 4  Wilbur    2        56
## 5 Petunia    2        90
## 6 Gregory    2        50
## 7  Wilbur    3        70
## 8 Petunia    3        67
## 9 Gregory    3        10
```

Which dataframe is in *tidy* format?

---

# Wide vs. Long Formats

**Wide** data:
- Row = patient. Columns = repeated observations over time.
- Often easier to take in at a glance (as in a spreadsheet).

**Long** data:
- Row = one observation. Columns = ID variables + observed variable.
- Usually easier to clean, merge with other data, and avoid errors.

Tidy data is more likely to be **long**.
- Most R packages have been written assuming your data is in long format.

"Tidy datasets are all alike but every messy dataset is messy in its own way."  
– Hadley Wickham

---

# Tidyverse packages

We need to install and load a couple of packages. Run these preliminaries:

```r
install.packages('tidyverse')
install.packages('nycflights13')
library(tidyverse)
```

```
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

We see that we have actually loaded a number of packages (which could also be loaded individually): **ggplot2**, **tibble**, **dplyr**, etc.
- We can also see information about the package versions and some [namespace conflicts](https://raw.githack.com/uo-ec607/lectures/master/04-rlang/04-rlang.html#59).

---

# Tidyverse packages

The tidyverse actually comes with a lot more packages than those that are just loaded automatically.

```r
tidyverse_packages()
```

```
##  [1] "broom"         "cli"           "crayon"        "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "googledrive"  
##  [9] "googlesheets4" "ggplot2"       "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "readr"        
## [21] "readxl"        "reprex"        "rlang"         "rstudioapi"   
## [25] "rvest"         "stringr"       "tibble"        "tidyr"        
## [29] "xml2"          "tidyverse"
```

All of these are super useful
- **lubridate** helps us work with dates
- **rvest** is for webscraping

This lecture will focus on two that are automatically loaded: **dplyr** and **tidyr**.

---

# Pipes: |> or %>%

Pipes take the **output** of one function and feed it into the **first argument** of the next (which you then skip).

`dataframe |> filter(condition)` is equivalent to `filter(dataframe, condition)`.

Note: `|>` on these slides is generated by the two characters `| >`, without the space.

**Older version** of the pipe: `%>%`
* From the `magrittr` package loaded with the tidyverse
* Works identically to `|>` in most situations.

**Keyboard shortcut:** Ctl/Cmd + Shift + M

* Have to turn on a setting in RStudio options to make `|>` the default

---

# Pipes: |> or %>%

Pipes can dramatically improve the experience of reading and writing code. Compare:

.small[

```r
## These next two lines of code do exactly the same thing.

mpg |> filter(manufacturer=="audi") |> group_by(model) |> summarize(hwy_mean = mean(hwy))

summarize(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))
```
]

The first line reads from left to right, exactly how you think about the operations.

The second line totally inverts this logical order (the final operation comes first!)

---

# Pipes: |> or %>%

Best practice is to put each function on its own line and indent. Look how much more readable this is:

```r
mpg |> 
  filter(manufacturer == "audi") |> 
  group_by(model) |> 
  summarize(hwy_mean = mean(hwy))
```

```
## # A tibble: 3 × 2
## model hwy_mean
## <chr> <dbl>
## 1 a4 28.3
## 2 a4 quattro 25.8
## 3 a6 quattro 24
```

Vertical space costs nothing and makes for much more readable/writable code than cramming things horizontally.

All together, this multi-line line of code is called a **pipeline**.

---
class: inverse, middle
name: dplyr

# dplyr

---

# Key dplyr verbs

**There are five key dplyr verbs that you need to learn.**

1. `filter`: Filter (i.e. subset) rows based on their values.

2. `arrange`: Arrange (i.e. reorder) rows based on their values.

3. `select`: Select (i.e. subset) columns by their names:

4. `mutate`: Create new columns.

5. `summarize`: Collapse multiple rows into a single summary value.1

.footnote[
1 Many sources spell `summarise` with an "s" -- either one works!
]

Let's practice these functions together using the `starwars` data frame that comes pre-packaged with dplyr.

---

# 1) dplyr::filter

![](img/filter.png)

---

# 1) dplyr::filter

We can chain multiple filter commands with the pipe (`|>`), or just separate them within a single filter command using commas.

```r
starwars |> 
  filter( 
    species == "Human", 
    height >= 190
    ) 
```

```
## # A tibble: 4 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 
## 1 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 2 Qui-Gon Jinn 193 89 brown fair blue 92 male mascu… <NA> 
## 3 Dooku 193 80 white fair brown 102 male mascu… Serenno
## 4 Bail Presto… 191 NA black tan brown 67 male mascu… Aldera…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `colnames()` to see all variable names
```

---

# 1) dplyr::filter

Regular expressions work well too.

```r
starwars |> 
  filter(str_detect(name, "Skywalker"))
```

```
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Anakin S… 188 84 blond fair blue 41.9 male mascu…
## 3 Shmi Sky… 163 NA black fair brown 72 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
```

---

# 1) dplyr::filter

A very common `filter` use case is identifying (or removing) missing data cases.

```r
starwars |> 
  filter(is.na(height))
```

```
## # A tibble: 6 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 
## 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA> 
## 2 Finn NA NA black dark dark NA male mascu… <NA> 
## 3 Rey NA NA brown light hazel NA fema… femin… <NA> 
## 4 Poe Dameron NA NA brown light brown NA male mascu… <NA> 
## 5 BB8 NA NA none none black NA none mascu… <NA> 
## 6 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA> 
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `colnames()` to see all variable names
```

To remove missing observations, simply use negation: `filter(!is.na(height))`.  
Try this yourself.

---

# 2) dplyr::arrange

![](img/arrange.png)

---

# 2) dplyr::arrange

`arrange` sorts your data frame by a particular column (numerically, or alphabetically)

```r
starwars |> 
  arrange(birth_year)
```

```
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 
## 1 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor 
## 2 IG-88 200 140 none metal red 15 none mascu… <NA> 
## 3 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 4 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 5 Wedge Anti… 170 77 brown fair hazel 21 male mascu… Corell…
## 6 Plo Koon 188 80 none orange black 22 male mascu… Dorin 
## 7 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 8 Han Solo 180 80 brown fair brown 29 male mascu… Corell…
## 9 Lando Calr… 177 79 black dark brown 31 male mascu… Socorro
## 10 Boba Fett 183 78.2 black fair brown 31.5 male mascu… Kamino 
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
```

---

# 2) dplyr::arrange

We can also arrange items in descending order using `arrange(desc())`.

```r
starwars |> 
  arrange(desc(birth_year))
```

```
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> 
## 1 Yoda 66 17 white green brown 896 male mascu… <NA> 
## 2 Jabba Desi… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…
## 3 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 4 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 5 Dooku 193 80 white fair brown 102 male mascu… Serenno
## 6 Qui-Gon Ji… 193 89 brown fair blue 92 male mascu… <NA> 
## 7 Ki-Adi-Mun… 198 82 white pale yellow 92 male mascu… Cerea 
## 8 Finis Valo… 170 NA blond fair blue 91 male mascu… Corusc…
## 9 Palpatine 170 75 grey pale yellow 82 male mascu… Naboo 
## 10 Cliegg Lars 183 NA brown fair blue 82 male mascu… Tatooi…
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
```

---

# 3) dplyr::select

![](img/select.png)

---

# 3) dplyr::select

Use commas to select multiple columns out of a data frame. (You can also use "first:last" for consecutive columns). Deselect a column with "-".

```r
starwars |> 
  select(name:skin_color, species, -height)
```

```
## # A tibble: 87 × 5
## name mass hair_color skin_color species
## <chr> <dbl> <chr> <chr> <chr> 
## 1 Luke Skywalker 77 blond fair Human 
## 2 C-3PO 75 <NA> gold Droid 
## 3 R2-D2 32 <NA> white, blue Droid 
## 4 Darth Vader 136 none white Human 
## 5 Leia Organa 49 brown light Human 
## 6 Owen Lars 120 brown, grey light Human 
## 7 Beru Whitesun lars 75 brown light Human 
## 8 R5-D4 32 <NA> white, red Droid 
## 9 Biggs Darklighter 84 black light Human 
## 10 Obi-Wan Kenobi 77 auburn, white fair Human 
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

# 3) dplyr::select

You can also rename some (or all) of your selected variables in place.

```r
starwars |>
  select(alias=name, planet=homeworld) 
```

```
## # A tibble: 87 × 2
## alias planet 
## <chr> <chr> 
## 1 Luke Skywalker Tatooine
## 2 C-3PO Tatooine
## 3 R2-D2 Naboo 
## 4 Darth Vader Tatooine
## 5 Leia Organa Alderaan
## 6 Owen Lars Tatooine
## 7 Beru Whitesun lars Tatooine
## 8 R5-D4 Tatooine
## 9 Biggs Darklighter Tatooine
## 10 Obi-Wan Kenobi Stewjon 
## # ℹ 77 more rows
```

If you just want to rename columns without subsetting them, you can use `rename`. Try this!

---

# 3) dplyr::select

The `select(contains(PATTERN))` option provides a nice shortcut in relevant cases.

```r
starwars |> 
  select(name, contains("color"))
```

```
## # A tibble: 87 × 4
## name hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> 
## 1 Luke Skywalker blond fair blue 
## 2 C-3PO <NA> gold yellow 
## 3 R2-D2 <NA> white, blue red 
## 4 Darth Vader none white yellow 
## 5 Leia Organa brown light brown 
## 6 Owen Lars brown, grey light blue 
## 7 Beru Whitesun lars brown light blue 
## 8 R5-D4 <NA> white, red red 
## 9 Biggs Darklighter black light brown 
## 10 Obi-Wan Kenobi auburn, white fair blue-gray
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

Some other selection helpers: `starts_with()`, `ends_with()`, `all_of(c("name1", "name2"))`, `matches()`.

---

# 4) dplyr::mutate

![](img/mutate.png)

---

# 4) dplyr::mutate

You can create new columns from scratch, or (more commonly) as transformations of existing columns.

```r
starwars |> 
  select(name, birth_year) |>
  mutate(dog_years = birth_year * 7) |>
  mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
```

```
## # A tibble: 87 × 4
## name birth_year dog_years comment 
## <chr> <dbl> <dbl> <chr> 
## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. 
## 2 C-3PO 112 784 C-3PO is 784 in dog years. 
## 3 R2-D2 33 231 R2-D2 is 231 in dog years. 
## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. 
## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. 
## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. 
## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea…
## 8 R5-D4 NA NA R5-D4 is NA in dog years. 
## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year…
## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. 
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

# 4) dplyr::mutate

*Note:* `mutate` is order aware. So you can chain multiple mutates in a single call.

```r
starwars |> 
  select(name, birth_year) |>
  mutate(
    dog_years = birth_year * 7,     # Separate with a comma
    comment = paste0(name, " is ", dog_years, " in dog years.")
    )
```

---

# 4) dplyr::mutate

Boolean, logical and conditional operators all work well with `mutate` too.

```r
starwars |> 
  select(name, height) |>
  filter(name %in% c("Luke Skywalker", "Anakin Skywalker")) |> 
  mutate(tall1 = height > 180) |>
  mutate(tall2 = if_else(height > 180, "Tall", "Short"))
```

```
## # A tibble: 2 × 4
## name height tall1 tall2
## <chr> <int> <lgl> <chr>
## 1 Luke Skywalker 172 FALSE Short
## 2 Anakin Skywalker 188 TRUE Tall
```

---

# 4) dplyr::mutate

Lastly, combining `mutate` with `across` allows you to easily perform the same operation on a subset of variables.

```r
starwars |> 
  select(name:eye_color) |> 
  mutate(across(where(is.character), toupper))
```

```
## # A tibble: 87 × 6
## name height mass hair_color skin_color eye_color
## <chr> <int> <dbl> <chr> <chr> <chr> 
## 1 LUKE SKYWALKER 172 77 BLOND FAIR BLUE 
## 2 C-3PO 167 75 <NA> GOLD YELLOW 
## 3 R2-D2 96 32 <NA> WHITE, BLUE RED 
## 4 DARTH VADER 202 136 NONE WHITE YELLOW 
## 5 LEIA ORGANA 150 49 BROWN LIGHT BROWN 
## 6 OWEN LARS 178 120 BROWN, GREY LIGHT BLUE 
## 7 BERU WHITESUN LARS 165 75 BROWN LIGHT BLUE 
## 8 R5-D4 97 32 <NA> WHITE, RED RED 
## 9 BIGGS DARKLIGHTER 183 84 BLACK LIGHT BROWN 
## 10 OBI-WAN KENOBI 182 77 AUBURN, WHITE FAIR BLUE-GRAY
## # … with 77 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

# 5) dplyr::summarize

![](img/summarize.png)
![](img/group_by.png)

---

# 5) dplyr::summarize

Particularly useful in combination with the `group_by` command.

```r
starwars |> 
  group_by(species) |> 
  summarize(mean_height = mean(height))
```

```
## # A tibble: 38 × 2
## species mean_height
## <chr> <dbl>
## 1 Aleena 79 
## 2 Besalisk 198 
## 3 Cerean 198 
## 4 Chagrian 196 
## 5 Clawdite 168 
## 6 Droid NA 
## 7 Dug 112 
## 8 Ewok 88 
## 9 Geonosian 183 
## 10 Gungan 209.
## # … with 28 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

# 5) dplyr::summarize

Notice that some of these summarized values are missing. If we want to ignore missing values, use `na.rm = T`:

```r
## Much better
starwars |> 
  group_by(species) |> 
  summarize(mean_height = mean(height, na.rm = T))
```

```
## # A tibble: 38 × 2
## species mean_height
## <chr> <dbl>
## 1 Aleena 79 
## 2 Besalisk 198 
## 3 Cerean 198 
## 4 Chagrian 196 
## 5 Clawdite 168 
## 6 Droid 131.
## 7 Dug 112 
## 8 Ewok 88 
## 9 Geonosian 183 
## 10 Gungan 209.
## # … with 28 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

# 5) dplyr::summarize

The same `across`-based workflow that we saw with `mutate` a few slides back also works with `summarize`.

```r
starwars |> 
  group_by(species) |>
  summarize(across(where(is.numeric), mean))
```

```
## # A tibble: 38 × 4
## species height mass birth_year
## <chr> <dbl> <dbl> <dbl>
## 1 Aleena 79 15 NA
## 2 Besalisk 198 102 NA
## 3 Cerean 198 82 92
## 4 Chagrian 196 NA NA
## 5 Clawdite 168 55 NA
## 6 Droid NA NA NA
## 7 Dug 112 40 NA
## 8 Ewok 88 20 8
## 9 Geonosian 183 80 NA
## 10 Gungan 209. NA NA
## # ℹ 28 more rows
```

---

# 5) dplyr::summarize

The same `across`-based workflow that we saw with `mutate` a few slides back also works with `summarize`. Though to add arguments, we have to use an **anonymous function:**

```r
starwars |> 
  group_by(species) |>
  summarize(across(where(is.numeric), ~ mean(.x, na.rm=T)))
```

```
## # A tibble: 38 × 4
## species height mass birth_year
## <chr> <dbl> <dbl> <dbl>
## 1 Aleena 79 15 NaN 
## 2 Besalisk 198 102 NaN 
## 3 Cerean 198 82 92 
## 4 Chagrian 196 NaN NaN 
## 5 Clawdite 168 55 NaN 
## 6 Droid 131. 69.8 53.3
## 7 Dug 112 40 NaN 
## 8 Ewok 88 20 8 
## 9 Geonosian 183 80 NaN 
## 10 Gungan 209. 74 52 
## # ℹ 28 more rows
```

---

# Other dplyr goodies

`ungroup`: For ungrouping after using `group_by`.
- Use after doing your grouped `summarize` or `mutate` operation, or everything else you do will be super slow.

`slice`: Subset rows by position rather than filtering by values.
- E.g. `starwars |> slice(1:10)`

`pull`: Extract a column from as a data frame as a vector or scalar.
- E.g. `starwars |> filter(sex=="female") |> pull(height)`

`distinct` and `count`: List unique values, with or without their number of appearances.
- E.g. `starwars |> distinct(species)`, or `starwars |> count(species)`
- `count` is equivalent to `group_by` and `summarize` with `n()`:

```r
starwars |> group_by(species) |> summarize(n = n())
```

---

# Challenge

Take a few minutes to try this on your own:

**List the most common eye colors among female Star Wars characters in descending order of frequency.**

---

# Storing results in memory

So far we haven't been saving the dataframes that result from our code in memory. Usually, we will want to use them for the next task.

**Solution:** Create a new object each time you write a pipeline.

```r
women = starwars |> filter(sex == "female")
brown_eyed_women = women |> filter(eye_color == "brown")
```

Resist the temptation to use the same object name. This is called **clobbering** and it ruins your ability to easily look back at a previous version of your object.

```r
# DON'T do this
starwars = starwars |> filter(sex == "female")
```

By keeping multiple copies of very similar dataframes, will you waste your computer's memory? Usually, no -- R is smart and stores only the changes between objects.

---
class: inverse, middle
name: joins

# Joins

---

# Joining operations

Most data analysis requires combining information from multiple data tables. To combine tables, we use **joins** (similar to `base::merge()`).

![](img/join-inner.png)

There are multiple types of joins.

An **inner join** returns a dataframe of all rows that appear in both x and y.
- It keeps *only* rows that appear in both tables.

.smaller[Image is from ["R for Data Science"](https://r4ds.had.co.nz/relational-data.html) by Hadley Wickham & Garrett Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license.]

---

# Joining operations

.pull-left[
A **left join** keeps all rows in x.
- For matched rows, it merges in the values of y
- For unmatched rows, it fills in NA values for the variables coming from y.

A **right join** keeps all rows in y.
- Same as a left join, just reversed.

A **full join** keeps all rows in x and y.

.smaller[Image is from ["R for Data Science"](https://r4ds.had.co.nz/relational-data.html) by Wickham & Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license.]

]
.pull-right[
![](img/join-outer.png)
]

---

# Joining operations

**Left joins are safer than inner joins**
- It's too easy to lose observations with inner joins
- Left joins preserve all the original observations

**Use the left join** unless you have a strong reason to prefer another.

For the examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package. 
- Load it now and then inspect these data frames on your own (click on them in your Environment pane).
- Pay attention to what the `year` column means in each data frame.

```r
library(nycflights13)
data(flights)
data(planes)
```

---

# Joining operations

Let's perform a left join on the flights and planes datasets.

```r
flights |> 
  left_join(planes)
```

```
## Joining, by = c("year", "tailnum")
```

```
## # A tibble: 336,776 × 10
## year month day dep_time arr_time carrier flight tailnum type model
## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA> 
## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA> 
## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA> 
## 4 2013 1 1 544 1004 B6 725 N804JB <NA> <NA> 
## 5 2013 1 1 554 812 DL 461 N668DN <NA> <NA> 
## 6 2013 1 1 554 740 UA 1696 N39463 <NA> <NA> 
## 7 2013 1 1 555 913 B6 507 N516JB <NA> <NA> 
## 8 2013 1 1 557 709 EV 5708 N829AS <NA> <NA> 
## 9 2013 1 1 557 838 B6 79 N593JB <NA> <NA> 
## 10 2013 1 1 558 753 AA 301 N3ALAA <NA> <NA> 
## # … with 336,766 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

# Joining operations

(*continued from previous slide*)

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

```
*## Joining with `by = join_by(year, tailnum)`
```

**What's the problem here?**

The variable "year" does not have a consistent meaning across our joining datasets!
- In one it refers to the *year of flight*, in the other it refers to *year of construction*.

Luckily, there's an easy way to avoid this problem. 
- Can you figure it out? Try `?dplyr::join`.

---

# Joining operations

(*continued from previous slide*)

You just need to be more explicit in your join call by using the `by = ` argument.
- Here, it would be good to rename any ambiguous columns to avoid confusion.

```r
flights |>
  left_join(
    planes |> rename(year_built = year),
    by = "tailnum"
    )
```

```
## # A tibble: 3 × 11
## year month day dep_time arr_time carrier flight tailnum year_…¹ type model
## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixe… 737-…
## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixe… 737-…
## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixe… 757-…
## # … with abbreviated variable name ¹year_built
```

---
class: inverse, middle
name: tidyr

# tidyr

---

# Key tidyr verbs

1. `pivot_longer`: Pivot wide data into long format.

2. `pivot_wider`: Pivot long data into wide format.

3. `separate`: Separate (i.e. split) one column into multiple columns.

4. `unite`: Unite (i.e. combine) multiple columns into one.

Which of `pivot_longer` vs `pivot_wider` produces "tidy" data?
  
---

# 1) tidyr::pivot_longer

```r
stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer
  time = as.Date('2009-01-01') + 0:1,
  X = rnorm(2, 10, 1),
  Y = rnorm(2, 10, 2),
  Z = rnorm(2, 10, 5)
  )
stocks
```

```
##         time        X         Y         Z
## 1 2009-01-01 9.353794 10.892745  9.648994
## 2 2009-01-02 9.657906  7.384239 19.472134
```

```r
tidy_stocks = stocks |> 
  pivot_longer(cols=X:Z, names_to="stock", values_to="price")
tidy_stocks
```

```
## # A tibble: 6 × 3
## time stock price
## <date> <chr> <dbl>
## 1 2009-01-01 X 9.35
## 2 2009-01-01 Y 10.9 
## 3 2009-01-01 Z 9.65
## 4 2009-01-02 X 9.66
## 5 2009-01-02 Y 7.38
## 6 2009-01-02 Z 19.5
```

---

# 2) tidyr::pivot_wider

Now we can use pivot_wider to go back to the original dataframe:

```r
tidy_stocks |> pivot_wider(names_from=stock, values_from=price)
```

```
## # A tibble: 2 × 4
## time X Y Z
## <date> <dbl> <dbl> <dbl>
## 1 2009-01-01 9.35 10.9 9.65
## 2 2009-01-02 9.66 7.38 19.5
```

Or, we can put it into a new ("transposed") format, in which the observations are stocks and the columns are dates:

```r
tidy_stocks |> pivot_wider(names_from=time, values_from=price)
```

```
## # A tibble: 3 × 3
## stock `2009-01-01` `2009-01-02`
## <chr> <dbl> <dbl>
## 1 X 9.35 9.66
## 2 Y 10.9 7.38
## 3 Z 9.65 19.5
```

---

# 3) tidyr::separate

`separate` helps when you have more than one value in a single column:

```r
economists = data.frame(name = c("Adam_Smith", "Paul_Samuelson", "Milton_Friedman"))
economists
```

```
##              name
## 1      Adam_Smith
## 2  Paul_Samuelson
## 3 Milton_Friedman
```

```r
economists |> separate(name, c("first_name", "last_name")) 
```

```
##   first_name last_name
## 1       Adam     Smith
## 2       Paul Samuelson
## 3     Milton  Friedman
```

This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with the `sep` argument:

```r
economists |> separate(name, c("first_name", "last_name"), sep = "_")
```

---

# 3) tidyr::separate

Related is `separate_rows`, for splitting cells with multiple values into multiple rows:

```r
jobs = data.frame(
  name = c("Joe", "Jill"),
  occupation = c("President", "First Lady, Professor, Grandmother")
  ) 
jobs
```

```
##   name                         occupation
## 1  Joe                          President
## 2 Jill First Lady, Professor, Grandmother
```

```r
# Now split out Jill's various occupations into different rows
jobs |> separate_rows(occupation)
```

```
## # A tibble: 5 × 2
## name occupation 
## <chr> <chr> 
## 1 Joe President 
## 2 Jill First 
## 3 Jill Lady 
## 4 Jill Professor 
## 5 Jill Grandmother
```

---

# 3) tidyr::separate

Related is `separate_rows`, for splitting cells with multiple values into multiple rows:

```r
jobs = data.frame(
  name = c("Joe", "Jill"),
  occupation = c("President", "First Lady, Professor, Grandmother")
  ) 
jobs
```

```
##   name                         occupation
## 1  Joe                          President
## 2 Jill First Lady, Professor, Grandmother
```

```r
# Now split out Jill's various occupations into different rows
jobs |> separate_rows(occupation, sep = ", ")
```

```
## # A tibble: 4 × 2
## name occupation 
## <chr> <chr> 
## 1 Joe President 
## 2 Jill First Lady 
## 3 Jill Professor 
## 4 Jill Grandmother
```

---

# 4) tidyr::unite

```r
gdp = data.frame(
  yr = rep(2016, times = 4),
  mnth = rep(1, times = 4),
  dy = 1:4,
  gdp = rnorm(4, mean = 100, sd = 2)
  )
gdp 
```

```
##     yr mnth dy       gdp
## 1 2016    1  1  98.44216
## 2 2016    1  2  98.98793
## 3 2016    1  3  99.91618
## 4 2016    1  4 100.97192
```

```r
## Combine "yr", "mnth", and "dy" into one "date" column
gdp |> unite(date, c("yr", "mnth", "dy"), sep = "-")
```

```
##       date       gdp
## 1 2016-1-1  98.44216
## 2 2016-1-2  98.98793
## 3 2016-1-3  99.91618
## 4 2016-1-4 100.97192
```

---

# 4) tidyr::unite

Note that `unite` will automatically create a character variable.

If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using `mutate`. This example uses the [lubridate](https://lubridate.tidyverse.org/) package's super helpful date conversion functions.

```r
library(lubridate)
gdp_u |> mutate(date = ymd(date))
```

```
## # A tibble: 4 × 2
## date gdp
## <date> <dbl>
## 1 2016-01-01 98.4
## 2 2016-01-02 99.0
## 3 2016-01-03 99.9
## 4 2016-01-04 101.
```

---

# Challenge

**Using `nycflights13`, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).**

---

# Challenge

**Using `nycflights13`, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).**

Hint: Recall that you can tabulate summary statistics using `group_by` and `summarize`:

```r
flights |> 
  group_by(carrier) |> 
  summarize(avg_late = mean(arr_delay, na.rm=T))
```

```
## # A tibble: 16 × 2
## carrier avg_late
## <chr> <dbl>
## 1 9E 7.38 
## 2 AA 0.364
## 3 AS -9.93 
## 4 B6 9.46 
## 5 DL 1.64 
## 6 EV 15.8 
## 7 F9 21.9 
## 8 FL 20.1 
## 9 HA -6.92 
## 10 MQ 10.8 
## 11 OO 11.9 
## 12 UA 3.56 
## 13 US 2.13 
## 14 VX 1.76 
## 15 WN 9.65 
## 16 YV 15.6
```

---
class: inverse, middle
name: readr

# Importing data with readr

---

# Download this data

Most data does not come nicely in R packages! You will download it from somewhere and load it in R.

**Go here:** <https://github.com/nytimes/covid-19-data/tree/master/colleges> 
and click on the **Raw CSV** link.

* This is a list of Covid-19 case counts reported at U.S. colleges and universities between July 2020 and May 2021.

In your browser, **save** this file somewhere sensible on your computer (perhaps a folder you already created to put your work for this class).

Your computer stores all its programs, files, and information in a **filesystem** consisting of a set of nested **folders** or **directories.**

---

# Paths and directories

The **working directory** is your "current location" in the filesystem. What's your current working directory?

```r
getwd()
```

```
## [1] "C:/git/covid-example"
```

* This is an example of a full or **absolute path.**
* It starts with `C:/` on Windows or `/` on Mac.

**Relative paths** are defined relative to the full path of the working directory.
* Let's say your working directory is `"C:/git/ecns-560/"`
* If you had a file saved at `"C:/git/ecns-560/assignment-3/assignment-3.Rmd"`
* Its relative path would be `"assignment-3/assignment-3.Rmd"`

It's always best to use **relative paths,** though you *can* also use an absolute path in any given situation.

---
  
# Setting the working directory
  
You can change your working directory with `setwd(dir)`. Now:

1. **Find the location you saved your CSV file and copy the filepath.**

2. **Use the console in R to set your working directory to that location.**

For example:

```r
setwd("C:/git/covid-example")
```

**Do not use `setwd()` in scripts.** Setting the working directory is one of the *few exceptions* to the rule that you should always write all your code in scripts.

* Someone else's working directory will be different from yours!
* You want your code to be portable.

---

# Read in data with readr

The tidyverse package `readr` provides lots of options to read data into R. Read in the college Covid data like this:

```r
dat = read_csv("colleges.csv")
```

`View` this data to take a look at it. It automatically formatted everything as a nice and neat data frame!

```r
str(dat)
```

```
## spc_tbl_ [1,948 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:1948], format: "2021-05-26" "2021-05-26" ...
## $ state : chr [1:1948] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr [1:1948] "Madison" "Montgomery" "Limestone" "Lee" ...
## $ city : chr [1:1948] "Huntsville" "Montgomery" "Athens" "Auburn" ...
## $ ipeds_id : chr [1:1948] "100654" "100724" "100812" "100858" ...
## $ college : chr [1:1948] "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ...
## $ cases : num [1:1948] 41 2 45 2742 220 ...
## $ cases_2021: num [1:1948] NA NA 10 567 80 NA 49 53 10 35 ...
## $ notes : chr [1:1948] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. state = col_character(),
## .. county = col_character(),
## .. city = col_character(),
## .. ipeds_id = col_character(),
## .. college = col_character(),
## .. cases = col_double(),
## .. cases_2021 = col_double(),
## .. notes = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
```

---

# Challenge

**Which state had the least total reported Covid-19 cases at colleges and universities?**

Is it Montana?

---

# Other tips for importing data

* Many things can go wrong with CSV files. When this happens, `read_csv` has many options -- look at the help file.

* You can read files directly from the internet without downloading them first (but you usually shouldn't!):
.smaller[

```r
url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv"
dat = read_csv(url)
```
]

* `write_csv` outputs a dataframe as a CSV to your computer.

* `readxl::read_excel` reads directly from Excel spreadsheets.

* `googlesheets4::read_sheet` reads directly from Google Sheets.

* `haven::read_dta` reads directly from Stata files! (Stata can't read R files....)

---
class: inverse, middle
name: summary

# Summary

---

# Key tidyverse functions

.pull-left[
### dplyr
1. `filter`
2. `arrange`
3. `select`
4. `mutate`
5. `summarize`
]
.pull-right[
### tidyr
1. `pivot_longer`
2. `pivot_wider`
3. `separate`
4. `unite`
]

.pull-left[
### Miscellaneous
1. Pipes (`|>`)
2. Grouping (`group_by`)
3. Joining (`left_join`)
]
.pull-right[
### Importing data
1. `getwd` & `setwd`
1. `read_csv`
]