+ - 0:00:00
Notes for current slide
Notes for next slide

Topic 4: Data Wrangling in the Tidyverse

Nick Hagerty*
ECNS 460/560
Montana State University


*Parts of these slides are adapted from “Data Science for Economists” by Grant R. McDermott, used under the MIT License. Images are from ModernDive by Chester Ismay and Albert Y. Kim, used under CC BY-NC-SA 4.0.

1 / 67

Agenda for the next 2-3 weeks

Where we've been:

  1. R Basics

  2. Programming

  3. Productivity tools


Where we're going:

  1. Data wrangling in the tidyverse

  2. Data cleaning

  3. Introduce the term project

  4. Data acquisition & webscraping

2 / 67

Tidyverse basics

4 / 67

The tidyverse

A whole "universe" of functions within R

  • The most powerful, intuitive, and popular approach to data cleaning, wrangling, and visualization in R

Advantages:

  • Consistent philosophy and syntax
  • "Verb" based approach makes it more familiar to users of Stata/SAS/SPSS
  • Serves as the front-end for many other big data and ML tools

Why did I make you learn base R to start with?

  • The tidyverse can't do everything
  • The best solution to many problems involves a combination of tidyverse and base R
  • Ultimately you want to learn about many tools and choose the best one for the job

Note: Another approach with many fans is data.table, which you can learn about on your own!

5 / 67

Tidyverse vs. base R

There is often a direct correspondence between a tidyverse command and its base R equivalent.

tidyverse base
?readr::read_csv ?utils::read.csv
?dplyr::if_else ?base::ifelse
?tibble::tibble ?base::data.frame

Notice these generally follow a tidyverse::snake_case vs base::period.case rule.

The tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart.

  • Remember: There are always multiple ways to achieve a single goal in R.
6 / 67

Tidying Data

The two most important properties of tidy data are:

  1. Each column is a unique variable.
  2. Each row is a single observation.

Tidy data is easier to work with, because you have a consistent way of referring to variables and observations. It then becomes easy to manipulate, visualize, and model.

Image is from "R for Data Science" by Hadley Wickham & Garrett Grolemund, used under CC BY-NC-ND 3.0 and is not included under this project's overall CC license.

7 / 67

Wide vs. Long Formats

Both of these data sets display information on heart rate observed in individuals across 3 different time periods:

## name time1 time2 time3
## 1 Wilbur 67 56 70
## 2 Petunia 80 90 67
## 3 Gregory 64 50 101
## name time heartrate
## 1 Wilbur 1 67
## 2 Petunia 1 80
## 3 Gregory 1 64
## 4 Wilbur 2 56
## 5 Petunia 2 90
## 6 Gregory 2 50
## 7 Wilbur 3 70
## 8 Petunia 3 67
## 9 Gregory 3 10

Which dataframe is in tidy format?

8 / 67

Wide vs. Long Formats

Wide data:

  • Row = patient. Columns = repeated observations over time.
  • Often easier to take in at a glance (as in a spreadsheet).

Long data:

  • Row = one observation. Columns = ID variables + observed variable.
  • Usually easier to clean, merge with other data, and avoid errors.

Tidy data is more likely to be long.

  • Most R packages have been written assuming your data is in long format.

"Tidy datasets are all alike but every messy dataset is messy in its own way."
– Hadley Wickham

9 / 67

Tidyverse packages

We need to install and load a couple of packages. Run these preliminaries:

install.packages('tidyverse')
install.packages('nycflights13')
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
10 / 67

Tidyverse packages

We need to install and load a couple of packages. Run these preliminaries:

install.packages('tidyverse')
install.packages('nycflights13')
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We see that we have actually loaded a number of packages (which could also be loaded individually): ggplot2, tibble, dplyr, etc.

10 / 67

Tidyverse packages

The tidyverse actually comes with a lot more packages than those that are just loaded automatically.

tidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "googledrive"
## [9] "googlesheets4" "ggplot2" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"

All of these are super useful

  • lubridate helps us work with dates
  • rvest is for webscraping

This lecture will focus on two that are automatically loaded: dplyr and tidyr.

11 / 67

Pipes: |> or %>%

Pipes take the output of one function and feed it into the first argument of the next (which you then skip).

dataframe |> filter(condition) is equivalent to filter(dataframe, condition).

Note: |> on these slides is generated by the two characters | >, without the space.

12 / 67

Pipes: |> or %>%

Pipes take the output of one function and feed it into the first argument of the next (which you then skip).

dataframe |> filter(condition) is equivalent to filter(dataframe, condition).

Note: |> on these slides is generated by the two characters | >, without the space.


Older version of the pipe: %>%

  • From the magrittr package loaded with the tidyverse
  • Works identically to |> in most situations.


Keyboard shortcut: Ctl/Cmd + Shift + M

  • Have to turn on a setting in RStudio options to make |> the default
12 / 67

Pipes: |> or %>%

Pipes can dramatically improve the experience of reading and writing code. Compare:

## These next two lines of code do exactly the same thing.
mpg |> filter(manufacturer=="audi") |> group_by(model) |> summarize(hwy_mean = mean(hwy))
summarize(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))

The first line reads from left to right, exactly how you think about the operations.

The second line totally inverts this logical order (the final operation comes first!)

13 / 67

Pipes: |> or %>%

Best practice is to put each function on its own line and indent. Look how much more readable this is:

mpg |>
filter(manufacturer == "audi") |>
group_by(model) |>
summarize(hwy_mean = mean(hwy))
## # A tibble: 3 × 2
## model hwy_mean
## <chr> <dbl>
## 1 a4 28.3
## 2 a4 quattro 25.8
## 3 a6 quattro 24

Vertical space costs nothing and makes for much more readable/writable code than cramming things horizontally.

All together, this multi-line line of code is called a pipeline.

14 / 67

dplyr

15 / 67

Key dplyr verbs

There are five key dplyr verbs that you need to learn.

  1. filter: Filter (i.e. subset) rows based on their values.

  2. arrange: Arrange (i.e. reorder) rows based on their values.

  3. select: Select (i.e. subset) columns by their names:

  4. mutate: Create new columns.

  5. summarize: Collapse multiple rows into a single summary value.1

1 Many sources spell summarise with an "s" -- either one works!

16 / 67

Key dplyr verbs

There are five key dplyr verbs that you need to learn.

  1. filter: Filter (i.e. subset) rows based on their values.

  2. arrange: Arrange (i.e. reorder) rows based on their values.

  3. select: Select (i.e. subset) columns by their names:

  4. mutate: Create new columns.

  5. summarize: Collapse multiple rows into a single summary value.1

1 Many sources spell summarise with an "s" -- either one works!


Let's practice these functions together using the starwars data frame that comes pre-packaged with dplyr.

16 / 67

1) dplyr::filter

17 / 67

1) dplyr::filter

We can chain multiple filter commands with the pipe (|>), or just separate them within a single filter command using commas.

starwars |>
filter(
species == "Human",
height >= 190
)
## # A tibble: 4 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 2 Qui-Gon Jinn 193 89 brown fair blue 92 male mascu… <NA>
## 3 Dooku 193 80 white fair brown 102 male mascu… Serenno
## 4 Bail Presto… 191 NA black tan brown 67 male mascu… Aldera…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## # ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use colnames() to see all variable names
18 / 67

1) dplyr::filter

Regular expressions work well too.

starwars |>
filter(str_detect(name, "Skywalker"))
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Anakin S… 188 84 blond fair blue 41.9 male mascu…
## 3 Shmi Sky… 163 NA black fair brown 72 fema… femin…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
19 / 67

1) dplyr::filter

A very common filter use case is identifying (or removing) missing data cases.

starwars |>
filter(is.na(height))
## # A tibble: 6 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA>
## 2 Finn NA NA black dark dark NA male mascu… <NA>
## 3 Rey NA NA brown light hazel NA fema… femin… <NA>
## 4 Poe Dameron NA NA brown light brown NA male mascu… <NA>
## 5 BB8 NA NA none none black NA none mascu… <NA>
## 6 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## # ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use colnames() to see all variable names
20 / 67

1) dplyr::filter

A very common filter use case is identifying (or removing) missing data cases.

starwars |>
filter(is.na(height))
## # A tibble: 6 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA>
## 2 Finn NA NA black dark dark NA male mascu… <NA>
## 3 Rey NA NA brown light hazel NA fema… femin… <NA>
## 4 Poe Dameron NA NA brown light brown NA male mascu… <NA>
## 5 BB8 NA NA none none black NA none mascu… <NA>
## 6 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## # ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use colnames() to see all variable names

To remove missing observations, simply use negation: filter(!is.na(height)).
Try this yourself.

20 / 67

2) dplyr::arrange

21 / 67

2) dplyr::arrange

arrange sorts your data frame by a particular column (numerically, or alphabetically)

starwars |>
arrange(birth_year)
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor
## 2 IG-88 200 140 none metal red 15 none mascu… <NA>
## 3 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 4 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 5 Wedge Anti… 170 77 brown fair hazel 21 male mascu… Corell…
## 6 Plo Koon 188 80 none orange black 22 male mascu… Dorin
## 7 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 8 Han Solo 180 80 brown fair brown 29 male mascu… Corell…
## 9 Lando Calr… 177 79 black dark brown 31 male mascu… Socorro
## 10 Boba Fett 183 78.2 black fair brown 31.5 male mascu… Kamino
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use print(n = ...) to see more rows, and colnames() to see all variable names
22 / 67

2) dplyr::arrange

We can also arrange items in descending order using arrange(desc()).

starwars |>
arrange(desc(birth_year))
## # A tibble: 87 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu… <NA>
## 2 Jabba Desi… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…
## 3 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 4 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 5 Dooku 193 80 white fair brown 102 male mascu… Serenno
## 6 Qui-Gon Ji… 193 89 brown fair blue 92 male mascu… <NA>
## 7 Ki-Adi-Mun… 198 82 white pale yellow 92 male mascu… Cerea
## 8 Finis Valo… 170 NA blond fair blue 91 male mascu… Corusc…
## 9 Palpatine 170 75 grey pale yellow 82 male mascu… Naboo
## 10 Cliegg Lars 183 NA brown fair blue 82 male mascu… Tatooi…
## # … with 77 more rows, 4 more variables: species <chr>, films <list>,
## # vehicles <list>, starships <list>, and abbreviated variable names
## # ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
## # ℹ Use print(n = ...) to see more rows, and colnames() to see all variable names
23 / 67

3) dplyr::select

24 / 67

3) dplyr::select

Use commas to select multiple columns out of a data frame. (You can also use "first:last" for consecutive columns). Deselect a column with "-".

starwars |>
select(name:skin_color, species, -height)
## # A tibble: 87 × 5
## name mass hair_color skin_color species
## <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker 77 blond fair Human
## 2 C-3PO 75 <NA> gold Droid
## 3 R2-D2 32 <NA> white, blue Droid
## 4 Darth Vader 136 none white Human
## 5 Leia Organa 49 brown light Human
## 6 Owen Lars 120 brown, grey light Human
## 7 Beru Whitesun lars 75 brown light Human
## 8 R5-D4 32 <NA> white, red Droid
## 9 Biggs Darklighter 84 black light Human
## 10 Obi-Wan Kenobi 77 auburn, white fair Human
## # … with 77 more rows
## # ℹ Use print(n = ...) to see more rows
25 / 67

3) dplyr::select

You can also rename some (or all) of your selected variables in place.

starwars |>
select(alias=name, planet=homeworld)
## # A tibble: 87 × 2
## alias planet
## <chr> <chr>
## 1 Luke Skywalker Tatooine
## 2 C-3PO Tatooine
## 3 R2-D2 Naboo
## 4 Darth Vader Tatooine
## 5 Leia Organa Alderaan
## 6 Owen Lars Tatooine
## 7 Beru Whitesun lars Tatooine
## 8 R5-D4 Tatooine
## 9 Biggs Darklighter Tatooine
## 10 Obi-Wan Kenobi Stewjon
## # ℹ 77 more rows
26 / 67

3) dplyr::select

You can also rename some (or all) of your selected variables in place.

starwars |>
select(alias=name, planet=homeworld)
## # A tibble: 87 × 2
## alias planet
## <chr> <chr>
## 1 Luke Skywalker Tatooine
## 2 C-3PO Tatooine
## 3 R2-D2 Naboo
## 4 Darth Vader Tatooine
## 5 Leia Organa Alderaan
## 6 Owen Lars Tatooine
## 7 Beru Whitesun lars Tatooine
## 8 R5-D4 Tatooine
## 9 Biggs Darklighter Tatooine
## 10 Obi-Wan Kenobi Stewjon
## # ℹ 77 more rows

If you just want to rename columns without subsetting them, you can use rename. Try this!

26 / 67

3) dplyr::select

The select(contains(PATTERN)) option provides a nice shortcut in relevant cases.

starwars |>
select(name, contains("color"))
## # A tibble: 87 × 4
## name hair_color skin_color eye_color
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker blond fair blue
## 2 C-3PO <NA> gold yellow
## 3 R2-D2 <NA> white, blue red
## 4 Darth Vader none white yellow
## 5 Leia Organa brown light brown
## 6 Owen Lars brown, grey light blue
## 7 Beru Whitesun lars brown light blue
## 8 R5-D4 <NA> white, red red
## 9 Biggs Darklighter black light brown
## 10 Obi-Wan Kenobi auburn, white fair blue-gray
## # … with 77 more rows
## # ℹ Use print(n = ...) to see more rows

Some other selection helpers: starts_with(), ends_with(), all_of(c("name1", "name2")), matches().

27 / 67

4) dplyr::mutate

28 / 67

4) dplyr::mutate

You can create new columns from scratch, or (more commonly) as transformations of existing columns.

starwars |>
select(name, birth_year) |>
mutate(dog_years = birth_year * 7) |>
mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
## # A tibble: 87 × 4
## name birth_year dog_years comment
## <chr> <dbl> <dbl> <chr>
## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years.
## 2 C-3PO 112 784 C-3PO is 784 in dog years.
## 3 R2-D2 33 231 R2-D2 is 231 in dog years.
## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years.
## 5 Leia Organa 19 133 Leia Organa is 133 in dog years.
## 6 Owen Lars 52 364 Owen Lars is 364 in dog years.
## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea…
## 8 R5-D4 NA NA R5-D4 is NA in dog years.
## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year…
## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years.
## # … with 77 more rows
## # ℹ Use print(n = ...) to see more rows
29 / 67

4) dplyr::mutate

Note: mutate is order aware. So you can chain multiple mutates in a single call.

starwars |>
select(name, birth_year) |>
mutate(
dog_years = birth_year * 7, # Separate with a comma
comment = paste0(name, " is ", dog_years, " in dog years.")
)
## # A tibble: 87 × 4
## name birth_year dog_years comment
## <chr> <dbl> <dbl> <chr>
## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years.
## 2 C-3PO 112 784 C-3PO is 784 in dog years.
## 3 R2-D2 33 231 R2-D2 is 231 in dog years.
## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years.
## 5 Leia Organa 19 133 Leia Organa is 133 in dog years.
## 6 Owen Lars 52 364 Owen Lars is 364 in dog years.
## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea…
## 8 R5-D4 NA NA R5-D4 is NA in dog years.
## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year…
## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years.
## # … with 77 more rows
## # ℹ Use print(n = ...) to see more rows
30 / 67

4) dplyr::mutate

Boolean, logical and conditional operators all work well with mutate too.

starwars |>
select(name, height) |>
filter(name %in% c("Luke Skywalker", "Anakin Skywalker")) |>
mutate(tall1 = height > 180) |>
mutate(tall2 = if_else(height > 180, "Tall", "Short"))
## # A tibble: 2 × 4
## name height tall1 tall2
## <chr> <int> <lgl> <chr>
## 1 Luke Skywalker 172 FALSE Short
## 2 Anakin Skywalker 188 TRUE Tall
31 / 67

4) dplyr::mutate

Lastly, combining mutate with across allows you to easily perform the same operation on a subset of variables.

starwars |>
select(name:eye_color) |>
mutate(across(where(is.character), toupper))
## # A tibble: 87 × 6
## name height mass hair_color skin_color eye_color
## <chr> <int> <dbl> <chr> <chr> <chr>
## 1 LUKE SKYWALKER 172 77 BLOND FAIR BLUE
## 2 C-3PO 167 75 <NA> GOLD YELLOW
## 3 R2-D2 96 32 <NA> WHITE, BLUE RED
## 4 DARTH VADER 202 136 NONE WHITE YELLOW
## 5 LEIA ORGANA 150 49 BROWN LIGHT BROWN
## 6 OWEN LARS 178 120 BROWN, GREY LIGHT BLUE
## 7 BERU WHITESUN LARS 165 75 BROWN LIGHT BLUE
## 8 R5-D4 97 32 <NA> WHITE, RED RED
## 9 BIGGS DARKLIGHTER 183 84 BLACK LIGHT BROWN
## 10 OBI-WAN KENOBI 182 77 AUBURN, WHITE FAIR BLUE-GRAY
## # … with 77 more rows
## # ℹ Use print(n = ...) to see more rows
32 / 67

5) dplyr::summarize

33 / 67

5) dplyr::summarize

Particularly useful in combination with the group_by command.

starwars |>
group_by(species) |>
summarize(mean_height = mean(height))
## # A tibble: 38 × 2
## species mean_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # … with 28 more rows
## # ℹ Use print(n = ...) to see more rows
34 / 67

5) dplyr::summarize

Notice that some of these summarized values are missing. If we want to ignore missing values, use na.rm = T:

## Much better
starwars |>
group_by(species) |>
summarize(mean_height = mean(height, na.rm = T))
## # A tibble: 38 × 2
## species mean_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid 131.
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # … with 28 more rows
## # ℹ Use print(n = ...) to see more rows
35 / 67

5) dplyr::summarize

The same across-based workflow that we saw with mutate a few slides back also works with summarize.

starwars |>
group_by(species) |>
summarize(across(where(is.numeric), mean))
## # A tibble: 38 × 4
## species height mass birth_year
## <chr> <dbl> <dbl> <dbl>
## 1 Aleena 79 15 NA
## 2 Besalisk 198 102 NA
## 3 Cerean 198 82 92
## 4 Chagrian 196 NA NA
## 5 Clawdite 168 55 NA
## 6 Droid NA NA NA
## 7 Dug 112 40 NA
## 8 Ewok 88 20 8
## 9 Geonosian 183 80 NA
## 10 Gungan 209. NA NA
## # ℹ 28 more rows
36 / 67

5) dplyr::summarize

The same across-based workflow that we saw with mutate a few slides back also works with summarize. Though to add arguments, we have to use an anonymous function:

starwars |>
group_by(species) |>
summarize(across(where(is.numeric), ~ mean(.x, na.rm=T)))
## # A tibble: 38 × 4
## species height mass birth_year
## <chr> <dbl> <dbl> <dbl>
## 1 Aleena 79 15 NaN
## 2 Besalisk 198 102 NaN
## 3 Cerean 198 82 92
## 4 Chagrian 196 NaN NaN
## 5 Clawdite 168 55 NaN
## 6 Droid 131. 69.8 53.3
## 7 Dug 112 40 NaN
## 8 Ewok 88 20 8
## 9 Geonosian 183 80 NaN
## 10 Gungan 209. 74 52
## # ℹ 28 more rows
37 / 67

Other dplyr goodies

ungroup: For ungrouping after using group_by.

  • Use after doing your grouped summarize or mutate operation, or everything else you do will be super slow.
38 / 67

Other dplyr goodies

ungroup: For ungrouping after using group_by.

  • Use after doing your grouped summarize or mutate operation, or everything else you do will be super slow.

slice: Subset rows by position rather than filtering by values.

  • E.g. starwars |> slice(1:10)
38 / 67

Other dplyr goodies

ungroup: For ungrouping after using group_by.

  • Use after doing your grouped summarize or mutate operation, or everything else you do will be super slow.

slice: Subset rows by position rather than filtering by values.

  • E.g. starwars |> slice(1:10)

pull: Extract a column from as a data frame as a vector or scalar.

  • E.g. starwars |> filter(sex=="female") |> pull(height)
38 / 67

Other dplyr goodies

ungroup: For ungrouping after using group_by.

  • Use after doing your grouped summarize or mutate operation, or everything else you do will be super slow.

slice: Subset rows by position rather than filtering by values.

  • E.g. starwars |> slice(1:10)

pull: Extract a column from as a data frame as a vector or scalar.

  • E.g. starwars |> filter(sex=="female") |> pull(height)

distinct and count: List unique values, with or without their number of appearances.

  • E.g. starwars |> distinct(species), or starwars |> count(species)
  • count is equivalent to group_by and summarize with n():
starwars |> group_by(species) |> summarize(n = n())
38 / 67

Challenge

Take a few minutes to try this on your own:

List the most common eye colors among female Star Wars characters in descending order of frequency.

39 / 67

Storing results in memory

So far we haven't been saving the dataframes that result from our code in memory. Usually, we will want to use them for the next task.

Solution: Create a new object each time you write a pipeline.

women = starwars |> filter(sex == "female")
brown_eyed_women = women |> filter(eye_color == "brown")

Resist the temptation to use the same object name. This is called clobbering and it ruins your ability to easily look back at a previous version of your object.

# DON'T do this
starwars = starwars |> filter(sex == "female")
40 / 67

Storing results in memory

So far we haven't been saving the dataframes that result from our code in memory. Usually, we will want to use them for the next task.

Solution: Create a new object each time you write a pipeline.

women = starwars |> filter(sex == "female")
brown_eyed_women = women |> filter(eye_color == "brown")

Resist the temptation to use the same object name. This is called clobbering and it ruins your ability to easily look back at a previous version of your object.

# DON'T do this
starwars = starwars |> filter(sex == "female")

By keeping multiple copies of very similar dataframes, will you waste your computer's memory? Usually, no -- R is smart and stores only the changes between objects.

40 / 67

Joins

41 / 67

Joining operations

Most data analysis requires combining information from multiple data tables. To combine tables, we use joins (similar to base::merge()).

42 / 67

Joining operations

Most data analysis requires combining information from multiple data tables. To combine tables, we use joins (similar to base::merge()).

There are multiple types of joins.

An inner join returns a dataframe of all rows that appear in both x and y.

  • It keeps only rows that appear in both tables.

Image is from "R for Data Science" by Hadley Wickham & Garrett Grolemund, used under CC BY-NC-ND 3.0 and not included under this resource's overall CC license.

42 / 67

Joining operations

A left join keeps all rows in x.

  • For matched rows, it merges in the values of y
  • For unmatched rows, it fills in NA values for the variables coming from y.

A right join keeps all rows in y.

  • Same as a left join, just reversed.

A full join keeps all rows in x and y.


Image is from "R for Data Science" by Wickham & Grolemund, used under CC BY-NC-ND 3.0 and not included under this resource's overall CC license.

43 / 67

Joining operations

Left joins are safer than inner joins

  • It's too easy to lose observations with inner joins
  • Left joins preserve all the original observations

Use the left join unless you have a strong reason to prefer another.

44 / 67

Joining operations

Left joins are safer than inner joins

  • It's too easy to lose observations with inner joins
  • Left joins preserve all the original observations

Use the left join unless you have a strong reason to prefer another.


For the examples that I'm going to show here, we'll need some data sets that come bundled with the nycflights13 package.

  • Load it now and then inspect these data frames on your own (click on them in your Environment pane).
  • Pay attention to what the year column means in each data frame.
library(nycflights13)
data(flights)
data(planes)
44 / 67

Joining operations

Let's perform a left join on the flights and planes datasets.

flights |>
left_join(planes)
## Joining, by = c("year", "tailnum")
## # A tibble: 336,776 × 10
## year month day dep_time arr_time carrier flight tailnum type model
## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA>
## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA>
## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA>
## 4 2013 1 1 544 1004 B6 725 N804JB <NA> <NA>
## 5 2013 1 1 554 812 DL 461 N668DN <NA> <NA>
## 6 2013 1 1 554 740 UA 1696 N39463 <NA> <NA>
## 7 2013 1 1 555 913 B6 507 N516JB <NA> <NA>
## 8 2013 1 1 557 709 EV 5708 N829AS <NA> <NA>
## 9 2013 1 1 557 838 B6 79 N593JB <NA> <NA>
## 10 2013 1 1 558 753 AA 301 N3ALAA <NA> <NA>
## # … with 336,766 more rows
## # ℹ Use print(n = ...) to see more rows
45 / 67

Joining operations

(continued from previous slide)

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

## Joining with by = join_by(year, tailnum)

What's the problem here?

46 / 67

Joining operations

(continued from previous slide)

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

## Joining with by = join_by(year, tailnum)

What's the problem here?

The variable "year" does not have a consistent meaning across our joining datasets!

  • In one it refers to the year of flight, in the other it refers to year of construction.
46 / 67

Joining operations

(continued from previous slide)

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

## Joining with by = join_by(year, tailnum)

What's the problem here?

The variable "year" does not have a consistent meaning across our joining datasets!

  • In one it refers to the year of flight, in the other it refers to year of construction.

Luckily, there's an easy way to avoid this problem.

  • Can you figure it out? Try ?dplyr::join.
46 / 67

Joining operations

(continued from previous slide)

You just need to be more explicit in your join call by using the by = argument.

  • Here, it would be good to rename any ambiguous columns to avoid confusion.
flights |>
left_join(
planes |> rename(year_built = year),
by = "tailnum"
)
## # A tibble: 3 × 11
## year month day dep_time arr_time carrier flight tailnum year_…¹ type model
## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixe… 737-…
## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixe… 737-…
## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixe… 757-…
## # … with abbreviated variable name ¹​year_built
47 / 67

tidyr

48 / 67

Key tidyr verbs

  1. pivot_longer: Pivot wide data into long format.

  2. pivot_wider: Pivot long data into wide format.

  3. separate: Separate (i.e. split) one column into multiple columns.

  4. unite: Unite (i.e. combine) multiple columns into one.

49 / 67

Key tidyr verbs

  1. pivot_longer: Pivot wide data into long format.

  2. pivot_wider: Pivot long data into wide format.

  3. separate: Separate (i.e. split) one column into multiple columns.

  4. unite: Unite (i.e. combine) multiple columns into one.


Which of pivot_longer vs pivot_wider produces "tidy" data?

49 / 67

1) tidyr::pivot_longer

stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer
time = as.Date('2009-01-01') + 0:1,
X = rnorm(2, 10, 1),
Y = rnorm(2, 10, 2),
Z = rnorm(2, 10, 5)
)
stocks
## time X Y Z
## 1 2009-01-01 9.353794 10.892745 9.648994
## 2 2009-01-02 9.657906 7.384239 19.472134
tidy_stocks = stocks |>
pivot_longer(cols=X:Z, names_to="stock", values_to="price")
tidy_stocks
## # A tibble: 6 × 3
## time stock price
## <date> <chr> <dbl>
## 1 2009-01-01 X 9.35
## 2 2009-01-01 Y 10.9
## 3 2009-01-01 Z 9.65
## 4 2009-01-02 X 9.66
## 5 2009-01-02 Y 7.38
## 6 2009-01-02 Z 19.5
50 / 67

2) tidyr::pivot_wider

Now we can use pivot_wider to go back to the original dataframe:

tidy_stocks |> pivot_wider(names_from=stock, values_from=price)
## # A tibble: 2 × 4
## time X Y Z
## <date> <dbl> <dbl> <dbl>
## 1 2009-01-01 9.35 10.9 9.65
## 2 2009-01-02 9.66 7.38 19.5

Or, we can put it into a new ("transposed") format, in which the observations are stocks and the columns are dates:

tidy_stocks |> pivot_wider(names_from=time, values_from=price)
## # A tibble: 3 × 3
## stock 2009-01-01 2009-01-02
## <chr> <dbl> <dbl>
## 1 X 9.35 9.66
## 2 Y 10.9 7.38
## 3 Z 9.65 19.5
51 / 67

3) tidyr::separate

separate helps when you have more than one value in a single column:

economists = data.frame(name = c("Adam_Smith", "Paul_Samuelson", "Milton_Friedman"))
economists
## name
## 1 Adam_Smith
## 2 Paul_Samuelson
## 3 Milton_Friedman
economists |> separate(name, c("first_name", "last_name"))
## first_name last_name
## 1 Adam Smith
## 2 Paul Samuelson
## 3 Milton Friedman
52 / 67

3) tidyr::separate

separate helps when you have more than one value in a single column:

economists = data.frame(name = c("Adam_Smith", "Paul_Samuelson", "Milton_Friedman"))
economists
## name
## 1 Adam_Smith
## 2 Paul_Samuelson
## 3 Milton_Friedman
economists |> separate(name, c("first_name", "last_name"))
## first_name last_name
## 1 Adam Smith
## 2 Paul Samuelson
## 3 Milton Friedman

This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with the sep argument:

economists |> separate(name, c("first_name", "last_name"), sep = "_")
52 / 67

3) tidyr::separate

Related is separate_rows, for splitting cells with multiple values into multiple rows:

jobs = data.frame(
name = c("Joe", "Jill"),
occupation = c("President", "First Lady, Professor, Grandmother")
)
jobs
## name occupation
## 1 Joe President
## 2 Jill First Lady, Professor, Grandmother
# Now split out Jill's various occupations into different rows
jobs |> separate_rows(occupation)
## # A tibble: 5 × 2
## name occupation
## <chr> <chr>
## 1 Joe President
## 2 Jill First
## 3 Jill Lady
## 4 Jill Professor
## 5 Jill Grandmother
53 / 67

3) tidyr::separate

Related is separate_rows, for splitting cells with multiple values into multiple rows:

jobs = data.frame(
name = c("Joe", "Jill"),
occupation = c("President", "First Lady, Professor, Grandmother")
)
jobs
## name occupation
## 1 Joe President
## 2 Jill First Lady, Professor, Grandmother
# Now split out Jill's various occupations into different rows
jobs |> separate_rows(occupation, sep = ", ")
## # A tibble: 4 × 2
## name occupation
## <chr> <chr>
## 1 Joe President
## 2 Jill First Lady
## 3 Jill Professor
## 4 Jill Grandmother
54 / 67

4) tidyr::unite

gdp = data.frame(
yr = rep(2016, times = 4),
mnth = rep(1, times = 4),
dy = 1:4,
gdp = rnorm(4, mean = 100, sd = 2)
)
gdp
## yr mnth dy gdp
## 1 2016 1 1 98.44216
## 2 2016 1 2 98.98793
## 3 2016 1 3 99.91618
## 4 2016 1 4 100.97192
## Combine "yr", "mnth", and "dy" into one "date" column
gdp |> unite(date, c("yr", "mnth", "dy"), sep = "-")
## date gdp
## 1 2016-1-1 98.44216
## 2 2016-1-2 98.98793
## 3 2016-1-3 99.91618
## 4 2016-1-4 100.97192
55 / 67

4) tidyr::unite

Note that unite will automatically create a character variable.

If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using mutate. This example uses the lubridate package's super helpful date conversion functions.

library(lubridate)
gdp_u |> mutate(date = ymd(date))
## # A tibble: 4 × 2
## date gdp
## <date> <dbl>
## 1 2016-01-01 98.4
## 2 2016-01-02 99.0
## 3 2016-01-03 99.9
## 4 2016-01-04 101.
56 / 67

Challenge

Using nycflights13, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).

57 / 67

Challenge

Using nycflights13, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).

Hint: Recall that you can tabulate summary statistics using group_by and summarize:

flights |>
group_by(carrier) |>
summarize(avg_late = mean(arr_delay, na.rm=T))
## # A tibble: 16 × 2
## carrier avg_late
## <chr> <dbl>
## 1 9E 7.38
## 2 AA 0.364
## 3 AS -9.93
## 4 B6 9.46
## 5 DL 1.64
## 6 EV 15.8
## 7 F9 21.9
## 8 FL 20.1
## 9 HA -6.92
## 10 MQ 10.8
## 11 OO 11.9
## 12 UA 3.56
## 13 US 2.13
## 14 VX 1.76
## 15 WN 9.65
## 16 YV 15.6
58 / 67

Importing data with readr

59 / 67

Download this data

Most data does not come nicely in R packages! You will download it from somewhere and load it in R.

Go here: https://github.com/nytimes/covid-19-data/tree/master/colleges
and click on the Raw CSV link.

  • This is a list of Covid-19 case counts reported at U.S. colleges and universities between July 2020 and May 2021.

In your browser, save this file somewhere sensible on your computer (perhaps a folder you already created to put your work for this class).

Your computer stores all its programs, files, and information in a filesystem consisting of a set of nested folders or directories.

60 / 67

Paths and directories

The working directory is your "current location" in the filesystem. What's your current working directory?

getwd()
## [1] "C:/git/covid-example"
  • This is an example of a full or absolute path.
  • It starts with C:/ on Windows or / on Mac.

Relative paths are defined relative to the full path of the working directory.

  • Let's say your working directory is "C:/git/ecns-560/"
  • If you had a file saved at "C:/git/ecns-560/assignment-3/assignment-3.Rmd"
  • Its relative path would be "assignment-3/assignment-3.Rmd"

It's always best to use relative paths, though you can also use an absolute path in any given situation.

61 / 67

Setting the working directory

You can change your working directory with setwd(dir). Now:

  1. Find the location you saved your CSV file and copy the filepath.

  2. Use the console in R to set your working directory to that location.

For example:

setwd("C:/git/covid-example")

Do not use setwd() in scripts. Setting the working directory is one of the few exceptions to the rule that you should always write all your code in scripts.

  • Someone else's working directory will be different from yours!
  • You want your code to be portable.
62 / 67

Read in data with readr

The tidyverse package readr provides lots of options to read data into R. Read in the college Covid data like this:

dat = read_csv("colleges.csv")

View this data to take a look at it. It automatically formatted everything as a nice and neat data frame!

str(dat)
## spc_tbl_ [1,948 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:1948], format: "2021-05-26" "2021-05-26" ...
## $ state : chr [1:1948] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr [1:1948] "Madison" "Montgomery" "Limestone" "Lee" ...
## $ city : chr [1:1948] "Huntsville" "Montgomery" "Athens" "Auburn" ...
## $ ipeds_id : chr [1:1948] "100654" "100724" "100812" "100858" ...
## $ college : chr [1:1948] "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ...
## $ cases : num [1:1948] 41 2 45 2742 220 ...
## $ cases_2021: num [1:1948] NA NA 10 567 80 NA 49 53 10 35 ...
## $ notes : chr [1:1948] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. state = col_character(),
## .. county = col_character(),
## .. city = col_character(),
## .. ipeds_id = col_character(),
## .. college = col_character(),
## .. cases = col_double(),
## .. cases_2021 = col_double(),
## .. notes = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
63 / 67

Challenge

Which state had the least total reported Covid-19 cases at colleges and universities?

Is it Montana?

64 / 67

Other tips for importing data

  • Many things can go wrong with CSV files. When this happens, read_csv has many options -- look at the help file.

  • You can read files directly from the internet without downloading them first (but you usually shouldn't!):

    url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv"
    dat = read_csv(url)
  • write_csv outputs a dataframe as a CSV to your computer.

  • readxl::read_excel reads directly from Excel spreadsheets.

  • googlesheets4::read_sheet reads directly from Google Sheets.

  • haven::read_dta reads directly from Stata files! (Stata can't read R files....)

65 / 67

Summary

66 / 67

Key tidyverse functions

dplyr

  1. filter
  2. arrange
  3. select
  4. mutate
  5. summarize

tidyr

  1. pivot_longer
  2. pivot_wider
  3. separate
  4. unite


Miscellaneous

  1. Pipes (|>)
  2. Grouping (group_by)
  3. Joining (left_join)

Importing data

  1. getwd & setwd
  2. read_csv
67 / 67

Agenda for the next 2-3 weeks

Where we've been:

  1. R Basics

  2. Programming

  3. Productivity tools


Where we're going:

  1. Data wrangling in the tidyverse

  2. Data cleaning

  3. Introduce the term project

  4. Data acquisition & webscraping

2 / 67
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow