class: center, middle, inverse, title-slide .title[ # Topic 4: Data Wrangling in the Tidyverse ] .author[ ### Nick Hagerty*
ECNS 460/560
Montana State University ] .date[ ### .smaller[
*Parts of these slides are adapted from
“Data Science for Economists”
by Grant R. McDermott, used under the
MIT License
. Images are from
ModernDive
by Chester Ismay and Albert Y. Kim, used under
CC BY-NC-SA 4.0
.] ] --- name: toc <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .small { font-size: 90%; } .smaller { font-size: 80%; } </style> # Agenda for the next 2-3 weeks **Where we've been:** 1. R Basics 1. Programming 1. Productivity tools <br> **Where we're going:** 1. Data wrangling in the tidyverse 1. Data cleaning 1. Introduce the term project 1. Data acquisition & webscraping --- # Table of contents 1. [Tidyverse basics](#basics) 1. [Data wrangling with dplyr](#dplyr) 1. [Joining data with dplyr](#joins) 1. [Data tidying with tidyr](#tidyr) 1. [Importing data with readr](#readr) 1. [Summary](#summary) --- class: inverse, middle name: basics # Tidyverse basics --- # The tidyverse **A whole "universe" of functions within R** - The most powerful, intuitive, and popular approach to data cleaning, wrangling, and visualization in R **Advantages:** - Consistent philosophy and syntax - "Verb" based approach makes it more familiar to users of Stata/SAS/SPSS - Serves as the front-end for many other big data and ML tools **Why did I make you learn base R to start with?** - The tidyverse can't do everything - The best solution to many problems involves a combination of tidyverse and base R - Ultimately you want to learn about many tools and choose the best one for the job .footnote[ Note: Another approach with many fans is [**data.table**](http://r-datatable.com/), which you can learn about on your own! ] --- # Tidyverse vs. base R There is often a direct correspondence between a tidyverse command and its base R equivalent. | tidyverse | base | |---|---| | `?readr::read_csv` | `?utils::read.csv` | | `?dplyr::if_else` | `?base::ifelse` | | `?tibble::tibble` | `?base::data.frame` | Notice these generally follow a `tidyverse::snake_case` vs `base::period.case` rule. The tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart. - Remember: There are always multiple ways to achieve a single goal in R. --- # Tidying Data The two most important properties of tidy data are: 1. Each column is a unique variable. 2. Each row is a single observation.  Tidy data is easier to work with, because you have a consistent way of referring to variables and observations. It then becomes easy to manipulate, visualize, and model. .smaller[Image is from ["R for Data Science"](https://r4ds.had.co.nz/tidy-data.html) by Hadley Wickham & Garrett Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and is not included under this project's overall CC license.] --- # Wide vs. Long Formats Both of these data sets display information on heart rate observed in individuals across 3 different time periods: ``` ## name time1 time2 time3 ## 1 Wilbur 67 56 70 ## 2 Petunia 80 90 67 ## 3 Gregory 64 50 101 ``` ``` ## name time heartrate ## 1 Wilbur 1 67 ## 2 Petunia 1 80 ## 3 Gregory 1 64 ## 4 Wilbur 2 56 ## 5 Petunia 2 90 ## 6 Gregory 2 50 ## 7 Wilbur 3 70 ## 8 Petunia 3 67 ## 9 Gregory 3 10 ``` Which dataframe is in *tidy* format? --- # Wide vs. Long Formats **Wide** data: - Row = patient. Columns = repeated observations over time. - Often easier to take in at a glance (as in a spreadsheet). **Long** data: - Row = one observation. Columns = ID variables + observed variable. - Usually easier to clean, merge with other data, and avoid errors. Tidy data is more likely to be **long**. - Most R packages have been written assuming your data is in long format. "Tidy datasets are all alike but every messy dataset is messy in its own way." – Hadley Wickham --- # Tidyverse packages We need to install and load a couple of packages. Run these preliminaries: ```r install.packages('tidyverse') install.packages('nycflights13') library(tidyverse) ``` ``` ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors ``` -- We see that we have actually loaded a number of packages (which could also be loaded individually): **ggplot2**, **tibble**, **dplyr**, etc. - We can also see information about the package versions and some [namespace conflicts](https://raw.githack.com/uo-ec607/lectures/master/04-rlang/04-rlang.html#59). --- # Tidyverse packages The tidyverse actually comes with a lot more packages than those that are just loaded automatically. ```r tidyverse_packages() ``` ``` ## [1] "broom" "cli" "crayon" "dbplyr" ## [5] "dplyr" "dtplyr" "forcats" "googledrive" ## [9] "googlesheets4" "ggplot2" "haven" "hms" ## [13] "httr" "jsonlite" "lubridate" "magrittr" ## [17] "modelr" "pillar" "purrr" "readr" ## [21] "readxl" "reprex" "rlang" "rstudioapi" ## [25] "rvest" "stringr" "tibble" "tidyr" ## [29] "xml2" "tidyverse" ``` All of these are super useful - **lubridate** helps us work with dates - **rvest** is for webscraping This lecture will focus on two that are automatically loaded: **dplyr** and **tidyr**. --- # Pipes: |> or %>% Pipes take the **output** of one function and feed it into the **first argument** of the next (which you then skip). `dataframe |> filter(condition)` is equivalent to `filter(dataframe, condition)`. Note: `|>` on these slides is generated by the two characters `| >`, without the space. -- <br> **Older version** of the pipe: `%>%` * From the `magrittr` package loaded with the tidyverse * Works identically to `|>` in most situations. <br> **Keyboard shortcut:** Ctl/Cmd + Shift + M * Have to turn on a setting in RStudio options to make `|>` the default --- # Pipes: |> or %>% Pipes can dramatically improve the experience of reading and writing code. Compare: .small[ ```r ## These next two lines of code do exactly the same thing. mpg |> filter(manufacturer=="audi") |> group_by(model) |> summarize(hwy_mean = mean(hwy)) summarize(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy)) ``` ] The first line reads from left to right, exactly how you think about the operations. The second line totally inverts this logical order (the final operation comes first!) --- # Pipes: |> or %>% Best practice is to put each function on its own line and indent. Look how much more readable this is: ```r mpg |> filter(manufacturer == "audi") |> group_by(model) |> summarize(hwy_mean = mean(hwy)) ``` ``` ## # A tibble: 3 × 2 ## model hwy_mean ## <chr> <dbl> ## 1 a4 28.3 ## 2 a4 quattro 25.8 ## 3 a6 quattro 24 ``` Vertical space costs nothing and makes for much more readable/writable code than cramming things horizontally. All together, this multi-line line of code is called a **pipeline**. --- class: inverse, middle name: dplyr # dplyr --- # Key dplyr verbs **There are five key dplyr verbs that you need to learn.** 1. `filter`: Filter (i.e. subset) rows based on their values. 2. `arrange`: Arrange (i.e. reorder) rows based on their values. 3. `select`: Select (i.e. subset) columns by their names: 4. `mutate`: Create new columns. 5. `summarize`: Collapse multiple rows into a single summary value.<sup>1</sup> .footnote[ <sup>1</sup> Many sources spell `summarise` with an "s" -- either one works! ] -- </br> Let's practice these functions together using the `starwars` data frame that comes pre-packaged with dplyr. --- # 1) dplyr::filter  --- # 1) dplyr::filter We can chain multiple filter commands with the pipe (`|>`), or just separate them within a single filter command using commas. ```r starwars |> filter( species == "Human", height >= 190 ) ``` ``` ## # A tibble: 4 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… ## 2 Qui-Gon Jinn 193 89 brown fair blue 92 male mascu… <NA> ## 3 Dooku 193 80 white fair brown 102 male mascu… Serenno ## 4 Bail Presto… 191 NA black tan brown 67 male mascu… Aldera… ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color, ## # ³eye_color, ⁴birth_year, ⁵homeworld ## # ℹ Use `colnames()` to see all variable names ``` --- # 1) dplyr::filter Regular expressions work well too. ```r starwars |> filter(str_detect(name, "Skywalker")) ``` ``` ## # A tibble: 3 × 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sky… 172 77 blond fair blue 19 male mascu… ## 2 Anakin S… 188 84 blond fair blue 41.9 male mascu… ## 3 Shmi Sky… 163 NA black fair brown 72 fema… femin… ## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list> ``` --- # 1) dplyr::filter A very common `filter` use case is identifying (or removing) missing data cases. ```r starwars |> filter(is.na(height)) ``` ``` ## # A tibble: 6 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA> ## 2 Finn NA NA black dark dark NA male mascu… <NA> ## 3 Rey NA NA brown light hazel NA fema… femin… <NA> ## 4 Poe Dameron NA NA brown light brown NA male mascu… <NA> ## 5 BB8 NA NA none none black NA none mascu… <NA> ## 6 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA> ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color, ## # ³eye_color, ⁴birth_year, ⁵homeworld ## # ℹ Use `colnames()` to see all variable names ``` -- To remove missing observations, simply use negation: `filter(!is.na(height))`. Try this yourself. --- # 2) dplyr::arrange  --- # 2) dplyr::arrange `arrange` sorts your data frame by a particular column (numerically, or alphabetically) ```r starwars |> arrange(birth_year) ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor ## 2 IG-88 200 140 none metal red 15 none mascu… <NA> ## 3 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi… ## 4 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera… ## 5 Wedge Anti… 170 77 brown fair hazel 21 male mascu… Corell… ## 6 Plo Koon 188 80 none orange black 22 male mascu… Dorin ## 7 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi… ## 8 Han Solo 180 80 brown fair brown 29 male mascu… Corell… ## 9 Lando Calr… 177 79 black dark brown 31 male mascu… Socorro ## 10 Boba Fett 183 78.2 black fair brown 31.5 male mascu… Kamino ## # … with 77 more rows, 4 more variables: species <chr>, films <list>, ## # vehicles <list>, starships <list>, and abbreviated variable names ## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld ## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names ``` --- # 2) dplyr::arrange We can also arrange items in descending order using `arrange(desc())`. ```r starwars |> arrange(desc(birth_year)) ``` ``` ## # A tibble: 87 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵ ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Yoda 66 17 white green brown 896 male mascu… <NA> ## 2 Jabba Desi… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu… ## 3 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy… ## 4 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… ## 5 Dooku 193 80 white fair brown 102 male mascu… Serenno ## 6 Qui-Gon Ji… 193 89 brown fair blue 92 male mascu… <NA> ## 7 Ki-Adi-Mun… 198 82 white pale yellow 92 male mascu… Cerea ## 8 Finis Valo… 170 NA blond fair blue 91 male mascu… Corusc… ## 9 Palpatine 170 75 grey pale yellow 82 male mascu… Naboo ## 10 Cliegg Lars 183 NA brown fair blue 82 male mascu… Tatooi… ## # … with 77 more rows, 4 more variables: species <chr>, films <list>, ## # vehicles <list>, starships <list>, and abbreviated variable names ## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld ## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names ``` --- # 3) dplyr::select  --- # 3) dplyr::select Use commas to select multiple columns out of a data frame. (You can also use "first:last" for consecutive columns). Deselect a column with "-". ```r starwars |> select(name:skin_color, species, -height) ``` ``` ## # A tibble: 87 × 5 ## name mass hair_color skin_color species ## <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywalker 77 blond fair Human ## 2 C-3PO 75 <NA> gold Droid ## 3 R2-D2 32 <NA> white, blue Droid ## 4 Darth Vader 136 none white Human ## 5 Leia Organa 49 brown light Human ## 6 Owen Lars 120 brown, grey light Human ## 7 Beru Whitesun lars 75 brown light Human ## 8 R5-D4 32 <NA> white, red Droid ## 9 Biggs Darklighter 84 black light Human ## 10 Obi-Wan Kenobi 77 auburn, white fair Human ## # … with 77 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # 3) dplyr::select You can also rename some (or all) of your selected variables in place. ```r starwars |> select(alias=name, planet=homeworld) ``` ``` ## # A tibble: 87 × 2 ## alias planet ## <chr> <chr> ## 1 Luke Skywalker Tatooine ## 2 C-3PO Tatooine ## 3 R2-D2 Naboo ## 4 Darth Vader Tatooine ## 5 Leia Organa Alderaan ## 6 Owen Lars Tatooine ## 7 Beru Whitesun lars Tatooine ## 8 R5-D4 Tatooine ## 9 Biggs Darklighter Tatooine ## 10 Obi-Wan Kenobi Stewjon ## # ℹ 77 more rows ``` -- If you just want to rename columns without subsetting them, you can use `rename`. Try this! --- # 3) dplyr::select The `select(contains(PATTERN))` option provides a nice shortcut in relevant cases. ```r starwars |> select(name, contains("color")) ``` ``` ## # A tibble: 87 × 4 ## name hair_color skin_color eye_color ## <chr> <chr> <chr> <chr> ## 1 Luke Skywalker blond fair blue ## 2 C-3PO <NA> gold yellow ## 3 R2-D2 <NA> white, blue red ## 4 Darth Vader none white yellow ## 5 Leia Organa brown light brown ## 6 Owen Lars brown, grey light blue ## 7 Beru Whitesun lars brown light blue ## 8 R5-D4 <NA> white, red red ## 9 Biggs Darklighter black light brown ## 10 Obi-Wan Kenobi auburn, white fair blue-gray ## # … with 77 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` Some other selection helpers: `starts_with()`, `ends_with()`, `all_of(c("name1", "name2"))`, `matches()`. --- # 4) dplyr::mutate  --- # 4) dplyr::mutate You can create new columns from scratch, or (more commonly) as transformations of existing columns. ```r starwars |> select(name, birth_year) |> mutate(dog_years = birth_year * 7) |> mutate(comment = paste0(name, " is ", dog_years, " in dog years.")) ``` ``` ## # A tibble: 87 × 4 ## name birth_year dog_years comment ## <chr> <dbl> <dbl> <chr> ## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. ## 2 C-3PO 112 784 C-3PO is 784 in dog years. ## 3 R2-D2 33 231 R2-D2 is 231 in dog years. ## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. ## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. ## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. ## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea… ## 8 R5-D4 NA NA R5-D4 is NA in dog years. ## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year… ## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. ## # … with 77 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # 4) dplyr::mutate *Note:* `mutate` is order aware. So you can chain multiple mutates in a single call. ```r starwars |> select(name, birth_year) |> mutate( dog_years = birth_year * 7, # Separate with a comma comment = paste0(name, " is ", dog_years, " in dog years.") ) ``` ``` ## # A tibble: 87 × 4 ## name birth_year dog_years comment ## <chr> <dbl> <dbl> <chr> ## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. ## 2 C-3PO 112 784 C-3PO is 784 in dog years. ## 3 R2-D2 33 231 R2-D2 is 231 in dog years. ## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. ## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. ## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. ## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea… ## 8 R5-D4 NA NA R5-D4 is NA in dog years. ## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year… ## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. ## # … with 77 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # 4) dplyr::mutate Boolean, logical and conditional operators all work well with `mutate` too. ```r starwars |> select(name, height) |> filter(name %in% c("Luke Skywalker", "Anakin Skywalker")) |> mutate(tall1 = height > 180) |> mutate(tall2 = if_else(height > 180, "Tall", "Short")) ``` ``` ## # A tibble: 2 × 4 ## name height tall1 tall2 ## <chr> <int> <lgl> <chr> ## 1 Luke Skywalker 172 FALSE Short ## 2 Anakin Skywalker 188 TRUE Tall ``` --- # 4) dplyr::mutate Lastly, combining `mutate` with `across` allows you to easily perform the same operation on a subset of variables. ```r starwars |> select(name:eye_color) |> mutate(across(where(is.character), toupper)) ``` ``` ## # A tibble: 87 × 6 ## name height mass hair_color skin_color eye_color ## <chr> <int> <dbl> <chr> <chr> <chr> ## 1 LUKE SKYWALKER 172 77 BLOND FAIR BLUE ## 2 C-3PO 167 75 <NA> GOLD YELLOW ## 3 R2-D2 96 32 <NA> WHITE, BLUE RED ## 4 DARTH VADER 202 136 NONE WHITE YELLOW ## 5 LEIA ORGANA 150 49 BROWN LIGHT BROWN ## 6 OWEN LARS 178 120 BROWN, GREY LIGHT BLUE ## 7 BERU WHITESUN LARS 165 75 BROWN LIGHT BLUE ## 8 R5-D4 97 32 <NA> WHITE, RED RED ## 9 BIGGS DARKLIGHTER 183 84 BLACK LIGHT BROWN ## 10 OBI-WAN KENOBI 182 77 AUBURN, WHITE FAIR BLUE-GRAY ## # … with 77 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # 5) dplyr::summarize   --- # 5) dplyr::summarize Particularly useful in combination with the `group_by` command. ```r starwars |> group_by(species) |> summarize(mean_height = mean(height)) ``` ``` ## # A tibble: 38 × 2 ## species mean_height ## <chr> <dbl> ## 1 Aleena 79 ## 2 Besalisk 198 ## 3 Cerean 198 ## 4 Chagrian 196 ## 5 Clawdite 168 ## 6 Droid NA ## 7 Dug 112 ## 8 Ewok 88 ## 9 Geonosian 183 ## 10 Gungan 209. ## # … with 28 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # 5) dplyr::summarize Notice that some of these summarized values are missing. If we want to ignore missing values, use `na.rm = T`: ```r ## Much better starwars |> group_by(species) |> summarize(mean_height = mean(height, na.rm = T)) ``` ``` ## # A tibble: 38 × 2 ## species mean_height ## <chr> <dbl> ## 1 Aleena 79 ## 2 Besalisk 198 ## 3 Cerean 198 ## 4 Chagrian 196 ## 5 Clawdite 168 ## 6 Droid 131. ## 7 Dug 112 ## 8 Ewok 88 ## 9 Geonosian 183 ## 10 Gungan 209. ## # … with 28 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # 5) dplyr::summarize The same `across`-based workflow that we saw with `mutate` a few slides back also works with `summarize`. ```r starwars |> group_by(species) |> summarize(across(where(is.numeric), mean)) ``` ``` ## # A tibble: 38 × 4 ## species height mass birth_year ## <chr> <dbl> <dbl> <dbl> ## 1 Aleena 79 15 NA ## 2 Besalisk 198 102 NA ## 3 Cerean 198 82 92 ## 4 Chagrian 196 NA NA ## 5 Clawdite 168 55 NA ## 6 Droid NA NA NA ## 7 Dug 112 40 NA ## 8 Ewok 88 20 8 ## 9 Geonosian 183 80 NA ## 10 Gungan 209. NA NA ## # ℹ 28 more rows ``` --- # 5) dplyr::summarize The same `across`-based workflow that we saw with `mutate` a few slides back also works with `summarize`. Though to add arguments, we have to use an **anonymous function:** ```r starwars |> group_by(species) |> summarize(across(where(is.numeric), ~ mean(.x, na.rm=T))) ``` ``` ## # A tibble: 38 × 4 ## species height mass birth_year ## <chr> <dbl> <dbl> <dbl> ## 1 Aleena 79 15 NaN ## 2 Besalisk 198 102 NaN ## 3 Cerean 198 82 92 ## 4 Chagrian 196 NaN NaN ## 5 Clawdite 168 55 NaN ## 6 Droid 131. 69.8 53.3 ## 7 Dug 112 40 NaN ## 8 Ewok 88 20 8 ## 9 Geonosian 183 80 NaN ## 10 Gungan 209. 74 52 ## # ℹ 28 more rows ``` --- # Other dplyr goodies `ungroup`: For ungrouping after using `group_by`. - Use after doing your grouped `summarize` or `mutate` operation, or everything else you do will be super slow. -- `slice`: Subset rows by position rather than filtering by values. - E.g. `starwars |> slice(1:10)` -- `pull`: Extract a column from as a data frame as a vector or scalar. - E.g. `starwars |> filter(sex=="female") |> pull(height)` -- `distinct` and `count`: List unique values, with or without their number of appearances. - E.g. `starwars |> distinct(species)`, or `starwars |> count(species)` - `count` is equivalent to `group_by` and `summarize` with `n()`: ```r starwars |> group_by(species) |> summarize(n = n()) ``` --- # Challenge Take a few minutes to try this on your own: **List the most common eye colors among female Star Wars characters in descending order of frequency.** --- # Storing results in memory So far we haven't been saving the dataframes that result from our code in memory. Usually, we will want to use them for the next task. **Solution:** Create a new object each time you write a pipeline. ```r women = starwars |> filter(sex == "female") brown_eyed_women = women |> filter(eye_color == "brown") ``` Resist the temptation to use the same object name. This is called **clobbering** and it ruins your ability to easily look back at a previous version of your object. ```r # DON'T do this starwars = starwars |> filter(sex == "female") ``` -- By keeping multiple copies of very similar dataframes, will you waste your computer's memory? Usually, no -- R is smart and stores only the changes between objects. --- class: inverse, middle name: joins # Joins --- # Joining operations Most data analysis requires combining information from multiple data tables. To combine tables, we use **joins** (similar to `base::merge()`).  -- There are multiple types of joins. An **inner join** returns a dataframe of all rows that appear in both x and y. - It keeps *only* rows that appear in both tables. .smaller[Image is from ["R for Data Science"](https://r4ds.had.co.nz/relational-data.html) by Hadley Wickham & Garrett Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license.] --- # Joining operations .pull-left[ A **left join** keeps all rows in x. - For matched rows, it merges in the values of y - For unmatched rows, it fills in NA values for the variables coming from y. A **right join** keeps all rows in y. - Same as a left join, just reversed. A **full join** keeps all rows in x and y. </br> .smaller[Image is from ["R for Data Science"](https://r4ds.had.co.nz/relational-data.html) by Wickham & Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license.] ] .pull-right[  ] --- # Joining operations **Left joins are safer than inner joins** - It's too easy to lose observations with inner joins - Left joins preserve all the original observations **Use the left join** unless you have a strong reason to prefer another. -- </br> For the examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package. - Load it now and then inspect these data frames on your own (click on them in your Environment pane). - Pay attention to what the `year` column means in each data frame. ```r library(nycflights13) data(flights) data(planes) ``` --- # Joining operations Let's perform a left join on the flights and planes datasets. ```r flights |> left_join(planes) ``` ``` ## Joining, by = c("year", "tailnum") ``` ``` ## # A tibble: 336,776 × 10 ## year month day dep_time arr_time carrier flight tailnum type model ## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr> ## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA> ## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA> ## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA> ## 4 2013 1 1 544 1004 B6 725 N804JB <NA> <NA> ## 5 2013 1 1 554 812 DL 461 N668DN <NA> <NA> ## 6 2013 1 1 554 740 UA 1696 N39463 <NA> <NA> ## 7 2013 1 1 555 913 B6 507 N516JB <NA> <NA> ## 8 2013 1 1 557 709 EV 5708 N829AS <NA> <NA> ## 9 2013 1 1 557 838 B6 79 N593JB <NA> <NA> ## 10 2013 1 1 558 753 AA 301 N3ALAA <NA> <NA> ## # … with 336,766 more rows ## # ℹ Use `print(n = ...)` to see more rows ``` --- # Joining operations (*continued from previous slide*) Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices: ``` *## Joining with `by = join_by(year, tailnum)` ``` **What's the problem here?** -- The variable "year" does not have a consistent meaning across our joining datasets! - In one it refers to the *year of flight*, in the other it refers to *year of construction*. -- Luckily, there's an easy way to avoid this problem. - Can you figure it out? Try `?dplyr::join`. --- # Joining operations (*continued from previous slide*) You just need to be more explicit in your join call by using the `by = ` argument. - Here, it would be good to rename any ambiguous columns to avoid confusion. ```r flights |> left_join( planes |> rename(year_built = year), by = "tailnum" ) ``` ``` ## # A tibble: 3 × 11 ## year month day dep_time arr_time carrier flight tailnum year_…¹ type model ## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr> <chr> ## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixe… 737-… ## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixe… 737-… ## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixe… 757-… ## # … with abbreviated variable name ¹year_built ``` --- class: inverse, middle name: tidyr # tidyr --- # Key tidyr verbs 1. `pivot_longer`: Pivot wide data into long format. 2. `pivot_wider`: Pivot long data into wide format. 3. `separate`: Separate (i.e. split) one column into multiple columns. 4. `unite`: Unite (i.e. combine) multiple columns into one. -- </br> Which of `pivot_longer` vs `pivot_wider` produces "tidy" data? --- # 1) tidyr::pivot_longer ```r stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer time = as.Date('2009-01-01') + 0:1, X = rnorm(2, 10, 1), Y = rnorm(2, 10, 2), Z = rnorm(2, 10, 5) ) stocks ``` ``` ## time X Y Z ## 1 2009-01-01 9.353794 10.892745 9.648994 ## 2 2009-01-02 9.657906 7.384239 19.472134 ``` ```r tidy_stocks = stocks |> pivot_longer(cols=X:Z, names_to="stock", values_to="price") tidy_stocks ``` ``` ## # A tibble: 6 × 3 ## time stock price ## <date> <chr> <dbl> ## 1 2009-01-01 X 9.35 ## 2 2009-01-01 Y 10.9 ## 3 2009-01-01 Z 9.65 ## 4 2009-01-02 X 9.66 ## 5 2009-01-02 Y 7.38 ## 6 2009-01-02 Z 19.5 ``` --- # 2) tidyr::pivot_wider Now we can use pivot_wider to go back to the original dataframe: ```r tidy_stocks |> pivot_wider(names_from=stock, values_from=price) ``` ``` ## # A tibble: 2 × 4 ## time X Y Z ## <date> <dbl> <dbl> <dbl> ## 1 2009-01-01 9.35 10.9 9.65 ## 2 2009-01-02 9.66 7.38 19.5 ``` Or, we can put it into a new ("transposed") format, in which the observations are stocks and the columns are dates: ```r tidy_stocks |> pivot_wider(names_from=time, values_from=price) ``` ``` ## # A tibble: 3 × 3 ## stock `2009-01-01` `2009-01-02` ## <chr> <dbl> <dbl> ## 1 X 9.35 9.66 ## 2 Y 10.9 7.38 ## 3 Z 9.65 19.5 ``` --- # 3) tidyr::separate `separate` helps when you have more than one value in a single column: ```r economists = data.frame(name = c("Adam_Smith", "Paul_Samuelson", "Milton_Friedman")) economists ``` ``` ## name ## 1 Adam_Smith ## 2 Paul_Samuelson ## 3 Milton_Friedman ``` ```r economists |> separate(name, c("first_name", "last_name")) ``` ``` ## first_name last_name ## 1 Adam Smith ## 2 Paul Samuelson ## 3 Milton Friedman ``` -- This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with the `sep` argument: ```r economists |> separate(name, c("first_name", "last_name"), sep = "_") ``` --- # 3) tidyr::separate Related is `separate_rows`, for splitting cells with multiple values into multiple rows: ```r jobs = data.frame( name = c("Joe", "Jill"), occupation = c("President", "First Lady, Professor, Grandmother") ) jobs ``` ``` ## name occupation ## 1 Joe President ## 2 Jill First Lady, Professor, Grandmother ``` ```r # Now split out Jill's various occupations into different rows jobs |> separate_rows(occupation) ``` ``` ## # A tibble: 5 × 2 ## name occupation ## <chr> <chr> ## 1 Joe President ## 2 Jill First ## 3 Jill Lady ## 4 Jill Professor ## 5 Jill Grandmother ``` --- # 3) tidyr::separate Related is `separate_rows`, for splitting cells with multiple values into multiple rows: ```r jobs = data.frame( name = c("Joe", "Jill"), occupation = c("President", "First Lady, Professor, Grandmother") ) jobs ``` ``` ## name occupation ## 1 Joe President ## 2 Jill First Lady, Professor, Grandmother ``` ```r # Now split out Jill's various occupations into different rows jobs |> separate_rows(occupation, sep = ", ") ``` ``` ## # A tibble: 4 × 2 ## name occupation ## <chr> <chr> ## 1 Joe President ## 2 Jill First Lady ## 3 Jill Professor ## 4 Jill Grandmother ``` --- # 4) tidyr::unite ```r gdp = data.frame( yr = rep(2016, times = 4), mnth = rep(1, times = 4), dy = 1:4, gdp = rnorm(4, mean = 100, sd = 2) ) gdp ``` ``` ## yr mnth dy gdp ## 1 2016 1 1 98.44216 ## 2 2016 1 2 98.98793 ## 3 2016 1 3 99.91618 ## 4 2016 1 4 100.97192 ``` ```r ## Combine "yr", "mnth", and "dy" into one "date" column gdp |> unite(date, c("yr", "mnth", "dy"), sep = "-") ``` ``` ## date gdp ## 1 2016-1-1 98.44216 ## 2 2016-1-2 98.98793 ## 3 2016-1-3 99.91618 ## 4 2016-1-4 100.97192 ``` --- # 4) tidyr::unite Note that `unite` will automatically create a character variable. If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using `mutate`. This example uses the [lubridate](https://lubridate.tidyverse.org/) package's super helpful date conversion functions. ```r library(lubridate) gdp_u |> mutate(date = ymd(date)) ``` ``` ## # A tibble: 4 × 2 ## date gdp ## <date> <dbl> ## 1 2016-01-01 98.4 ## 2 2016-01-02 99.0 ## 3 2016-01-03 99.9 ## 4 2016-01-04 101. ``` --- # Challenge **Using `nycflights13`, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).** --- # Challenge **Using `nycflights13`, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).** Hint: Recall that you can tabulate summary statistics using `group_by` and `summarize`: ```r flights |> group_by(carrier) |> summarize(avg_late = mean(arr_delay, na.rm=T)) ``` ``` ## # A tibble: 16 × 2 ## carrier avg_late ## <chr> <dbl> ## 1 9E 7.38 ## 2 AA 0.364 ## 3 AS -9.93 ## 4 B6 9.46 ## 5 DL 1.64 ## 6 EV 15.8 ## 7 F9 21.9 ## 8 FL 20.1 ## 9 HA -6.92 ## 10 MQ 10.8 ## 11 OO 11.9 ## 12 UA 3.56 ## 13 US 2.13 ## 14 VX 1.76 ## 15 WN 9.65 ## 16 YV 15.6 ``` --- class: inverse, middle name: readr # Importing data with readr --- # Download this data Most data does not come nicely in R packages! You will download it from somewhere and load it in R. **Go here:** <https://github.com/nytimes/covid-19-data/tree/master/colleges> and click on the **Raw CSV** link. * This is a list of Covid-19 case counts reported at U.S. colleges and universities between July 2020 and May 2021. In your browser, **save** this file somewhere sensible on your computer (perhaps a folder you already created to put your work for this class). Your computer stores all its programs, files, and information in a **filesystem** consisting of a set of nested **folders** or **directories.** --- # Paths and directories The **working directory** is your "current location" in the filesystem. What's your current working directory? ```r getwd() ``` ``` ## [1] "C:/git/covid-example" ``` * This is an example of a full or **absolute path.** * It starts with `C:/` on Windows or `/` on Mac. **Relative paths** are defined relative to the full path of the working directory. * Let's say your working directory is `"C:/git/ecns-560/"` * If you had a file saved at `"C:/git/ecns-560/assignment-3/assignment-3.Rmd"` * Its relative path would be `"assignment-3/assignment-3.Rmd"` It's always best to use **relative paths,** though you *can* also use an absolute path in any given situation. --- # Setting the working directory You can change your working directory with `setwd(dir)`. Now: 1. **Find the location you saved your CSV file and copy the filepath.** 2. **Use the console in R to set your working directory to that location.** For example: ```r setwd("C:/git/covid-example") ``` **Do not use `setwd()` in scripts.** Setting the working directory is one of the *few exceptions* to the rule that you should always write all your code in scripts. * Someone else's working directory will be different from yours! * You want your code to be portable. --- # Read in data with readr The tidyverse package `readr` provides lots of options to read data into R. Read in the college Covid data like this: ```r dat = read_csv("colleges.csv") ``` `View` this data to take a look at it. It automatically formatted everything as a nice and neat data frame! ```r str(dat) ``` ``` ## spc_tbl_ [1,948 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ date : Date[1:1948], format: "2021-05-26" "2021-05-26" ... ## $ state : chr [1:1948] "Alabama" "Alabama" "Alabama" "Alabama" ... ## $ county : chr [1:1948] "Madison" "Montgomery" "Limestone" "Lee" ... ## $ city : chr [1:1948] "Huntsville" "Montgomery" "Athens" "Auburn" ... ## $ ipeds_id : chr [1:1948] "100654" "100724" "100812" "100858" ... ## $ college : chr [1:1948] "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ... ## $ cases : num [1:1948] 41 2 45 2742 220 ... ## $ cases_2021: num [1:1948] NA NA 10 567 80 NA 49 53 10 35 ... ## $ notes : chr [1:1948] NA NA NA NA ... ## - attr(*, "spec")= ## .. cols( ## .. date = col_date(format = ""), ## .. state = col_character(), ## .. county = col_character(), ## .. city = col_character(), ## .. ipeds_id = col_character(), ## .. college = col_character(), ## .. cases = col_double(), ## .. cases_2021 = col_double(), ## .. notes = col_character() ## .. ) ## - attr(*, "problems")=<externalptr> ``` --- # Challenge **Which state had the least total reported Covid-19 cases at colleges and universities?** Is it Montana? --- # Other tips for importing data * Many things can go wrong with CSV files. When this happens, `read_csv` has many options -- look at the help file. * You can read files directly from the internet without downloading them first (but you usually shouldn't!): .smaller[ ```r url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv" dat = read_csv(url) ``` ] * `write_csv` outputs a dataframe as a CSV to your computer. * `readxl::read_excel` reads directly from Excel spreadsheets. * `googlesheets4::read_sheet` reads directly from Google Sheets. * `haven::read_dta` reads directly from Stata files! (Stata can't read R files....) --- class: inverse, middle name: summary # Summary --- # Key tidyverse functions .pull-left[ ### dplyr 1. `filter` 2. `arrange` 3. `select` 4. `mutate` 5. `summarize` ] .pull-right[ ### tidyr 1. `pivot_longer` 2. `pivot_wider` 3. `separate` 4. `unite` ] </br> .pull-left[ ### Miscellaneous 1. Pipes (`|>`) 2. Grouping (`group_by`) 3. Joining (`left_join`) ] .pull-right[ ### Importing data 1. `getwd` & `setwd` 1. `read_csv` ]