class: center, middle, inverse, title-slide .title[ # Guest Lecture ] .subtitle[ ## Data cleaning & wrangling: data.table ] .author[ ### Grant McDermott ] .date[ ### University of Oregon |
EC 607
] --- class: middle Today we look at [Grant McDermott's](https://grantmcdermott.com) great [slides](https://github.com/uo-ec607) on `data.table`. I'll make some minor modifications, but it's largely his work. I am definitely more a `data.table` person than Grant, so it's good to see his balanced view here. In particular this is well taken: >Don't be a fanatic. Insisting on ecosystem purity is rarely worth it. Thanks Grant! 🙏 --- name: toc <style type="text/css"> .large4 { font-size: 400% } .large2 { font-size: 200% } .small90 { font-size: 90% } .small75 { font-size: 75% } </style> # Table of contents 1. [Prologue](#prologue) 2. [Introduction](#intro) 3. [data.table basics](#basics) 4. [Working with rows: DT[i, ]](#i) 5. [Manipulating columns: DT[, j]](#j) 6. [Grouping: DT[, , by]](#by) 7. [Keys](#keys) 8. [Merging datasets](#merge) 9. [Reshaping data](#reshape) 10. [data.table + tidyverse workflows](#tidyverse) 10. [Summary](#summary) --- class: inverse, center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Checklist We'll be using the following packages in today's lecture: - Already installed: **dplyr**, **ggplot2**, **nycflights13** - New: **data.table**, **tidyfast**, **dtplyr**, **microbenchmark** -- The following code chunk will install (if necessary) and load everything for you. ```r if (!require(pacman)) install.packages('pacman', repos = 'https://cran.rstudio.com') pacman::p_load(dplyr, data.table, dtplyr, tidyfast, microbenchmark, ggplot2, nycflights13) options(dplyr.summarise.inform = FALSE) ## Turn off annoying dplyr group_by messages (optional) ``` --- class: inverse, center, middle name: intro # Introduction <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Why learn data.table? The **tidyverse** is great. As I keep hinting, it will also provide a bridge to many of the big data tools that we'll encounter later in the course (SQL databases, etc.) -- So why bother learning another data wrangling package/syntax? -- When it comes to **data.table**, I can think of at least five reasons: 1. Concise 2. Insanely fast 3. Memory efficient 4. Feature rich (and stable) 5. Dependency free -- Before we get into specifics, here are a few examples to whet your appetite... --- # Why learn data.table? (cont.) ### 1) Concise These two code chunks do the same thing: ```r # library(dplyr) ## Already loaded # data(starwars, package = "dplyr") ## Optional to bring the DF into the global env starwars %>% filter(species=="Human") %>% group_by(homeworld) %>% summarise(mean_height=mean(height)) ``` vs ```r # library(data.table) ## Already loaded starwars_dt = as.data.table(starwars) starwars_dt[species=="Human", mean(height), by=homeworld] ``` --- name:fast # Why learn data.table? (cont.) ### 2) Insanely fast .small90[ ```r collapse_dplyr = function() { storms %>% group_by(name, year, month, day) %>% summarize(wind = mean(wind), pressure = mean(pressure), category = dplyr::first(category)) } storms_dt = as.data.table(storms) collapse_dt = function() { storms_dt[, .(wind = mean(wind), pressure = mean(pressure), category = first(category)), by = .(name, year, month, day)] } microbenchmark(collapse_dplyr(), collapse_dt(), times = 10) ``` ``` ## Warning in microbenchmark(collapse_dplyr(), collapse_dt(), times = 10): less ## accurate nanosecond times to avoid potential integer overflows ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max ## collapse_dplyr() 95.429058 98.75891 100.249998 99.647384 101.337281 107.746729 ## collapse_dt() 3.880773 3.97618 4.717259 4.021444 4.111193 9.047347 ## neval ## 10 ## 10 ``` ] -- .small90[ **Result:** data.table is 75x faster! (Thanks to [Keith Head](https://twitter.com/ckhead) for this example.) ] --- # Why learn data.table? (cont.) ### 3) Memory efficient Measuring and comparing memory use [gets complicated](https://stackoverflow.com/a/61376971). But see [here](https://jangorecki.gitlab.io/r-talks/2019-06-18_Poznan_why-data.table/why-data.table.pdf) (esp. from slide 12) for a thorough walkthrough of data.table's memory use and efficiency. ### 4) Features and 5) No dependencies I'll lump these together, since they really have to do with the stability of your code over time. Just to emphasise the point about [dependencies](http://www.tinyverse.org/), though: ```r tools::package_dependencies("data.table", recursive = TRUE)[[1]] ``` ``` ## [1] "methods" ``` ```r tools::package_dependencies("dplyr", recursive = TRUE)[[1]] ``` ``` ## [1] "ellipsis" "assertthat" "glue" "magrittr" "methods" "pkgconfig" ## [7] "R6" "Rcpp" "rlang" "tibble" "tidyselect" "utils" ## [13] "BH" "plogr" "tools" "cli" "crayon" "fansi" ## [19] "lifecycle" "pillar" "vctrs" "purrr" "grDevices" "utf8" ## [25] "digest" ``` --- # Before we continue... The purpose of this lecture is *not* to convince you that data.table is superior to the tidyverse. (Or vice versa.) For sure, people have strong opinions on the matter and you may find yourself pulling strongly in one direction or the other. And that's okay, but... My goal is simply to show you another powerful tool that you can use to tackle big (or small!) data problems efficiently in R. -- FWIW, I'm a huge fan of both the tidyverse and data.table, and use them about equally in my own work. - Knowing how to use both of them and how they complement each other has, I believe, made me a much more effective R user/empirical economist/data scientist/etc. -- We'll get back to the point about complementarity at the end of the lecture. --- class: inverse, center, middle name: basics # data.table basics <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # The data.table object We've already seen that the tidyerse provides its own enhanced version of a data.frame in the form of tibbles. -- The same is true for data.table. In fact, data.table functions only work on objects that have been converted to data.tables first. - Beyond simple visual enhancements (similar to tibbles), the specialised internal structure of data.table objects is a key reason why the package is so fast. (More [here](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html#what-is-datatable-1a) and [here](https://twitter.com/matloff/status/1131372631372918784).) -- To create a data.table, we have a couple of options: - `fread('mydata.csv')` reads a CSV into R as a data.table (and is very fast).<sup>1</sup> - `data.table(x = 1:10)` creates a new data.table from scratch - `as.data.table(df)` coerces an existing data frame (here: `df`) to a data.table. - `setDT(df)` coerces an existing data frame to a data.table *by reference*; i.e. we don't have to (re)assign it. .footnote[<sup>1</sup> We'll cover `fread()` in more depth in the next lecture on data I/O.] --- # data.table Basics * The [introduction to `data.table` vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) is a very good place to start. -- * So, a `data.table` is an *enhanced data.frame*: ```r DT = data.table(a = 1:10) class(DT) ``` ``` ## [1] "data.table" "data.frame" ``` *enhanced* means that you can do *more* with it than just store and access data (as with a `data.frame`). And much more efficiently as well. -- * A `data.table` is also a `list`: ```r is.list(DT) ``` ``` ## [1] TRUE ``` -- * Great. How do I use it?! --- # data.table syntax All data.tables accept the same basic syntax: .center[ .large2[DT[<span style="color: #66C2A5;">i</span>, <span style="color: #FC8D62;">j</span>, <span style="color: #8DA0CB;">by</span>]] ] ![:col_row <span style='color: #66C2A5;'>On which rows?</span>, <span style='color: #FC8D62;'>What to do?</span>, <span style='color: #8DA0CB;'>Grouped by what?</span>] -- .center[dplyr "equivalents":] ![:col_list <span style='color: #66C2A5;'>filter(); slice(); arrange()</span>, <span style='color: #FC8D62;'>select(); mutate()</span>, <span style='color: #8DA0CB;'>group_by()</span>] -- While the tidyverse tends to break up operations step-by-step, data.table aims to do everything in one concise expression. - We can execute complex data wrangling commands as a single, fluid thought. - Although, as we'll see in a bit, you can certainly chain (pipe) multiple operations together too. --- # A Quick Example We'll dive into the details (and quirks) of data.table shortly. But first, a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know: > What is the average height of the human characters by gender? -- .pull-left[ ### dplyr ```r data(starwars, package = "dplyr") starwars %>% filter(species=="Human") %>% group_by(gender) %>% summarise(mean(height, na.rm=T)) ``` ] .pull-right[ ### data.table ```r starwars_dt = as.data.table(starwars) starwars_dt[ species=="Human", mean(height, na.rm=T), by = gender] ``` ] --- # A Quick Example We'll dive into the details (and quirks) of data.table shortly. But first, a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know: > What is the average height of the human characters by gender? .pull-left[ ### dplyr ```r data(starwars, package = "dplyr") starwars %>% * filter(species=="Human") %>% group_by(gender) %>% summarise(mean(height, na.rm=T)) ``` ] .pull-right[ ### data.table ```r starwars_dt = as.data.table(starwars) starwars_dt[ * species=="Human", ## i mean(height, na.rm=T), by = gender] ``` ] --- # A Quick Example We'll dive into the details (and quirks) of data.table shortly. But first, a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know: > What is the average height of the human characters by gender? .pull-left[ ### dplyr ```r data(starwars, package = "dplyr") starwars %>% filter(species=="Human") %>% group_by(gender) %>% * summarise(mean(height, na.rm=T)) ``` ] .pull-right[ ### data.table ```r starwars_dt = as.data.table(starwars) starwars_dt[ species=="Human", * mean(height, na.rm=T), ## j by = gender] ``` ] --- # A Quick Example We'll dive into the details (and quirks) of data.table shortly. But first, a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know: > What is the average height of the human characters by gender? .pull-left[ ### dplyr ```r data(starwars, package = "dplyr") starwars %>% filter(species=="Human") %>% * group_by(gender) %>% summarise(mean(height, na.rm=T)) ``` ] .pull-right[ ### data.table ```r starwars_dt = as.data.table(starwars) starwars_dt[ species=="Human", mean(height, na.rm=T), * by = gender] ## by ``` ] --- # A Quick Example We'll dive into the details (and quirks) of data.table shortly. But first, a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know: > What is the average height of the human characters by gender? .pull-left[ ### dplyr ```r data(starwars, package = "dplyr") starwars %>% filter(species=="Human") %>% group_by(gender) %>% summarise(mean(height, na.rm=T)) ``` ``` ## # A tibble: 2 × 2 ## gender `mean(height, na.rm = T)` ## <chr> <dbl> ## 1 feminine 160. ## 2 masculine 182. ``` ] .pull-right[ ### data.table ```r starwars_dt = as.data.table(starwars) starwars_dt[ species=="Human", mean(height, na.rm=T), by = gender] ``` ``` ## gender V1 ## 1: masculine 182.3478 ## 2: feminine 160.2500 ``` ] --- class: inverse, center, middle name: i # Working with rows: DT[i, ] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Subset by rows (filter) Subsetting by rows is very straightforward in data.table. Everything works pretty much the same as you'd expect if you're coming from dplyr. - `DT[x == "string", ]`: Subset to rows where variable x equals "string" - `DT[y > 5, ]`: Subset to rows where variable y is greater than 5 - `DT[1:10, ]`: Subset to the first 10 rows -- Multiple filters/conditions are fine too: - `DT[x=="string" & y>5, ]`: Subset to rows where x is "string" AND y is greater than 5 -- Note that we don't actually need commas when we're only subsetting on `i` (i.e. no `j` or `by` components). - `DT[x=="string"]` is equivalent to `DT[x=="string", ]` - `DT[1:10]` is equivalent to `DT[1:10, ]` - etc. --- # Subset by rows (filter) cont. Just to emphasise the point, here's an example of subsetting by rows using our starwars data.table from earlier. ```r starwars_dt[height>190 & species=='Human'] ``` ``` ## name height mass hair_color skin_color eye_color birth_year ## 1: Darth Vader 202 136 none white yellow 41.9 ## 2: Qui-Gon Jinn 193 89 brown fair blue 92.0 ## 3: Dooku 193 80 white fair brown 102.0 ## 4: Bail Prestor Organa 191 NA black tan brown 67.0 ## sex gender homeworld species ## 1: male masculine Tatooine Human ## 2: male masculine <NA> Human ## 3: male masculine Serenno Human ## 4: male masculine Alderaan Human ## films ## 1: The Empire Strikes Back,Revenge of the Sith,Return of the Jedi,A New Hope ## 2: The Phantom Menace ## 3: Attack of the Clones,Revenge of the Sith ## 4: Attack of the Clones,Revenge of the Sith ## vehicles starships ## 1: TIE Advanced x1 ## 2: Tribubble bongo ## 3: Flitknot speeder ## 4: ``` --- # Order by rows (arrange) ```r starwars_dt[order(birth_year)] ## (temporarily) sort by youngest to oldest starwars_dt[order(-birth_year)] ## (temporarily) sort by oldest to youngest ``` -- While ordering as per the above is very straightforward, data.table also provides an optimised `setorder()` function for reordering *by reference*. -- ```r setorder(starwars_dt, birth_year, na.last = TRUE) starwars_dt[1:5, name:birth_year] ## Only print subset to stay on the slide ``` ``` ## name height mass hair_color skin_color eye_color birth_year ## 1: Wicket Systri Warrick 88 20 brown brown brown 8 ## 2: IG-88 200 140 none metal red 15 ## 3: Luke Skywalker 172 77 blond fair blue 19 ## 4: Leia Organa 150 49 brown light brown 19 ## 5: Wedge Antilles 170 77 brown fair hazel 21 ``` --- # Helper function: `%like%` * We have handy helper functions `%like%`, `%ilike%`, `%llike%` * Say we want to get all `name`s with a *sky* in them: ```r starwars_dt[name %like% "Sky",.(name,height,mass)] ``` ``` ## name height mass ## 1: Luke Skywalker 172 77 ## 2: Anakin Skywalker 188 84 ## 3: Shmi Skywalker 163 NA ``` --- class: inverse, center, middle name: j # Manipulating columns: DT[, j] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # j: One slot to rule them all Recall some of the dplyr verbs that we used to manipulate our variables in different ways: - `select()` - `mutate()` - `summarise()` - `count()` -- data.table recognizes that all of these verbs are just different versions of telling R... >"*Do something to this variable in my dataset*" ... and it let's you do all of those operations in one place: the `j` slot. -- However, this concision requires a few syntax tweaks w.r.t. how we change and assign variables in our dataset. - Some people find this off-putting (or, at least, weird) when they first come to data.table. - I hope to convince you that these syntax tweaks aren't actually that difficult to grok and give us a *lot* of power in return. --- # Modifying columns `:=` To add, delete, or change columns in data.table, we use the **`:=`** operator. - Known as the *walrus* operator (geddit??) -- For example, - `DT[, xsq := x^2]`: Create a new column (`xsq`) from an existing one (`x`) - `DT[, x := as.character(x)]`: Change an existing column -- **Important:** `:=` is *modifying by reference*, i.e. in place. So we don't have to (re)assign the object to save these changes. -- However, we also won't see these changes printed to screen unless we ask R explicitly. ```r DT = data.table(x = 1:2) # DT[, xsq := x^2] ## Modifies in place but doesn't print the result DT[, x_sq := x^2][] ## Adding [] prints the result. ``` ``` ## x x_sq ## 1: 1 1 ## 2: 2 4 ``` --- # Modifying columns `:=` (cont.) As I keep saying, *modifying by reference* has important implications for data manipulation. Consider what happens if we copy our data.table and then remove a column. ```r DT_copy = DT DT_copy[, x_sq := NULL] ``` -- Clearly, "x_sq" has been removed from `DT_copy`. But what of the original `DT` table? -- ```r DT ``` ``` ## x ## 1: 1 ## 2: 2 ``` -- Uh-oh! It too has been removed... exactly as modifying by reference demands. To avoid this behaviour, use the explicit [`data.table::copy()`](https://rdatatable.gitlab.io/data.table/reference/copy.html) function. Run this next chunk yourself: ```r DT[, x_sq := x^2] DT_copy = copy(DT) DT_copy[, x_sq := NULL] DT ## x_sq is still there (run and confirm for yourself) ``` --- # Modifying columns `:=` (cont.) ### Sub-assign by reference One really cool implication of `:=` is data.table's [sub-assign by reference](https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#ref-i-j) functionality. As a simple example, consider another fake dataset. ```r DT2 = data.table(a = -2:2, b = LETTERS[1:5]) ``` -- Now, imagine we want to locate all rows where "a" is negative and replace the corresponding "b" cell with NA. -- - In dplyr you'd have to do something like `...mutate(b = ifelse(a < 0, NA, b))`. -- - In data.table, simply specify which rows to target (`i`) and then sub-assign (`j`) directly. ```r DT2[a < 0, b := NA][] ## Again, just adding the second [] to print to screen ``` ``` ## a b ## 1: -2 <NA> ## 2: -1 <NA> ## 3: 0 C ## 4: 1 D ## 5: 2 E ``` --- # Modifying columns `:=` (cont.) ### Sub-assign by reference * the recently added `fcase` function is similar to the dplyr `ifelse`. * Makes it easy (and efficient) to add new categories by reference ```r starwars_dt[ , eye_cat := fcase( eye_color %chin% c("black","blue-gray","brown","dark","hazel"), "dark", eye_color %chin% c("gold","green, yellow","orange","pink", "red","red, blue"), "funky", eye_color %chin% c("blue","white","yellow"), "light" )] starwars_dt[,.(eye_color, eye_cat, name)] ``` ``` ## eye_color eye_cat name ## 1: brown dark Wicket Systri Warrick ## 2: red funky IG-88 ## 3: blue light Luke Skywalker ## 4: brown dark Leia Organa ## 5: hazel dark Wedge Antilles ## 6: black dark Plo Koon ## 7: brown dark Biggs Darklighter ## 8: brown dark Han Solo ## 9: brown dark Lando Calrissian ## 10: brown dark Boba Fett ## 11: red funky R2-D2 ## 12: blue light Lobot ## 13: blue light Barriss Offee ## 14: orange funky Ackbar ## 15: yellow light Darth Vader ## 16: blue light Anakin Skywalker ## 17: black dark Greedo ## 18: brown dark Padmé Amidala ## 19: blue light Beru Whitesun lars ## 20: blue light Mon Mothma ## 21: hazel dark Ayla Secura ## 22: blue light Owen Lars ## 23: orange funky Jar Jar Binks ## 24: red funky Bossk ## 25: yellow light Darth Maul ## 26: blue-gray dark Obi-Wan Kenobi ## 27: blue light Luminara Unduli ## 28: brown dark Quarsh Panaka ## 29: blue light Wilhuff Tarkin ## 30: brown dark Jango Fett ## 31: brown dark Bail Prestor Organa ## 32: brown dark Shmi Skywalker ## 33: brown dark Mace Windu ## 34: yellow light Palpatine ## 35: blue light Cliegg Lars ## 36: blue light Finis Valorum ## 37: blue light Qui-Gon Jinn ## 38: yellow light Ki-Adi-Mundi ## 39: brown dark Dooku ## 40: yellow light C-3PO ## 41: blue light Chewbacca ## 42: orange funky Jabba Desilijic Tiure ## 43: brown dark Yoda ## 44: red funky R5-D4 ## 45: blue light Jek Tono Porkins ## 46: brown dark Arvel Crynyd ## 47: black dark Nien Nunb ## 48: red funky Nute Gunray ## 49: orange funky Roos Tarpals ## 50: orange funky Rugor Nass ## 51: blue light Ric Olié ## 52: yellow light Watto ## 53: orange funky Sebulba ## 54: pink funky Bib Fortuna ## 55: yellow light Dud Bolt ## 56: black dark Gasgano ## 57: orange funky Ben Quadinaros ## 58: black dark Kit Fisto ## 59: brown dark Eeth Koth ## 60: blue light Adi Gallia ## 61: orange funky Saesee Tiin ## 62: yellow light Yarael Poof ## 63: blue light Mas Amedda ## 64: brown dark Gregar Typho ## 65: brown dark Cordé ## 66: yellow light Poggle the Lesser ## 67: brown dark Dormé ## 68: yellow light Zam Wesell ## 69: yellow light Dexter Jettster ## 70: black dark Lama Su ## 71: black dark Taun We ## 72: blue light Jocasta Nu ## 73: unknown <NA> Ratts Tyerell ## 74: red, blue funky R4-P17 ## 75: unknown <NA> Wat Tambor ## 76: gold funky San Hill ## 77: black dark Shaak Ti ## 78: green, yellow funky Grievous ## 79: blue light Tarfful ## 80: brown dark Raymus Antilles ## 81: white light Sly Moore ## 82: black dark Tion Medon ## 83: dark dark Finn ## 84: hazel dark Rey ## 85: brown dark Poe Dameron ## 86: black dark BB8 ## 87: unknown <NA> Captain Phasma ## eye_color eye_cat name ``` --- # Modifying columns `:=` (cont.) ### Sub-assign by reference on Subset * The same works on a subset of the data only: ```r starwars_dt[ sex %in% c("male","none") , eye_cat_m := fcase( eye_color %chin% c("black","blue-gray","brown","dark","hazel"), "dark", eye_color %chin% c("gold","green, yellow","orange","pink", "red","red, blue"), "funky", eye_color %chin% c("blue","white","yellow"), "light" )] head(starwars_dt[,.(eye_color, eye_cat, eye_cat_m,name)],n=6) ``` ``` ## eye_color eye_cat eye_cat_m name ## 1: brown dark dark Wicket Systri Warrick ## 2: red funky funky IG-88 ## 3: blue light light Luke Skywalker ## 4: brown dark <NA> Leia Organa ## 5: hazel dark dark Wedge Antilles ## 6: black dark dark Plo Koon ``` --- # Modifying columns: Benchmark ```r mod_dplyr <- function(){ starwars %>% mutate(eye_cat_m = case_when( sex %in% c("male","none") ~ ifelse( eye_color %in% c("black","blue-gray","brown","dark","hazel"),"dark", ifelse(eye_color %in% c("gold","green, yellow","orange","pink", "red","red, blue"), "funky", ifelse(c("blue","white","yellow"), "light", NA) ) ), sex == "female" ~ NA_character_) ) } mod_dt <- function(){ starwars_dt[ sex %in% c("male","none") , eye_cat_m := fcase( eye_color %chin% c("black","blue-gray","brown","dark","hazel"), "dark", eye_color %chin% c("gold","green, yellow","orange","pink", "red","red, blue"), "funky", eye_color %chin% c("blue","white","yellow"), "light" )] } ``` --- # Modifying columns: Benchmark ```r microbenchmark(mod_dplyr(), mod_dt(), times = 10) ``` ``` ## Warning in microbenchmark(mod_dplyr(), mod_dt(), times = 10): less accurate ## nanosecond times to avoid potential integer overflows ``` ``` ## Unit: microseconds ## expr min lq mean median uq max neval ## mod_dplyr() 664.446 713.318 1495.3151 733.9820 826.97 7041.955 10 ## mod_dt() 582.364 622.954 967.5139 644.7455 755.22 3651.132 10 ``` --- # Modifying columns := (cont.) To modify multiple columns simultaneously, we have two options. 1. LHS `:=` RHS form: `DT[, c("var1", "var2") := .(val1, val2)]` 2. Functional form: `DT[, ':=' (var1=val1, var2=val2)]` -- Personally, I *much* prefer the first form and so that's what I'll use going forward. E.g. ```r DT[, c("y", "y_name") := list(3:4 , c("three", "four"))] DT ## Another way to print the results instead of appending [] ``` ``` ## x x_sq y y_name ## 1: 1 1 3 three ## 2: 2 4 4 four ``` -- Note, however, that dynamically assigning dependent columns in a single step (like we did with dplyr::mutate) doesn't work. ```r DT[, ':=' (z = 5:6, z_sq = z^2)][] ``` ``` ## Error in eval(jsub, SDenv, parent.frame()): object 'z' not found ``` --- # Aside: Chaining data.table operations That last example provides as good a time as any to mention that you can chain multiple data.table operations together. -- The native data.table way is simply to append consecutive `[]` terms. ```r DT[, z := 5:6][, z_sq := z^2][] ``` ``` ## x x_sq y y_name z z_sq ## 1: 1 1 3 three 5 25 ## 2: 2 4 4 four 6 36 ``` -- But if you prefer the **magrittr** pipe, then that's also possible. Just prefix each step with `.`: ```r # library(magrittr) ## Not needed since we've already loaded %>% via dplyr DT %>% .[, xyz := x+y+z] %>% .[, xyz_sq := xyz^2] %>% .[] ``` ``` ## x x_sq y y_name z z_sq xyz xyz_sq ## 1: 1 1 3 three 5 25 9 81 ## 2: 2 4 4 four 6 36 12 144 ``` --- # Modifying columns := (cont.) To remove a column from your dataset, set it to NULL. ```r DT[, y_name := NULL] DT ``` ``` ## x x_sq y z z_sq xyz xyz_sq ## 1: 1 1 3 5 25 9 81 ## 2: 2 4 4 6 36 12 144 ``` --- # Subsetting on columns (select) We can also use the `j` slot to subset our data on columns. I'll return to the starwars dataset for these examples... -- Subset by column position: ```r starwars_dt[1:2, c(1:3, 10)] ``` ``` ## name height mass homeworld ## 1: Wicket Systri Warrick 88 20 Endor ## 2: IG-88 200 140 <NA> ``` -- Or by name: ```r # starwars_dt[, c("name", "height", "mass", "homeworld")] ## Also works # starwars_dt[, list(name, height, mass, homeworld)] ## So does this starwars_dt[1:2, .(name, height, mass, homeworld)] ``` ``` ## name height mass homeworld ## 1: Wicket Systri Warrick 88 20 Endor ## 2: IG-88 200 140 <NA> ``` --- # Aside: What's with the .()? We've now seen `.()` in a couple places, e.g the previous slide and [this slide](#fast) from earlier if you were paying close attention. - `.()` is just a data.table shortcut for `list()`. We'll be using `.()` quite liberally once we start working subsetting and/or grouping by multiple variables at a time. You can think of it as one of data.table's syntactical quirks. But, really, it's just there to give you more options. You can often — if not always — use these three forms interchangeably in data.table: - `.(var1, var2, ...)` - `list(var1, var2, ...)` - `c("var1", "var2", ...)` -- I like the `.()` syntax best — less typing! — but each to their own. -- Okay, back to subsetting on columns... --- # Subsetting on columns (cont.) You can also exclude columns through negation. Try this next code chunk yourself: ```r starwars_dt[, !c("name", "height")] ``` -- ### Renaming columns You can rename (set) a column by reference. Again, run this yourself: ```r setnames(starwars_dt, old = c("name", "homeworld"), new = c"(alias", "crib"))[] ## Better change it back, in case we use "name" or "homeworld" on a later slide setnames(starwars_dt, old = c("alias", "crib"), new = c("name", "homeworld")) ``` -- While the `setnames()` approach offers performance benefits, I often find it convenient to dynamically (and/or temporarily) rename columns when subsetting them. For example: ```r starwars_dt[1:2, .(alias = name, crib = homeworld)] ``` ``` ## alias crib ## 1: Wicket Systri Warrick Endor ## 2: IG-88 <NA> ``` --- # Subsetting on columns (cont.) One last thing I'll mention w.r.t. to subsetting columns is that you can also use dplyr verbs on data.tables if you prefer. For example run the following code chunk for yourself. (You'll get a warning about efficiency loss, but this will be very minor for a case like this.) ```r starwars_dt[1:5, ] %>% select(crib = homeworld, everything()) ``` -- I don't want to preempt myself, though. I'll get back to dplyr+data.table functionality at the end of the lecture.... --- # Aggregating Finally, we can do aggregating manipulations in `j`. ```r starwars_dt[, mean(height, na.rm=T)] ``` ``` ## [1] 174.358 ``` -- Note that we don't keep anything unless we assign the result to a new object. If you want to add the new aggregated column to your original dataset, use `:=`. ```r starwars_dt[, mean_height := mean(height, na.rm=T)] %>% ## Add mean height as column .[1:5, .(name, height, mean_height)] ## Just to keep everything on the slide ``` ``` ## name height mean_height ## 1: Luke Skywalker 172 174.358 ## 2: C-3PO 167 174.358 ## 3: R2-D2 96 174.358 ## 4: Darth Vader 202 174.358 ## 5: Leia Organa 150 174.358 ``` --- # Aggregating (cont.) data.table also provides [special convenience symbols](https://rdatatable.gitlab.io/data.table/reference/special-symbols.html) for common aggregation tasks in `j`. For example, we can quickly count the number of observations using `.N`. ```r starwars_dt[, .N] ``` ``` ## [1] 87 ``` -- Of course, this is a pretty silly example since it's just going to give us the total number of rows in the dataset. Like most forms of aggregation, `.N` is much more interesting when it is applied by group. - This provides a nice segue to our next section... --- name: by class: inverse, center, middle # Group by: DT[, , by] <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # by data.table's `by` argument functions very similarly to the `dplyr::group_by` equivalent. Try these next few examples in your own R console: - `starwars_dt[, mean(height, na.rm=T), by = species]`: Collapse by single variable - `starwars_dt[, .(species_height = mean(height, na.rm=T)), by = species]`: As above, but explicitly name the summary variable - `starwars_dt[, mean(mass, na.rm=T), by = height>190]`: Conditionals work too. - `starwars_dt[, species_n := .N, by = species][]`: Add an aggregated column to the data (here: number of observations by species group) -- To perform aggregations by multiple variables, we'll use the `.()` syntax again. ```r starwars_dt[, .(mean_height = mean(height, na.rm=T)), by = .(species, homeworld)] %>% head(4) ## Just to keep everything on the slide ``` ``` ## species homeworld mean_height ## 1: Human Tatooine 179.2500 ## 2: Droid Tatooine 132.0000 ## 3: Droid Naboo 96.0000 ## 4: Human Alderaan 176.3333 ``` --- # Efficient subsetting with .SD We've seen how to group by multiple variables. But what if we want to *summarise* multiple variables, regardless of how we are grouping? One solution is to again use `.()` and write everything out, e.g. ```r ## Run yourself if you'd like to check the output starwars_dt[, .(mean(height, na.rm=T), mean(mass, na.rm=T), mean(birth_year, na.rm=T)), by = species] ``` -- But this soon become tedious. Imagine we have even more variables. Do we really have to write out `mean(..., na.rm=T)` for each one? -- Fortunately, the answer is "no". data.table provides a special `.SD` symbol for **s**ubsetting **d**ata. In truth, .SD can do a [lot more](https://rdatatable.gitlab.io/data.table/articles/datatable-sd-usage.html) than what I'm about to show you, but here's how it would work in the present case... .right[*See next slide.*] --- # Efficient subsetting with .SD (cont.) ```r starwars_dt[, lapply(.SD, mean, na.rm=T), .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` --- count: false # Efficient subsetting with .SD (cont.) ```r starwars_dt[, * lapply(.SD, mean, na.rm=T), .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` First, we specify what we want to *do* on our data subset (i.e. `.SD`). In this case, we want the mean for each element, which we obtain by iterating over with the base `lapply()` function.<sup>1</sup> .footnote[ <sup>1</sup> We'll learn more about iteration once we get to the programming section of the course. ] --- count: false # Efficient subsetting with .SD (cont.) ```r starwars_dt[, lapply(.SD, mean, na.rm=T), * .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` First, we specify what we want to *do* on our data subset (i.e. `.SD`). In this case, we want the mean for each element, which we obtain by iterating over with the base `lapply()` function.<sup>1</sup> Then, we specify *which columns* to subset with the `.SDcols` argument. .footnote[ <sup>1</sup> We'll learn more about iteration once we get to the programming section of the course. ] --- count: false # Efficient subsetting with .SD (cont.) ```r starwars_dt[, lapply(.SD, mean, na.rm=T), .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` First, we specify what we want to *do* on our data subset (i.e. `.SD`). In this case, we want the mean for each element, which we obtain by iterating over with the base `lapply()` function.<sup>1</sup> Then, we specify *which columns* to subset with the `.SDcols` argument. P.S. One annoyance I have is that the `.()` syntax doesn't work for `.SDcols`. However, we can at least feed it consecutive columns without quotes, e.g. `.SDcols = height:mass`. See [here](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html#how-can-we-specify-just-the-columns-we-would-like-to-compute-the-mean-on). .footnote[ <sup>1</sup> We'll learn more about iteration once we get to the programming section of the course. ] --- # Efficient subsetting with .SD (cont.) Just to add: We need only specify `.SDcols` if we want to subset specific parts of the data. (You can also use shortcuts like `.SDcols = is.numeric` or `.SDcols = patterns('abc')`.) If we instead want to apply the same function on *all* the variables in our dataset, then `.SD` by itself will suffice. -- As a quick example, recall our earlier DT object that contains only numeric variables. ```r DT ``` ``` ## x x_sq y z z_sq xyz xyz_sq ## 1: 1 1 3 5 25 9 81 ## 2: 2 4 4 6 36 12 144 ``` -- We can obtain the mean for each variable as follows. ```r DT[, lapply(.SD, mean)] ``` ``` ## x x_sq y z z_sq xyz xyz_sq ## 1: 1.5 2.5 3.5 5.5 30.5 10.5 112.5 ``` --- count: false # Efficient subsetting with .SD (cont.) ```r starwars_dt[, * lapply(.SD, mean, na.rm=T), .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` First, we specify what we want to *do* on our data subset (i.e. `.SD`). In this case, we want the mean for each element, which we obtain by iterating over with the base `lapply()` function.<sup>1</sup> .footnote[ <sup>1</sup> We'll learn more about iteration once we get to the programming section of the course. ] --- count: false # Efficient subsetting with .SD (cont.) ```r starwars_dt[, lapply(.SD, mean, na.rm=T), * .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` First, we specify what we want to *do* on our data subset (i.e. `.SD`). In this case, we want the mean for each element, which we obtain by iterating over with the base `lapply()` function.<sup>1</sup> Then, we specify *which columns* to subset with the `.SDcols` argument. .footnote[ <sup>1</sup> We'll learn more about iteration once we get to the programming section of the course. ] --- count: false # Efficient subsetting with .SD (cont.) ```r starwars_dt[, lapply(.SD, mean, na.rm=T), .SDcols = c("height", "mass", "birth_year"), by = species] %>% head(2) ## Just keep everything on the slide ``` ``` ## species height mass birth_year ## 1: Ewok 88.0 20.00 8.00000 ## 2: Droid 131.2 69.75 53.33333 ``` First, we specify what we want to *do* on our data subset (i.e. `.SD`). In this case, we want the mean for each element, which we obtain by iterating over with the base `lapply()` function.<sup>1</sup> Then, we specify *which columns* to subset with the `.SDcols` argument. P.S. One annoyance I have is that the `.()` syntax doesn't work for `.SDcols`. However, we can at least feed it consecutive columns without quotes, e.g. `.SDcols = height:mass`. See [here](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html#how-can-we-specify-just-the-columns-we-would-like-to-compute-the-mean-on). .footnote[ <sup>1</sup> We'll learn more about iteration once we get to the programming section of the course. ] --- # Efficient subsetting with .SD (cont.) Just to add: We need only specify `.SDcols` if we want to subset specific parts of the data. (You can also use shortcuts like `.SDcols = is.numeric` or `.SDcols = patterns('abc')`.) If we instead want to apply the same function on *all* the variables in our dataset, then `.SD` by itself will suffice. -- As a quick example, recall our earlier DT object that contains only numeric variables. ```r DT ``` ``` ## x x_sq y z z_sq xyz xyz_sq ## 1: 1 1 3 5 25 9 81 ## 2: 2 4 4 6 36 12 144 ``` -- We can obtain the mean for each variable as follows. ```r DT[, lapply(.SD, mean)] ``` ``` ## x x_sq y z z_sq xyz xyz_sq ## 1: 1.5 2.5 3.5 5.5 30.5 10.5 112.5 ``` --- # keyby The last thing I want to mention w.r.t. `by` is its close relative: `keyby`. The `keyby` argument works exactly like `by` — you can use it as a drop-in replacement — except that it orders the results and creates a **key**. - Setting a key for a data.table will allow for various (and often astonishing) performance gains.<sup>1</sup> - Keys are important enough that I want to save them for their own section, though... .footnote[<sup>1</sup> Note that you won't see an immediate performance gain with `keyby`, but subsequent operations will certainly benefit. (Of course, you can get an immediate boost by setting the key ahead of time, but I'll explain all that on the next slide...)] --- name: keys class: inverse, center, middle # Keys <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # What are keys? Keys are a way of ordering the data that allows for *extremely* fast subsetting. The data.table [vignette](https://rdatatable.gitlab.io/data.table/articles/datatable-keys-fast-subset.html) describes them as "supercharged rownames". I know that might sound a bit abstract, but here's the idea in a nutshell... -- Imagine that we want to filter a dataset based on a particular value (e.g. find all the human characters in our starwars dataset). - Normally, we'd have to search through the whole dataset to identify matching cases. - But, if we've set an appropriate key, then the data are already ordered in such a way that we (i.e. our computer) only has to search through a much smaller subset. -- **Analogy:** Think of the way a filing cabinet might divide items by alphabetical order: Files starting "ABC" in the top drawer, "DEF" in the second drawer, etc. To find *Alice's* file, you'd only have to search the top draw. For *Fred*, the second draw, and so on. -- Not only is this much quicker, but the same idea also carries over to *all other* forms of data manipulation that rely on subsetting (aggregation by group, joins, etc.) -- P.S. We'll get there later in the course, but keys are also the secret sauce in databases. --- # How do I set a key? You can set a key when you first create a data.table. E.g. - `DT = data.table(x = 1:10, y = LETTERS[1:10], key = "x")` - `DT = as.data.table(DF, key = "x")` - `setDT(DF, key = "x")` -- Or, you can set keys on an existing data.table with the `setkey()` function. - `setkey(DT, x)`: Note that the key doesn't have to be quoted this time -- **Important:** Since keys just describe a particular ordering of the data, you can set a key on *multiple* columns. (More [here](https://rdatatable.gitlab.io/data.table/articles/datatable-keys-fast-subset.html#key-properties).) E.g. - `DT = as.data.table(DF, key = c("x", "y"))` - `setkey(DT, x, y)`: Again, no quotes needed -- </br> P.S. Use the `key()` function to see what keys are currently set for your data.table. You can only ever have one key per table at a time, but it's very easy to change them using one of the above commands. --- # Example Recall the [speed benchmark](#fast) that we saw at the very beginning of the lecture: data.table ended up being 75x faster than dplyr for a fairly standard summarising task. -- Let's redo the benchmark, but this time include a version where we pre-assign a key. For optimal performance, the key should match the same variables that we're grouping/subsetting on. - Again, a key can be set on multiple variables, although the lead grouping variable (in the below case: "name") is the most important. -- .small90[ ```r ## First create a keyed version of the storms data.table. ## Note that key variables match the 'by' grouping variables below. storms_dt_key = as.data.table(storms, key = c("name", "year", "month", "day")) ## Collapse function for this keyed data.table. Everything else stays the same. collapse_dt_key = function() { storms_dt_key[, .(wind = mean(wind), pressure = mean(pressure), category = first(category)), by = .(name, year, month, day)] } ## Run the benchmark on all three functions. microbenchmark(collapse_dplyr(), collapse_dt(), collapse_dt_key(), times = 10) ``` ] .right[*See next slide for results*] --- # Example (cont.) ``` ## Unit: milliseconds ## expr min lq mean median uq max ## collapse_dplyr() 93.860931 96.373862 97.394487 96.890442 98.191269 102.544731 ## collapse_dt() 3.850228 3.896230 4.636899 4.004285 4.053588 10.432450 ## collapse_dt_key() 1.321922 1.338691 1.614014 1.435512 1.475959 3.505582 ## neval ## 10 ## 10 ## 10 ``` -- The keyed data.table version is now **67** (!!!) faster than dplyr. -- - That thing you feel... is your face melting. -- It's not just this toy example. In working with real-life data, my experience is that setting keys almost always leads to huge speed-ups... and those gains tend to scale as the datasets increase in size. -- **Bottom line:** data.table is already plenty fast. But use keys if you're really serious about performance. --- name: merge class: inverse, center, middle # Merging datasets <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Merge (aka join) options data.table provides two ways to merge datasets. - `DT1[DT2, on = "id"]` - `merge(DT1, DT2, by = "id")` -- I prefer the latter because it offers extra functionality (see `?merge.data.table`), but each to their own.<sup>1</sup> .footnote[<sup>1</sup> For a really good summary of the different join options (left, right, full, anti, etc.) using these two methods, as well as their dplyr equivalents, see [here](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#joinbind-data-sets).] -- I'm going to keep things brief by simply showing you how to repeat the same left join that we [practiced](https://raw.githack.com/uo-ec607/lectures/master/05-tidyverse/05-tidyverse.html#54) with dplyr in the last lecture, using data from the **nycflights13** package. ```r # library(nycflights13) ## Already loaded flights_dt = as.data.table(flights) planes_dt = as.data.table(planes) ``` --- # Left join example (cont.) Here's a comparison with the dplyr equivalents from last week. I'll let you run and compare these yourself. (Note that the row orders will be different.) .pull-left[ **dplyr** ```r left_join( flights, planes, by = "tailnum" ) ``` ] .pull-right[ **data.table** ```r merge( flights_dt, planes_dt, all.x = TRUE, ## omit for inner join by = "tailnum") ``` ] -- If you run these, you'll see that both methods handle the ambiguous "year" columns by creating "year.x" and "year.y" variants. We avoided this in dplyr by using `rename()`. How might you avoid the same thing in data.table? -- <b>Possible answer:</b> Use `setnames()`. ```r merge( * setnames(flights_dt, old = "year", new = "year_built"), planes_dt, all.x = TRUE, by = "tailnum") ``` --- # Use keys for lightning fast joins The only other thing I'll point out is that setting [keys](#keys) can lead to dramatic speed-ups for merging data.tables. I'll demonstrate using an inner join this time. ```r merge_dt = function() merge(flights_dt, planes_dt, by = "tailnum") flights_dt_key = as.data.table(flights, key = "tailnum") planes_dt_key = as.data.table(planes, key = "tailnum") merge_dt_key = function() merge(flights_dt_key, planes_dt_key, by = "tailnum") microbenchmark(merge_dt(), merge_dt_key(), times = 10) ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max neval ## merge_dt() 50.55919 55.03049 74.53085 62.11094 107.83455 118.65215 10 ## merge_dt_key() 29.05051 31.41285 46.12807 38.70507 41.30672 89.99369 10 ``` -- So, we get about a 2x speed-up for this simple case. Trust me, though: For really big datasets with complicated joins, ordering your data by setting keys will make a huge difference. (FWIW, the same is true for dplyr. See [here](https://tysonbarrett.com//jekyll/update/2019/10/11/speed_of_joins/).) --- name: reshape class: inverse, center, middle # Reshaping data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Reshaping options with data.table In the tidyverse lecture, we saw how to reshape data using the tidyr::pivot* functions. data.table offers its own functions for flexibly reshaping data: - `melt()`: convert wide data to long data - `dcast()`: convert long data to wide data -- However, I also want to flag the [**tidyfast**](https://tysonbarrett.com/tidyfast/index.html) package by Tyson Barrett, which implements data.table versions of the tidyr::pivot* functions (among other things). - `tidyfast::dt_pivot_longer()`: wide to long - `tidyfast::dt_pivot_wider()`: long to wide -- I'll demonstrate reshaping with both options on the same fake "stocks" data that we created last time: ```r stocks = data.table(time = as.Date('2009-01-01') + 0:1, X = rnorm(2, 0, 1), Y = rnorm(2, 0, 2), Z = rnorm(2, 0, 4)) ``` --- # Reshape from wide to long Our data are currently in wide format. ```r stocks ``` ``` ## time X Y Z ## 1: 2009-01-01 -0.4105827 3.5158984 -2.460965 ## 2: 2009-01-02 -0.4205304 -0.5169784 -1.913041 ``` To convert this into long format, we could do either of the following: .pull-left[.small90[ ```r # See ?melt.data.table for options melt(stocks, id.vars ="time") ``` ``` ## time variable value ## 1: 2009-01-01 X -0.4105827 ## 2: 2009-01-02 X -0.4205304 ## 3: 2009-01-01 Y 3.5158984 ## 4: 2009-01-02 Y -0.5169784 ## 5: 2009-01-01 Z -2.4609647 ## 6: 2009-01-02 Z -1.9130410 ``` ]] .pull-right[.small90[ ```r stocks %>% dt_pivot_longer(X:Z, names_to="stock", values_to="price") ``` ``` ## time stock price ## 1: 2009-01-01 X -0.4105827 ## 2: 2009-01-02 X -0.4205304 ## 3: 2009-01-01 Y 3.5158984 ## 4: 2009-01-02 Y -0.5169784 ## 5: 2009-01-01 Z -2.4609647 ## 6: 2009-01-02 Z -1.9130410 ``` ]] --- # Reshape from long to wide Let's quickly save the long-format stocks data.table. I'll use the `melt()` approach and also throw in some extra column-naming options, just so you can see those in action. ```r stocks_long = melt(stocks, id.vars ="time", variable.name = "stock", value.name = "price") stocks_long ``` ``` ## time stock price ## 1: 2009-01-01 X -0.4105827 ## 2: 2009-01-02 X -0.4205304 ## 3: 2009-01-01 Y 3.5158984 ## 4: 2009-01-02 Y -0.5169784 ## 5: 2009-01-01 Z -2.4609647 ## 6: 2009-01-02 Z -1.9130410 ``` -- .pull-left[.small90[ ```r dcast(stocks_long, time ~ stock, value.var = "price") ``` ``` ## time X Y Z ## 1: 2009-01-01 -0.4105827 3.5158984 -2.460965 ## 2: 2009-01-02 -0.4205304 -0.5169784 -1.913041 ``` ]] .pull-right[.small90[ ```r stocks_long %>% dt_pivot_wider(names_from=stock, values_from=price) ``` ``` ## time X Y Z ## 1: 2009-01-01 -0.4105827 3.5158984 -2.460965 ## 2: 2009-01-02 -0.4205304 -0.5169784 -1.913041 ``` ]] --- # Reshaping multiple columns * In this [vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html) you have a bunch more examples of efficient reshapes with data.table * One of my favourites is the ability to reshape *several* columns at once. .small90[ ```r DT <- fread(s2) DT ``` ``` ## family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 ## 1: 1 30 1998-11-26 2000-01-29 <NA> 1 ## 2: 2 27 1996-06-22 <NA> <NA> 2 ## 3: 3 26 2002-07-11 2004-04-05 2007-09-02 2 ## 4: 4 32 2004-10-10 2009-08-27 2012-07-21 1 ## 5: 5 29 2000-12-05 2005-02-28 <NA> 2 ## gender_child2 gender_child3 ## 1: 2 NA ## 2: NA NA ## 3: 2 1 ## 4: 1 1 ## 5: 1 NA ``` ```r # melt colA = paste("dob_child", 1:3, sep = "") colB = paste("gender_child", 1:3, sep = "") DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender")) ``` ] --- # Reshape ```r DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender")) DT.m2 ``` ``` ## family_id age_mother variable dob gender ## 1: 1 30 1 1998-11-26 1 ## 2: 2 27 1 1996-06-22 2 ## 3: 3 26 1 2002-07-11 2 ## 4: 4 32 1 2004-10-10 1 ## 5: 5 29 1 2000-12-05 2 ## 6: 1 30 2 2000-01-29 2 ## 7: 2 27 2 <NA> NA ## 8: 3 26 2 2004-04-05 2 ## 9: 4 32 2 2009-08-27 1 ## 10: 5 29 2 2005-02-28 1 ## 11: 1 30 3 <NA> NA ## 12: 2 27 3 <NA> NA ## 13: 3 26 3 2007-09-02 1 ## 14: 4 32 3 2012-07-21 1 ## 15: 5 29 3 <NA> NA ``` --- name: tidyverse class: inverse, center, middle # data.table + tidyverse workflows <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Choosing a workflow that works for you When it comes to data work in R, we truly are spoilt for choice. We have two incredible data wrangling ecosystems to choose from. - **tidyverse** (esp. **dplyr** and **tidyr**) - **data.table** Over the last two lectures, we've explored some of the key features of each. It's only natural that people might find themselves gravitating to one or the other. - Some people love the expressiveness and modularity of the tidyverse. - Others love the concision and power of data.table. -- And that's cool. But I'll repeat a point I made earlier: I use both ecosystems about equally in my own work and honestly believe that this has been to my great benefit. These next few slides offer a few additional thoughts on how data.table and the tidyverse can be combined profitably in your own workflow. --- # Pick and choose The first point is perhaps the most obvious one: The tidyverse consists of *multiple* packages. Just because you prefer to use data.table instead of dplyr+tidyr for your data.wrangling needs, doesn't mean that other tidyverse packages are off limits too. Case in point: Almost every hardcore data.table user I know is a hardcore **ggplot2** user too. .pull-left[ .small75[ ```r ## library(ggplot2) # already loaded storms_dt[, .(wind = mean(wind), pressure = mean(pressure), category = first(category)), by = .(name, year, month, day)] %>% ggplot(aes(x = pressure, y = wind, col=category)) + geom_point(alpha = 0.3) + theme_minimal() ``` ]] .pull-right[ <img src="03-datatable_files/figure-html/storms_ggplot-1.png" style="display: block; margin: auto;" /> ] --- # Don't be a fanatic Closely related to the second first point: Don't try to shoehorn every problem into a tidyverse or data.table framework. - Recall my [admonition](https://raw.githack.com/uo-ec607/lectures/master/05-tidyverse/05-tidyverse.html#11) from last time: "A combination of tidyverse and base R is often the best solution to a problem. You can add data.table to that list. -- Having worked extensively with both packages, I think it's fair to say that there are things the tidyverse (dplyr+tidyr) does better, and there are things that data.table does better. - If you find a great solution on StackOverflow that uses the "other" package... use it. -- Plus, as I hinted earlier, you *can* use tidyverse verbs on data.tables. Try yourself: ```r starwars_dt %>% group_by(homeworld) %>% summarise(height = mean(height, na.rm=T)) ``` .footnote[<sup>1</sup> This does [incur](https://stackoverflow.com/a/27513921) a performance penalty. Luckily there's a better solution coming on the next slide...] -- **Bottom line:** Don't be a fanatic. Insisting on ecosystem purity is rarely worth it. --- # dtplyr Do you love dplyr's syntax, but want data.table's performance? -- Well, you're in luck! -- Hadley Wickham's **dtplyr** package provides a data.table "back-end" for dplyr. - Basically, write your code as if you were using dplyr and then it gets automatically translated to (and evaluated as) data.table code. -- If this sounds appealing to you (and it should) I strongly encourage you to check out the [package website](https://dtplyr.tidyverse.org/) for details. But here's quick example, using our benchmark from earlier. .font80[ ```r # library(dtplyr) ## Already loaded storms_dtplyr = lazy_dt(storms) ## dtplyr requires objects to be set as "lazy" data.tables collapse_dtplyr = function() { storms_dtplyr %>% group_by(name, year, month, day) %>% summarize(wind = mean(wind), pressure = mean(pressure), category = first(category)) %>% as_tibble() } ## Just compare dtplyr with normal dplyr and data.table versions (i.e. no keys) microbenchmark::microbenchmark(collapse_dplyr(), collapse_dt(), collapse_dtplyr(), times = 10) ``` ] .right[*See next slide for results*] --- # dtplyr (cont.) ``` ## Unit: milliseconds ## expr min lq mean median uq ## collapse_dplyr() 97.656834 100.076982 101.764062 100.921357 103.430495 ## collapse_dt() 3.941740 4.043338 4.117860 4.114494 4.137638 ## collapse_dtplyr() 4.508975 4.688842 5.349897 4.808849 5.806420 ## max neval ## 107.331645 10 ## 4.362974 10 ## 7.711854 10 ``` -- Not quite a fast as the native data.table method, but a >30x speed-up for free? I'd take it! -- **Aside:** dtplyr automatically prints its data.table translation to screen. This can be super helpful when you first come over to data.table from the tidyverse. ```r lazy_dt(starwars) %>% filter(species=="Human") %>% group_by(gender) %>% summarise(height = mean(height, na.rm=TRUE)) ``` ``` ## Source: local data table [2 x 2] ## Call: `_DT2`[species == "Human"][, .(height = mean(height, na.rm = TRUE)), ## keyby = .(gender)] ## ## gender height ## <chr> <dbl> ## 1 feminine 160. ## 2 masculine 182. ## ## # Use as.data.table()/as.data.frame()/as_tibble() to access results ``` --- name: summary class: inverse, center, middle # Summary <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Summary data.table is a powerful data wrangling package that combines concise syntax with incredible performance. It is also very lightweight, despite being feature rich. The basic syntax is `DT[i, j, by]` - `i` On which rows? - `j` What to do? - `by` Grouped by what? data.table (re)introduces some new ideas like modify by reference (e.g. `:=`), as well as syntax (e.g. `.()`, `.SD` `.SDcols`, etc.). - All of these ideas support data.table's core goals: Maximise performance and flexibility, whilst maintaining a concise and consistent syntax. They are worth learning. Pro tip: Use keys to order your data and yield dramatic speed-ups. The tidyverse and data.table are often viewed as substitutes, but you can profitably combine both into your workflow... even if you favour one for (most of) your data wrangling needs. --- # Further resources As hard as it may be to believe, there's a ton of data.table features that we didn't cover today. Some, we'll get to in later lectures (e.g. the insanely fast `fread()` and `fwrite()` CSV I/O functions). Others, we won't, but hopefully I've given you enough of a grounding to continue exploring on your own. Here are some recommended further resources: - http://r-datatable.com (Official website. See the vignettes, especially.) - https://github.com/Rdatatable/data.table#cheatsheets (Cheatsheet.) - https://atrebas.github.io/post/2019-03-03-datatable-dplyr (Really nice, side-by-side comparisons of common data.table and dplyr operations.) - https://brooksandrew.github.io/simpleblog/articles/advanced-data-table/ (Some cool advanced data.table tricks.) And related packages: - https://tysonbarrett.com/tidyfast - https://dtplyr.tidyverse.org --- # PS — Assignment Your next assignment is up. Impress me with your data wrangling skills using either the tidyverse or data.table (or both!) --- class: inverse, center, middle # Next lecture: Websrcaping (1) <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>