class: center, middle, inverse, title-slide .title[ # MKT615: Data Storytelling for Marketers ] .subtitle[ ## Lecture 5.2: data.table ] .author[ ### Davide Proserpio ] .institute[ ### Marshall School of Business ] --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } .large4 { font-size: 400% } .large2 { font-size: 200% } .small90 { font-size: 90% } .small75 { font-size: 75% } </style> # Table of contents 1. [Checklist](#check) 2. [Introduction](#intro) 3. [data.table basics](#basics) 4. [Working with rows](#rows) 5. [Working with columns](#columns) 6. [More useful things](#useful) --- # Checklist name:check We'll be using the following packages in today's lecture: - Already installed: **dplyr**, **ggplot2**, **nycflights13** - New: **data.table**, **tidyfast**, **dtplyr**, **microbenchmark**, **gapminder** -- The following code chunk will install (if necessary) and load everything for you. ```r if (!require(pacman)) install.packages('pacman', repos = 'https://cran.rstudio.com') pacman::p_load(dplyr, data.table, dtplyr, tidyfast, microbenchmark, ggplot2, nycflights13) options(dplyr.summarise.inform = FALSE) ## Turn off annoying dplyr group_by messages (optional) ``` --- class: inverse, center, middle name: intro # Introduction --- # Why learn data.table? The **tidyverse** is great. As I keep hinting, it will also provide a bridge to many of the big data tools that we'll encounter later in the course (SQL databases, etc.) -- So why bother learning another data wrangling package/syntax? -- When it comes to **data.table**, I can think of at least five reasons: 1. Concise 2. Insanely fast 3. Memory efficient 4. Feature rich 5. Dependency free -- Before we get into specifics, here are a few examples... --- # Why learn data.table? (cont.) ### 1) Concise These two code chunks do the same thing: ```r # library(dplyr) ## Already loaded # data(starwars, package = "dplyr") ## Optional to bring the DF into the global env starwars %>% filter(species=="Human") %>% group_by(homeworld) %>% summarise(mean_height=mean(height)) ``` vs ```r # library(data.table) ## Already loaded starwars_dt = as.data.table(starwars) starwars_dt[species=="Human", mean(height), by=homeworld] ``` --- name:fast # Why learn data.table? (cont.) .small90[ ```r collapse_dplyr = function() { storms %>% group_by(name, year, month, day) %>% summarize(wind = mean(wind), pressure = mean(pressure), category = dplyr::first(category)) } storms_dt = as.data.table(storms) collapse_dt = function() { storms_dt[, .(wind = mean(wind), pressure = mean(pressure), category = first(category)), by = .(name, year, month, day)] } microbenchmark(collapse_dplyr(), collapse_dt(), times = 1) ``` --- ``` ## Warning in microbenchmark(collapse_dplyr(), collapse_dt(), times = 1): less ## accurate nanosecond times to avoid potential integer overflows ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max ## collapse_dplyr() 87.222826 87.222826 87.222826 87.222826 87.222826 87.222826 ## collapse_dt() 7.113664 7.113664 7.113664 7.113664 7.113664 7.113664 ## neval ## 1 ## 1 ``` ] -- .small90[ **Result:** data.table is ~12x faster (median)! (Thanks to [Keith Head](https://twitter.com/ckhead) for this example.) ] --- class: inverse, center, middle name: basics # data.table basics <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # The data.table object data.table functions only work on objects that have been converted to data.tables first. - Beyond simple visual enhancements (similar to tibbles), the specialised internal structure of data.table objects is a key reason why the package is so fast. (More [here](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html#what-is-datatable-1a) and [here](https://twitter.com/matloff/status/1131372631372918784).) -- To create a data.table, we have a couple of options: - `fread('mydata.csv')` reads a CSV into R as a data.table (and is very fast). - `data.table(x = 1:10)` creates a new data.table from scratch - `as.data.table(d)` or `data.table(d)` coerces an existing data frame (`d`) to a data.table. - `setDT(d)` coerces an existing data frame to a data.table *by reference*; i.e. we don't have to (re)assign it. --- # What does "modify by reference" mean? That last bullet leads us to an important concept that underlies much of data.table's awesomeness: It tries, as much as possible, to *modify by reference*. -- What does this mean? -- We don't have time to go into details here, but the very (very) short version is that there are basically two ways of changing or assigning objects in R. 1. **Copy-on-modify:** Creates a copy of your data. Implies extra computational overhead.<sup>*</sup> 2. **Modify-in-place:** Avoids creating copies and simply changes the data where it sits in memory. .footnote[<sup>*</sup> In truth, we need to distinguish between <i>shallow</i> and <i>deep copies</i>. But that's more than I want you to worry about here.] -- When we say that data.table "modifies by reference", that essentially means it modifies objects in place. This translates to lower memory overhead and faster computation time! -- P.S. Further reading if this stuff interests you: (a) [Reference semantics](https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html) data.table vignette, (b) [Names and Values](https://adv-r.hadley.nz/names-values.html) section of *Advanced R* (Hadley Wickham), (c) Nice [blog post](https://tysonbarrett.com//jekyll/update/2019/07/12/datatable/) by Tyson Barrett that's accessible to beginners. --- # data.table syntax All data.tables accept the same basic syntax: <center> <h1> DT[<span style='color: #66C2A5;'>i</span>, <span style='color: #FC8D62;'>j</span>, <span style='color: #8DA0CB;'>by</span>] </h1> </center> <center> <h2> <span style='color: #66C2A5;'>On which rows?</span> <span style='color: #FC8D62;'>What to do?</span> <span style='color: #8DA0CB;'>Grouped by what?</span> </h2> </center> -- dplyr "equivalents": <center> <h2> <span style='color: #66C2A5;'>filter(); slice(); arrange()</span> <span style='color: #FC8D62;'>select(); mutate()</span> <span style='color: #8DA0CB;'>group_by()</span> </h2> </center> -- While the tidyverse tends to break up operations step-by-step, data.table aims to do everything in one concise expression. --- # An Example Let's see a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know: .center[**What is the average height of the human characters by gender?**] -- .pull-left[ ### dplyr ```r data(starwars, package = "dplyr") starwars %>% filter(species=="Human") %>% group_by(gender) %>% summarise(mean(height, na.rm=T)) ``` ``` ## # A tibble: 2 × 2 ## gender `mean(height, na.rm = T)` ## <chr> <dbl> ## 1 feminine 164. ## 2 masculine 182. ``` ] .pull-right[ ### data.table ```r starwars_dt = as.data.table(starwars) starwars_dt[ species=="Human", mean(height, na.rm=T), by = gender] ``` ``` ## gender V1 ## <char> <num> ## 1: masculine 182.3913 ## 2: feminine 163.5714 ``` ] --- class: inverse, center, middle name: rows # Working with rows: DT[i, ] <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # Subset by rows (filter) Subsetting by rows is very straightforward in data.table. Everything works pretty much the same as you'd expect if you're coming from dplyr. - `DT[x == "string", ]`: Subset to rows where variable x equals "string" - `DT[y > 5, ]`: Subset to rows where variable y is greater than 5 - `DT[1:10, ]`: Subset to the first 10 rows -- Multiple filters/conditions are fine too: - `DT[x=="string" & y>5, ]`: Subset to rows where x is "string" AND y is greater than 5 -- **Note that we don't actually need commas when we're only subsetting on `i` (i.e., no `j` or `by` components).** - `DT[x=="string"]` is equivalent to `DT[x=="string", ]` - `DT[1:10]` is equivalent to `DT[1:10, ]` - etc. --- # Subset by rows (filter) cont. Just to emphasise the point, here's an example of subsetting by rows using our starwars data.table from earlier. ```r starwars_dt[height>190 & species=='Human'] ``` ``` ## name height mass hair_color skin_color eye_color birth_year ## <char> <int> <num> <char> <char> <char> <num> ## 1: Darth Vader 202 136 none white yellow 41.9 ## 2: Qui-Gon Jinn 193 89 brown fair blue 92.0 ## 3: Dooku 193 80 white fair brown 102.0 ## 4: Bail Prestor Organa 191 NA black tan brown 67.0 ## sex gender homeworld species ## <char> <char> <char> <char> ## 1: male masculine Tatooine Human ## 2: male masculine <NA> Human ## 3: male masculine Serenno Human ## 4: male masculine Alderaan Human ## films ## <list> ## 1: A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith ## 2: The Phantom Menace ## 3: Attack of the Clones,Revenge of the Sith ## 4: Attack of the Clones,Revenge of the Sith ## vehicles starships ## <list> <list> ## 1: TIE Advanced x1 ## 2: Tribubble bongo ## 3: Flitknot speeder ## 4: ``` --- # Order by rows (arrange) ```r starwars_dt[order(birth_year)] ## (temporarily) sort by youngest to oldest starwars_dt[order(-birth_year)] ## (temporarily) sort by oldest to youngest ``` -- While ordering as per the above is very straightforward, data.table also provides an optimized `setorder()` function for reordering *by reference*. -- ```r setorder(starwars_dt, birth_year, na.last = TRUE) starwars_dt[1:5, name:birth_year] ## Only print subset to stay on the slide ``` ``` ## name height mass hair_color skin_color eye_color ## <char> <int> <num> <char> <char> <char> ## 1: Wicket Systri Warrick 88 20 brown brown brown ## 2: IG-88 200 140 none metal red ## 3: Luke Skywalker 172 77 blond fair blue ## 4: Leia Organa 150 49 brown light brown ## 5: Wedge Antilles 170 77 brown fair hazel ## birth_year ## <num> ## 1: 8 ## 2: 15 ## 3: 19 ## 4: 19 ## 5: 21 ``` --- class: inverse, center, middle name: cols # Manipulating columns: DT[, j] <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # j: One slot to rule them all Recall some of the dplyr verbs that we used to manipulate our variables in different ways: - `select()` - `mutate()` - `summarise()` - `count()` -- data.table recognizes that all of these verbs are just different versions of telling R... >"*Do something to this variable in my dataset*" ... and it let's you do all of those operations in one place: the `j` slot. -- However, this requires a few syntax tweaks w.r.t. how we change and assign variables in our dataset. --- # Modifying columns := To add, delete, or change columns in data.table, we use the **`:=`** operator. -- For example, - `DT[, xsq := x^2]`: Create a new column (`xsq`) from an existing one (`x`) - `DT[, x := as.character(x)]`: Change an existing column -- **Important:** `:=` is *modifying by reference*, i.e. in place. So we don't have to (re)assign the object to save these changes. -- However, we also won't see these changes printed to screen unless we ask R explicitly. ```r DT = data.table(x = 1:2) # DT[, xsq := x^2] ## Modifies in place but doesn't print the result DT[, x_sq := x^2][] ## Adding [] prints the result. ``` ``` ## x x_sq ## <int> <num> ## 1: 1 1 ## 2: 2 4 ``` --- # Modifying columns := (cont.) As I keep saying, *modifying by reference* has important implications for data manipulation. Consider what happens if we copy our data.table and then remove a column. ```r DT_copy = DT DT_copy[, x_sq := NULL] ``` -- Clearly, "x_sq" has been removed from `DT_copy`. But what of the original `DT` table? -- ```r DT ``` ``` ## x ## <int> ## 1: 1 ## 2: 2 ``` -- It has been removed from DT too... exactly as modifying by reference demands. To avoid this behavior, use the explicit [`data.table::copy()`](https://rdatatable.gitlab.io/data.table/reference/copy.html) function. Run this next chunk yourself: ```r DT[, x_sq := x^2] DT_copy = copy(DT) DT_copy[, x_sq := NULL] DT ## x_sq is still there (run and confirm for yourself) ``` --- # Modifying columns := (cont.) ### Sub-assign by reference One really cool implication of `:=` is data.table's [sub-assign by reference](https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#ref-i-j) functionality. As a simple example, consider another fake dataset. ```r DT2 = data.table(a = -2:2, b = LETTERS[1:5]) ``` -- Now, imagine we want to locate all rows where "a" is negative and replace the corresponding "b" cell with NA. -- - In dplyr you'd have to do something like `...mutate(b = ifelse(a < 0, NA, b))`. -- - In data.table, simply specify which rows to target (`i`) and then sub-assign (`j`) directly. ```r DT2[a < 0, b := NA][] ## Again, just adding the second [] to print to screen ``` ``` ## a b ## <int> <char> ## 1: -2 <NA> ## 2: -1 <NA> ## 3: 0 C ## 4: 1 D ## 5: 2 E ``` --- # Modifying columns := (cont.) To modify multiple columns simultaneously, we have two options. 1. LHS `:=` RHS form: `DT[, c("var1", "var2") := .(val1, val2)]` 2. Functional form: `DT[, ':=' (var1=val1, var2=val2)]` ```r DT[, ':=' (y = 3:4, y_name = c("three", "four"))] DT ## Another way to print the results instead of appending [] ``` ``` ## x x_sq y y_name ## <int> <num> <int> <char> ## 1: 1 1 3 three ## 2: 2 4 4 four ``` -- Note, however, that dynamically assigning dependent columns in a single step (like we did with dplyr::mutate) doesn't work. ```r DT[, ':=' (z = 5:6, z_sq = z^2)][] ``` ``` ## Error in eval(jsub, SDenv, parent.frame()): object 'z' not found ``` --- # Aside: Piping data.table operations (%>% in tidyverse) However, with data.table you can chain multiple data.table operations together. -- The native data.table way is simply to append consecutive `[]` terms. ```r DT[, z := 5:6][, z_sq := z^2][] ``` ``` ## x x_sq y y_name z z_sq ## <int> <num> <int> <char> <int> <num> ## 1: 1 1 3 three 5 25 ## 2: 2 4 4 four 6 36 ``` -- Alternatively, using `%>%`: ```r DT[, z := 5:6] %>% .[, z_sq := z^2] ``` --- # Subsetting on columns (select) We can also use the `j` slot to subset our data on columns. Let's the starwars dataset for these examples... -- Subset by column position: ```r starwars_dt[1:2, c(1:3, 10)] ``` ``` ## name height mass homeworld ## <char> <int> <num> <char> ## 1: Wicket Systri Warrick 88 20 Endor ## 2: IG-88 200 140 <NA> ``` -- Or by name: ```r # starwars_dt[, c("name", "height", "mass", "homeworld")] ## Also works # starwars_dt[, list(name, height, mass, homeworld)] ## So does this starwars_dt[1:2, .(name, height, mass, homeworld)] ``` ``` ## name height mass homeworld ## <char> <int> <num> <char> ## 1: Wicket Systri Warrick 88 20 Endor ## 2: IG-88 200 140 <NA> ``` --- # Aside: What's with the .()? We've now seen `.()` in a couple places, e.g the previous slide and [this slide](#fast) from earlier if you were paying close attention. - `.()` is just a data.table shortcut for `list()`. We'll be using `.()` quite liberally once we start working subsetting and/or grouping by multiple variables at a time. You can think of it as one of data.table's syntactical quirks. But, really, it's just there to give you more options. You can often — but not always — use these three forms interchangeably in data.table: - `.(var1, var2, ...)` - `list(var1, var2, ...)` - `c("var1", "var2", ...)` --- # Subsetting on columns (cont.) You can also exclude columns through negation. Try this next code chunk yourself: ```r starwars_dt[, !c("name", "height")] ``` --- # Subsetting on columns (cont.) ### Renaming columns You can rename (set) a column by reference. Again, run this yourself: ```r setnames(starwars_dt, old = c("name", "homeworld"), new = c("alias", "crib"))[] ## Better change it back, in case we use "name" or "homeworld" on a later slide setnames(starwars_dt, old = c("alias", "crib"), new = c("name", "homeworld")) ``` --- class: inverse, center, middle name: useful # More useful things --- # A few useful data.table operators - Remove a column from your dataset: `DT[, y_name := NULL]` - Count observations: `DT[, y_name := .N]` - Create an index 1 to N: `DT[, y_name := 1:.N]` - Subset rows between two values: `DT[col %between% c(x, y)]` - Subset rows in values (`%in%` in base R): `DT[col %chin% c(x, y)]` - Calculate summary statistics for a larger list of variables: `mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")]` (w/o `SDcols` .SD takes all continuous variables) - Remove duplicates based on all variables: ```r setkey(mydata, NULL) unique(mydata) ``` set index to a column(s) to removed duplicates based on this column(s) --- # Creating a new data.table from a data.table Before I showed you that you can do: ```r starwars_dt[ species=="Human", mean(height, na.rm=T), by = gender] ``` ``` ## gender V1 ## <char> <num> ## 1: masculine 182.3913 ## 2: feminine 163.5714 ``` but what if you want to assign a name to **V1**? -- ```r starwars_dt[ species=="Human", list(avg_height = mean(height, na.rm=T)), by = gender] ``` ``` ## gender avg_height ## <char> <num> ## 1: masculine 182.3913 ## 2: feminine 163.5714 ``` --- # $ operator Works in the same way as with data frames, so you can use it to select a column, e.g. ```r starwars_dt$species ``` ``` ## [1] "Ewok" "Droid" "Human" "Human" ## [5] "Human" "Kel Dor" "Human" "Human" ## [9] "Human" "Human" "Droid" "Human" ## [13] "Mirialan" "Mon Calamari" "Human" "Human" ## [17] "Rodian" "Human" "Human" "Human" ## [21] "Twi'lek" "Human" "Gungan" "Trandoshan" ## [25] "Zabrak" "Human" "Mirialan" "Human" ## [29] "Human" "Human" "Human" "Human" ## [33] "Human" "Human" "Human" "Human" ## [37] "Human" "Cerean" "Human" "Droid" ## [41] "Wookiee" "Hutt" "Yoda's species" "Droid" ## [45] NA "Human" "Sullustan" "Neimodian" ## [49] "Gungan" "Gungan" "Human" "Toydarian" ## [53] "Dug" "Twi'lek" "Aleena" "Vulptereen" ## [57] "Xexto" "Toong" "Nautolan" "Zabrak" ## [61] "Tholothian" "Iktotchi" "Quermian" "Chagrian" ## [65] NA NA "Geonosian" "Human" ## [69] "Clawdite" "Besalisk" "Kaminoan" "Kaminoan" ## [73] "Human" "Droid" "Skakoan" "Muun" ## [77] "Togruta" "Kaleesh" "Wookiee" "Human" ## [81] NA "Pau'an" "Human" "Human" ## [85] "Human" "Droid" "Human" ``` --- # Some exercises with the starwars dataset Recall the data.table version is called starwars_dt 1. Count the number of male and female characters 2. Select rows with missing `hair_color` 3. Select all non-human rows which `name` includes the string `R2` 4. Sort the data by `name` and `height` --- # Joins with data.table: Merge Sintax: ```r merge(dt1, dt2, by (or by.x and by.y when column names do not match), all=F (default, inner join, T is for full join), all.x=T (left join), all.y=T (right join)) ``` -- Alternatively, you can perform inner joins in this way: `X[Y]` but you need to set the `key` (i.e., the variable to join on) Let's see an example using the flights and planes data.tables: ```r flights_dt = data.table(flights) planes_dt = data.table(planes) setkey(flights_dt, "tailnum") setkey(planes_dt, "tailnum") test = flights_dt[planes_dt] ``` --- # Join exercises Remember to convert the dataframes to data.tables 1. Count flights that don't have matching planes <!-- planes_dt = data.table(planes) --> <!-- flights_dt = data.table(flights) --> <!-- out = merge(flights_dt, planes_dt, by = "tailnum", all.x=T) --> <!-- nrow(out[is.na(model)]) --> 2. Create a table containing flights number, origin, origin lat, origin lon <!-- airports_dt = data.table(airports) --> <!-- out = merge(flights_dt, airports_dt, by.x = 'origin', by.y = 'faa') --> <!-- out[, c("flight", "origin", "lat", "lon")] --> 3. Compute average temperature by flight origin <!-- weather_dt= data.table(weather) --> <!-- out = merge(flights_dt, weather_dt, by = c("origin", "year", "month", "day", "hour")) --> <!-- out[, list(avg_temp = mean(temp, na.rm=T)), by = c("origin")] --> 4. Find coldest Airport <!-- weather_dt= data.table(weather) --> <!-- out = merge(flights_dt, weather_dt, by = c("origin", "year", "month", "day", "hour")) --> <!-- out[, list(avg_temp = mean(temp, na.rm=T)), by = c("origin")] --> <!-- out[, list(avg_temp = mean(temp, na.rm=T)), by = c("origin")][order(avg_temp)] --> --- # Melt (wide to long) ```r stocks = data.table( time = as.Date('2009-01-01') + 0:1, X = rnorm(2, 0, 1), Y = rnorm(2, 0, 2), Z = rnorm(2, 0, 4) ) stocks ``` ``` ## time X Y Z ## <Date> <num> <num> <num> ## 1: 2009-01-01 0.870211 -2.2472386 4.117511 ## 2: 2009-01-02 0.407793 0.6812328 2.927533 ``` -- ```r melt(stocks, id = "time", measure = c("X", "Y", "Z"), value.name = "price") ``` ``` ## time variable price ## <Date> <fctr> <num> ## 1: 2009-01-01 X 0.8702110 ## 2: 2009-01-02 X 0.4077930 ## 3: 2009-01-01 Y -2.2472386 ## 4: 2009-01-02 Y 0.6812328 ## 5: 2009-01-01 Z 4.1175113 ## 6: 2009-01-02 Z 2.9275330 ``` --- # Dcast (long to wide) ```r stocks_long = melt(stocks, id = "time", measure = c("X", "Y", "Z"), value.name = "price") ``` ```r dcast(stocks_long, time ~ variable, value.var = "price") ``` ``` ## Key: <time> ## time X Y Z ## <Date> <num> <num> <num> ## 1: 2009-01-01 0.870211 -2.2472386 4.117511 ## 2: 2009-01-02 0.407793 0.6812328 2.927533 ``` --- # Some more exercises Let's use the gapminder dataset part of the gapminder package (it contains panel data on life expectancy, population size, and GDP per capita for 142 countries since the 1950s.) ```r library(gapminder) gp_dt = data.table(gapminder) ``` 1. How many different countries are covered by the data? (hint: look how to use `uniqueN`) <!-- gp[, uniqueN(country)] --> 2. Create a lag variable for GDP per capita. (hint: look how to use `shift`) <!-- gp_dt[, gdpPercap_l1 := shift(gdpPercap), by = country] --> <!-- head(gp_dt) --> 3. Calculate the GDP per capita growth from 2002 to 2007 for each country. Extract the one with the highest value for each continent. <!-- gp_dt[year == 2007, list(gdp_growth = (gdpPercap/gdpPercap_l1)-1), by = list(country, continent)] %>% --> <!-- .[, max_gdp_growth:=max(gdp_growth, na.rm =T), by = continent] %>% --> <!-- .[max_gdp_growth == gdp_growth] %>% --> <!-- .[order(-gdp_growth)] --> 4. Calculate the number of years since the country reached $8k in GDP per capita at each relevant observation as accurately as the data allows (set years before gdp is greater or equal to 8k to NA) (hint: use the `which` function) <!-- gp_dt = [order(country, year)] --> <!-- # Years from first 8k --> <!-- tt = gp_dt[, years_from8k := year - year[which(gdpPercap >= 8000)[1]], country][years_from8k < 0, years_from8k := NA] --> <!-- tt -->