MKT615: Data Storytelling for Marketers

class: center, middle, inverse, title-slide

.title[
# MKT615: Data Storytelling for Marketers
]
.subtitle[
## Lecture 5.2: data.table
]
.author[
### Davide Proserpio
]
.institute[
### Marshall School of Business
]

---

# Table of contents

1. [Checklist](#check)
2. [Introduction](#intro)
3. [data.table basics](#basics)
4. [Working with rows](#rows)
5. [Working with columns](#columns)
6. [More useful things](#useful)

---

# Checklist

name:check

We'll be using the following packages in today's lecture:
- Already installed: **dplyr**, **ggplot2**, **nycflights13**
- New: **data.table**, **tidyfast**, **dtplyr**, **microbenchmark**, **gapminder**

The following code chunk will install (if necessary) and load everything for you.

```r
if (!require(pacman)) install.packages('pacman', repos = 'https://cran.rstudio.com')
pacman::p_load(dplyr, data.table, dtplyr, tidyfast, microbenchmark, ggplot2, nycflights13)
options(dplyr.summarise.inform = FALSE) ## Turn off annoying dplyr group_by messages (optional)
```

---
class: inverse, center, middle
name: intro

# Introduction

---

# Why learn data.table?

The **tidyverse** is great. As I keep hinting, it will also provide a bridge to many of the big data tools that we'll encounter later in the course (SQL databases, etc.)

So why bother learning another data wrangling package/syntax?

When it comes to **data.table**, I can think of at least five reasons:

1. Concise
2. Insanely fast
3. Memory efficient
4. Feature rich
5. Dependency free

Before we get into specifics, here are a few examples...

---

# Why learn data.table? (cont.)

### 1) Concise

These two code chunks do the same thing:

```r
# library(dplyr) ## Already loaded
# data(starwars, package = "dplyr") ## Optional to bring the DF into the global env
starwars %>% 
  filter(species=="Human") %>% 
  group_by(homeworld) %>% 
  summarise(mean_height=mean(height)) 
```
vs
```r
# library(data.table) ## Already loaded
starwars_dt = as.data.table(starwars)
starwars_dt[species=="Human", mean(height), by=homeworld]
```

---
name:fast

# Why learn data.table? (cont.)

.small90[

```r
collapse_dplyr = function() {
  storms %>%
    group_by(name, year, month, day) %>% 
    summarize(wind = mean(wind), pressure = mean(pressure), category = dplyr::first(category))
}

storms_dt = as.data.table(storms)
collapse_dt = function() {
  storms_dt[, .(wind = mean(wind), pressure = mean(pressure), category = first(category)),
            by = .(name, year, month, day)]
  }
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
```
---

```
## Warning in microbenchmark(collapse_dplyr(), collapse_dt(), times = 1): less
## accurate nanosecond times to avoid potential integer overflows
```

```
## Unit: milliseconds
##              expr       min        lq      mean    median        uq       max
##  collapse_dplyr() 87.222826 87.222826 87.222826 87.222826 87.222826 87.222826
##     collapse_dt()  7.113664  7.113664  7.113664  7.113664  7.113664  7.113664
##  neval
##      1
##      1
```
]

.small90[
**Result:** data.table is ~12x faster (median)! (Thanks to [Keith Head](https://twitter.com/ckhead) for this example.)
]

---
class: inverse, center, middle
name: basics

# data.table basics

---

# The data.table object

data.table functions only work on objects that have been converted to data.tables first.
- Beyond simple visual enhancements (similar to tibbles), the specialised internal structure of data.table objects is a key reason why the package is so fast. (More [here](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html#what-is-datatable-1a) and [here](https://twitter.com/matloff/status/1131372631372918784).)

To create a data.table, we have a couple of options:
- `fread('mydata.csv')` reads a CSV into R as a data.table (and is very fast).
- `data.table(x = 1:10)` creates a new data.table from scratch
- `as.data.table(d)` or `data.table(d)` coerces an existing data frame (`d`) to a data.table.
- `setDT(d)` coerces an existing data frame to a data.table *by reference*; i.e. we don't have to (re)assign it.

---

# What does "modify by reference" mean?

That last bullet leads us to an important concept that underlies much of data.table's awesomeness: It tries, as much as possible, to *modify by reference*.

What does this mean?

--
We don't have time to go into details here, but the very (very) short version is that there are basically two ways of changing or assigning objects in R.
1. **Copy-on-modify:** Creates a copy of your data. Implies extra computational overhead.<sup>*</sup>
2. **Modify-in-place:** Avoids creating copies and simply changes the data where it sits in memory.

.footnote[<sup>*</sup> In truth, we need to distinguish between <i>shallow</i> and <i>deep copies</i>. But that's more than I want you to worry about here.]

When we say that data.table "modifies by reference", that essentially means it modifies objects in place. This translates to lower memory overhead and faster computation time!

P.S. Further reading if this stuff interests you: (a) [Reference semantics](https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html) data.table vignette, (b) [Names and Values](https://adv-r.hadley.nz/names-values.html) section of *Advanced R* (Hadley Wickham), (c) Nice [blog post](https://tysonbarrett.com//jekyll/update/2019/07/12/datatable/) by Tyson Barrett that's accessible to beginners.

---

# data.table syntax

All data.tables accept the same basic syntax:

<center>
<h2>
<span style='color: #66C2A5;'>On which rows?</span>&emsp;<span style='color: #FC8D62;'>What to do?</span>&emsp;<span style='color: #8DA0CB;'>Grouped by what?</span>
</h2>
</center>

dplyr "equivalents":

<center>
<h2>
<span style='color: #66C2A5;'>filter(); slice(); arrange()</span>&emsp;<span style='color: #FC8D62;'>select(); mutate()</span>&emsp;<span style='color: #8DA0CB;'>group_by()</span>
</h2>
</center>

While the tidyverse tends to break up operations step-by-step, data.table aims to do everything in one concise expression.

---

# An Example

Let's see a quick side-by-side comparison with dplyr, since that will help to orientate us for the rest of the lecture. Using our starwars dataset, say we want to know:

.center[**What is the average height of the human characters by gender?**]
--

.pull-left[
### dplyr

```r
data(starwars, package = "dplyr")
starwars %>%
  filter(species=="Human") %>%
  group_by(gender) %>%
  summarise(mean(height, na.rm=T))
```

```
## # A tibble: 2 × 2
##   gender    `mean(height, na.rm = T)`
##   <chr>                         <dbl>
## 1 feminine                       164.
## 2 masculine                      182.
```

]

.pull-right[
### data.table

```r
starwars_dt = as.data.table(starwars)
starwars_dt[
  species=="Human", 
  mean(height, na.rm=T), 
  by = gender]
```

```
##       gender       V1
##       <char>    <num>
## 1: masculine 182.3913
## 2:  feminine 163.5714
```
]

---
class: inverse, center, middle

name: rows

# Working with rows: DT[i, ]

---

# Subset by rows (filter)

Subsetting by rows is very straightforward in data.table. Everything works pretty much the same as you'd expect if you're coming from dplyr.

- `DT[x == "string", ]`: Subset to rows where variable x equals "string"

- `DT[y > 5, ]`: Subset to rows where variable y is greater than 5

- `DT[1:10, ]`: Subset to the first 10 rows

Multiple filters/conditions are fine too:

- `DT[x=="string" & y>5, ]`: Subset to rows where x is "string" AND y is greater than 5

**Note that we don't actually need commas when we're only subsetting on `i` (i.e., no `j` or `by` components).**
- `DT[x=="string"]` is equivalent to `DT[x=="string", ]`
- `DT[1:10]` is equivalent to `DT[1:10, ]`
- etc.

---

# Subset by rows (filter) cont.

Just to emphasise the point, here's an example of subsetting by rows using our starwars data.table from earlier.

```r
starwars_dt[height>190 & species=='Human']
```

```
##                   name height  mass hair_color skin_color eye_color birth_year
##                 <char>  <int> <num>     <char>     <char>    <char>      <num>
## 1:         Darth Vader    202   136       none      white    yellow       41.9
## 2:        Qui-Gon Jinn    193    89      brown       fair      blue       92.0
## 3:               Dooku    193    80      white       fair     brown      102.0
## 4: Bail Prestor Organa    191    NA      black        tan     brown       67.0
##       sex    gender homeworld species
##    <char>    <char>    <char>  <char>
## 1:   male masculine  Tatooine   Human
## 2:   male masculine      <NA>   Human
## 3:   male masculine   Serenno   Human
## 4:   male masculine  Alderaan   Human
##                                                                        films
##                                                                       <list>
## 1: A New Hope,The Empire Strikes Back,Return of the Jedi,Revenge of the Sith
## 2:                                                        The Phantom Menace
## 3:                                  Attack of the Clones,Revenge of the Sith
## 4:                                  Attack of the Clones,Revenge of the Sith
##            vehicles       starships
##              <list>          <list>
## 1:                  TIE Advanced x1
## 2:  Tribubble bongo                
## 3: Flitknot speeder                
## 4:
```

---

# Order by rows (arrange)

```r
starwars_dt[order(birth_year)]  ## (temporarily) sort by youngest to oldest
starwars_dt[order(-birth_year)] ## (temporarily) sort by oldest to youngest
```

While ordering as per the above is very straightforward, data.table also provides an optimized `setorder()` function for reordering *by reference*.

```r
setorder(starwars_dt, birth_year, na.last = TRUE)
starwars_dt[1:5, name:birth_year] ## Only print subset to stay on the slide
```

```
##                     name height  mass hair_color skin_color eye_color
##                   <char>  <int> <num>     <char>     <char>    <char>
## 1: Wicket Systri Warrick     88    20      brown      brown     brown
## 2:                 IG-88    200   140       none      metal       red
## 3:        Luke Skywalker    172    77      blond       fair      blue
## 4:           Leia Organa    150    49      brown      light     brown
## 5:        Wedge Antilles    170    77      brown       fair     hazel
##    birth_year
##         <num>
## 1:          8
## 2:         15
## 3:         19
## 4:         19
## 5:         21
```

---
class: inverse, center, middle
name: cols

# Manipulating columns: DT[, j]

---

# j: One slot to rule them all

Recall some of the dplyr verbs that we used to manipulate our variables in different ways:

- `select()`
- `mutate()`
- `summarise()`
- `count()`

data.table recognizes that all of these verbs are just different versions of telling R... 
>"*Do something to this variable in my dataset*"
... and it let's you do all of those operations in one place: the `j` slot.

However, this requires a few syntax tweaks w.r.t. how we change and assign variables in our dataset.

---

# Modifying columns :=

To add, delete, or change columns in data.table, we use the **`:=`** operator.

For example,

- `DT[, xsq := x^2]`: Create a new column (`xsq`) from an existing one (`x`)
- `DT[, x := as.character(x)]`: Change an existing column

**Important:** `:=` is *modifying by reference*, i.e. in place. So we don't have to (re)assign the object to save these changes.

However, we also won't see these changes printed to screen unless we ask R explicitly.

```r
DT = data.table(x = 1:2)
# DT[, xsq := x^2] ## Modifies in place but doesn't print the result
DT[, x_sq := x^2][] ## Adding [] prints the result.
```

```
##        x  x_sq
##    <int> <num>
## 1:     1     1
## 2:     2     4
```

---

# Modifying columns := (cont.)

As I keep saying, *modifying by reference* has important implications for data manipulation. Consider what happens if we copy our data.table and then remove a column.

```r
DT_copy = DT
DT_copy[, x_sq := NULL]
```

Clearly, "x_sq" has been removed from `DT_copy`. But what of the original `DT` table?

```r
DT
```

```
##        x
##    <int>
## 1:     1
## 2:     2
```

It has been removed from DT too... exactly as modifying by reference demands. To avoid this behavior, use the explicit [`data.table::copy()`](https://rdatatable.gitlab.io/data.table/reference/copy.html) function. Run this next chunk yourself:

```r
DT[, x_sq := x^2]     
DT_copy = copy(DT)  
DT_copy[, x_sq := NULL]
DT ## x_sq is still there (run and confirm for yourself)
```

---

# Modifying columns := (cont.)

### Sub-assign by reference

One really cool implication of `:=` is data.table's [sub-assign by reference](https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#ref-i-j) functionality. As a simple example, consider another fake dataset.

```r
DT2 = data.table(a = -2:2, b = LETTERS[1:5])
```

Now, imagine we want to locate all rows where "a" is negative and replace the corresponding "b" cell with NA.

- In dplyr you'd have to do something like `...mutate(b = ifelse(a < 0, NA, b))`.
--

- In data.table, simply specify which rows to target (`i`) and then sub-assign (`j`) directly.

```r
DT2[a < 0, b := NA][] ## Again, just adding the second [] to print to screen
```

```
##        a      b
##    <int> <char>
## 1:    -2   <NA>
## 2:    -1   <NA>
## 3:     0      C
## 4:     1      D
## 5:     2      E
```

---

# Modifying columns := (cont.)

To modify multiple columns simultaneously, we have two options.

1. LHS `:=` RHS form: `DT[, c("var1", "var2") := .(val1, val2)]`

2. Functional form: `DT[, ':=' (var1=val1, var2=val2)]`

```r
DT[, ':=' (y = 3:4, y_name = c("three", "four"))]
DT ## Another way to print the results instead of appending []
```

```
##        x  x_sq     y y_name
##    <int> <num> <int> <char>
## 1:     1     1     3  three
## 2:     2     4     4   four
```

Note, however, that dynamically assigning dependent columns in a single step (like we did with dplyr::mutate) doesn't work.

```r
DT[, ':=' (z = 5:6, z_sq = z^2)][]
```

```
## Error in eval(jsub, SDenv, parent.frame()): object 'z' not found
```

---

# Aside: Piping data.table operations (%>% in tidyverse)

However, with data.table you can chain multiple data.table operations together.

The native data.table way is simply to append consecutive `[]` terms.

```r
DT[, z := 5:6][, z_sq := z^2][] 
```

```
##        x  x_sq     y y_name     z  z_sq
##    <int> <num> <int> <char> <int> <num>
## 1:     1     1     3  three     5    25
## 2:     2     4     4   four     6    36
```

--
Alternatively, using `%>%`:

```r
DT[, z := 5:6] %>%
  .[, z_sq := z^2] 
```

---

# Subsetting on columns (select)

We can also use the `j` slot to subset our data on columns. Let's the starwars dataset for these examples...

Subset by column position:

```r
starwars_dt[1:2, c(1:3, 10)]
```

```
##                     name height  mass homeworld
##                   <char>  <int> <num>    <char>
## 1: Wicket Systri Warrick     88    20     Endor
## 2:                 IG-88    200   140      <NA>
```

--
Or by name:

```r
# starwars_dt[, c("name", "height", "mass", "homeworld")] ## Also works
# starwars_dt[, list(name, height, mass, homeworld)] ## So does this
starwars_dt[1:2, .(name, height, mass, homeworld)]
```

---

# Aside: What's with the .()?

We've now seen `.()` in a couple places, e.g the previous slide and [this slide](#fast) from earlier if you were paying close attention.
- `.()` is just a data.table shortcut for `list()`.

We'll be using `.()` quite liberally once we start working subsetting and/or grouping by multiple variables at a time.

You can think of it as one of data.table's syntactical quirks. But, really, it's just there to give you more options. You can often &mdash; but not always &mdash; use these three forms interchangeably in data.table:
- `.(var1, var2, ...)`
- `list(var1, var2, ...)`
- `c("var1", "var2", ...)`

---

# Subsetting on columns (cont.)

You can also exclude columns through negation. Try this next code chunk yourself:

```r
starwars_dt[, !c("name", "height")]
```

---
# Subsetting on columns (cont.)

### Renaming columns

You can rename (set) a column by reference. Again, run this yourself:

```r
setnames(starwars_dt, old = c("name", "homeworld"), new = c("alias", "crib"))[]
## Better change it back, in case we use "name" or "homeworld" on a later slide
setnames(starwars_dt, old = c("alias", "crib"), new = c("name", "homeworld"))
```

---
class: inverse, center, middle
name: useful

# More useful things

---

# A few useful data.table operators

- Remove a column from your dataset:    `DT[, y_name := NULL]`

- Count observations: `DT[, y_name := .N]`

- Create an index 1 to N: `DT[, y_name := 1:.N]`

- Subset rows between two values: `DT[col %between% c(x, y)]`

- Subset rows in values (`%in%` in base R): `DT[col %chin% c(x, y)]`

- Calculate summary statistics for a larger list of variables: `mydata[, lapply(.SD, mean), .SDcols = c("arr_delay", "dep_delay")]`
  
  (w/o `SDcols` .SD takes all continuous variables)
  
- Remove duplicates based on all variables:
  
  ```r
  setkey(mydata, NULL)
  unique(mydata)
  ```
  set index to a column(s) to removed duplicates based on this column(s)
---

# Creating a new data.table from a data.table

Before I showed you that you can do:

```r
starwars_dt[
  species=="Human", 
  mean(height, na.rm=T), 
  by = gender]
```

```
##       gender       V1
##       <char>    <num>
## 1: masculine 182.3913
## 2:  feminine 163.5714
```

but what if you want to assign a name to **V1**?
--

```r
starwars_dt[
  species=="Human", 
  list(avg_height = mean(height, na.rm=T)), 
  by = gender]
```

```
##       gender avg_height
##       <char>      <num>
## 1: masculine   182.3913
## 2:  feminine   163.5714
```

---
# $ operator

Works in the same way as with data frames, so you can use it to select a column, e.g.

```r
starwars_dt$species
```

```
##  [1] "Ewok"           "Droid"          "Human"          "Human"         
##  [5] "Human"          "Kel Dor"        "Human"          "Human"         
##  [9] "Human"          "Human"          "Droid"          "Human"         
## [13] "Mirialan"       "Mon Calamari"   "Human"          "Human"         
## [17] "Rodian"         "Human"          "Human"          "Human"         
## [21] "Twi'lek"        "Human"          "Gungan"         "Trandoshan"    
## [25] "Zabrak"         "Human"          "Mirialan"       "Human"         
## [29] "Human"          "Human"          "Human"          "Human"         
## [33] "Human"          "Human"          "Human"          "Human"         
## [37] "Human"          "Cerean"         "Human"          "Droid"         
## [41] "Wookiee"        "Hutt"           "Yoda's species" "Droid"         
## [45] NA               "Human"          "Sullustan"      "Neimodian"     
## [49] "Gungan"         "Gungan"         "Human"          "Toydarian"     
## [53] "Dug"            "Twi'lek"        "Aleena"         "Vulptereen"    
## [57] "Xexto"          "Toong"          "Nautolan"       "Zabrak"        
## [61] "Tholothian"     "Iktotchi"       "Quermian"       "Chagrian"      
## [65] NA               NA               "Geonosian"      "Human"         
## [69] "Clawdite"       "Besalisk"       "Kaminoan"       "Kaminoan"      
## [73] "Human"          "Droid"          "Skakoan"        "Muun"          
## [77] "Togruta"        "Kaleesh"        "Wookiee"        "Human"         
## [81] NA               "Pau'an"         "Human"          "Human"         
## [85] "Human"          "Droid"          "Human"
```

---

# Some exercises with the starwars dataset

Recall the data.table version is called starwars_dt

1. Count the number of male and female characters

2. Select rows with missing `hair_color`

3. Select all non-human rows which `name` includes the string `R2`

4. Sort the data by `name` and `height`

---

# Joins with data.table: Merge

Sintax: 
  
  ```r
  merge(dt1, dt2,
              by (or by.x and by.y when column names do not match),
              all=F (default, inner join, T is for full join), all.x=T (left join), all.y=T (right join))
  ```
--

Alternatively, you can perform inner joins in this way:
  `X[Y]` but  you need to set the `key` (i.e., the variable to join on)

Let's see an example using the flights and planes data.tables:

```r
flights_dt = data.table(flights)
planes_dt = data.table(planes)
setkey(flights_dt, "tailnum")
setkey(planes_dt, "tailnum")
test = flights_dt[planes_dt]
```

---
# Join exercises

Remember to convert the dataframes to data.tables

1. Count flights that don't have matching planes




2. Create a table containing flights number, origin, origin lat, origin lon

3. Compute average temperature by flight origin

4. Find coldest Airport

---

# Melt (wide to long)

```r
stocks = data.table(
  time = as.Date('2009-01-01') + 0:1,
  X = rnorm(2, 0, 1),
  Y = rnorm(2, 0, 2),
  Z = rnorm(2, 0, 4)
  )
stocks
```

```
##          time        X          Y        Z
##        <Date>    <num>      <num>    <num>
## 1: 2009-01-01 0.870211 -2.2472386 4.117511
## 2: 2009-01-02 0.407793  0.6812328 2.927533
```
--

```r
melt(stocks, id = "time", measure = c("X", "Y", "Z"), value.name = "price")
```

```
##          time variable      price
##        <Date>   <fctr>      <num>
## 1: 2009-01-01        X  0.8702110
## 2: 2009-01-02        X  0.4077930
## 3: 2009-01-01        Y -2.2472386
## 4: 2009-01-02        Y  0.6812328
## 5: 2009-01-01        Z  4.1175113
## 6: 2009-01-02        Z  2.9275330
```

---

# Dcast (long to wide)

```r
stocks_long = melt(stocks, id = "time", measure = c("X", "Y", "Z"), value.name = "price")
```

```r
dcast(stocks_long, time ~ variable, value.var = "price")
```

```
## Key: <time>
##          time        X          Y        Z
##        <Date>    <num>      <num>    <num>
## 1: 2009-01-01 0.870211 -2.2472386 4.117511
## 2: 2009-01-02 0.407793  0.6812328 2.927533
```

---

# Some more exercises

Let's use the gapminder dataset part of the gapminder package (it contains panel data on life expectancy, population size, and GDP per capita for 142 countries since the 1950s.)

```r
  library(gapminder)
  gp_dt = data.table(gapminder)
```

1. How many different countries are covered by the data? (hint: look how to use `uniqueN`)

2. Create a lag variable for GDP per capita. (hint: look how to use `shift`)

3. Calculate the GDP per capita growth from 2002 to 2007 for each country. Extract the one with the highest value for each continent.

4. Calculate the number of years since the country reached $8k in GDP per capita at each relevant observation as accurately as the data allows (set years before gdp is greater or equal to 8k to NA) (hint: use the `which` function)