Blazing fast binary search in parallel with data.table

2019-02-27

Introduction

About me: Justin M. Shea

Almost two decades of experience in Derivatives markets
Visiting Professor of Finance Roosevelt University
Organizer of The Chicago R User Group
Committee member R/Finance 2019
Author of the wooldridge R package, the official R package for “Introductory Econometrics” by Jeffrey M. Wooldridge.
Author of the neverhpfilter R package, implementing James Hamilton’s alternative to the Hodrick-Prescott filter.
Google summer of code mentor 2019, R and Finance applications?

Why data.table

library(data.table)

concise syntax: fast to type, fast to read
fast speed
memory efficient
careful API lifecycle management
community
feature rich

https://github.com/Rdatatable/data.table

Benchmarks

Why I need data.table

Problem: Model every single equity traded in the United States since 1926.

34000+ securities
Daily observations from 1926-to-present
Survivorship Bias free

Import Data

## Import Data Actual
library(data.table)

CRSP_2659 <- fread("/media/jshea/data/CRSPdailyValue-1926-1959.csv")
CRSP_6080 <- fread("/media/jshea/data/CRSPdailyValue-1960-1979.csv")
CRSP_8000 <- fread("/media/jshea/data/CRSPdailyValue-1980-1999.csv")
CRSP_0019 <- fread("/media/jshea/data/CRSPdailyValue-2000-2019.csv")

CRSP <- rbind(CRSP_2659, CRSP_6080, CRSP_8000, CRSP_0019)
rm(CRSP_2659, CRSP_6080, CRSP_8000, CRSP_0019)

## Format Data types 

CRSP$date <- as.Date(CRSP$date, "%Y/%m/%d")
CRSP$RET <- as.numeric(CRSP$RET)

# Remove missing values
sum(is.na(CRSP))
CRSP <- na.omit(CRSP)

A 1 year sample: 100 seconds!

CRSP_Betas <- matrix(c(9999, 0, 1, 0, 1, 100, 1, 23625), nrow = 1)

 for(permno in unique(CRSP$PERMNO)) {
    
      factor4 <- glm(RET ~ 0 + Q1 + Q2 + Q3 + Q4 + RmQ1 + RmQ2 + RmQ3 + RmQ4, 
                     data = CRSP[permno])
    
    betarow <- matrix(c(permno, alpha, beta, summary(factor4)$sigma, 
                        summary(factor4)$adj.r.squared, 
                        summary(factor4)$fstatistic), nrow = 1)
          
          if(dim(betarow)[2] == 8) CRSP_Betas <- rbind(CRSP_Betas, betarow)
      
            }

colnames(CRSP_Betas) <- c("PERMNO", "Alpha", "Beta", "sigma", "Adj.R2", "Fstat", 
                          "numeratorDF", "denominatorDF")

What to do???

From setindex documentation

binary search and joins are faster when they detect they can use an existing key
grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM
simpler shorter syntax; e.g. DT[“id”,] finds the group “id” in the first column of DT’s key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type rownames

setindex doesn’t physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that order vector in an additional attribute called index.

94 years: 121 seconds!

With one line of code and a bracket!

setindex(CRSP, PERMNO)

CRSP_Betas <- matrix(c(9999, 0, 1, 0, 1, 100, 1, 23625), nrow = 1)

 for(permno in unique(CRSP$PERMNO)) {
    
      factor4 <- glm(RET ~ 0 + Q1 + Q2 + Q3 + Q4 + RmQ1 + RmQ2 + RmQ3 + RmQ4, 
                     data = CRSP[.(permno)])
    
    betarow <- matrix(c(permno, alpha, beta, summary(factor4)$sigma, 
                        summary(factor4)$adj.r.squared, 
                        summary(factor4)$fstatistic), nrow = 1)
          
          if(dim(betarow)[2] == 8) CRSP_Betas <- rbind(CRSP_Betas, betarow)
      
            }

colnames(CRSP_Betas) <- c("PERMNO", "Alpha", "Beta", "sigma", "Adj.R2", "Fstat", 
                          "numeratorDF", "denominatorDF")