2019-02-27

Introduction

About me: Justin M. Shea

  • Almost two decades of experience in Derivatives markets

  • Visiting Professor of Finance Roosevelt University

  • Organizer of The Chicago R User Group

  • Committee member R/Finance 2019

  • Author of the wooldridge R package, the official R package for “Introductory Econometrics” by Jeffrey M. Wooldridge.

  • Author of the neverhpfilter R package, implementing James Hamilton’s alternative to the Hodrick-Prescott filter.

  • Google summer of code mentor 2019, R and Finance applications?

Why data.table

Benchmarks

Why I need data.table

Problem: Model every single equity traded in the United States since 1926.

  • 34000+ securities

  • Daily observations from 1926-to-present

  • Survivorship Bias free

Import Data

## Import Data Actual
library(data.table)

CRSP_2659 <- fread("/media/jshea/data/CRSPdailyValue-1926-1959.csv")
CRSP_6080 <- fread("/media/jshea/data/CRSPdailyValue-1960-1979.csv")
CRSP_8000 <- fread("/media/jshea/data/CRSPdailyValue-1980-1999.csv")
CRSP_0019 <- fread("/media/jshea/data/CRSPdailyValue-2000-2019.csv")

CRSP <- rbind(CRSP_2659, CRSP_6080, CRSP_8000, CRSP_0019)
rm(CRSP_2659, CRSP_6080, CRSP_8000, CRSP_0019)

## Format Data types 

CRSP$date <- as.Date(CRSP$date, "%Y/%m/%d")
CRSP$RET <- as.numeric(CRSP$RET)

# Remove missing values
sum(is.na(CRSP))
CRSP <- na.omit(CRSP)

A 1 year sample: 100 seconds!

CRSP_Betas <- matrix(c(9999, 0, 1, 0, 1, 100, 1, 23625), nrow = 1)

 for(permno in unique(CRSP$PERMNO)) {
    
      factor4 <- glm(RET ~ 0 + Q1 + Q2 + Q3 + Q4 + RmQ1 + RmQ2 + RmQ3 + RmQ4, 
                     data = CRSP[permno])
    
    betarow <- matrix(c(permno, alpha, beta, summary(factor4)$sigma, 
                        summary(factor4)$adj.r.squared, 
                        summary(factor4)$fstatistic), nrow = 1)
          
          if(dim(betarow)[2] == 8) CRSP_Betas <- rbind(CRSP_Betas, betarow)
      
            }

colnames(CRSP_Betas) <- c("PERMNO", "Alpha", "Beta", "sigma", "Adj.R2", "Fstat", 
                          "numeratorDF", "denominatorDF")

What to do???

From setindex documentation

  1. binary search and joins are faster when they detect they can use an existing key

  2. grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM

  3. simpler shorter syntax; e.g. DT[“id”,] finds the group “id” in the first column of DT’s key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type rownames

setindex doesn’t physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that order vector in an additional attribute called index.

94 years: 121 seconds!

With one line of code and a bracket!

setindex(CRSP, PERMNO)

CRSP_Betas <- matrix(c(9999, 0, 1, 0, 1, 100, 1, 23625), nrow = 1)

 for(permno in unique(CRSP$PERMNO)) {
    
      factor4 <- glm(RET ~ 0 + Q1 + Q2 + Q3 + Q4 + RmQ1 + RmQ2 + RmQ3 + RmQ4, 
                     data = CRSP[.(permno)])
    
    betarow <- matrix(c(permno, alpha, beta, summary(factor4)$sigma, 
                        summary(factor4)$adj.r.squared, 
                        summary(factor4)$fstatistic), nrow = 1)
          
          if(dim(betarow)[2] == 8) CRSP_Betas <- rbind(CRSP_Betas, betarow)
      
            }

colnames(CRSP_Betas) <- c("PERMNO", "Alpha", "Beta", "sigma", "Adj.R2", "Fstat", 
                          "numeratorDF", "denominatorDF")

Thank You for listening!