Identifying the Predictors of COVID-19 Spread by Ensemble Learning

Please also check our works released earlier and coming soon as a part of our Nova Scotia COVID-19 Research Coalition Project at Blog:

We now attempt to train our ensemble models to see if the forecasting accuracy in epi curves would improve when we add some predictors related to mobility and weather. We use the individual-level data from Ontario containing Episode Date, which is the date when the first symptoms were started. The mobility data is from Facebook. There are two indices: all_day_bing_tiles_visited_relative_change: Positive or negative change in movement relative to baseline, and all_day_ratio_single_tile_users: Positive proportion of users staying put within a single location. We will use the first one and call it Mob1. We will use only four neighboring regions: Toronto, Peel, York, and Durham. These regions are the hotspots in Ontario. We will also incorporate the panel structure of the data in all learning algorithms.

1. Data

setwd("~/Dropbox/RNS2")
rm(list = ls())

library(readxl)
Indoor <- read_excel("~/Dropbox/RNS2/DataON/Rinda/Indoor.xlsx")
load("~/Dropbox/RNS2/Toronto")
dt <- data_v1_Toronto_aggregated[-c(267:268), c(2,3,5,6,7,13)]
names(dt)[c(2,4,5)] <- c("mob", "delay", "male")
data <- cbind(dt, Indoor[,1:2])
names(data)[c(7,8)] <- c("temp", "hum")
data <- data[, c(1,3,2,4:8)]
head(data,10)

##    Episode Date cases      mob     delay      male      age  temp  hum
## 1    2020-03-01     4 -0.01721 36.750000 0.7500000 55.00000 -4.20 65.5
## 2    2020-03-02     6 -0.03201  8.500000 1.0000000 45.00000  3.80 84.0
## 3    2020-03-03    10 -0.01188 15.000000 0.7000000 54.00000  2.30 90.0
## 4    2020-03-04     7  0.01861 25.714286 0.2857143 50.00000  3.35 71.0
## 5    2020-03-05     7  0.02228 21.000000 0.4285714 48.57143  1.20 63.5
## 6    2020-03-06    10 -0.00626 13.100000 0.5000000 36.00000  0.04 75.0
## 7    2020-03-07     8  0.02606 10.375000 0.5000000 46.25000 -1.65 54.0
## 8    2020-03-08    10  0.02729 11.600000 0.9000000 50.00000  6.30 56.0
## 9    2020-03-09    18 -0.01577  8.888889 0.6111111 35.55556 12.50 55.0
## 10   2020-03-10    29 -0.05208  9.689655 0.4482759 41.72414  5.15 79.0

2. Univariate setting for Toronto

We will set up the data as time series and make them ts objects.

tor <- data[,1:2]
tor <- cbind(tor, "week" = as.numeric(format(data$Episode, format = "%w")))
#ts
tor_ts <- as.ts(tor[,2:3])
#Train split - last week we want to predict
tor_train <- window(tor_ts, end=259)
#first diff
tor_ts_train <- cbind(diff(tor_train[,1]), tor_train[-1,2])

2.1 Random Forest

Embedding for direct forecasting (see this post)

lag_order <- 14 # the desired number of lags (2 weeks)
horizon <- 7 # the forecast horizon

tor_ts_mbd <- embed(tor_ts_train, lag_order + 1) # embedding

y_train <- tor_ts_mbd[, 1] # the target
X_train <- tor_ts_mbd[, -1] # everything but the target

y_test <- window(tor_ts, start = c(260), end = c(266)) # Last week
#X_test <- tor_ts_mbd[nrow(tor_ts_mbd), c(1:lag_order)] # Same for all, last 2 week
X_test <- X_train[nrow(X_train), ] # Same for all, last 2 week

Now training Random Forest.

forecasts_rf <- numeric(horizon)

for (i in 1:horizon){
  set.seed(2019)
  fit_rf <- randomForest(X_train, y_train)
  forecasts_rf[i] <- predict(fit_rf, X_test)
  y_train <- y_train[-1] 
  X_train <- X_train[-nrow(X_train), ] 
}

Getting the predicted values back to originals is simple: The usual approach is to cumulatively add the difference-forecasts to the last cumulative observation. If \(z\) are the differenced data and \(y\) the original, then:

\[ \begin{array}{l} \hat{y}_{t+1}=y_{t}+\hat{z}_{t+1} \\ \hat{y}_{t+2}=\hat{y}_{t+1}+\hat{z}_{t+2}=y_{t}+\hat{z}_{t+1}+\hat{z}_{t+2} \end{array} \]

# calculate the cumulative sums
cumterm <- cumsum(forecasts_rf)

# extract the last observation from the time series (y_t)
last_observation <- as.vector(tail(tor_train[,1], 1))

# calculate the final predictions
backtransformed_forecasts <- last_observation + cumterm

# convert to ts format
y_pred <- ts(backtransformed_forecasts, start = 260)

# Plot
plot(y_test[,1], type = "l", col = "blue", ylim = c(150, 400))
lines(y_pred, type = "l", col = "red")

Accuracy measures

accuracy(y_pred, y_test[,1])

##                ME     RMSE     MAE       MPE     MAPE     ACF1 Theil's U
## Test set -20.4929 70.38674 62.1692 -12.70611 24.78715 0.575273  1.908075

2.2 ARIMA as a benckmark

First, we will use auto,arima() to search for the best seasonal ARIMA model.

torr_ts <- ts(data[1:259,"cases"], start = 1, end = 259, frequency = 7) 
#Train split - last week we want to predict
#tor_train <- window(torr_ts, end=259)

fit <- auto.arima(torr_ts, seasonal = TRUE, stepwise = FALSE,
                    approximation = FALSE, allowdrift = TRUE,
                    allowmean = TRUE)
checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,3)(1,0,0)[7] with non-zero mean
## Q* = 115.36, df = 8, p-value < 2.2e-16
## 
## Model df: 6.   Total lags used: 14

y_pred <- fit %>% forecast::forecast(h=horizon)
ypred <- ts(y_pred$mean, start = 260)
  
plot(y_test[,1], type = "l", col = "blue", ylim = c(150, 410))
lines(ypred, type = "l", col = "red")

accuracy(ypred, y_test[,1])

##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -72.80606 80.54758 72.80606 -28.99111 28.99111 0.1045337  1.872844

The residuals are not stationary. And, the univariate random forest is much better than ARIMA indicating that the relationship could be captured with nonparamateric models better. We will also apply our own grid search to find the best ARIMA with stationary residuals.

yp <- data$cases
hr <- 7 # horizon

# Grid (in line with auto.arima())
p <- c(0:5)
q <- c(0:5)
P <- 0:3
D <- 0:1
Q <- 0:3
S <- c(0, 7)

#Random grid-search
fcomb <- as.matrix(expand.grid(p,q,P,D,Q,S))
conf_lev <- 0.95
num_max <- 5
n <- round(log(1-conf_lev)/log(1-num_max/nrow(fcomb)))
trials <- 1:n

# Multi-core
numCores <- detectCores()

ypart <- yp[1:259]
aic_vec <- c()

#Random grid-search
set.seed(123)
ind <- sample(nrow(fcomb), n, replace = FALSE)
comb <- fcomb[ind, ]
  
registerDoParallel(numCores)
lst <- foreach(k=trials, .combine=c, .errorhandling = 'remove') %dopar% {
    fit <- forecast::Arima(ypart, c(comb[k,1], 1, comb[k,2]), 
                           seasonal=list(order=c(comb[k,3], comb[k,4], comb[k,5]),
                                         period = comb[k,6]))
    p <- checkresiduals(fit, plot = FALSE)$p.value
    list(c(k, fit$aicc, mean(abs(fit$residuals)), p))
    }
stopImplicitCluster()

unlst  <- do.call(rbind, lst)
result <- cbind(comb[unlst[,1],], unlst)
sorted  <- result[order(result[,8]), -7]
colnames(sorted) <- c("p", "q", "P", "D", "Q", "S", "AICc", "MAPE", "p_val")
sorted_f <- sorted[sorted[,9] > 0.15,]
  
fit_best <- Arima(ypart, c(sorted_f[1,1], 1, sorted_f[1,2]), 
                           seasonal=list(order=c(sorted_f[1,3], sorted_f[1,4],
                                                 sorted_f[1,5]),
                                         period = sorted_f[1,6]))

y_pred <- fit_best %>% forecast::forecast(h=horizon)
ypred <- ts(y_pred$mean, start = 260)
  
plot(y_test[,1], type = "l", col = "blue", ylim = c(150, 410))
lines(ypred, type = "l", col = "red")

accuracy(ypred, y_test[,1])

##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -83.79027 103.3556 84.25591 -35.32335 35.47453 0.2989114  2.784119

#Residual checks
checkresiduals(fit_best)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)(0,1,1)[7]
## Q* = 7.6929, df = 6, p-value = 0.2615
## 
## Model df: 4.   Total lags used: 10

2.3 Facebook’s `Prophet`

Here is the description on CRAN: Implements a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. And here is an example Forecasting in R with Prophet: The intent behind Prophet is to “make it easier for experts and non-experts to make high-quality forecasts that keep up with demand.” Prophet is able to produce reliable and robust forecasts (often performing better than other common forecasting techniques) with very little manual effort while allowing for the application of domain knowledge via easily-interpretable parameters.

Our implementation:

library(prophet)

prop <- data[,1:2]
names(prop) <- c("ds", "y")

m <- prophet::prophet(prop[1:259,])
future <- make_future_dataframe(m, periods = 7)
#tail(future)
fore <- predict(m, future)
tail(fore[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

##             ds     yhat yhat_lower yhat_upper
## 261 2020-11-16 456.4060   412.3223   499.7811
## 262 2020-11-17 452.3519   410.0777   493.3377
## 263 2020-11-18 460.4144   416.2707   501.7125
## 264 2020-11-19 455.6957   411.0563   495.6399
## 265 2020-11-20 465.6138   423.5631   508.2360
## 266 2020-11-21 451.8994   409.0695   495.7227

plot(m, fore)

plot(tor[260:266,2], col = "blue", type = "l", ylim = c(195, 470))
lines(fore[260:266,"yhat"], col = "red", type ="l")

prophet_plot_components(m, fore)

We will not calculate the accuracy measures, as it’s obvious from the graph. However, the weekly pattern is interesting and confirms the 7-day frequency we found in our data

3. Multivariate setting for Toronto

We first check if the all series are stationary or not:

apply(data[,2:8], 2, function(x) ndiffs(x))

## cases   mob delay  male   age  temp   hum 
##     1     1     1     0     1     1     1

#Adding day of the week
tor <- cbind(data[,2:8], "week" = as.numeric(format(data$Episode, format = "%w")))
tor_ts <- as.ts(tor)
head(tor_ts)

## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   cases      mob    delay      male      age  temp  hum week
## 1     4 -0.01721 36.75000 0.7500000 55.00000 -4.20 65.5    0
## 2     6 -0.03201  8.50000 1.0000000 45.00000  3.80 84.0    1
## 3    10 -0.01188 15.00000 0.7000000 54.00000  2.30 90.0    2
## 4     7  0.01861 25.71429 0.2857143 50.00000  3.35 71.0    3
## 5     7  0.02228 21.00000 0.4285714 48.57143  1.20 63.5    4
## 6    10 -0.00626 13.10000 0.5000000 36.00000  0.04 75.0    5

#Train split - last week we want to predict
tor_train <- window(tor_ts, end = 259)

#First Differencing
tor_ts_train <- diff(tor_train[, c(1:4, 6:7)])
age <- tor_train[2:nrow(tor_train), "age"]
week <- tor_train[2:nrow(tor_train), "week"]
tor_ts_train <- cbind(tor_ts_train, age, week)
colnames(tor_ts_train) <- c("cases", "mob", "delay", "male", "temp", "hum", "age", "week")

Except for age, all variables are \(\textbf{I}(1)\). Now the embedding:

lag_order <- 14 
horizon <- 7 

tor_ts_mbd <- embed(tor_ts_train, lag_order + 1) # embedding
nm <- c()
for (i in 1:15) {
  nm <- c(nm, paste0(c("cases", "mob", "delay", "male", "temp", "hum", "age", "week"), i))
}
colnames(tor_ts_mbd) <- nm

#Setting the train/test sets
y_train <- tor_ts_mbd[, 1] 
X_train <- tor_ts_mbd[, -1]

y_test <- window(tor_ts[,1], start = c(260), end = c(266)) # Last week
X_test <- X_train[nrow(X_train), ]

Training Random Forest:

forecasts_rf <- numeric(horizon)
fit_list <- vector(mode = "list", length = horizon)
ytrain_list <- vector(mode = "list", length = horizon)
xtrain_list <- vector(mode = "list", length = horizon)

for (i in 1:horizon){
  set.seed(2019)
  fit_rf <- randomForest(X_train, y_train, localImp = TRUE)
  fit_list[[i]] <- fit_rf
  forecasts_rf[i] <- predict(fit_rf, X_test)
  
  y_train <- y_train[-1] 
  X_train <- X_train[-nrow(X_train), ] 
  ytrain_list[[i]] <- y_train[-1] 
  xtrain_list[[i]] <- X_train[-nrow(X_train), ] 
}

Getting the predicted values back to originals and the plot for the last 7 days (blue) actual case numbers vs. predicted case numbers (red):

cumterm <- cumsum(forecasts_rf)
last_observation <- as.vector(tail(tor_train[,1], 1))
backtransformed_forecasts <- last_observation + cumterm

y_pred <- ts(backtransformed_forecasts, start = 260)

plot(y_test, type = "l", col = "blue", ylim = c(150, 410))
lines(y_pred, type = "l", col = "red")

Accuracy measures

accuracy(y_pred, y_test)

##                ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -14.7304 62.84593 56.89757 -10.01043 22.27738 0.5790216  1.662361

As it’s shown here, adding mobility, temperature, age, humidity, delay, and day of the week substantially increases the prediction accuracy. We will add more variables in a panel setting in the following sections. We will also make \(h\) our tuning parameter. A simple variable importance by the percentage change in MSE for the forecast of the \(7^{th}\) day is given below. Later, we will report this for every day in 7-day forecast.

Here is the list of important variables ordered by %IncMSE and the bar plot:

library(randomForestExplainer)
vi <- fit_rf$importance
vdec <- vi[order(vi[,1], decreasing = TRUE),]
vdecm <- as.matrix(vdec)
head(vdecm, 30)

##           %IncMSE IncNodePurity
## week10  93.317583    11158.5437
## cases2  73.896888    16833.0114
## cases9  71.544854    17470.3593
## week3   61.655459     6944.3122
## week4   45.557414     4790.6833
## week11  42.874972     5930.2226
## cases5  29.882628     9931.9787
## cases7  19.289114     5451.1893
## mob12   12.357707     2775.8786
## cases8  10.854425     8157.6929
## mob2    10.734101     3225.8986
## cases6  10.694762     4451.4257
## cases10  7.884482     3611.0435
## mob14    7.470376     1920.0504
## age11    7.019565     3672.1899
## cases14  6.709210     2950.4793
## mob5     6.037247     2002.4994
## temp6    4.711864     2286.5428
## age15    4.100878     1118.9643
## cases11  3.830025     4498.2605
## cases12  3.438944     2433.4530
## mob1     3.349710     1811.6750
## delay9   3.346623     2036.2750
## age12    3.247851     1309.1921
## male14   3.182159      665.1152
## age7     3.036401      832.5355
## delay11  2.981384      703.5696
## age9     2.957037     1605.0307
## age10    2.810812     1091.6714
## male12   2.641851     1195.0995

barplot(vdec[1:20,1]/100, horiz = TRUE, col = "pink",
        cex.names = 0.5, cex.axis = 0.8, main = "Variable Importance for h=7")

4. Dataset-level Explainers

We will now dig deeper with randomForestExplainer and DALEX packages. Assuming that the observations form a representative sample from a general population, dataset-level explainers can provide information about the quality of predictions for the population.

4.1 Variable importance

First, we will look at the measures that provide the variable importance, which is calculated by considering the average increase in node purity a split on that variable causes. Variables whose splits cause larger increases in node purity are more important. The first split typically causes the largest increase in node purity. The minimum depth reveals when the first time this variable is used to split the tree. Hence, more important variables have lower minimum depth values. The splits that cause the larger increases in purity happen early and so the important variables are split on early.

The table below allows you to sort the variables for different metrics:

library(randomForestExplainer)
# randomForestExplainer()
min_depth_frame <- min_depth_distribution(fit_rf)
importance_frame <- measure_importance(fit_rf)
tabl <- cbind(importance_frame[,1], round(importance_frame[,2:8],4))
datatable(tabl, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )

4.2 Multi-way Importance Plots

A more intuitive measure is the mse_increase due to the shuffle in the values of variable \(j\). %IncMSE is the most robust and informative measure as it shows the increase in mse of predictions (estimated with out-of-bag-CV) as a result of variable \(j\) being permuted (values randomly shuffled). Below we show only two importance plots: \(h=7\) and \(h=1\). Since the algorithm above enables Random Forest to get trained separately in each \(h\), the trained model and the important variables are also different for each \(h\).

p-value shows whether the observed number of successes (number of nodes in which \(X_j\) was used for splitting) exceeds the theoretical number of successes if they were random.

h = 7 Node-Purity vs. Depth

plot_multi_way_importance(importance_frame, x_measure = "mean_min_depth",
                          y_measure = "node_purity_increase",
                          size_measure = "p_value", no_of_labels = 12)

h = 7 Node-Purity vs. MSE

plot_multi_way_importance(importance_frame, x_measure = "mse_increase",
                          y_measure = "node_purity_increase",
                          size_measure = "p_value", no_of_labels = 12,
                          main = "Multi-way importance plot for h = 7")

h = 1 Node-Purity vs. MSE

importance_frame2 <- measure_importance(fit_list[[1]])
plot_multi_way_importance(importance_frame2, x_measure = "mse_increase",
                          y_measure = "node_purity_increase",
                          size_measure = "p_value", no_of_labels = 12,
                          main = "Multi-way importance plot for h = 1")

Now we will look at the minimum depth in each variable and the interactions:

4.3 Tree depth and Interactions

h = 7 Tree Depths

plot_min_depth_distribution(min_depth_frame, mean_sample = "all_trees", k =20,
                            main = "Distribution of minimal depth and its mean h =7")

h = 7 Interaction

Note that the \(x\)-axis ranges from zero trees to the maximum number of trees in which any variable was used for splitting. After seeing a set of most important variables we can investigate interactions with respect to them, i.e. splits appearing in maximal subtrees with respect to one of the variables selected.

vars <- important_variables(importance_frame, k = 15, measures = c("mean_min_depth", "no_of_trees"))
vars

##  [1] "cases9"  "cases2"  "cases5"  "cases8"  "cases7"  "cases6"  "temp6"  
##  [8] "cases11" "cases4"  "cases10" "hum7"    "cases3"  "temp10"  "cases12"
## [15] "cases14"

interactions_frame <- min_depth_interactions(fit_rf, vars)
head(interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE), ])

##      variable root_variable mean_min_depth occurrences   interaction
## 372    cases5        cases9       4.269231         182 cases9:cases5
## 387    cases6        cases9       6.085043         166 cases9:cases6
## 1512    temp6        cases9       6.949710         162  cases9:temp6
## 342    cases3        cases9       7.841850         158 cases9:cases3
## 402    cases7        cases9       7.827397         154 cases9:cases7
## 432    cases9        cases9       6.889717         153 cases9:cases9
##      uncond_mean_min_depth
## 372               5.783793
## 387               9.635547
## 1512             10.285557
## 342              11.344975
## 402               9.364542
## 432               4.177340

plot_min_depth_interactions(interactions_frame)

Interactions are ordered by decreasing number of occurrences – the most frequent one, cases9:cases5, is also the one with minimal mean conditional minimal depth. Remarkably, the unconditional mean minimal depth of cases5 in the forest is almost equal to its mean minimal depth across maximal subtrees with cases9 as the root variable.

4.4 Partial Dependence Plots

Finally, we will look at partial effects by partial dependence plots (PDP). A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. However, the assumption of independence is the biggest issue with PD plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features. If the feature for which we compute the PDP is not correlated with the other features, then the PDPs perfectly represent how the feature influences the prediction on average.

The following plots includes top three mobility features in each day of 7-day horizon predictions.

h = 1

imp <- importance(fit_list[[1]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 1),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

h = 2

imp <- importance(fit_list[[2]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 2),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

h = 3

imp <- importance(fit_list[[3]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 3),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

h = 4

imp <- importance(fit_list[[4]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 4),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

h = 5

imp <- importance(fit_list[[5]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 5),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

h = 6

imp <- importance(fit_list[[6]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 6),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

h = 7

imp <- importance(fit_list[[7]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
  op <- par(mfrow=c(1, 3))
  for (i in 1:3) {
    partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
                main=paste("P.Dep. on", imvarr[i], " for h=", 7),
                col = colors()[2*10+i*i], lwd = 3)
  }

par(op)

Although mobility would be affected by weather, we do not see any fundamental feature that may be correlated with mobility. This shows how mobility would have nonlinear and varying relationship with the case numbers. The most important aspect of this relationship is the reverse causality that explains the negative relations in each predictions. We will return this issue later.

5. Instance-level explainers

Instance-level exploration methods help us understand how a model yields a prediction for a particular single observation. Remember, we use the last 2 weeks of information (lag order = 14) to predict the next week (7 days). Hence, our instance is the last two weeks, but we have 7 different predictions, since \(h = 7\).

5.1 Break-down, CP, Oscillation Plots

Now we will use more detailed plots with DALEX with \(h = 1\):

explain_rf <- DALEX::explain(model = fit_list[[1]],  
                        data = xtrain_list[[1]],
                           y = ytrain_list[[1]], 
                       label = "Random Forest")

## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest 
##   -> data              :  242  rows  119  cols 
##   -> data              :  rownames to data was added ( from 1 to 242 ) 
##   -> target variable   :  242  values 
##   -> predict function  :  yhat.randomForest  will be used ( [33m default [39m )
##   -> predicted values  :  numerical, min =  -55.8924 , mean =  1.196471 , max =  118.4095  
##   -> model_info        :  package randomForest , ver. 4.6.14 , task regression ( [33m default [39m ) 
##   -> residual function :  difference between y and yhat ( [33m default [39m )
##   -> residuals         :  numerical, min =  -125.5139 , mean =  -0.1882065 , max =  208.2969  
##  [32m A new explainer has been created! [39m

5.2 Break-down

The plot shows “variable attributions”, i.e., the decomposition of the model’s prediction into contributions that can be attributed to different explanatory variables.

#Prediction increments for h = 1
xtestrf <- matrix(X_test, 1, 119)
colnames(xtestrf) <- names(X_test)
bd_rf <- predict_parts(explainer = explain_rf,
                 new_observation = xtestrf,
                            type = "break_down")
#bd_rf
plot(bd_rf)

5.3 Break-down only with `mob` and `temp`

ordr <- paste0("mob", 1:14)
ordr2 <- paste0("temp", 1:14)

bd_rf2 <- predict_parts(explainer = explain_rf,
                 new_observation = xtestrf,
                            type = "break_down",
                 order = ordr)

bd_rf3 <- predict_parts(explainer = explain_rf,
                 new_observation = xtestrf,
                            type = "break_down",
                 order = ordr2)

plot(bd_rf2)

plot(bd_rf3)

5.4 Break-down interaction

bd_rf3 <- predict_parts(explainer = explain_rf,
                 new_observation = xtestrf,
                            type = "break_down_interactions")
plot(bd_rf3)

5.5. Shapley Additive Explanations

In the presence of interactions, the computed value of the attribution depends on the order of explanatory covariates that are used in calculations of BD plots. In this chapter, we introduce yet another approach to address the ordering issue. It is based on the idea of averaging the value of a variable’s attribution over all (or a large number of) possible orderings.

bd_rf4 <- predict_parts(explainer = explain_rf,
                 new_observation = xtestrf,
                            type = "shap")
plot(bd_rf4)

5.6 Ceteris-paribus Profiles for h = 1

A method that evaluates the effect of a selected explanatory variable in terms of changes of a model’s prediction induced by changes in the variable’s values. We will look at only for h=1 and mobility and temprature variables.

`temp`7:9

bd_rf5 <- predict_profile(explainer = explain_rf,
                 new_observation = xtestrf)

for (i in 7:9){
  fck <- ordr2[i]
  print(plot(bd_rf5 , variables = c(fck)))
}

`temp`10:14

for (i in 10:14){
  fck <- ordr2[i]
  print(plot(bd_rf5 , variables = c(fck)))
}

`mob`7:9

bd_rf5 <- predict_profile(explainer = explain_rf,
                 new_observation = xtestrf)

for (i in 7:9){
  fck <- ordr[i]
  print(plot(bd_rf5 , variables = c(fck)))
}

`mob`10:14

bd_rf6 <- predict_profile(explainer = explain_rf,
                 new_observation = xtestrf)

for (i in 10:14){
  fck <- ordr[i]
  print(plot(bd_rf6 , variables = c(fck)))
}

5.7 CP Oscillations

To assign importance to CP profiles, we can use the concept of profile oscillations. It is worth noting that the larger influence of an explanatory variable on prediction for a particular instance, the larger the fluctuations of the corresponding CP profile. For a variable that exercises little or no influence on a model’s prediction, the profile will be flat or will barely change. In other words, the values of the CP profile should be close to the value of the model’s prediction for a particular instance. Consequently, the sum of differences between the profile and the value of the prediction, taken across all possible values of the explanatory variable, should be close to zero. The sum can be graphically depicted by the area between the profile and the horizontal line representing the value of the single-instance prediction. On the other hand, for an explanatory variable with a large influence on the prediction, the area should be large.

bd_rf5 <- predict_parts(explainer = explain_rf,
                 new_observation = xtestrf,
                 type = "oscillations_emp")
plot(bd_rf5[1:25,]) +
  ggtitle("Ceteris-paribus Oscillations for h = 1", 
            "Expectation over empirical distribution (unique values)")

6. Positivity rate, test completed and percent postive.

Graphs and tables of COVID-19 data by number of tests, percent of positive results by age and location and time taken for test processing are taken from https://covid-19.ontario.ca/data/testing-volumes-and-results#byPHU. Since the data is not downloadable, we did a web scarping to get the data for Toronto. The total test numbers reflect 7-day averages of how many tests were processed by the lab each day. The available information also starts (as of 02-04-2021) from May 1. 2021. Here is the original 7-day average data:

library (Matrix)
library (matrixcalc)

df <- data.frame(read_csv("~/Dropbox/RNS2/DataON/Rinda/Toronto_test_data.csv"))
test <- matrix(df[,2])
pr <- matrix(df[,3])

# Smoothing with loess()
t = 1:266
loepr <- loess(pr~t, degree=2, span = 0.02)
fitpr <- predict(loepr, t)
loecs <- loess(test~t, degree=2, span = 0.02)
fitcs <- predict(loecs, t)

#Plots
plot(test, type = "h", col = "red", main = "Tests performed (7-day average)",
     xlab = "Days")
lines(t, fitcs/1.5, col = "darkgreen", lwd = 2)

plot(pr, type = "h", col = "blue", main = "7-day average PR",
     xlab = "Days")
lines(t, fitpr/1.5, col = "darkgreen", lwd = 2)

In order to extract the daily tests, positivity rate (PR), and positive results (dtest, dpr, and dpos, respectively) from 7-day moving averages, we applied a simple linear algebra:

# See: https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average>
# y = (1/7)Ax: y is the 7-day mov.aver test data, A is the band matrix, x is what we need
# dimensions

#dim(test) 
A <- as.matrix (band (matrix (1, nrow=272, ncol=272), -3, 3)[-c(1:3, 270:272),])
# 3 by 3 because it's a sliding window with one number in it so the total 7
#head(A)[,1:10]
#tail(A)[, 260:272]
#dim(A)

# Since A is not reversible 
Ai <- svd.inverse (A)
#dim(Ai)

# Hence the solution: 7 * Ai %*% test
mytest <- 7 * Ai %*% test
mypr <- 7 * Ai %*% pr

# The same date range
dailytest <- ceiling(mytest[-c(1:6), ])
#length(dailytest)
dailypr <- mypr[-c(1:6), ]
dailypr <- ifelse(dailypr < 0, 0.01, dailypr)  
#length(dailypr)

dff <- cbind(df, dtest = dailytest, dpr = dailypr)
dff$dpos <- round(dff$dtest*dff$dpr/100, 0)

head(dff, 20)

##           Date Total.tests.processed Positive.results dtest       dpr dpos
## 1  May 01 2020                  2442              9.5  2805 12.539700  352
## 2  May 02 2020                  2450              9.4  3524  6.654067  234
## 3  May 03 2020                  2460              9.6  3437  8.538682  293
## 4  May 04 2020                  2423              9.7  1942  8.413041  163
## 5  May 05 2020                  2408              9.6  1607  9.938682  160
## 6  May 06 2020                  2561              8.8  2192  7.102785  156
## 7  May 07 2020                  2575              8.9  2522  9.113041  230
## 8  May 08 2020                  2616              8.7  3092 11.139700  344
## 9  May 09 2020                  2664              8.5  3860  5.254067  203
## 10 May 10 2020                  2671              8.3  3486  7.138682  249
## 11 May 11 2020                  2853              8.1  3216  7.013041  226
## 12 May 12 2020                  2776              8.2  1068 10.638682  114
## 13 May 13 2020                  2782              8.4  2234  8.502785  190
## 14 May 14 2020                  2737              8.2  2207  7.713041  170
## 15 May 15 2020                  2705              8.5  2868 13.239700  380
## 16 May 16 2020                  2639              9.0  3398  8.754067  297
## 17 May 17 2020                  2627              9.2  3402  8.538682  290
## 18 May 18 2020                  2240             10.0   507 12.613041   64
## 19 May 19 2020                  2317             10.1  1607 11.338682  182
## 20 May 20 2020                  2212             10.4  1499 10.602785  159

Here are the same graphs with daily numbers:

test <- dff[,4]
pr <- dff[,5]

# Smoothing with loess()
t = 1:266
loepr <- loess(pr~t, degree=2, span = 0.02)
fitpr <- predict(loepr, t)
loecs <- loess(test~t, degree=2, span = 0.02)
fitcs <- predict(loecs, t)

#Plots
plot(test, type = "h", col = "red", main = "Daily tests performed",
     xlab = "Days")
lines(t, fitcs/1.5, col = "darkgreen", lwd = 2)

plot(pr, type = "h", col = "blue", main = "Daily PR",
     xlab = "Days")
lines(t, fitpr/1.5, col = "darkgreen", lwd = 2)

While it was not obvious when the data smoothed with 7-day moving averages, the daily test data now show clear drops in Victoria Day (May 18), Thanksgiving (Oct 12), Christmas (Dec 25), and New Year eve (Jan 1, 2021). We know that lower test results would translate into lower case numbers. This sort bias in the case numbers are well reported in the literature. That’s why using PR is important. The positive test results calculated by the (tests x PR) are very different than the officially reported case numbers for Toronto by the province. This could be the number of false positives. That is, the numbers here are reflecting positive results rather than confirmed cases. Here is the relationship between positive results and test numbers in three stages: All, between day 1 and 120, and between 121 and 266.

# All
cor(dff[,c(4,6)])

##           dtest      dpos
## dtest 1.0000000 0.7339496
## dpos  0.7339496 1.0000000

# First
cor(dff[1:120,c(4,6)])

##            dtest       dpos
## dtest  1.0000000 -0.5074416
## dpos  -0.5074416  1.0000000

# Last part
cor(dff[121:266,c(4,6)])

##           dtest      dpos
## dtest 1.0000000 0.7040209
## dpos  0.7040209 1.0000000

In the first period, the provinces tested only those with strong symptoms. Hence, the higher that the more susceptible people to be tested, the higher the likelihood of finding positives. Therefore, we have higher PR rates at the beginning.

In the next section we will use both metrics, cases and PR in a panel setting.

6. Panel estimation for Toronto, Durham, Peel, and York with Boosting

We know that boosting gives a “voice” to weak predictors. This is more important when we have autoregressive time series data. Coming soon…

A work by ML-Portal

Identifying the Predictors of COVID-19 Spread by Ensemble Learning

Effects of Humidity, Temprature, and Mobility on the Spread

January 25, 2020

1. Data

2. Univariate setting for Toronto

2.1 Random Forest

2.2 ARIMA as a benckmark

2.3 Facebook’s `Prophet`

3. Multivariate setting for Toronto

4. Dataset-level Explainers

4.1 Variable importance

4.2 Multi-way Importance Plots

h = 7 Node-Purity vs. Depth

h = 7 Node-Purity vs. MSE

h = 1 Node-Purity vs. MSE

4.3 Tree depth and Interactions

h = 7 Tree Depths

h = 7 Interaction

4.4 Partial Dependence Plots

h = 1

h = 2

h = 3

h = 4

h = 5

h = 6

h = 7

5. Instance-level explainers

5.1 Break-down, CP, Oscillation Plots

5.2 Break-down

5.3 Break-down only with `mob` and `temp`

5.4 Break-down interaction

5.5. Shapley Additive Explanations

5.6 Ceteris-paribus Profiles for h = 1

`temp`7:9

`temp`10:14

`mob`7:9

`mob`10:14

5.7 CP Oscillations

6. Positivity rate, test completed and percent postive.

6. Panel estimation for Toronto, Durham, Peel, and York with Boosting

Identifying the Predictors of COVID-19 Spread by Ensemble Learning

Effects of Humidity, Temprature, and Mobility on the Spread

January 25, 2020

1. Data

2. Univariate setting for Toronto

2.1 Random Forest

2.2 ARIMA as a benckmark

2.3 Facebook’s Prophet

3. Multivariate setting for Toronto

4. Dataset-level Explainers

4.1 Variable importance

4.2 Multi-way Importance Plots

h = 7 Node-Purity vs. Depth

h = 7 Node-Purity vs. MSE

h = 1 Node-Purity vs. MSE

4.3 Tree depth and Interactions

h = 7 Tree Depths

h = 7 Interaction

4.4 Partial Dependence Plots

h = 1

h = 2

h = 3

h = 4

h = 5

h = 6

h = 7

5. Instance-level explainers

5.1 Break-down, CP, Oscillation Plots

5.2 Break-down

5.3 Break-down only with mob and temp

5.4 Break-down interaction

5.5. Shapley Additive Explanations

5.6 Ceteris-paribus Profiles for h = 1

temp7:9

temp10:14

mob7:9

mob10:14

5.7 CP Oscillations

6. Positivity rate, test completed and percent postive.

6. Panel estimation for Toronto, Durham, Peel, and York with Boosting

2.3 Facebook’s `Prophet`

5.3 Break-down only with `mob` and `temp`

`temp`7:9

`temp`10:14

`mob`7:9

`mob`10:14