Please also check our works released earlier and coming soon as a part of our Nova Scotia COVID-19 Research Coalition Project at Blog:
We now attempt to train our ensemble models to see if the forecasting accuracy in epi curves would improve when we add some predictors related to mobility and weather. We use the individual-level data from Ontario containing Episode Date
, which is the date when the first symptoms were started. The mobility data is from Facebook. There are two indices: all_day_bing_tiles_visited_relative_change
: Positive or negative change in movement relative to baseline, and all_day_ratio_single_tile_users
: Positive proportion of users staying put within a single location. We will use the first one and call it Mob1
. We will use only four neighboring regions: Toronto, Peel, York, and Durham. These regions are the hotspots in Ontario. We will also incorporate the panel structure of the data in all learning algorithms.
setwd("~/Dropbox/RNS2")
rm(list = ls())
library(readxl)
Indoor <- read_excel("~/Dropbox/RNS2/DataON/Rinda/Indoor.xlsx")
load("~/Dropbox/RNS2/Toronto")
dt <- data_v1_Toronto_aggregated[-c(267:268), c(2,3,5,6,7,13)]
names(dt)[c(2,4,5)] <- c("mob", "delay", "male")
data <- cbind(dt, Indoor[,1:2])
names(data)[c(7,8)] <- c("temp", "hum")
data <- data[, c(1,3,2,4:8)]
head(data,10)
## Episode Date cases mob delay male age temp hum
## 1 2020-03-01 4 -0.01721 36.750000 0.7500000 55.00000 -4.20 65.5
## 2 2020-03-02 6 -0.03201 8.500000 1.0000000 45.00000 3.80 84.0
## 3 2020-03-03 10 -0.01188 15.000000 0.7000000 54.00000 2.30 90.0
## 4 2020-03-04 7 0.01861 25.714286 0.2857143 50.00000 3.35 71.0
## 5 2020-03-05 7 0.02228 21.000000 0.4285714 48.57143 1.20 63.5
## 6 2020-03-06 10 -0.00626 13.100000 0.5000000 36.00000 0.04 75.0
## 7 2020-03-07 8 0.02606 10.375000 0.5000000 46.25000 -1.65 54.0
## 8 2020-03-08 10 0.02729 11.600000 0.9000000 50.00000 6.30 56.0
## 9 2020-03-09 18 -0.01577 8.888889 0.6111111 35.55556 12.50 55.0
## 10 2020-03-10 29 -0.05208 9.689655 0.4482759 41.72414 5.15 79.0
We will set up the data as time series and make them ts
objects.
tor <- data[,1:2]
tor <- cbind(tor, "week" = as.numeric(format(data$Episode, format = "%w")))
#ts
tor_ts <- as.ts(tor[,2:3])
#Train split - last week we want to predict
tor_train <- window(tor_ts, end=259)
#first diff
tor_ts_train <- cbind(diff(tor_train[,1]), tor_train[-1,2])
Embedding for direct forecasting (see this post)
lag_order <- 14 # the desired number of lags (2 weeks)
horizon <- 7 # the forecast horizon
tor_ts_mbd <- embed(tor_ts_train, lag_order + 1) # embedding
y_train <- tor_ts_mbd[, 1] # the target
X_train <- tor_ts_mbd[, -1] # everything but the target
y_test <- window(tor_ts, start = c(260), end = c(266)) # Last week
#X_test <- tor_ts_mbd[nrow(tor_ts_mbd), c(1:lag_order)] # Same for all, last 2 week
X_test <- X_train[nrow(X_train), ] # Same for all, last 2 week
Now training Random Forest.
forecasts_rf <- numeric(horizon)
for (i in 1:horizon){
set.seed(2019)
fit_rf <- randomForest(X_train, y_train)
forecasts_rf[i] <- predict(fit_rf, X_test)
y_train <- y_train[-1]
X_train <- X_train[-nrow(X_train), ]
}
Getting the predicted values back to originals is simple: The usual approach is to cumulatively add the difference-forecasts to the last cumulative observation. If \(z\) are the differenced data and \(y\) the original, then:
\[ \begin{array}{l} \hat{y}_{t+1}=y_{t}+\hat{z}_{t+1} \\ \hat{y}_{t+2}=\hat{y}_{t+1}+\hat{z}_{t+2}=y_{t}+\hat{z}_{t+1}+\hat{z}_{t+2} \end{array} \]
# calculate the cumulative sums
cumterm <- cumsum(forecasts_rf)
# extract the last observation from the time series (y_t)
last_observation <- as.vector(tail(tor_train[,1], 1))
# calculate the final predictions
backtransformed_forecasts <- last_observation + cumterm
# convert to ts format
y_pred <- ts(backtransformed_forecasts, start = 260)
# Plot
plot(y_test[,1], type = "l", col = "blue", ylim = c(150, 400))
lines(y_pred, type = "l", col = "red")
Accuracy measures
accuracy(y_pred, y_test[,1])
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -20.4929 70.38674 62.1692 -12.70611 24.78715 0.575273 1.908075
First, we will use auto,arima()
to search for the best seasonal ARIMA model.
torr_ts <- ts(data[1:259,"cases"], start = 1, end = 259, frequency = 7)
#Train split - last week we want to predict
#tor_train <- window(torr_ts, end=259)
fit <- auto.arima(torr_ts, seasonal = TRUE, stepwise = FALSE,
approximation = FALSE, allowdrift = TRUE,
allowmean = TRUE)
checkresiduals(fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,3)(1,0,0)[7] with non-zero mean
## Q* = 115.36, df = 8, p-value < 2.2e-16
##
## Model df: 6. Total lags used: 14
y_pred <- fit %>% forecast::forecast(h=horizon)
ypred <- ts(y_pred$mean, start = 260)
plot(y_test[,1], type = "l", col = "blue", ylim = c(150, 410))
lines(ypred, type = "l", col = "red")
accuracy(ypred, y_test[,1])
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -72.80606 80.54758 72.80606 -28.99111 28.99111 0.1045337 1.872844
The residuals are not stationary. And, the univariate random forest is much better than ARIMA indicating that the relationship could be captured with nonparamateric models better. We will also apply our own grid search to find the best ARIMA with stationary residuals.
yp <- data$cases
hr <- 7 # horizon
# Grid (in line with auto.arima())
p <- c(0:5)
q <- c(0:5)
P <- 0:3
D <- 0:1
Q <- 0:3
S <- c(0, 7)
#Random grid-search
fcomb <- as.matrix(expand.grid(p,q,P,D,Q,S))
conf_lev <- 0.95
num_max <- 5
n <- round(log(1-conf_lev)/log(1-num_max/nrow(fcomb)))
trials <- 1:n
# Multi-core
numCores <- detectCores()
ypart <- yp[1:259]
aic_vec <- c()
#Random grid-search
set.seed(123)
ind <- sample(nrow(fcomb), n, replace = FALSE)
comb <- fcomb[ind, ]
registerDoParallel(numCores)
lst <- foreach(k=trials, .combine=c, .errorhandling = 'remove') %dopar% {
fit <- forecast::Arima(ypart, c(comb[k,1], 1, comb[k,2]),
seasonal=list(order=c(comb[k,3], comb[k,4], comb[k,5]),
period = comb[k,6]))
p <- checkresiduals(fit, plot = FALSE)$p.value
list(c(k, fit$aicc, mean(abs(fit$residuals)), p))
}
stopImplicitCluster()
unlst <- do.call(rbind, lst)
result <- cbind(comb[unlst[,1],], unlst)
sorted <- result[order(result[,8]), -7]
colnames(sorted) <- c("p", "q", "P", "D", "Q", "S", "AICc", "MAPE", "p_val")
sorted_f <- sorted[sorted[,9] > 0.15,]
fit_best <- Arima(ypart, c(sorted_f[1,1], 1, sorted_f[1,2]),
seasonal=list(order=c(sorted_f[1,3], sorted_f[1,4],
sorted_f[1,5]),
period = sorted_f[1,6]))
y_pred <- fit_best %>% forecast::forecast(h=horizon)
ypred <- ts(y_pred$mean, start = 260)
plot(y_test[,1], type = "l", col = "blue", ylim = c(150, 410))
lines(ypred, type = "l", col = "red")
accuracy(ypred, y_test[,1])
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -83.79027 103.3556 84.25591 -35.32335 35.47453 0.2989114 2.784119
#Residual checks
checkresiduals(fit_best)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)(0,1,1)[7]
## Q* = 7.6929, df = 6, p-value = 0.2615
##
## Model df: 4. Total lags used: 10
Prophet
Here is the description on CRAN: Implements a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. And here is an example Forecasting in R with Prophet: The intent behind Prophet is to “make it easier for experts and non-experts to make high-quality forecasts that keep up with demand.” Prophet is able to produce reliable and robust forecasts (often performing better than other common forecasting techniques) with very little manual effort while allowing for the application of domain knowledge via easily-interpretable parameters.
Our implementation:
library(prophet)
prop <- data[,1:2]
names(prop) <- c("ds", "y")
m <- prophet::prophet(prop[1:259,])
future <- make_future_dataframe(m, periods = 7)
#tail(future)
fore <- predict(m, future)
tail(fore[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
## ds yhat yhat_lower yhat_upper
## 261 2020-11-16 456.4060 412.3223 499.7811
## 262 2020-11-17 452.3519 410.0777 493.3377
## 263 2020-11-18 460.4144 416.2707 501.7125
## 264 2020-11-19 455.6957 411.0563 495.6399
## 265 2020-11-20 465.6138 423.5631 508.2360
## 266 2020-11-21 451.8994 409.0695 495.7227
plot(m, fore)
plot(tor[260:266,2], col = "blue", type = "l", ylim = c(195, 470))
lines(fore[260:266,"yhat"], col = "red", type ="l")
prophet_plot_components(m, fore)
We will not calculate the accuracy measures, as it’s obvious from the graph. However, the weekly pattern is interesting and confirms the 7-day frequency we found in our data
We first check if the all series are stationary or not:
apply(data[,2:8], 2, function(x) ndiffs(x))
## cases mob delay male age temp hum
## 1 1 1 0 1 1 1
#Adding day of the week
tor <- cbind(data[,2:8], "week" = as.numeric(format(data$Episode, format = "%w")))
tor_ts <- as.ts(tor)
head(tor_ts)
## Time Series:
## Start = 1
## End = 6
## Frequency = 1
## cases mob delay male age temp hum week
## 1 4 -0.01721 36.75000 0.7500000 55.00000 -4.20 65.5 0
## 2 6 -0.03201 8.50000 1.0000000 45.00000 3.80 84.0 1
## 3 10 -0.01188 15.00000 0.7000000 54.00000 2.30 90.0 2
## 4 7 0.01861 25.71429 0.2857143 50.00000 3.35 71.0 3
## 5 7 0.02228 21.00000 0.4285714 48.57143 1.20 63.5 4
## 6 10 -0.00626 13.10000 0.5000000 36.00000 0.04 75.0 5
#Train split - last week we want to predict
tor_train <- window(tor_ts, end = 259)
#First Differencing
tor_ts_train <- diff(tor_train[, c(1:4, 6:7)])
age <- tor_train[2:nrow(tor_train), "age"]
week <- tor_train[2:nrow(tor_train), "week"]
tor_ts_train <- cbind(tor_ts_train, age, week)
colnames(tor_ts_train) <- c("cases", "mob", "delay", "male", "temp", "hum", "age", "week")
Except for age
, all variables are \(\textbf{I}(1)\). Now the embedding:
lag_order <- 14
horizon <- 7
tor_ts_mbd <- embed(tor_ts_train, lag_order + 1) # embedding
nm <- c()
for (i in 1:15) {
nm <- c(nm, paste0(c("cases", "mob", "delay", "male", "temp", "hum", "age", "week"), i))
}
colnames(tor_ts_mbd) <- nm
#Setting the train/test sets
y_train <- tor_ts_mbd[, 1]
X_train <- tor_ts_mbd[, -1]
y_test <- window(tor_ts[,1], start = c(260), end = c(266)) # Last week
X_test <- X_train[nrow(X_train), ]
Training Random Forest:
forecasts_rf <- numeric(horizon)
fit_list <- vector(mode = "list", length = horizon)
ytrain_list <- vector(mode = "list", length = horizon)
xtrain_list <- vector(mode = "list", length = horizon)
for (i in 1:horizon){
set.seed(2019)
fit_rf <- randomForest(X_train, y_train, localImp = TRUE)
fit_list[[i]] <- fit_rf
forecasts_rf[i] <- predict(fit_rf, X_test)
y_train <- y_train[-1]
X_train <- X_train[-nrow(X_train), ]
ytrain_list[[i]] <- y_train[-1]
xtrain_list[[i]] <- X_train[-nrow(X_train), ]
}
Getting the predicted values back to originals and the plot for the last 7 days (blue) actual case numbers vs. predicted case numbers (red):
cumterm <- cumsum(forecasts_rf)
last_observation <- as.vector(tail(tor_train[,1], 1))
backtransformed_forecasts <- last_observation + cumterm
y_pred <- ts(backtransformed_forecasts, start = 260)
plot(y_test, type = "l", col = "blue", ylim = c(150, 410))
lines(y_pred, type = "l", col = "red")
Accuracy measures
accuracy(y_pred, y_test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -14.7304 62.84593 56.89757 -10.01043 22.27738 0.5790216 1.662361
As it’s shown here, adding mobility, temperature, age, humidity, delay, and day of the week substantially increases the prediction accuracy. We will add more variables in a panel setting in the following sections. We will also make \(h\) our tuning parameter. A simple variable importance by the percentage change in MSE for the forecast of the \(7^{th}\) day is given below. Later, we will report this for every day in 7-day forecast.
Here is the list of important variables ordered by %IncMSE
and the bar plot:
library(randomForestExplainer)
vi <- fit_rf$importance
vdec <- vi[order(vi[,1], decreasing = TRUE),]
vdecm <- as.matrix(vdec)
head(vdecm, 30)
## %IncMSE IncNodePurity
## week10 93.317583 11158.5437
## cases2 73.896888 16833.0114
## cases9 71.544854 17470.3593
## week3 61.655459 6944.3122
## week4 45.557414 4790.6833
## week11 42.874972 5930.2226
## cases5 29.882628 9931.9787
## cases7 19.289114 5451.1893
## mob12 12.357707 2775.8786
## cases8 10.854425 8157.6929
## mob2 10.734101 3225.8986
## cases6 10.694762 4451.4257
## cases10 7.884482 3611.0435
## mob14 7.470376 1920.0504
## age11 7.019565 3672.1899
## cases14 6.709210 2950.4793
## mob5 6.037247 2002.4994
## temp6 4.711864 2286.5428
## age15 4.100878 1118.9643
## cases11 3.830025 4498.2605
## cases12 3.438944 2433.4530
## mob1 3.349710 1811.6750
## delay9 3.346623 2036.2750
## age12 3.247851 1309.1921
## male14 3.182159 665.1152
## age7 3.036401 832.5355
## delay11 2.981384 703.5696
## age9 2.957037 1605.0307
## age10 2.810812 1091.6714
## male12 2.641851 1195.0995
barplot(vdec[1:20,1]/100, horiz = TRUE, col = "pink",
cex.names = 0.5, cex.axis = 0.8, main = "Variable Importance for h=7")
We will now dig deeper with randomForestExplainer
and DALEX
packages. Assuming that the observations form a representative sample from a general population, dataset-level explainers can provide information about the quality of predictions for the population.
First, we will look at the measures that provide the variable importance, which is calculated by considering the average increase in node purity a split on that variable causes. Variables whose splits cause larger increases in node purity are more important. The first split typically causes the largest increase in node purity. The minimum depth reveals when the first time this variable is used to split the tree. Hence, more important variables have lower minimum depth values. The splits that cause the larger increases in purity happen early and so the important variables are split on early.
The table below allows you to sort the variables for different metrics:
library(randomForestExplainer)
# randomForestExplainer()
min_depth_frame <- min_depth_distribution(fit_rf)
importance_frame <- measure_importance(fit_rf)
tabl <- cbind(importance_frame[,1], round(importance_frame[,2:8],4))
datatable(tabl, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )
A more intuitive measure is the mse_increase
due to the shuffle in the values of variable \(j\). %IncMSE
is the most robust and informative measure as it shows the increase in mse
of predictions (estimated with out-of-bag-CV) as a result of variable \(j\) being permuted (values randomly shuffled). Below we show only two importance plots: \(h=7\) and \(h=1\). Since the algorithm above enables Random Forest to get trained separately in each \(h\), the trained model and the important variables are also different for each \(h\).
p-value
shows whether the observed number of successes (number of nodes in which \(X_j\) was used for splitting) exceeds the theoretical number of successes if they were random.
plot_multi_way_importance(importance_frame, x_measure = "mean_min_depth",
y_measure = "node_purity_increase",
size_measure = "p_value", no_of_labels = 12)
plot_multi_way_importance(importance_frame, x_measure = "mse_increase",
y_measure = "node_purity_increase",
size_measure = "p_value", no_of_labels = 12,
main = "Multi-way importance plot for h = 7")
importance_frame2 <- measure_importance(fit_list[[1]])
plot_multi_way_importance(importance_frame2, x_measure = "mse_increase",
y_measure = "node_purity_increase",
size_measure = "p_value", no_of_labels = 12,
main = "Multi-way importance plot for h = 1")
Now we will look at the minimum depth in each variable and the interactions:
plot_min_depth_distribution(min_depth_frame, mean_sample = "all_trees", k =20,
main = "Distribution of minimal depth and its mean h =7")
Note that the \(x\)-axis ranges from zero trees to the maximum number of trees in which any variable was used for splitting. After seeing a set of most important variables we can investigate interactions with respect to them, i.e. splits appearing in maximal subtrees with respect to one of the variables selected.
vars <- important_variables(importance_frame, k = 15, measures = c("mean_min_depth", "no_of_trees"))
vars
## [1] "cases9" "cases2" "cases5" "cases8" "cases7" "cases6" "temp6"
## [8] "cases11" "cases4" "cases10" "hum7" "cases3" "temp10" "cases12"
## [15] "cases14"
interactions_frame <- min_depth_interactions(fit_rf, vars)
head(interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE), ])
## variable root_variable mean_min_depth occurrences interaction
## 372 cases5 cases9 4.269231 182 cases9:cases5
## 387 cases6 cases9 6.085043 166 cases9:cases6
## 1512 temp6 cases9 6.949710 162 cases9:temp6
## 342 cases3 cases9 7.841850 158 cases9:cases3
## 402 cases7 cases9 7.827397 154 cases9:cases7
## 432 cases9 cases9 6.889717 153 cases9:cases9
## uncond_mean_min_depth
## 372 5.783793
## 387 9.635547
## 1512 10.285557
## 342 11.344975
## 402 9.364542
## 432 4.177340
plot_min_depth_interactions(interactions_frame)
Interactions are ordered by decreasing number of occurrences – the most frequent one, cases9:cases5
, is also the one with minimal mean conditional minimal depth. Remarkably, the unconditional mean minimal depth of cases5
in the forest is almost equal to its mean minimal depth across maximal subtrees with cases9
as the root variable.
Finally, we will look at partial effects by partial dependence plots (PDP). A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex. However, the assumption of independence is the biggest issue with PD plots. It is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features. If the feature for which we compute the PDP is not correlated with the other features, then the PDPs perfectly represent how the feature influences the prediction on average.
The following plots includes top three mobility features in each day of 7-day horizon predictions.
imp <- importance(fit_list[[1]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 1),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
imp <- importance(fit_list[[2]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 2),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
imp <- importance(fit_list[[3]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 3),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
imp <- importance(fit_list[[4]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 4),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
imp <- importance(fit_list[[5]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 5),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
imp <- importance(fit_list[[6]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 6),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
imp <- importance(fit_list[[7]])
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
imvarr <- impvar[grep("mob", impvar)]
op <- par(mfrow=c(1, 3))
for (i in 1:3) {
partialPlot(fit_list[[i]], xtrain_list[[i]], imvarr[i], xlab=imvarr[i],
main=paste("P.Dep. on", imvarr[i], " for h=", 7),
col = colors()[2*10+i*i], lwd = 3)
}
par(op)
Although mobility would be affected by weather, we do not see any fundamental feature that may be correlated with mobility. This shows how mobility would have nonlinear and varying relationship with the case numbers. The most important aspect of this relationship is the reverse causality that explains the negative relations in each predictions. We will return this issue later.
Instance-level exploration methods help us understand how a model yields a prediction for a particular single observation. Remember, we use the last 2 weeks of information (lag order = 14) to predict the next week (7 days). Hence, our instance is the last two weeks, but we have 7 different predictions, since \(h = 7\).
Now we will use more detailed plots with DALEX
with \(h = 1\):
explain_rf <- DALEX::explain(model = fit_list[[1]],
data = xtrain_list[[1]],
y = ytrain_list[[1]],
label = "Random Forest")
## Preparation of a new explainer is initiated
## -> model label : Random Forest
## -> data : 242 rows 119 cols
## -> data : rownames to data was added ( from 1 to 242 )
## -> target variable : 242 values
## -> predict function : yhat.randomForest will be used ( [33m default [39m )
## -> predicted values : numerical, min = -55.8924 , mean = 1.196471 , max = 118.4095
## -> model_info : package randomForest , ver. 4.6.14 , task regression ( [33m default [39m )
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -125.5139 , mean = -0.1882065 , max = 208.2969
## [32m A new explainer has been created! [39m
The plot shows “variable attributions”, i.e., the decomposition of the model’s prediction into contributions that can be attributed to different explanatory variables.
#Prediction increments for h = 1
xtestrf <- matrix(X_test, 1, 119)
colnames(xtestrf) <- names(X_test)
bd_rf <- predict_parts(explainer = explain_rf,
new_observation = xtestrf,
type = "break_down")
#bd_rf
plot(bd_rf)
mob
and temp
ordr <- paste0("mob", 1:14)
ordr2 <- paste0("temp", 1:14)
bd_rf2 <- predict_parts(explainer = explain_rf,
new_observation = xtestrf,
type = "break_down",
order = ordr)
bd_rf3 <- predict_parts(explainer = explain_rf,
new_observation = xtestrf,
type = "break_down",
order = ordr2)
plot(bd_rf2)
plot(bd_rf3)
bd_rf3 <- predict_parts(explainer = explain_rf,
new_observation = xtestrf,
type = "break_down_interactions")
plot(bd_rf3)
In the presence of interactions, the computed value of the attribution depends on the order of explanatory covariates that are used in calculations of BD plots. In this chapter, we introduce yet another approach to address the ordering issue. It is based on the idea of averaging the value of a variable’s attribution over all (or a large number of) possible orderings.
bd_rf4 <- predict_parts(explainer = explain_rf,
new_observation = xtestrf,
type = "shap")
plot(bd_rf4)
A method that evaluates the effect of a selected explanatory variable in terms of changes of a model’s prediction induced by changes in the variable’s values. We will look at only for h=1 and mobility and temprature variables.
temp
7:9bd_rf5 <- predict_profile(explainer = explain_rf,
new_observation = xtestrf)
for (i in 7:9){
fck <- ordr2[i]
print(plot(bd_rf5 , variables = c(fck)))
}
temp
10:14for (i in 10:14){
fck <- ordr2[i]
print(plot(bd_rf5 , variables = c(fck)))
}
mob
7:9bd_rf5 <- predict_profile(explainer = explain_rf,
new_observation = xtestrf)
for (i in 7:9){
fck <- ordr[i]
print(plot(bd_rf5 , variables = c(fck)))
}
mob
10:14bd_rf6 <- predict_profile(explainer = explain_rf,
new_observation = xtestrf)
for (i in 10:14){
fck <- ordr[i]
print(plot(bd_rf6 , variables = c(fck)))
}
To assign importance to CP profiles, we can use the concept of profile oscillations. It is worth noting that the larger influence of an explanatory variable on prediction for a particular instance, the larger the fluctuations of the corresponding CP profile. For a variable that exercises little or no influence on a model’s prediction, the profile will be flat or will barely change. In other words, the values of the CP profile should be close to the value of the model’s prediction for a particular instance. Consequently, the sum of differences between the profile and the value of the prediction, taken across all possible values of the explanatory variable, should be close to zero. The sum can be graphically depicted by the area between the profile and the horizontal line representing the value of the single-instance prediction. On the other hand, for an explanatory variable with a large influence on the prediction, the area should be large.
bd_rf5 <- predict_parts(explainer = explain_rf,
new_observation = xtestrf,
type = "oscillations_emp")
plot(bd_rf5[1:25,]) +
ggtitle("Ceteris-paribus Oscillations for h = 1",
"Expectation over empirical distribution (unique values)")
Graphs and tables of COVID-19 data by number of tests, percent of positive results by age and location and time taken for test processing are taken from https://covid-19.ontario.ca/data/testing-volumes-and-results#byPHU. Since the data is not downloadable, we did a web scarping to get the data for Toronto. The total test numbers reflect 7-day averages of how many tests were processed by the lab each day. The available information also starts (as of 02-04-2021) from May 1. 2021. Here is the original 7-day average data:
library (Matrix)
library (matrixcalc)
df <- data.frame(read_csv("~/Dropbox/RNS2/DataON/Rinda/Toronto_test_data.csv"))
test <- matrix(df[,2])
pr <- matrix(df[,3])
# Smoothing with loess()
t = 1:266
loepr <- loess(pr~t, degree=2, span = 0.02)
fitpr <- predict(loepr, t)
loecs <- loess(test~t, degree=2, span = 0.02)
fitcs <- predict(loecs, t)
#Plots
plot(test, type = "h", col = "red", main = "Tests performed (7-day average)",
xlab = "Days")
lines(t, fitcs/1.5, col = "darkgreen", lwd = 2)
plot(pr, type = "h", col = "blue", main = "7-day average PR",
xlab = "Days")
lines(t, fitpr/1.5, col = "darkgreen", lwd = 2)
In order to extract the daily tests, positivity rate (PR), and positive results (dtest
, dpr
, and dpos
, respectively) from 7-day moving averages, we applied a simple linear algebra:
# See: https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average>
# y = (1/7)Ax: y is the 7-day mov.aver test data, A is the band matrix, x is what we need
# dimensions
#dim(test)
A <- as.matrix (band (matrix (1, nrow=272, ncol=272), -3, 3)[-c(1:3, 270:272),])
# 3 by 3 because it's a sliding window with one number in it so the total 7
#head(A)[,1:10]
#tail(A)[, 260:272]
#dim(A)
# Since A is not reversible
Ai <- svd.inverse (A)
#dim(Ai)
# Hence the solution: 7 * Ai %*% test
mytest <- 7 * Ai %*% test
mypr <- 7 * Ai %*% pr
# The same date range
dailytest <- ceiling(mytest[-c(1:6), ])
#length(dailytest)
dailypr <- mypr[-c(1:6), ]
dailypr <- ifelse(dailypr < 0, 0.01, dailypr)
#length(dailypr)
dff <- cbind(df, dtest = dailytest, dpr = dailypr)
dff$dpos <- round(dff$dtest*dff$dpr/100, 0)
head(dff, 20)
## Date Total.tests.processed Positive.results dtest dpr dpos
## 1 May 01 2020 2442 9.5 2805 12.539700 352
## 2 May 02 2020 2450 9.4 3524 6.654067 234
## 3 May 03 2020 2460 9.6 3437 8.538682 293
## 4 May 04 2020 2423 9.7 1942 8.413041 163
## 5 May 05 2020 2408 9.6 1607 9.938682 160
## 6 May 06 2020 2561 8.8 2192 7.102785 156
## 7 May 07 2020 2575 8.9 2522 9.113041 230
## 8 May 08 2020 2616 8.7 3092 11.139700 344
## 9 May 09 2020 2664 8.5 3860 5.254067 203
## 10 May 10 2020 2671 8.3 3486 7.138682 249
## 11 May 11 2020 2853 8.1 3216 7.013041 226
## 12 May 12 2020 2776 8.2 1068 10.638682 114
## 13 May 13 2020 2782 8.4 2234 8.502785 190
## 14 May 14 2020 2737 8.2 2207 7.713041 170
## 15 May 15 2020 2705 8.5 2868 13.239700 380
## 16 May 16 2020 2639 9.0 3398 8.754067 297
## 17 May 17 2020 2627 9.2 3402 8.538682 290
## 18 May 18 2020 2240 10.0 507 12.613041 64
## 19 May 19 2020 2317 10.1 1607 11.338682 182
## 20 May 20 2020 2212 10.4 1499 10.602785 159
Here are the same graphs with daily numbers:
test <- dff[,4]
pr <- dff[,5]
# Smoothing with loess()
t = 1:266
loepr <- loess(pr~t, degree=2, span = 0.02)
fitpr <- predict(loepr, t)
loecs <- loess(test~t, degree=2, span = 0.02)
fitcs <- predict(loecs, t)
#Plots
plot(test, type = "h", col = "red", main = "Daily tests performed",
xlab = "Days")
lines(t, fitcs/1.5, col = "darkgreen", lwd = 2)
plot(pr, type = "h", col = "blue", main = "Daily PR",
xlab = "Days")
lines(t, fitpr/1.5, col = "darkgreen", lwd = 2)
While it was not obvious when the data smoothed with 7-day moving averages, the daily test data now show clear drops in Victoria Day (May 18), Thanksgiving (Oct 12), Christmas (Dec 25), and New Year eve (Jan 1, 2021). We know that lower test results would translate into lower case numbers. This sort bias in the case numbers are well reported in the literature. That’s why using PR is important. The positive test results calculated by the (tests x PR) are very different than the officially reported case numbers for Toronto by the province. This could be the number of false positives. That is, the numbers here are reflecting positive results rather than confirmed cases. Here is the relationship between positive results and test numbers in three stages: All, between day 1 and 120, and between 121 and 266.
# All
cor(dff[,c(4,6)])
## dtest dpos
## dtest 1.0000000 0.7339496
## dpos 0.7339496 1.0000000
# First
cor(dff[1:120,c(4,6)])
## dtest dpos
## dtest 1.0000000 -0.5074416
## dpos -0.5074416 1.0000000
# Last part
cor(dff[121:266,c(4,6)])
## dtest dpos
## dtest 1.0000000 0.7040209
## dpos 0.7040209 1.0000000
In the first period, the provinces tested only those with strong symptoms. Hence, the higher that the more susceptible people to be tested, the higher the likelihood of finding positives. Therefore, we have higher PR rates at the beginning.
In the next section we will use both metrics, cases and PR in a panel setting.
We know that boosting gives a “voice” to weak predictors. This is more important when we have autoregressive time series data. Coming soon…
A work by ML-Portal