This rmd file uses several packages. If you want to replicate the same results, please install them first.
NOTE: Please click on the Code
button to see the codes.
Let’s load the packages and download the data. We drop a couple of variables that still have few NA’s, as before.
rm(list = ls())
library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr)
library(maps)
# Data Cleaning
library(janitor)
# EDA
library(skimr)
library(DataExplorer)
library(DT)
library(corrplot)
library(parameters)
library(smooth)
# ggplot2 Helpers
library(scales)
theme_set(theme_tq())
library(raster)
library(spdplyr)
library(dplyr)
library(RColorBrewer)
# Missing data
library(mice)
library(VIM)
library(lattice)
# MPA/Models
library(tsmp)
library(pracma)
library(betareg)
library(blorr)
library(olsrr)
library(ROCR)
library(dynamac)
library(urca)
setwd("~/Dropbox/RNS2")
load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
# Dropping these but later we'll impute
data1 <- data[, -c(16, 35:42, 48)]
# skim(data1)[,1:12]
From Wikipedia: In many applications including econometrics and biostatistics a fixed effects model refers to a regression model in which the group means are fixed (non-random) as opposed to a random effects model in which the group means are a random sample from a population. Generally, data can be grouped according to several observed factors. The group means could be modeled as fixed or random effects for each grouping. In a fixed effects model each group mean is a group-specific fixed quantity.
In panel data analysis the term fixed effects estimator (also known as the within estimator) is used to refer to an estimator for the coefficients in the regression model including those fixed effects (one time-invariant intercept for each subject). Such models assist in controlling for omitted variable bias due to unobserved heterogeneity when this heterogeneity is constant over time.
Regional time series covid-19 cases are the product of regional factors, such as population densities, age distributions, income levels, ethnic minority densities, and so on. Therefore, without controlling for them, comparing multiple regions using the case data would not reflect the real spatial relationship between neighboring regions.
One conventional method that removes fixed (time-invariant) regional differences is to demean the time series for each region. Consider the linear model for N observations and T time periods:
\[ y_{i t}=X_{i t} \beta+\alpha_{i}+u_{i t} \text { for } t=1, \ldots, T \text { and } i=1, \ldots, N \]
Assume that this function defines the time series \(y_t\) for region \(i\), which is \(y_{it}\). While \(X_{it}\) is the time-variant 1xk (the number of factors) features vector, \(\alpha_{i}\) is the unobserved (or observed) time-invariant regional effect. For example, historical and institutional factors, the local government’s management quality, or the age distribution of the population. Demeaning the variables using the within transformation eliminates the time-invariant factors:
\[ y_{i t}-\bar{y}_{i}=\left(X_{i t}-\bar{X}_{i}\right) \beta+\left(\alpha_{i}-\bar{\alpha}_{i}\right)+\left(u_{i t}-\bar{u}_{i}\right) \Longrightarrow \ddot{y}_{i t}=\ddot{X}_{i t} \beta+\ddot{u}_{i t} \] where,
\[ \bar{y}_{i}=\frac{1}{T} \sum_{t=1}^{T} y_{i t}, \bar{X}_{i}=\frac{1}{T} \sum_{t=1}^{T} X_{i t}, \text { and } \bar{u}_{i}=\frac{1}{T} \sum_{t=1}^{T} u_{i t} \]
Since \(\alpha_{i}\) is constant \(\overline{\alpha_{i}}=\alpha_{i}\). Hence the observed and unobserved time-invariant effect eliminated. This method produces the same results as a Least-Squared-Dummy-Variable (LSDV) model, in which the intercept is defined for each region by including a binary variable representing it. The following methods use both versions of the fixed effects estimator.
The significance of each variable on epi curves with L=10, after controlling for Episode
date and the regional differences by CD
.
data1 <- data1 %>%
dplyr::group_by(CD) %>%
mutate(Lcases = lag(cases,10))
data2 <- data1[, c(48,1:4, 6:(ncol(data1)-1))]
coef <- c()
sign <- c()
for (i in 4:ncol(data2)) {
model <- glm(Lcases~.-1, family = gaussian(link = "identity"),
data = data2[, c(1:3, i)])
results <- summary(model)
coef[i] <- results$coefficients[30, 1]
sign[i] <- results$coefficients[30, 3]
}
tbl <- data.frame(variables = names(data2)[4:ncol(data2)],
Coefficients = coef[4:ncol(data2)],
t = sign[4:ncol(data2)])
tbl$Above2.1 <- ifelse(abs(tbl$t) > 2.1, TRUE, FALSE)
tbl
## variables Coefficients t Above2.1
## 1 mob1 -31.097913011 -15.01588540 TRUE
## 2 mbo2 61.889814016 15.86499564 TRUE
## 3 maleperc -0.905092105 -1.18203654 FALSE
## 4 delay_mean -0.241576853 -4.09741001 TRUE
## 5 CC 0.292555216 0.35040417 FALSE
## 6 NoEpi -3.965002459 -4.38563306 TRUE
## 7 NoInfo -0.042528417 -0.02018483 FALSE
## 8 Unknown -16.327591320 -2.07908352 FALSE
## 9 OB 8.523110350 10.43889347 TRUE
## 10 Travel -14.811752138 -10.40201055 TRUE
## 11 age 0.041153790 3.27523058 TRUE
## 12 max_temperature -0.165630344 -3.43416110 TRUE
## 13 avg_temperature -0.270433628 -4.80596296 TRUE
## 14 min_temperature -0.304251011 -5.50259709 TRUE
## 15 max_humidex -0.168169928 -7.30522993 TRUE
## 16 min_windchill 0.168330425 2.27826684 TRUE
## 17 max_relative_humidity -0.117542026 -4.60738191 TRUE
## 18 avg_hourly_relative_humidity -0.106239141 -5.79965917 TRUE
## 19 avg_relative_humidity -0.133484370 -6.08696399 TRUE
## 20 min_relative_humidity -0.087650477 -5.72081320 TRUE
## 21 max_dew_point -0.263560183 -5.94017142 TRUE
## 22 avg_hourly_dew_point -0.304791228 -6.82444906 TRUE
## 23 avg_dew_point -0.319666008 -7.10103636 TRUE
## 24 min_dew_point -0.321040666 -7.64157383 TRUE
## 25 max_wind_speed 0.085095716 2.89204503 TRUE
## 26 avg_hourly_wind_speed 0.125046507 2.75027731 TRUE
## 27 avg_wind_speed 0.118256161 2.62516141 TRUE
## 28 min_wind_speed 0.071359971 1.14194642 FALSE
## 29 max_wind_gust 0.023679664 2.00787553 FALSE
## 30 wind_gust_dir_10s 0.003646354 0.19374774 FALSE
## 31 heatdegdays 0.054887348 0.84384593 FALSE
## 32 cooldegdays -1.187080426 -9.34472867 TRUE
## 33 growdegdays_5 -0.518973301 -8.00424644 TRUE
## 34 growdegdays_7 -0.621278061 -9.06739270 TRUE
## 35 growdegdays_10 -0.747063066 -9.97795120 TRUE
## 36 daylight 5.723171708 13.41803551 TRUE
## 37 sunrise_f -10.199709000 -14.74224831 TRUE
## 38 sunset_f 12.173790810 11.15140780 TRUE
## 39 min_uv_forecast -0.184825760 -1.36379489 FALSE
## 40 max_uv_forecast -0.070568061 -0.47800386 FALSE
## 41 min_high_temperature_forecast -0.231219100 -4.78405831 TRUE
## 42 max_high_temperature_forecast -0.213141326 -4.42981042 TRUE
## 43 min_low_temperature_forecast -0.256147115 -5.08635019 TRUE
## 44 max_low_temperature_forecast -0.275588159 -5.50191182 TRUE
#tbl[abs(tbl$t) > 2.1, ]
Note that this method, p-hacking
, is highly criticized in practice. So more proper method is shown below.
Using a backward-stepwise selection, we look for the best associated variables with the outcome, Lcases
. Note that, instead of following the demeaning procedure described above, we will use Least Squared Dummy Variable model (LSDV) here.
# dmdata <- demean(data2, select = c(names(data2)[-c(2,3)]), group = "CD")
# dmdata <- dmdata[, 46:90]
# names(dmdata)<- strsplit(names(dmdata),"_within")
# dmdata$CD <- data2$CD
model1 <- lm(Lcases~.-1, data = data2)
lst <- ols_step_backward_p(model1)
lst
##
##
## Elimination Summary
## --------------------------------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------------------------------
## 1 max_wind_gust 0.7019 0.6965 17.0000 31385.0433 13.5092
## 2 maleperc 0.7019 0.6966 15.0005 31383.0438 13.5075
## 3 max_uv_forecast 0.7019 0.6966 13.0042 31381.0476 13.5057
## 4 min_low_temperature_forecast 0.7019 0.6967 11.0287 31379.0726 13.5040
## 5 growdegdays_7 0.7019 0.6968 9.0535 31377.0978 13.5023
## 6 max_high_temperature_forecast 0.7019 0.6969 7.0866 31375.1315 13.5006
## 7 sunset_f 0.7018 0.6969 5.1260 31373.1716 13.4989
## 8 min_high_temperature_forecast 0.7018 0.697 3.1694 31371.2158 13.4972
## 9 avg_hourly_dew_point 0.7018 0.6971 1.2118 31369.2590 13.4955
## 10 cooldegdays 0.7018 0.6972 -0.7247 31367.3237 13.4938
## 11 min_windchill 0.7018 0.6972 -2.4652 31365.5880 13.4925
## 12 min_uv_forecast 0.7018 0.6973 -4.2075 31363.8504 13.4912
## 13 max_temperature 0.7018 0.6973 -5.8740 31362.1900 13.4901
## 14 max_dew_point 0.7017 0.6974 -7.3783 31360.6948 13.4892
## 15 avg_hourly_wind_speed 0.7017 0.6974 -8.8211 31359.2620 13.4884
## 16 NoInfo 0.7016 0.6974 -10.1933 31357.9011 13.4878
## 17 max_low_temperature_forecast 0.7016 0.6975 -11.3948 31356.7137 13.4874
## 18 mob1 0.7015 0.6975 -12.5980 31355.5245 13.4871
## --------------------------------------------------------------------------------------------------
We try a within effect regression model from a set of candidate predictors based on Akaike Information Criteria (see above).
# In-sample
rmv <- lst$removed
ind <- which(names(data2) %in% rmv)
final_data <- data2[, -c(ind)]
model2 <- lm(Lcases~.-1, data = final_data)
summary(model2)
##
## Call:
## lm(formula = Lcases ~ . - 1, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.009 -4.236 -0.359 3.332 178.281
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## CDBrant -675.37018 398.40991 -1.695 0.090125 .
## CDChatham-Kent -666.33690 397.80085 -1.675 0.094006 .
## CDCochrane -669.18305 397.69079 -1.683 0.092520 .
## CDDurham -671.94026 398.83485 -1.685 0.092117 .
## CDElgin -671.82666 398.11198 -1.688 0.091582 .
## CDEssex -651.42214 397.52741 -1.639 0.101361
## CDFrontenac -692.92032 399.62340 -1.734 0.083010 .
## CDGrey -672.56628 398.09945 -1.689 0.091216 .
## CDHaldimand-Norfolk -674.51581 398.41089 -1.693 0.090534 .
## CDHalton -674.70126 398.57364 -1.693 0.090577 .
## CDHamilton -671.95799 398.52080 -1.686 0.091852 .
## CDLambton -666.30004 397.71311 -1.675 0.093951 .
## CDLeeds and Grenville -694.66638 399.87440 -1.737 0.082430 .
## CDMiddlesex -668.54087 398.11677 -1.679 0.093183 .
## CDNiagara -677.00109 398.81108 -1.698 0.089674 .
## CDNorthumberland -685.45345 399.04672 -1.718 0.085927 .
## CDOttawa -683.13509 399.91878 -1.708 0.087683 .
## CDPeel -634.22782 398.58868 -1.591 0.111651
## CDPerth -671.90726 398.12515 -1.688 0.091554 .
## CDPeterborough -684.94149 398.99075 -1.717 0.086117 .
## CDSimcoe -674.99598 398.57760 -1.694 0.090439 .
## CDStormont, Dundas and Glengarry -699.41282 400.19184 -1.748 0.080597 .
## CDSudbury -672.30994 397.89663 -1.690 0.091174 .
## CDThunder Bay -634.43012 395.04568 -1.606 0.108364
## CDToronto -581.82164 398.64061 -1.460 0.144505
## CDWaterloo -666.29134 398.37365 -1.673 0.094502 .
## CDWellington -673.59747 398.41497 -1.691 0.090977 .
## CDYork -658.77580 398.62575 -1.653 0.098491 .
## Episode 0.08786 0.02670 3.291 0.001009 **
## mbo2 39.23039 4.88493 8.031 1.28e-15 ***
## delay_mean 0.18072 0.06658 2.714 0.006668 **
## CC -3.68752 1.13311 -3.254 0.001146 **
## NoEpi -6.15014 1.23746 -4.970 6.99e-07 ***
## Unknown -18.22557 7.45308 -2.445 0.014515 *
## OB -1.87320 1.27386 -1.470 0.141510
## Travel -5.49966 1.86355 -2.951 0.003185 **
## age 0.08063 0.01932 4.174 3.06e-05 ***
## avg_temperature -1.08085 0.36508 -2.961 0.003089 **
## min_temperature 0.23463 0.16749 1.401 0.161325
## max_humidex 0.16758 0.05014 3.342 0.000839 ***
## max_relative_humidity -0.04097 0.04423 -0.926 0.354399
## avg_hourly_relative_humidity -0.10450 0.06793 -1.538 0.124056
## avg_relative_humidity -0.01199 0.08599 -0.139 0.889146
## min_relative_humidity NA NA NA NA
## avg_dew_point 0.55365 0.28960 1.912 0.055977 .
## min_dew_point -0.26610 0.16366 -1.626 0.104049
## max_wind_speed 0.14063 0.09647 1.458 0.144982
## avg_wind_speed -0.27743 0.14879 -1.865 0.062317 .
## min_wind_speed NA NA NA NA
## wind_gust_dir_10s -0.03035 0.02049 -1.481 0.138649
## heatdegdays -0.59275 0.25441 -2.330 0.019862 *
## growdegdays_5 0.30817 0.28970 1.064 0.287504
## growdegdays_10 -0.50916 0.38065 -1.338 0.181108
## daylight -36.04203 5.52394 -6.525 7.70e-11 ***
## sunrise_f -66.23327 9.15408 -7.235 5.58e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.49 on 3839 degrees of freedom
## (280 observations deleted due to missingness)
## Multiple R-squared: 0.742, Adjusted R-squared: 0.7384
## F-statistic: 208.3 on 53 and 3839 DF, p-value: < 2.2e-16
You can see that the fit is not bad (Adjusted R-squared). Specially max_humidex
confirms the findings of an earlier study. However, one wonders if we run only a simple model without any frills, what happens?
# In-sample
model3 <- lm(Lcases~CD+Episode+mbo2-1, data = final_data)
summary(model3)
##
## Call:
## lm(formula = Lcases ~ CD + Episode + mbo2 - 1, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.632 -2.929 -0.125 2.184 182.161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## CDBrant -6.134e+02 1.303e+02 -4.706 2.61e-06 ***
## CDChatham-Kent -6.127e+02 1.303e+02 -4.701 2.67e-06 ***
## CDCochrane -6.134e+02 1.303e+02 -4.707 2.60e-06 ***
## CDDurham -6.041e+02 1.304e+02 -4.632 3.74e-06 ***
## CDElgin -6.136e+02 1.303e+02 -4.708 2.59e-06 ***
## CDEssex -6.009e+02 1.304e+02 -4.609 4.17e-06 ***
## CDFrontenac -6.146e+02 1.304e+02 -4.714 2.51e-06 ***
## CDGrey -6.135e+02 1.303e+02 -4.708 2.60e-06 ***
## CDHaldimand-Norfolk -6.109e+02 1.303e+02 -4.688 2.86e-06 ***
## CDHalton -6.110e+02 1.304e+02 -4.685 2.90e-06 ***
## CDHamilton -6.095e+02 1.304e+02 -4.675 3.04e-06 ***
## CDLambton -6.123e+02 1.303e+02 -4.698 2.72e-06 ***
## CDLeeds and Grenville -6.115e+02 1.303e+02 -4.692 2.79e-06 ***
## CDMiddlesex -6.112e+02 1.304e+02 -4.687 2.86e-06 ***
## CDNiagara -6.092e+02 1.304e+02 -4.673 3.07e-06 ***
## CDNorthumberland -6.138e+02 1.304e+02 -4.709 2.58e-06 ***
## CDOttawa -6.010e+02 1.305e+02 -4.607 4.21e-06 ***
## CDPeel -5.702e+02 1.304e+02 -4.373 1.26e-05 ***
## CDPerth -6.130e+02 1.303e+02 -4.705 2.63e-06 ***
## CDPeterborough -6.152e+02 1.304e+02 -4.718 2.46e-06 ***
## CDSimcoe -6.106e+02 1.304e+02 -4.684 2.91e-06 ***
## CDStormont, Dundas and Glengarry -6.128e+02 1.303e+02 -4.702 2.66e-06 ***
## CDSudbury -6.158e+02 1.304e+02 -4.723 2.41e-06 ***
## CDThunder Bay -6.135e+02 1.303e+02 -4.708 2.60e-06 ***
## CDToronto -5.169e+02 1.304e+02 -3.963 7.54e-05 ***
## CDWaterloo -6.061e+02 1.304e+02 -4.649 3.45e-06 ***
## CDWellington -6.114e+02 1.304e+02 -4.690 2.83e-06 ***
## CDYork -5.935e+02 1.304e+02 -4.551 5.50e-06 ***
## Episode 3.242e-02 7.045e-03 4.602 4.32e-06 ***
## mbo2 6.189e+01 3.901e+00 15.865 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.96 on 3862 degrees of freedom
## (280 observations deleted due to missingness)
## Multiple R-squared: 0.7218, Adjusted R-squared: 0.7196
## F-statistic: 334 on 30 and 3862 DF, p-value: < 2.2e-16
Interesting. We’ll stop here as it would become a useless data mining practice within the limits of parametric models.
We would shuffle and split the data into train (80%) and test (20%) sets and run FEM 100 times, but this type of bootstrapping requires removal of a temporal relationship in the data with a proper differencing. Our demeaning alone would not remove the presence of a possible autocorrelation or non-stationarity in the data.
We use a different method, the population-informed nested cross-validation, and assume the independence between different regions’ epi curves. This allows us to break the strict temporal ordering, at least between regions’ epi curves. The idea is simple: since we removed the regional fixed differences, we can fit (train) the model by 27 regions and and use the remaining region as our out-of-sample (test) set.
df <- data2[complete.cases(data2), ]
dmdata <- demean(df, select = c(names(df)[-c(2,3)]), group = "CD")
dmdata <- dmdata[, 46:90]
names(dmdata)<- strsplit(names(dmdata),"_within")
dmdata$Episode <- df$Episode
dmdata$CD <- df$CD
cd <- as.character(unique(df$CD))
rmspe <- c()
mae <- c()
mape <- c()
for (i in 1:length(cd)) {
test <- dmdata[dmdata$CD==cd[i], -c(47)]
train <- dmdata[dmdata$CD!=cd[i], -c(47)]
model <- lm(Lcases~., data = train)
lst <- ols_step_backward_p(model)
rmv <- lst$removed
ind_tr <- which(names(train) %in% rmv)
final_tr <- train[, -c(ind_tr)]
ind_tt <- which(names(test) %in% rmv)
final_tt <- test[, -c(ind_tt)]
model <- lm(Lcases~., data = final_tr)
yhat <- predict(model, newdata = final_tt)
rmspe[i] <- sqrt(mean((final_tt$Lcases - yhat)^2))
mae[i] <- mean(abs(final_tt$Lcases - yhat))
mape[i] <- mean(abs(100*(final_tt$Lcases - yhat)/final_tt$Lcases))
}
mean(rmspe)
## [1] 8.721739
mean(mae)
## [1] 6.82111
mean(mape)
## [1] 512.1652
We can now look at these performance metrics for each region:
tbl1 <- aggregate(df$Lcases, by=list(df$CD), FUN=mean)
tbl2 <- aggregate(df$Lcases, by=list(df$CD), FUN=sd)
lst <- data_frame("Region" = cd, "RMSPE" = rmspe, "MAE" = mae, "MAPE" = mape, "Avr_Lcases" = tbl1[,2],"sd" = tbl2[,2] )
lst
## # A tibble: 28 x 6
## Region RMSPE MAE MAPE Avr_Lcases sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Brant 5.44 4.55 2092. 1.06 1.28
## 2 Chatham-Kent 5.21 4.08 431. 1.45 2.85
## 3 Cochrane 4.87 3.95 713. 0.504 0.863
## 4 Durham 7.38 5.38 233. 12.8 10.6
## 5 Elgin 5.02 4.19 744. 0.683 1.08
## 6 Essex 14.8 8.61 228. 14.5 13.8
## 7 Frontenac 5.51 4.60 732. 0.763 1.44
## 8 Grey 4.62 3.74 751. 0.784 1.49
## 9 Haldimand-Norfolk 8.66 5.35 520. 3.12 7.10
## 10 Halton 5.46 4.24 394. 5.83 4.88
## # … with 18 more rows
Small regions with many zero-case days might be an issue. One solution could be to group high-case regions together. Using the same algorithm, we will only look at York, Toronto, Peel, Ottawa, Essex and, Durham to see if we can improve the results:
df <- data2[complete.cases(data2), ]
dmdata <- demean(df, select = c(names(df)[-c(2,3)]), group = "CD")
dmdata <- dmdata[, 46:90]
names(dmdata)<- strsplit(names(dmdata),"_within")
dmdata$Episode <- df$Episode
dmdata$CD <- df$CD
cd_high <- c("York", "Toronto", "Peel", "Ottawa", "Essex", "Durham")
ind <- which(dmdata$CD %in% cd_high)
dmdata <- dmdata[ind, ]
cd <- as.character(unique(dmdata$CD))
rmspe <- c()
mae <- c()
mape <- c()
for (i in 1:length(cd)) {
test <- dmdata[dmdata$CD==cd[i], -c(47)]
train <- dmdata[dmdata$CD!=cd[i], -c(47)]
model <- lm(Lcases~., data = train)
lst <- ols_step_backward_p(model)
rmv <- lst$removed
ind_tr <- which(names(train) %in% rmv)
final_tr <- train[, -c(ind_tr)]
ind_tt <- which(names(test) %in% rmv)
final_tt <- test[, -c(ind_tt)]
model <- lm(Lcases~., data = final_tr)
yhat <- predict(model, newdata = final_tt)
rmspe[i] <- sqrt(mean((final_tt$Lcases - yhat)^2))
mae[i] <- mean(abs(final_tt$Lcases - yhat))
mape[i] <- mean(abs(100*(final_tt$Lcases - yhat)/final_tt$Lcases))
}
mean(rmspe)
## [1] 24.3446
mean(mae)
## [1] 19.80646
mean(mape)
## [1] 709.8797
The table:
tbl1 <- aggregate(dmdata$Lcases, by=list(dmdata$CD), FUN=mean)
tbl2 <- aggregate(dmdata$Lcases, by=list(dmdata$CD), FUN=sd)
lst <- data_frame("Region" = cd, "RMSPE" = rmspe, "MAE" = mae, "MAPE" = mape, "Avr_Lcases" = tbl1[,2],"sd" = tbl2[,2] )
lst
## # A tibble: 6 x 6
## Region RMSPE MAE MAPE Avr_Lcases sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Durham 14.4 11.5 770. -4.86e-17 10.6
## 2 Essex 27.6 21.3 682. 1.02e-16 13.8
## 3 Ottawa 16.3 13.4 395. 1.53e-15 14.5
## 4 Peel 16.6 13.0 138. -2.76e-15 25.4
## 5 Toronto 56.0 47.5 95.9 -4.29e-15 64.7
## 6 York 15.2 12.1 2178. 9.47e-16 14.6
It seems like Toronto and Peel improve slightly but others get worse. Although we control for a linear trend by including the date in the regressions, all these applications ignore the presence of temporal ordering within the region.
There are more questions to answer in this parametric framework:
ARDL models are an integral part of estimating temporal dynamics over time.
\[ y_{t}=\alpha_{0}+\phi_{1} y_{t-1}+\theta_{1,0} x_{1, t}+\theta_{1,1} x_{1, t-1}+\cdots+\theta_{k, 0} x_{k, t}+\theta_{k, 1} x_{k, t-1}+\beta * T+\epsilon_{t} \]
where \(T\) is the trend. Our earlier examples above arbitrarily use cases with L = 10. Although we can make the lag structure in ARDL a hyperparamater to tune with cross-validation, the degree of freedom would be a problem even with two lags given that we have more than 45 variables and 149 observations in each regions. We will return this issue later when we apply shrinkage methods in Lasso type estimations.
In general, these changes require more flexible nonparametric models based on self-learning algorithm.
In order to control for the temporal structure of the data, at least partly, we will use a binary outcome. We transform the daily number of cases to a binary outcome representing “higher” or “lower” days of cases by comparing the number of cases in any given day with some threshold like a 3-day moving average of the past cases. Hence, the binary outcome in any given day will preserve the temporal ordering by carrying the information from the past values of cases. This application will also reduce the issue of zero-case observations.
We use demeaned data and apply population-informed nested cross-validation with the identity link (linear probability model - LPM) instead of logistic link, as it’s known for its convergence issues for higher dimensions. The algorithm below has three loops:
updown
) between 2 and 12 days,These two tables show the average RMSPE and AUC for all 28 regions, respectively. The columns are the length of lags (2:12) reflecting the delay in the impact of features and the columns are the distance in moving average thresholds (2:10) in calculating the binary outcomes, “higher” or “lower”.
rm(list = ls())
load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
data1 <- data[, -c(16, 35:42, 48)]
df <- data1[complete.cases(data1), ]
dmdata <- demean(df, select = c(names(df)[-c(1,2)]), group = "CD")
dmdata <- dmdata[, 46:90]
names(dmdata)<- strsplit(names(dmdata),"_within")
dmdata$Episode <- df$Episode
dmdata$CD <- df$CD
cd <- as.character(unique(df$CD))
L <- 2:12
th <- 2:10
rmspe <- array(0, dim = c(length(th), length(L), length(cd)))
auc <- array(0, dim = c(length(th), length(L), length(cd)))
phatt <- vector(mode = "list", length = length(cd))
for (g in 1:length(th)) {
data2 <- dmdata %>%
dplyr::group_by(CD) %>%
mutate(ccs = movavg(cases, th[g], "s")) %>%
mutate(updown <- ifelse(cases > ccs, 1, 0)) %>%
ungroup()
data2 <- data2[, c(49,1,2,4:47)]
names(data2)[1] <- "updown"
for (i in 1:length(L)) {
data3 <- data2 %>%
dplyr::group_by(CD) %>%
mutate(Lcases = lag(updown,L[i])) %>%
ungroup()
dm <- data3[, c(48, 2:(ncol(data2)))]
dm <- dm[complete.cases(dm), ]
for (j in 1:length(cd)) {
test <- dm[dm$CD == cd[j], -c(47)]
train <- dm[dm$CD != cd[j], -c(47)]
model <- lm(Lcases~., data = train)
lst <- ols_step_backward_p(model)
rmv <- lst$removed
ind_tr <- which(names(train) %in% rmv)
final_tr <- train[, -c(ind_tr)]
ind_tt <- which(names(test) %in% rmv)
final_tt <- test[, -c(ind_tt)]
model <- lm(Lcases~., data = final_tr)
yhat <- predict(model, newdata = final_tt)
rmspe[g,i,j] <- sqrt(mean((final_tt$Lcases - yhat)^2))
#AUC
phat <- rescale(yhat, to = c(0, 1),
from = range(yhat, na.rm = TRUE, finite = TRUE))
phatt[[j]] <- phat
final_tt$Lcases <- as.factor(final_tt$Lcases)
pred_rocr <- prediction(phat, final_tt$Lcases)
auc_ROCR <- performance(pred_rocr, measure = "auc")
auc[g,i,j] <- auc_ROCR@y.values[[1]]
}
}
}
apply(rmspe, c(1,2), mean)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.4682240 0.4709323 0.4703193 0.4700835 0.4698089 0.4711440 0.4710007
## [2,] 0.4722933 0.4742418 0.4749785 0.4756441 0.4760400 0.4735963 0.4741240
## [3,] 0.4748892 0.4766434 0.4779671 0.4789133 0.4805583 0.4778534 0.4772796
## [4,] 0.4732048 0.4752383 0.4764926 0.4783440 0.4773594 0.4761430 0.4768071
## [5,] 0.4717477 0.4732924 0.4745162 0.4755444 0.4762719 0.4742910 0.4740048
## [6,] 0.4711055 0.4734736 0.4733106 0.4743040 0.4742714 0.4724791 0.4734876
## [7,] 0.4699191 0.4710549 0.4742402 0.4736760 0.4730010 0.4724961 0.4727392
## [8,] 0.4686641 0.4705513 0.4716291 0.4724631 0.4724973 0.4728210 0.4717215
## [9,] 0.4674746 0.4700038 0.4708553 0.4705048 0.4709208 0.4715389 0.4706430
## [,8] [,9] [,10] [,11]
## [1,] 0.4713788 0.4736232 0.4712888 0.4732521
## [2,] 0.4770715 0.4797079 0.4774783 0.4791801
## [3,] 0.4799913 0.4817548 0.4809400 0.4819120
## [4,] 0.4794575 0.4794073 0.4788816 0.4803674
## [5,] 0.4784570 0.4793299 0.4780075 0.4801720
## [6,] 0.4765846 0.4779030 0.4756931 0.4795699
## [7,] 0.4761530 0.4761459 0.4748530 0.4782480
## [8,] 0.4748603 0.4770412 0.4752151 0.4775535
## [9,] 0.4735703 0.4755578 0.4731407 0.4769946
apply(auc, c(1,2), mean)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.5967591 0.5785727 0.5854778 0.5936768 0.5890975 0.5796719 0.5821880
## [2,] 0.6024668 0.5901346 0.5859261 0.5829437 0.5759581 0.6030509 0.5971795
## [3,] 0.6085598 0.5934690 0.5874633 0.5792877 0.5675101 0.5937972 0.5951245
## [4,] 0.6264259 0.6083358 0.6056145 0.5846023 0.5896532 0.6143652 0.6044428
## [5,] 0.6375015 0.6237740 0.6231649 0.6125851 0.6090355 0.6306310 0.6265243
## [6,] 0.6391461 0.6232214 0.6288686 0.6202509 0.6245416 0.6426002 0.6301978
## [7,] 0.6429597 0.6353104 0.6309244 0.6269067 0.6313385 0.6433604 0.6336460
## [8,] 0.6492089 0.6349513 0.6429202 0.6337613 0.6317745 0.6377334 0.6374576
## [9,] 0.6546867 0.6398631 0.6496245 0.6436606 0.6416356 0.6461089 0.6460223
## [,8] [,9] [,10] [,11]
## [1,] 0.5799774 0.5624412 0.5826161 0.5636027
## [2,] 0.5769669 0.5649006 0.5750536 0.5661074
## [3,] 0.5774746 0.5700230 0.5749558 0.5707374
## [4,] 0.5849559 0.5881230 0.5892225 0.5859425
## [5,] 0.5959054 0.5980193 0.6012773 0.5922324
## [6,] 0.6051109 0.6086204 0.6162124 0.6014687
## [7,] 0.6073724 0.6168784 0.6211139 0.6070396
## [8,] 0.6124012 0.6130945 0.6172655 0.6087519
## [9,] 0.6254062 0.6239118 0.6298693 0.6146243
The results of this parametric estimation can be used as a benchmark to evaluate more complex, nonparametric, and trainable algorithms that we will try later.
Let’s see the maximum AUC values obtained from predicting each region.
muac <- c()
for (k in 1: length(cd)) {
muac[k] <- max(auc[,,k])
}
data.frame("CD"= cd, "AUC" = muac)
## CD AUC
## 1 Brant 0.6143158
## 2 Chatham-Kent 0.6435281
## 3 Cochrane 0.7894466
## 4 Durham 0.7110733
## 5 Elgin 0.6604167
## 6 Essex 0.6753694
## 7 Frontenac 0.7111818
## 8 Grey 0.6853685
## 9 Haldimand-Norfolk 0.7046703
## 10 Halton 0.6213786
## 11 Hamilton 0.6546154
## 12 Lambton 0.6847662
## 13 Leeds and Grenville 0.7900268
## 14 Middlesex 0.6654135
## 15 Niagara 0.6724266
## 16 Northumberland 0.6660018
## 17 Ottawa 0.6571484
## 18 Peel 0.7676056
## 19 Perth 0.6836945
## 20 Peterborough 0.7379791
## 21 Simcoe 0.6700721
## 22 Stormont, Dundas and Glengarry 0.6365990
## 23 Sudbury 0.7751065
## 24 Thunder Bay 0.6943115
## 25 Toronto 0.8217409
## 26 Waterloo 0.7505962
## 27 Wellington 0.6667314
## 28 York 0.7634615
Let’s look at AUC for Toronto in detail:
auc[,,25]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.5118606 0.5541784 0.5464762 0.5555556 0.5466118 0.5625994 0.5958048
## [2,] 0.6170686 0.6734962 0.6384439 0.6258937 0.6688210 0.6728935 0.7037112
## [3,] 0.6443414 0.6945489 0.7198703 0.6844444 0.6535448 0.6707075 0.7222670
## [4,] 0.6518023 0.7353661 0.7253629 0.6635062 0.7129412 0.7558209 0.7230738
## [5,] 0.7203704 0.7747748 0.7268748 0.7248698 0.7511737 0.7642857 0.7551308
## [6,] 0.7456498 0.7890786 0.7437215 0.7314815 0.7748435 0.8049990 0.7645875
## [7,] 0.7550000 0.8038664 0.7341073 0.7383755 0.8080986 0.8026185 0.7712274
## [8,] 0.7575251 0.8051948 0.7740642 0.7517415 0.8123529 0.8217409 0.7874698
## [9,] 0.7848101 0.8075038 0.7801757 0.7757353 0.8115686 0.8120032 0.7921031
## [,8] [,9] [,10] [,11]
## [1,] 0.5344510 0.5124533 0.5147306 0.5093897
## [2,] 0.6013085 0.6214760 0.5915493 0.5897655
## [3,] 0.5853609 0.6183665 0.6344335 0.6840085
## [4,] 0.6141893 0.6927861 0.6745848 0.6810235
## [5,] 0.6569388 0.7287785 0.6930672 0.6953092
## [6,] 0.6844254 0.7539354 0.7126340 0.7274861
## [7,] 0.6885079 0.7315659 0.7262981 0.7445583
## [8,] 0.7252859 0.7435791 0.7331933 0.7495737
## [9,] 0.7410131 0.7665700 0.7573529 0.7559676
It seems that L = 7 and th = 9 (they start at 2) are the optimal parameters. By using them, we get ROC and confusion matrix:
L <- 7
th <- 9
data2 <- dmdata %>%
dplyr::group_by(CD) %>%
mutate(ccs = movavg(cases, th, "s")) %>%
mutate(updown <- ifelse(cases > ccs, 1, 0)) %>%
ungroup()
data2 <- data2[, c(49,1,2,4:47)]
names(data2)[1] <- "updown"
data3 <- data2 %>%
dplyr::group_by(CD) %>%
mutate(Lcases = lag(updown,L)) %>%
ungroup()
dm <- data3[, c(48, 2:(ncol(data2)))]
dm <- dm[complete.cases(dm), ]
test <- dm[dm$CD == "Toronto", -c(47)]
train <- dm[dm$CD != "Toronto", -c(47)]
model <- lm(Lcases~., data = train)
lst <- ols_step_backward_p(model)
rmv <- lst$removed
ind_tr <- which(names(train) %in% rmv)
final_tr <- train[, -c(ind_tr)]
ind_tt <- which(names(test) %in% rmv)
final_tt <- test[, -c(ind_tt)]
model <- lm(Lcases~., data = final_tr)
yhat <- predict(model, newdata = final_tt)
#AUC
phat <- rescale(yhat, to = c(0, 1),
from = range(yhat, na.rm = TRUE, finite = TRUE))
final_tt$Lcases <- as.factor(final_tt$Lcases)
pred_rocr <- prediction(phat, final_tt$Lcases)
auc_ROCR <- performance(pred_rocr, measure = "auc")
auc <- auc_ROCR@y.values[[1]]
auc
## [1] 0.8217409
# ROC
TPR <- c()
FPR <- c()
phato <- phat[order(phat)]
for (i in 1:length(phato)) {
yHat <- phat > phato[i]
conf_table <- table(yHat, test$Lcases)
ct <- as.matrix(conf_table) #No need to change the table for calculations
if(sum(dim(ct))>3){ #here we ignore the thresholds 0 and 1
TPR[i] <- ct[2,2]/(ct[2,2]+ct[1,2])
FPR[i] <- ct[2,1]/(ct[1,1]+ct[2,1])
}
}
plot(FPR, TPR, col= "blue", type = "l", main = "ROC", lwd = 2)
abline(a = 0, b = 1, col="red")
The best discriminating threshold:
# Confusion matrix
J <- TPR - FPR
# The best discriminating threshold
phat[which.max(J)]
## 59
## 0.6513276
The confusion matrix:
yHat <- phat > 0.5
conf_table <- table(yHat, test$Lcases)
ctt <- as.matrix(conf_table)
ct <- matrix(0, 2, 2)
ct[1,1] <- ctt[2,2]
ct[2,2] <- ctt[1,1]
ct[1,2] <- ctt[2,1]
ct[2,1] <- ctt[1,2]
rownames(ct) <- c("Yhat = 1", "Yhat = 0")
colnames(ct) <- c("Y = 1", "Y = 0")
ct
## Y = 1 Y = 0
## Yhat = 1 59 26
## Yhat = 0 9 48
Note that this is a spatial forecast not a temporal one. We did not forecast the future epi curves by looking at the past values. We can also incorporate the temporal prediction by using only the fist 80 days, for example, in 27 regions to fit a model and then use it to predict the remaining future days for the holdout region. We can even make it a h-step ahead cross-validation. However, the objective of these parametric models estimated here is to reveal the underlying relationship between weather conditions and epi curves after controlling for the regional fixed differences, rather than to forecast.
The script above is very primitive and takes very long time to run. We will use trainable nonparametric models for forecasting with proper cross-validation procedures later.
Here is what the spatial model is for Toronto:
summary(model)
##
## Call:
## lm(formula = Lcases ~ ., data = final_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7782 -0.3624 -0.2558 0.5121 0.9249
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.2414530 17.4486546 1.332 0.18294
## mob1 -0.7085440 0.0869812 -8.146 5.05e-16 ***
## CC 0.1178983 0.0393861 2.993 0.00278 **
## NoEpi 0.0929027 0.0432939 2.146 0.03195 *
## NoInfo 0.2330864 0.0766623 3.040 0.00238 **
## OB 0.1462883 0.0459324 3.185 0.00146 **
## Travel 0.1005931 0.0602388 1.670 0.09502 .
## age -0.0010515 0.0007159 -1.469 0.14194
## max_temperature 2.2397828 1.0626811 2.108 0.03512 *
## avg_temperature -4.4693621 2.1254503 -2.103 0.03555 *
## min_temperature 2.2445663 1.0626368 2.112 0.03473 *
## min_windchill -0.0054772 0.0049233 -1.112 0.26600
## max_relative_humidity -0.0018308 0.0015804 -1.158 0.24677
## avg_hourly_relative_humidity 0.0059636 0.0028395 2.100 0.03577 *
## avg_relative_humidity -0.0007753 0.0030205 -0.257 0.79743
## min_relative_humidity NA NA NA NA
## max_dew_point 0.0169388 0.0064897 2.610 0.00909 **
## avg_hourly_dew_point -0.0370487 0.0129050 -2.871 0.00412 **
## min_dew_point 0.0059994 0.0056853 1.055 0.29138
## max_wind_speed -0.0026939 0.0034249 -0.787 0.43159
## avg_hourly_wind_speed -0.0131564 0.0045095 -2.917 0.00355 **
## avg_wind_speed 0.0136806 0.0069246 1.976 0.04827 *
## min_wind_speed NA NA NA NA
## wind_gust_dir_10s -0.0013233 0.0007389 -1.791 0.07337 .
## cooldegdays -0.0322999 0.0086138 -3.750 0.00018 ***
## growdegdays_10 0.0154970 0.0088212 1.757 0.07903 .
## daylight -0.5290311 0.1329438 -3.979 7.04e-05 ***
## sunset_f 1.0771552 0.3298217 3.266 0.00110 **
## max_uv_forecast 0.0265086 0.0065461 4.050 5.23e-05 ***
## min_low_temperature_forecast -0.0033548 0.0029160 -1.150 0.25002
## Episode -0.0012456 0.0009491 -1.312 0.18944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4695 on 3805 degrees of freedom
## Multiple R-squared: 0.06966, Adjusted R-squared: 0.06282
## F-statistic: 10.18 on 28 and 3805 DF, p-value: < 2.2e-16
Although the temporal prediction (forecasting) requires a more proper way to set the data for keeping its temporal order, we will try the same algorithm with some changes: we will use only 11 high-case regions (Durham, Essex, Halton, Hamilton, Niagara, Ottawa, Peel, Simcoe, Waterloo, Toronto, and York) with a 10-day sliding window to predict the next 2 days. We also look at only the first half of the epi curves until the decline is observed. There should be a better segmentation of epi curves than this, but for now, we look at the fist half of the curve where it inclines. The other issue is that the data is not stationary. Although our forecast horizon is very short (2 days), we should apply all the conventional steps before forecasting. We will take care of it later.
rm(list = ls())
load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
data1 <- data[, -c(16, 35:42, 48)]
df <- data1[complete.cases(data1), ]
# dmdata <- demean(df, select = c(names(df)[-c(1,2)]), group = "CD")
# dmdata <- dmdata[, 46:90]
# names(dmdata)<- strsplit(names(dmdata),"_within")
#
# dmdata$Episode <- df$Episode
# dmdata$CD <- df$CD
cd <- as.character(unique(df$CD))
cds <- as.character(unique(df$CD))[c(4,6,10,11,15,17,18,21,25,26,28)]
L <- 2:12
th <- 2:10
mrmspe <- matrix(0, length(th), length(L))
mauc <- matrix(0, length(th), length(L))
for (g in 1:length(th)) {
data2 <- data1 %>%
dplyr::group_by(CD) %>%
mutate(ccs = movavg(cases, th[g], "s")) %>%
mutate(updown <- ifelse(cases > ccs, 1, 0)) %>%
ungroup()
data2 <- data2[, c(49,1:4,6:47)]
names(data2)[1] <- "updown"
for (i in 1:length(L)) {
data3 <- data2 %>%
dplyr::group_by(CD) %>%
mutate(Lcases = lag(updown,L[i])) %>%
ungroup()
dm <- data3[, c(48, 2:(ncol(data2)))]
dm <- dm[complete.cases(dm), ]
rmspe <- c()
auc <- c()
#sliding window
step = 10
# Number of observation for each region
n = (nrow(dm)/length(cd))-60
for(k in 1:n) {
train <- matrix(0, 0, ncol(dm))
test <- matrix(0, 0, ncol(dm))
for (j in 1:length(cds)) {
arr <- dm[dm$CD == cd[j], ]
s <- k - 1 + step
tr <- arr[k:s,]
train <- rbind(train, tr)
tt <- arr[c((s+1),(s+2)),]
test <- rbind(test, tt)
}
model <- lm(Lcases~.-1, data = train)
lst <- ols_step_backward_p(model)
rmv <- lst$removed
ind_tr <- which(names(train) %in% rmv)
final_tr <- train[, -c(ind_tr)]
ind_tt <- which(names(test) %in% rmv)
final_tt <- test[, -c(ind_tt)]
model <- lm(Lcases~.-1, data = final_tr)
yhat <- predict(model, newdata = final_tt)
rmspe[k] <- sqrt(mean((final_tt$Lcases - yhat)^2))
#AUC
phat <- rescale(yhat, to = c(0, 1),
from = range(yhat, na.rm = TRUE, finite = TRUE))
final_tt$Lcases <- as.factor(final_tt$Lcases)
pred_rocr <- prediction(phat, final_tt$Lcases)
auc_ROCR <- performance(pred_rocr, measure = "auc")
auc[k] <- auc_ROCR@y.values[[1]]
}
mrmspe[g, i] <- mean(rmspe)
mauc[g, i] <- mean(auc)
}
}
mauc
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.5118872 0.5396695 0.4944077 0.5529247 0.5178809 0.5012637 0.5163914
## [2,] 0.5299799 0.4963428 0.4993872 0.5299924 0.4934876 0.5244277 0.4961056
## [3,] 0.5206529 0.4952314 0.4896386 0.5032291 0.4871756 0.5134877 0.4882870
## [4,] 0.5290461 0.5044342 0.5115684 0.5203528 0.4947340 0.5086228 0.5223935
## [5,] 0.5063811 0.5091636 0.5124110 0.5234754 0.5119877 0.5151747 0.5114066
## [6,] 0.5269921 0.5112450 0.5202445 0.5298808 0.5152335 0.5232409 0.5140756
## [7,] 0.5316238 0.5012354 0.5231277 0.5458470 0.5181580 0.5232384 0.5236800
## [8,] 0.5434355 0.5119887 0.5268718 0.5161558 0.5117887 0.5085543 0.5307354
## [9,] 0.5573029 0.4985612 0.5280653 0.5457150 0.5417617 0.5355803 0.5186650
## [,8] [,9] [,10] [,11]
## [1,] 0.5111290 0.5406291 0.5205796 0.4834508
## [2,] 0.4986134 0.4980452 0.5221568 0.5066070
## [3,] 0.4982964 0.4977495 0.5301191 0.5141146
## [4,] 0.5043868 0.5010629 0.5136011 0.5173384
## [5,] 0.5064905 0.5073739 0.5004918 0.5250408
## [6,] 0.5141413 0.5319692 0.5225467 0.5357547
## [7,] 0.5090316 0.5096228 0.5110844 0.5264995
## [8,] 0.5093734 0.5073274 0.5336256 0.5366277
## [9,] 0.5162832 0.5278893 0.5413386 0.5471153
This table of AUC measures (rows th
and columns L
) confirms the fact that temporal prediction is nearly impossible at least with simple parametric models. Although, the main issue would be the use of case numbers that are not able to reveal the correct number of infected people, more appropriate and trainable models based on a nonparametric structure and a proper time series segmentation may capture the underlying relationship between case numbers and environmental conditions. Here is the quote from Forecasting COVID-19 by Hyndman:
(. . .) While we have a good understanding of how it works in terms of person-to-person infections, we have limited and misleading data. The current numbers of confirmed cases are known to be vastly underestimated due to the limited testing available. There are almost certainly many more cases of COVID-19 that have not been diagnosed than those that have. Also, the level of under-estimation varies enormously between countries. In a country like South Korea with a lot of testing, the numbers of confirmed cases are going to be closer to the numbers of actual cases than in the US where there has been much less testing. So we simply cannot easily model the spread of the pandemic using the data that is available (. . .) That’s why we asked NSHA the test (811) numbers in addition to case numbers!