Parametric Models

This rmd file uses several packages. If you want to replicate the same results, please install them first.
NOTE: Please click on the Code button to see the codes.

Let’s load the packages and download the data. We drop a couple of variables that still have few NA’s, as before.

rm(list = ls())

library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr)
library(maps)

# Data Cleaning
library(janitor)

# EDA
library(skimr)
library(DataExplorer)
library(DT)
library(corrplot)
library(parameters)
library(smooth)

# ggplot2 Helpers
library(scales)
theme_set(theme_tq())
library(raster)
library(spdplyr)
library(dplyr)
library(RColorBrewer)

# Missing data
library(mice)
library(VIM)
library(lattice)

# MPA/Models
library(tsmp)
library(pracma)
library(betareg)
library(blorr)
library(olsrr)
library(ROCR)
library(dynamac)
library(urca)

setwd("~/Dropbox/RNS2")
load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3

# Dropping these but later we'll impute
data1 <- data[, -c(16, 35:42, 48)]
# skim(data1)[,1:12]

1. Controlling for regional differences

From Wikipedia: In many applications including econometrics and biostatistics a fixed effects model refers to a regression model in which the group means are fixed (non-random) as opposed to a random effects model in which the group means are a random sample from a population. Generally, data can be grouped according to several observed factors. The group means could be modeled as fixed or random effects for each grouping. In a fixed effects model each group mean is a group-specific fixed quantity.

In panel data analysis the term fixed effects estimator (also known as the within estimator) is used to refer to an estimator for the coefficients in the regression model including those fixed effects (one time-invariant intercept for each subject). Such models assist in controlling for omitted variable bias due to unobserved heterogeneity when this heterogeneity is constant over time.

Regional time series covid-19 cases are the product of regional factors, such as population densities, age distributions, income levels, ethnic minority densities, and so on. Therefore, without controlling for them, comparing multiple regions using the case data would not reflect the real spatial relationship between neighboring regions.

One conventional method that removes fixed (time-invariant) regional differences is to demean the time series for each region. Consider the linear model for N observations and T time periods:

\[ y_{i t}=X_{i t} \beta+\alpha_{i}+u_{i t} \text { for } t=1, \ldots, T \text { and } i=1, \ldots, N \]

Assume that this function defines the time series \(y_t\) for region \(i\), which is \(y_{it}\). While \(X_{it}\) is the time-variant 1xk (the number of factors) features vector, \(\alpha_{i}\) is the unobserved (or observed) time-invariant regional effect. For example, historical and institutional factors, the local government’s management quality, or the age distribution of the population. Demeaning the variables using the within transformation eliminates the time-invariant factors:

\[ y_{i t}-\bar{y}_{i}=\left(X_{i t}-\bar{X}_{i}\right) \beta+\left(\alpha_{i}-\bar{\alpha}_{i}\right)+\left(u_{i t}-\bar{u}_{i}\right) \Longrightarrow \ddot{y}_{i t}=\ddot{X}_{i t} \beta+\ddot{u}_{i t} \] where,

\[ \bar{y}_{i}=\frac{1}{T} \sum_{t=1}^{T} y_{i t}, \bar{X}_{i}=\frac{1}{T} \sum_{t=1}^{T} X_{i t}, \text { and } \bar{u}_{i}=\frac{1}{T} \sum_{t=1}^{T} u_{i t} \]

Since \(\alpha_{i}\) is constant \(\overline{\alpha_{i}}=\alpha_{i}\). Hence the observed and unobserved time-invariant effect eliminated. This method produces the same results as a Least-Squared-Dummy-Variable (LSDV) model, in which the intercept is defined for each region by including a binary variable representing it. The following methods use both versions of the fixed effects estimator.

2. P-hacking

The significance of each variable on epi curves with L=10, after controlling for Episode date and the regional differences by CD.

data1 <- data1 %>%
  dplyr::group_by(CD) %>%
  mutate(Lcases = lag(cases,10))

data2 <- data1[, c(48,1:4, 6:(ncol(data1)-1))]

coef <- c()
sign <- c()
for (i in 4:ncol(data2)) {
  model <- glm(Lcases~.-1, family = gaussian(link = "identity"),
               data = data2[, c(1:3, i)])
  results <- summary(model)
  coef[i] <- results$coefficients[30, 1]
  sign[i] <- results$coefficients[30, 3]
}
tbl <- data.frame(variables = names(data2)[4:ncol(data2)],
                         Coefficients = coef[4:ncol(data2)],
                         t = sign[4:ncol(data2)])
tbl$Above2.1 <- ifelse(abs(tbl$t) > 2.1, TRUE, FALSE)
tbl

##                        variables  Coefficients            t Above2.1
## 1                           mob1 -31.097913011 -15.01588540     TRUE
## 2                           mbo2  61.889814016  15.86499564     TRUE
## 3                       maleperc  -0.905092105  -1.18203654    FALSE
## 4                     delay_mean  -0.241576853  -4.09741001     TRUE
## 5                             CC   0.292555216   0.35040417    FALSE
## 6                          NoEpi  -3.965002459  -4.38563306     TRUE
## 7                         NoInfo  -0.042528417  -0.02018483    FALSE
## 8                        Unknown -16.327591320  -2.07908352    FALSE
## 9                             OB   8.523110350  10.43889347     TRUE
## 10                        Travel -14.811752138 -10.40201055     TRUE
## 11                           age   0.041153790   3.27523058     TRUE
## 12               max_temperature  -0.165630344  -3.43416110     TRUE
## 13               avg_temperature  -0.270433628  -4.80596296     TRUE
## 14               min_temperature  -0.304251011  -5.50259709     TRUE
## 15                   max_humidex  -0.168169928  -7.30522993     TRUE
## 16                 min_windchill   0.168330425   2.27826684     TRUE
## 17         max_relative_humidity  -0.117542026  -4.60738191     TRUE
## 18  avg_hourly_relative_humidity  -0.106239141  -5.79965917     TRUE
## 19         avg_relative_humidity  -0.133484370  -6.08696399     TRUE
## 20         min_relative_humidity  -0.087650477  -5.72081320     TRUE
## 21                 max_dew_point  -0.263560183  -5.94017142     TRUE
## 22          avg_hourly_dew_point  -0.304791228  -6.82444906     TRUE
## 23                 avg_dew_point  -0.319666008  -7.10103636     TRUE
## 24                 min_dew_point  -0.321040666  -7.64157383     TRUE
## 25                max_wind_speed   0.085095716   2.89204503     TRUE
## 26         avg_hourly_wind_speed   0.125046507   2.75027731     TRUE
## 27                avg_wind_speed   0.118256161   2.62516141     TRUE
## 28                min_wind_speed   0.071359971   1.14194642    FALSE
## 29                 max_wind_gust   0.023679664   2.00787553    FALSE
## 30             wind_gust_dir_10s   0.003646354   0.19374774    FALSE
## 31                   heatdegdays   0.054887348   0.84384593    FALSE
## 32                   cooldegdays  -1.187080426  -9.34472867     TRUE
## 33                 growdegdays_5  -0.518973301  -8.00424644     TRUE
## 34                 growdegdays_7  -0.621278061  -9.06739270     TRUE
## 35                growdegdays_10  -0.747063066  -9.97795120     TRUE
## 36                      daylight   5.723171708  13.41803551     TRUE
## 37                     sunrise_f -10.199709000 -14.74224831     TRUE
## 38                      sunset_f  12.173790810  11.15140780     TRUE
## 39               min_uv_forecast  -0.184825760  -1.36379489    FALSE
## 40               max_uv_forecast  -0.070568061  -0.47800386    FALSE
## 41 min_high_temperature_forecast  -0.231219100  -4.78405831     TRUE
## 42 max_high_temperature_forecast  -0.213141326  -4.42981042     TRUE
## 43  min_low_temperature_forecast  -0.256147115  -5.08635019     TRUE
## 44  max_low_temperature_forecast  -0.275588159  -5.50191182     TRUE

#tbl[abs(tbl$t) > 2.1, ]

Note that this method, p-hacking, is highly criticized in practice. So more proper method is shown below.

3. Feature Selection

Using a backward-stepwise selection, we look for the best associated variables with the outcome, Lcases. Note that, instead of following the demeaning procedure described above, we will use Least Squared Dummy Variable model (LSDV) here.

# dmdata <- demean(data2, select = c(names(data2)[-c(2,3)]), group = "CD")
# dmdata <- dmdata[, 46:90]
# names(dmdata)<- strsplit(names(dmdata),"_within")
# dmdata$CD <- data2$CD
model1 <- lm(Lcases~.-1, data = data2)
lst <- ols_step_backward_p(model1)
lst

## 
## 
##                                        Elimination Summary                                         
## --------------------------------------------------------------------------------------------------
##         Variable                                       Adj.                                           
## Step               Removed               R-Square    R-Square      C(p)         AIC         RMSE      
## --------------------------------------------------------------------------------------------------
##    1    max_wind_gust                      0.7019      0.6965     17.0000    31385.0433    13.5092    
##    2    maleperc                           0.7019      0.6966     15.0005    31383.0438    13.5075    
##    3    max_uv_forecast                    0.7019      0.6966     13.0042    31381.0476    13.5057    
##    4    min_low_temperature_forecast       0.7019      0.6967     11.0287    31379.0726    13.5040    
##    5    growdegdays_7                      0.7019      0.6968      9.0535    31377.0978    13.5023    
##    6    max_high_temperature_forecast      0.7019      0.6969      7.0866    31375.1315    13.5006    
##    7    sunset_f                           0.7018      0.6969      5.1260    31373.1716    13.4989    
##    8    min_high_temperature_forecast      0.7018       0.697      3.1694    31371.2158    13.4972    
##    9    avg_hourly_dew_point               0.7018      0.6971      1.2118    31369.2590    13.4955    
##   10    cooldegdays                        0.7018      0.6972     -0.7247    31367.3237    13.4938    
##   11    min_windchill                      0.7018      0.6972     -2.4652    31365.5880    13.4925    
##   12    min_uv_forecast                    0.7018      0.6973     -4.2075    31363.8504    13.4912    
##   13    max_temperature                    0.7018      0.6973     -5.8740    31362.1900    13.4901    
##   14    max_dew_point                      0.7017      0.6974     -7.3783    31360.6948    13.4892    
##   15    avg_hourly_wind_speed              0.7017      0.6974     -8.8211    31359.2620    13.4884    
##   16    NoInfo                             0.7016      0.6974    -10.1933    31357.9011    13.4878    
##   17    max_low_temperature_forecast       0.7016      0.6975    -11.3948    31356.7137    13.4874    
##   18    mob1                               0.7015      0.6975    -12.5980    31355.5245    13.4871    
## --------------------------------------------------------------------------------------------------

4. FEM: In-Sample

We try a within effect regression model from a set of candidate predictors based on Akaike Information Criteria (see above).

# In-sample
rmv <- lst$removed
ind <- which(names(data2) %in% rmv)
final_data <- data2[, -c(ind)]
model2 <- lm(Lcases~.-1, data = final_data)
summary(model2)

## 
## Call:
## lm(formula = Lcases ~ . - 1, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.009  -4.236  -0.359   3.332 178.281 
## 
## Coefficients: (2 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)    
## CDBrant                          -675.37018  398.40991  -1.695 0.090125 .  
## CDChatham-Kent                   -666.33690  397.80085  -1.675 0.094006 .  
## CDCochrane                       -669.18305  397.69079  -1.683 0.092520 .  
## CDDurham                         -671.94026  398.83485  -1.685 0.092117 .  
## CDElgin                          -671.82666  398.11198  -1.688 0.091582 .  
## CDEssex                          -651.42214  397.52741  -1.639 0.101361    
## CDFrontenac                      -692.92032  399.62340  -1.734 0.083010 .  
## CDGrey                           -672.56628  398.09945  -1.689 0.091216 .  
## CDHaldimand-Norfolk              -674.51581  398.41089  -1.693 0.090534 .  
## CDHalton                         -674.70126  398.57364  -1.693 0.090577 .  
## CDHamilton                       -671.95799  398.52080  -1.686 0.091852 .  
## CDLambton                        -666.30004  397.71311  -1.675 0.093951 .  
## CDLeeds and Grenville            -694.66638  399.87440  -1.737 0.082430 .  
## CDMiddlesex                      -668.54087  398.11677  -1.679 0.093183 .  
## CDNiagara                        -677.00109  398.81108  -1.698 0.089674 .  
## CDNorthumberland                 -685.45345  399.04672  -1.718 0.085927 .  
## CDOttawa                         -683.13509  399.91878  -1.708 0.087683 .  
## CDPeel                           -634.22782  398.58868  -1.591 0.111651    
## CDPerth                          -671.90726  398.12515  -1.688 0.091554 .  
## CDPeterborough                   -684.94149  398.99075  -1.717 0.086117 .  
## CDSimcoe                         -674.99598  398.57760  -1.694 0.090439 .  
## CDStormont, Dundas and Glengarry -699.41282  400.19184  -1.748 0.080597 .  
## CDSudbury                        -672.30994  397.89663  -1.690 0.091174 .  
## CDThunder Bay                    -634.43012  395.04568  -1.606 0.108364    
## CDToronto                        -581.82164  398.64061  -1.460 0.144505    
## CDWaterloo                       -666.29134  398.37365  -1.673 0.094502 .  
## CDWellington                     -673.59747  398.41497  -1.691 0.090977 .  
## CDYork                           -658.77580  398.62575  -1.653 0.098491 .  
## Episode                             0.08786    0.02670   3.291 0.001009 ** 
## mbo2                               39.23039    4.88493   8.031 1.28e-15 ***
## delay_mean                          0.18072    0.06658   2.714 0.006668 ** 
## CC                                 -3.68752    1.13311  -3.254 0.001146 ** 
## NoEpi                              -6.15014    1.23746  -4.970 6.99e-07 ***
## Unknown                           -18.22557    7.45308  -2.445 0.014515 *  
## OB                                 -1.87320    1.27386  -1.470 0.141510    
## Travel                             -5.49966    1.86355  -2.951 0.003185 ** 
## age                                 0.08063    0.01932   4.174 3.06e-05 ***
## avg_temperature                    -1.08085    0.36508  -2.961 0.003089 ** 
## min_temperature                     0.23463    0.16749   1.401 0.161325    
## max_humidex                         0.16758    0.05014   3.342 0.000839 ***
## max_relative_humidity              -0.04097    0.04423  -0.926 0.354399    
## avg_hourly_relative_humidity       -0.10450    0.06793  -1.538 0.124056    
## avg_relative_humidity              -0.01199    0.08599  -0.139 0.889146    
## min_relative_humidity                    NA         NA      NA       NA    
## avg_dew_point                       0.55365    0.28960   1.912 0.055977 .  
## min_dew_point                      -0.26610    0.16366  -1.626 0.104049    
## max_wind_speed                      0.14063    0.09647   1.458 0.144982    
## avg_wind_speed                     -0.27743    0.14879  -1.865 0.062317 .  
## min_wind_speed                           NA         NA      NA       NA    
## wind_gust_dir_10s                  -0.03035    0.02049  -1.481 0.138649    
## heatdegdays                        -0.59275    0.25441  -2.330 0.019862 *  
## growdegdays_5                       0.30817    0.28970   1.064 0.287504    
## growdegdays_10                     -0.50916    0.38065  -1.338 0.181108    
## daylight                          -36.04203    5.52394  -6.525 7.70e-11 ***
## sunrise_f                         -66.23327    9.15408  -7.235 5.58e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 3839 degrees of freedom
##   (280 observations deleted due to missingness)
## Multiple R-squared:  0.742,  Adjusted R-squared:  0.7384 
## F-statistic: 208.3 on 53 and 3839 DF,  p-value: < 2.2e-16

You can see that the fit is not bad (Adjusted R-squared). Specially max_humidex confirms the findings of an earlier study. However, one wonders if we run only a simple model without any frills, what happens?

# In-sample
model3 <- lm(Lcases~CD+Episode+mbo2-1, data = final_data)
summary(model3)

## 
## Call:
## lm(formula = Lcases ~ CD + Episode + mbo2 - 1, data = final_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.632  -2.929  -0.125   2.184 182.161 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## CDBrant                          -6.134e+02  1.303e+02  -4.706 2.61e-06 ***
## CDChatham-Kent                   -6.127e+02  1.303e+02  -4.701 2.67e-06 ***
## CDCochrane                       -6.134e+02  1.303e+02  -4.707 2.60e-06 ***
## CDDurham                         -6.041e+02  1.304e+02  -4.632 3.74e-06 ***
## CDElgin                          -6.136e+02  1.303e+02  -4.708 2.59e-06 ***
## CDEssex                          -6.009e+02  1.304e+02  -4.609 4.17e-06 ***
## CDFrontenac                      -6.146e+02  1.304e+02  -4.714 2.51e-06 ***
## CDGrey                           -6.135e+02  1.303e+02  -4.708 2.60e-06 ***
## CDHaldimand-Norfolk              -6.109e+02  1.303e+02  -4.688 2.86e-06 ***
## CDHalton                         -6.110e+02  1.304e+02  -4.685 2.90e-06 ***
## CDHamilton                       -6.095e+02  1.304e+02  -4.675 3.04e-06 ***
## CDLambton                        -6.123e+02  1.303e+02  -4.698 2.72e-06 ***
## CDLeeds and Grenville            -6.115e+02  1.303e+02  -4.692 2.79e-06 ***
## CDMiddlesex                      -6.112e+02  1.304e+02  -4.687 2.86e-06 ***
## CDNiagara                        -6.092e+02  1.304e+02  -4.673 3.07e-06 ***
## CDNorthumberland                 -6.138e+02  1.304e+02  -4.709 2.58e-06 ***
## CDOttawa                         -6.010e+02  1.305e+02  -4.607 4.21e-06 ***
## CDPeel                           -5.702e+02  1.304e+02  -4.373 1.26e-05 ***
## CDPerth                          -6.130e+02  1.303e+02  -4.705 2.63e-06 ***
## CDPeterborough                   -6.152e+02  1.304e+02  -4.718 2.46e-06 ***
## CDSimcoe                         -6.106e+02  1.304e+02  -4.684 2.91e-06 ***
## CDStormont, Dundas and Glengarry -6.128e+02  1.303e+02  -4.702 2.66e-06 ***
## CDSudbury                        -6.158e+02  1.304e+02  -4.723 2.41e-06 ***
## CDThunder Bay                    -6.135e+02  1.303e+02  -4.708 2.60e-06 ***
## CDToronto                        -5.169e+02  1.304e+02  -3.963 7.54e-05 ***
## CDWaterloo                       -6.061e+02  1.304e+02  -4.649 3.45e-06 ***
## CDWellington                     -6.114e+02  1.304e+02  -4.690 2.83e-06 ***
## CDYork                           -5.935e+02  1.304e+02  -4.551 5.50e-06 ***
## Episode                           3.242e-02  7.045e-03   4.602 4.32e-06 ***
## mbo2                              6.189e+01  3.901e+00  15.865  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.96 on 3862 degrees of freedom
##   (280 observations deleted due to missingness)
## Multiple R-squared:  0.7218, Adjusted R-squared:  0.7196 
## F-statistic:   334 on 30 and 3862 DF,  p-value: < 2.2e-16

Interesting. We’ll stop here as it would become a useless data mining practice within the limits of parametric models.

5. FEM: Out-sample

We would shuffle and split the data into train (80%) and test (20%) sets and run FEM 100 times, but this type of bootstrapping requires removal of a temporal relationship in the data with a proper differencing. Our demeaning alone would not remove the presence of a possible autocorrelation or non-stationarity in the data.

We use a different method, the population-informed nested cross-validation, and assume the independence between different regions’ epi curves. This allows us to break the strict temporal ordering, at least between regions’ epi curves. The idea is simple: since we removed the regional fixed differences, we can fit (train) the model by 27 regions and and use the remaining region as our out-of-sample (test) set.

df <- data2[complete.cases(data2), ]
dmdata <- demean(df, select = c(names(df)[-c(2,3)]), group = "CD")
dmdata <- dmdata[, 46:90]
names(dmdata)<- strsplit(names(dmdata),"_within")

dmdata$Episode <- df$Episode
dmdata$CD <- df$CD

cd <- as.character(unique(df$CD))
rmspe <- c()
mae <- c()
mape <- c()

for (i in 1:length(cd)) {
  test <- dmdata[dmdata$CD==cd[i], -c(47)]
  train <- dmdata[dmdata$CD!=cd[i], -c(47)]
  
  model <- lm(Lcases~., data = train)
  lst <- ols_step_backward_p(model)
  rmv <- lst$removed
  
  ind_tr <- which(names(train) %in% rmv)
  final_tr <- train[, -c(ind_tr)]
  ind_tt <- which(names(test) %in% rmv)
  final_tt <- test[, -c(ind_tt)]
  
  model <- lm(Lcases~., data = final_tr)
  yhat <- predict(model, newdata = final_tt)
  rmspe[i] <- sqrt(mean((final_tt$Lcases - yhat)^2))
  mae[i] <- mean(abs(final_tt$Lcases - yhat))
  mape[i] <- mean(abs(100*(final_tt$Lcases - yhat)/final_tt$Lcases))
}
mean(rmspe)

## [1] 8.721739

mean(mae)

## [1] 6.82111

mean(mape)

## [1] 512.1652

We can now look at these performance metrics for each region:

tbl1 <- aggregate(df$Lcases, by=list(df$CD), FUN=mean)
tbl2 <- aggregate(df$Lcases, by=list(df$CD), FUN=sd)
lst <- data_frame("Region" = cd, "RMSPE" = rmspe, "MAE" = mae, "MAPE" = mape, "Avr_Lcases" = tbl1[,2],"sd" = tbl2[,2] )
lst

## # A tibble: 28 x 6
##    Region            RMSPE   MAE  MAPE Avr_Lcases     sd
##    <chr>             <dbl> <dbl> <dbl>      <dbl>  <dbl>
##  1 Brant              5.44  4.55 2092.      1.06   1.28 
##  2 Chatham-Kent       5.21  4.08  431.      1.45   2.85 
##  3 Cochrane           4.87  3.95  713.      0.504  0.863
##  4 Durham             7.38  5.38  233.     12.8   10.6  
##  5 Elgin              5.02  4.19  744.      0.683  1.08 
##  6 Essex             14.8   8.61  228.     14.5   13.8  
##  7 Frontenac          5.51  4.60  732.      0.763  1.44 
##  8 Grey               4.62  3.74  751.      0.784  1.49 
##  9 Haldimand-Norfolk  8.66  5.35  520.      3.12   7.10 
## 10 Halton             5.46  4.24  394.      5.83   4.88 
## # … with 18 more rows

Small regions with many zero-case days might be an issue. One solution could be to group high-case regions together. Using the same algorithm, we will only look at York, Toronto, Peel, Ottawa, Essex and, Durham to see if we can improve the results:

df <- data2[complete.cases(data2), ]
dmdata <- demean(df, select = c(names(df)[-c(2,3)]), group = "CD")
dmdata <- dmdata[, 46:90]
names(dmdata)<- strsplit(names(dmdata),"_within")

dmdata$Episode <- df$Episode
dmdata$CD <- df$CD

cd_high <- c("York", "Toronto", "Peel", "Ottawa", "Essex", "Durham")
ind <- which(dmdata$CD %in% cd_high)
dmdata <- dmdata[ind, ]

cd <- as.character(unique(dmdata$CD))

rmspe <- c()
mae <- c()
mape <- c()

for (i in 1:length(cd)) {
  test <- dmdata[dmdata$CD==cd[i], -c(47)]
  train <- dmdata[dmdata$CD!=cd[i], -c(47)]
  
  model <- lm(Lcases~., data = train)
  lst <- ols_step_backward_p(model)
  rmv <- lst$removed
  
  ind_tr <- which(names(train) %in% rmv)
  final_tr <- train[, -c(ind_tr)]
  ind_tt <- which(names(test) %in% rmv)
  final_tt <- test[, -c(ind_tt)]
  
  model <- lm(Lcases~., data = final_tr)
  yhat <- predict(model, newdata = final_tt)
  rmspe[i] <- sqrt(mean((final_tt$Lcases - yhat)^2))
  mae[i] <- mean(abs(final_tt$Lcases - yhat))
  mape[i] <- mean(abs(100*(final_tt$Lcases - yhat)/final_tt$Lcases))
}
mean(rmspe)

## [1] 24.3446

mean(mae)

## [1] 19.80646

mean(mape)

## [1] 709.8797

The table:

tbl1 <- aggregate(dmdata$Lcases, by=list(dmdata$CD), FUN=mean)
tbl2 <- aggregate(dmdata$Lcases, by=list(dmdata$CD), FUN=sd)
lst <- data_frame("Region" = cd, "RMSPE" = rmspe, "MAE" = mae, "MAPE" = mape, "Avr_Lcases" = tbl1[,2],"sd" = tbl2[,2] )
lst

## # A tibble: 6 x 6
##   Region  RMSPE   MAE   MAPE Avr_Lcases    sd
##   <chr>   <dbl> <dbl>  <dbl>      <dbl> <dbl>
## 1 Durham   14.4  11.5  770.   -4.86e-17  10.6
## 2 Essex    27.6  21.3  682.    1.02e-16  13.8
## 3 Ottawa   16.3  13.4  395.    1.53e-15  14.5
## 4 Peel     16.6  13.0  138.   -2.76e-15  25.4
## 5 Toronto  56.0  47.5   95.9  -4.29e-15  64.7
## 6 York     15.2  12.1 2178.    9.47e-16  14.6

It seems like Toronto and Peel improve slightly but others get worse. Although we control for a linear trend by including the date in the regressions, all these applications ignore the presence of temporal ordering within the region.

There are more questions to answer in this parametric framework:

Since daily case numbers include lots of “noise”, smoothing them may improve the results.
Taking the first difference, instead of demeaning, would improve the results.
The current model includes features without their lagged value. A more dynamic feature selection with a wider time window can increase the prediction. Moreover, not only the levels of features but the change in them may affect the outcome.
There are many days without a case. The lagged values of outcome variable should be included in the feature space in Autoregressive Distributed Lag Models (ARDL).
We have to use a spline type structure to take care of different segments in epi curves.

ARDL models are an integral part of estimating temporal dynamics over time.

\[ y_{t}=\alpha_{0}+\phi_{1} y_{t-1}+\theta_{1,0} x_{1, t}+\theta_{1,1} x_{1, t-1}+\cdots+\theta_{k, 0} x_{k, t}+\theta_{k, 1} x_{k, t-1}+\beta * T+\epsilon_{t} \]

where \(T\) is the trend. Our earlier examples above arbitrarily use cases with L = 10. Although we can make the lag structure in ARDL a hyperparamater to tune with cross-validation, the degree of freedom would be a problem even with two lags given that we have more than 45 variables and 149 observations in each regions. We will return this issue later when we apply shrinkage methods in Lasso type estimations.

In general, these changes require more flexible nonparametric models based on self-learning algorithm.

6. FEM with a binary outcome

6.1 Spatial Prediction

In order to control for the temporal structure of the data, at least partly, we will use a binary outcome. We transform the daily number of cases to a binary outcome representing “higher” or “lower” days of cases by comparing the number of cases in any given day with some threshold like a 3-day moving average of the past cases. Hence, the binary outcome in any given day will preserve the temporal ordering by carrying the information from the past values of cases. This application will also reduce the issue of zero-case observations.

We use demeaned data and apply population-informed nested cross-validation with the identity link (linear probability model - LPM) instead of logistic link, as it’s known for its convergence issues for higher dimensions. The algorithm below has three loops:

The first loop searches the lag length for the moving average between 2 and 10 days,
The second loop searches the lag length for binary cases (updown) between 2 and 12 days,
The third loop runs the validation and calculates RMSPE and AUC for each regions.

These two tables show the average RMSPE and AUC for all 28 regions, respectively. The columns are the length of lags (2:12) reflecting the delay in the impact of features and the columns are the distance in moving average thresholds (2:10) in calculating the binary outcomes, “higher” or “lower”.

rm(list = ls())

load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
data1 <- data[, -c(16, 35:42, 48)]

df <- data1[complete.cases(data1), ]
dmdata <- demean(df, select = c(names(df)[-c(1,2)]), group = "CD")
dmdata <- dmdata[, 46:90]
names(dmdata)<- strsplit(names(dmdata),"_within")

dmdata$Episode <- df$Episode
dmdata$CD <- df$CD

cd <- as.character(unique(df$CD))

L <- 2:12
th <- 2:10

rmspe <- array(0, dim = c(length(th), length(L), length(cd)))
auc <- array(0, dim = c(length(th), length(L), length(cd)))
phatt <- vector(mode = "list", length = length(cd))

for (g in 1:length(th)) {
  data2 <- dmdata %>%
    dplyr::group_by(CD) %>%
    mutate(ccs = movavg(cases, th[g], "s")) %>%
    mutate(updown <- ifelse(cases > ccs, 1, 0)) %>%
    ungroup()
  
  data2 <- data2[, c(49,1,2,4:47)]
  names(data2)[1] <- "updown"
  
  for (i in 1:length(L)) {
    data3 <- data2 %>%
      dplyr::group_by(CD) %>%
      mutate(Lcases = lag(updown,L[i])) %>%
      ungroup()
    
    dm <- data3[, c(48, 2:(ncol(data2)))]
    dm <- dm[complete.cases(dm), ]
    
    for (j in 1:length(cd)) {
      test <- dm[dm$CD == cd[j], -c(47)]
      train <- dm[dm$CD != cd[j], -c(47)]
  
      model <- lm(Lcases~., data = train)
      lst <- ols_step_backward_p(model)
      rmv <- lst$removed
  
      ind_tr <- which(names(train) %in% rmv)
      final_tr <- train[, -c(ind_tr)]
      ind_tt <- which(names(test) %in% rmv)
      final_tt <- test[, -c(ind_tt)]
  
      model <- lm(Lcases~., data = final_tr)
      yhat <- predict(model, newdata = final_tt)
      
      rmspe[g,i,j] <- sqrt(mean((final_tt$Lcases - yhat)^2))
      #AUC
      phat <- rescale(yhat, to = c(0, 1),
                      from = range(yhat, na.rm = TRUE, finite = TRUE))
      phatt[[j]] <- phat
      final_tt$Lcases <- as.factor(final_tt$Lcases)
      pred_rocr <- prediction(phat, final_tt$Lcases)
      auc_ROCR <- performance(pred_rocr, measure = "auc")
      auc[g,i,j] <- auc_ROCR@y.values[[1]]
    }
  }
}
apply(rmspe, c(1,2), mean)

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 0.4682240 0.4709323 0.4703193 0.4700835 0.4698089 0.4711440 0.4710007
##  [2,] 0.4722933 0.4742418 0.4749785 0.4756441 0.4760400 0.4735963 0.4741240
##  [3,] 0.4748892 0.4766434 0.4779671 0.4789133 0.4805583 0.4778534 0.4772796
##  [4,] 0.4732048 0.4752383 0.4764926 0.4783440 0.4773594 0.4761430 0.4768071
##  [5,] 0.4717477 0.4732924 0.4745162 0.4755444 0.4762719 0.4742910 0.4740048
##  [6,] 0.4711055 0.4734736 0.4733106 0.4743040 0.4742714 0.4724791 0.4734876
##  [7,] 0.4699191 0.4710549 0.4742402 0.4736760 0.4730010 0.4724961 0.4727392
##  [8,] 0.4686641 0.4705513 0.4716291 0.4724631 0.4724973 0.4728210 0.4717215
##  [9,] 0.4674746 0.4700038 0.4708553 0.4705048 0.4709208 0.4715389 0.4706430
##            [,8]      [,9]     [,10]     [,11]
##  [1,] 0.4713788 0.4736232 0.4712888 0.4732521
##  [2,] 0.4770715 0.4797079 0.4774783 0.4791801
##  [3,] 0.4799913 0.4817548 0.4809400 0.4819120
##  [4,] 0.4794575 0.4794073 0.4788816 0.4803674
##  [5,] 0.4784570 0.4793299 0.4780075 0.4801720
##  [6,] 0.4765846 0.4779030 0.4756931 0.4795699
##  [7,] 0.4761530 0.4761459 0.4748530 0.4782480
##  [8,] 0.4748603 0.4770412 0.4752151 0.4775535
##  [9,] 0.4735703 0.4755578 0.4731407 0.4769946

apply(auc, c(1,2), mean)

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 0.5967591 0.5785727 0.5854778 0.5936768 0.5890975 0.5796719 0.5821880
##  [2,] 0.6024668 0.5901346 0.5859261 0.5829437 0.5759581 0.6030509 0.5971795
##  [3,] 0.6085598 0.5934690 0.5874633 0.5792877 0.5675101 0.5937972 0.5951245
##  [4,] 0.6264259 0.6083358 0.6056145 0.5846023 0.5896532 0.6143652 0.6044428
##  [5,] 0.6375015 0.6237740 0.6231649 0.6125851 0.6090355 0.6306310 0.6265243
##  [6,] 0.6391461 0.6232214 0.6288686 0.6202509 0.6245416 0.6426002 0.6301978
##  [7,] 0.6429597 0.6353104 0.6309244 0.6269067 0.6313385 0.6433604 0.6336460
##  [8,] 0.6492089 0.6349513 0.6429202 0.6337613 0.6317745 0.6377334 0.6374576
##  [9,] 0.6546867 0.6398631 0.6496245 0.6436606 0.6416356 0.6461089 0.6460223
##            [,8]      [,9]     [,10]     [,11]
##  [1,] 0.5799774 0.5624412 0.5826161 0.5636027
##  [2,] 0.5769669 0.5649006 0.5750536 0.5661074
##  [3,] 0.5774746 0.5700230 0.5749558 0.5707374
##  [4,] 0.5849559 0.5881230 0.5892225 0.5859425
##  [5,] 0.5959054 0.5980193 0.6012773 0.5922324
##  [6,] 0.6051109 0.6086204 0.6162124 0.6014687
##  [7,] 0.6073724 0.6168784 0.6211139 0.6070396
##  [8,] 0.6124012 0.6130945 0.6172655 0.6087519
##  [9,] 0.6254062 0.6239118 0.6298693 0.6146243

The results of this parametric estimation can be used as a benchmark to evaluate more complex, nonparametric, and trainable algorithms that we will try later.

Let’s see the maximum AUC values obtained from predicting each region.

muac <- c()
for (k in 1: length(cd)) {
  muac[k] <- max(auc[,,k])
}
data.frame("CD"= cd, "AUC" = muac)

##                                CD       AUC
## 1                           Brant 0.6143158
## 2                    Chatham-Kent 0.6435281
## 3                        Cochrane 0.7894466
## 4                          Durham 0.7110733
## 5                           Elgin 0.6604167
## 6                           Essex 0.6753694
## 7                       Frontenac 0.7111818
## 8                            Grey 0.6853685
## 9               Haldimand-Norfolk 0.7046703
## 10                         Halton 0.6213786
## 11                       Hamilton 0.6546154
## 12                        Lambton 0.6847662
## 13            Leeds and Grenville 0.7900268
## 14                      Middlesex 0.6654135
## 15                        Niagara 0.6724266
## 16                 Northumberland 0.6660018
## 17                         Ottawa 0.6571484
## 18                           Peel 0.7676056
## 19                          Perth 0.6836945
## 20                   Peterborough 0.7379791
## 21                         Simcoe 0.6700721
## 22 Stormont, Dundas and Glengarry 0.6365990
## 23                        Sudbury 0.7751065
## 24                    Thunder Bay 0.6943115
## 25                        Toronto 0.8217409
## 26                       Waterloo 0.7505962
## 27                     Wellington 0.6667314
## 28                           York 0.7634615

Let’s look at AUC for Toronto in detail:

auc[,,25]

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 0.5118606 0.5541784 0.5464762 0.5555556 0.5466118 0.5625994 0.5958048
##  [2,] 0.6170686 0.6734962 0.6384439 0.6258937 0.6688210 0.6728935 0.7037112
##  [3,] 0.6443414 0.6945489 0.7198703 0.6844444 0.6535448 0.6707075 0.7222670
##  [4,] 0.6518023 0.7353661 0.7253629 0.6635062 0.7129412 0.7558209 0.7230738
##  [5,] 0.7203704 0.7747748 0.7268748 0.7248698 0.7511737 0.7642857 0.7551308
##  [6,] 0.7456498 0.7890786 0.7437215 0.7314815 0.7748435 0.8049990 0.7645875
##  [7,] 0.7550000 0.8038664 0.7341073 0.7383755 0.8080986 0.8026185 0.7712274
##  [8,] 0.7575251 0.8051948 0.7740642 0.7517415 0.8123529 0.8217409 0.7874698
##  [9,] 0.7848101 0.8075038 0.7801757 0.7757353 0.8115686 0.8120032 0.7921031
##            [,8]      [,9]     [,10]     [,11]
##  [1,] 0.5344510 0.5124533 0.5147306 0.5093897
##  [2,] 0.6013085 0.6214760 0.5915493 0.5897655
##  [3,] 0.5853609 0.6183665 0.6344335 0.6840085
##  [4,] 0.6141893 0.6927861 0.6745848 0.6810235
##  [5,] 0.6569388 0.7287785 0.6930672 0.6953092
##  [6,] 0.6844254 0.7539354 0.7126340 0.7274861
##  [7,] 0.6885079 0.7315659 0.7262981 0.7445583
##  [8,] 0.7252859 0.7435791 0.7331933 0.7495737
##  [9,] 0.7410131 0.7665700 0.7573529 0.7559676

It seems that L = 7 and th = 9 (they start at 2) are the optimal parameters. By using them, we get ROC and confusion matrix:

L <- 7
th <- 9

data2 <- dmdata %>%
  dplyr::group_by(CD) %>%
  mutate(ccs = movavg(cases, th, "s")) %>%
  mutate(updown <- ifelse(cases > ccs, 1, 0)) %>%
  ungroup()
  
data2 <- data2[, c(49,1,2,4:47)]
names(data2)[1] <- "updown"
  
data3 <- data2 %>%
  dplyr::group_by(CD) %>%
  mutate(Lcases = lag(updown,L)) %>%
  ungroup()
    
dm <- data3[, c(48, 2:(ncol(data2)))]
dm <- dm[complete.cases(dm), ]
    
test <- dm[dm$CD == "Toronto", -c(47)]
train <- dm[dm$CD != "Toronto", -c(47)]
  
model <- lm(Lcases~., data = train)
lst <- ols_step_backward_p(model)
rmv <- lst$removed
  
ind_tr <- which(names(train) %in% rmv)
final_tr <- train[, -c(ind_tr)]
ind_tt <- which(names(test) %in% rmv)
final_tt <- test[, -c(ind_tt)]
  
model <- lm(Lcases~., data = final_tr)
yhat <- predict(model, newdata = final_tt)
      
#AUC
phat <- rescale(yhat, to = c(0, 1),
                from = range(yhat, na.rm = TRUE, finite = TRUE))
final_tt$Lcases <- as.factor(final_tt$Lcases)
pred_rocr <- prediction(phat, final_tt$Lcases)
auc_ROCR <- performance(pred_rocr, measure = "auc")
auc <- auc_ROCR@y.values[[1]]
auc

## [1] 0.8217409

# ROC
TPR <- c()
FPR <- c()
phato <- phat[order(phat)]

for (i in 1:length(phato)) {
  yHat <- phat > phato[i]
  conf_table <- table(yHat, test$Lcases)
  ct <- as.matrix(conf_table) #No need to change the table for calculations
  if(sum(dim(ct))>3){ #here we ignore the thresholds 0 and 1
    TPR[i] <- ct[2,2]/(ct[2,2]+ct[1,2])
    FPR[i] <- ct[2,1]/(ct[1,1]+ct[2,1])
  }
}
plot(FPR, TPR, col= "blue", type = "l", main = "ROC", lwd = 2)
abline(a = 0, b = 1, col="red")

The best discriminating threshold:

# Confusion matrix
J <- TPR - FPR
# The best discriminating threshold
phat[which.max(J)]

##        59 
## 0.6513276

The confusion matrix:

yHat <- phat > 0.5
conf_table <- table(yHat, test$Lcases)
ctt <- as.matrix(conf_table)
ct <- matrix(0, 2, 2)
ct[1,1] <- ctt[2,2]
ct[2,2] <- ctt[1,1]
ct[1,2] <- ctt[2,1]
ct[2,1] <- ctt[1,2]
rownames(ct) <- c("Yhat = 1", "Yhat = 0")
colnames(ct) <- c("Y = 1", "Y = 0")
ct

##          Y = 1 Y = 0
## Yhat = 1    59    26
## Yhat = 0     9    48

Note that this is a spatial forecast not a temporal one. We did not forecast the future epi curves by looking at the past values. We can also incorporate the temporal prediction by using only the fist 80 days, for example, in 27 regions to fit a model and then use it to predict the remaining future days for the holdout region. We can even make it a h-step ahead cross-validation. However, the objective of these parametric models estimated here is to reveal the underlying relationship between weather conditions and epi curves after controlling for the regional fixed differences, rather than to forecast.

The script above is very primitive and takes very long time to run. We will use trainable nonparametric models for forecasting with proper cross-validation procedures later.

Here is what the spatial model is for Toronto:

summary(model)

## 
## Call:
## lm(formula = Lcases ~ ., data = final_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7782 -0.3624 -0.2558  0.5121  0.9249 
## 
## Coefficients: (2 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  23.2414530 17.4486546   1.332  0.18294    
## mob1                         -0.7085440  0.0869812  -8.146 5.05e-16 ***
## CC                            0.1178983  0.0393861   2.993  0.00278 ** 
## NoEpi                         0.0929027  0.0432939   2.146  0.03195 *  
## NoInfo                        0.2330864  0.0766623   3.040  0.00238 ** 
## OB                            0.1462883  0.0459324   3.185  0.00146 ** 
## Travel                        0.1005931  0.0602388   1.670  0.09502 .  
## age                          -0.0010515  0.0007159  -1.469  0.14194    
## max_temperature               2.2397828  1.0626811   2.108  0.03512 *  
## avg_temperature              -4.4693621  2.1254503  -2.103  0.03555 *  
## min_temperature               2.2445663  1.0626368   2.112  0.03473 *  
## min_windchill                -0.0054772  0.0049233  -1.112  0.26600    
## max_relative_humidity        -0.0018308  0.0015804  -1.158  0.24677    
## avg_hourly_relative_humidity  0.0059636  0.0028395   2.100  0.03577 *  
## avg_relative_humidity        -0.0007753  0.0030205  -0.257  0.79743    
## min_relative_humidity                NA         NA      NA       NA    
## max_dew_point                 0.0169388  0.0064897   2.610  0.00909 ** 
## avg_hourly_dew_point         -0.0370487  0.0129050  -2.871  0.00412 ** 
## min_dew_point                 0.0059994  0.0056853   1.055  0.29138    
## max_wind_speed               -0.0026939  0.0034249  -0.787  0.43159    
## avg_hourly_wind_speed        -0.0131564  0.0045095  -2.917  0.00355 ** 
## avg_wind_speed                0.0136806  0.0069246   1.976  0.04827 *  
## min_wind_speed                       NA         NA      NA       NA    
## wind_gust_dir_10s            -0.0013233  0.0007389  -1.791  0.07337 .  
## cooldegdays                  -0.0322999  0.0086138  -3.750  0.00018 ***
## growdegdays_10                0.0154970  0.0088212   1.757  0.07903 .  
## daylight                     -0.5290311  0.1329438  -3.979 7.04e-05 ***
## sunset_f                      1.0771552  0.3298217   3.266  0.00110 ** 
## max_uv_forecast               0.0265086  0.0065461   4.050 5.23e-05 ***
## min_low_temperature_forecast -0.0033548  0.0029160  -1.150  0.25002    
## Episode                      -0.0012456  0.0009491  -1.312  0.18944    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4695 on 3805 degrees of freedom
## Multiple R-squared:  0.06966,    Adjusted R-squared:  0.06282 
## F-statistic: 10.18 on 28 and 3805 DF,  p-value: < 2.2e-16

6.2 Temporal Prediction

Although the temporal prediction (forecasting) requires a more proper way to set the data for keeping its temporal order, we will try the same algorithm with some changes: we will use only 11 high-case regions (Durham, Essex, Halton, Hamilton, Niagara, Ottawa, Peel, Simcoe, Waterloo, Toronto, and York) with a 10-day sliding window to predict the next 2 days. We also look at only the first half of the epi curves until the decline is observed. There should be a better segmentation of epi curves than this, but for now, we look at the fist half of the curve where it inclines. The other issue is that the data is not stationary. Although our forecast horizon is very short (2 days), we should apply all the conventional steps before forecasting. We will take care of it later.

rm(list = ls())

load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
data1 <- data[, -c(16, 35:42, 48)]

df <- data1[complete.cases(data1), ]
# dmdata <- demean(df, select = c(names(df)[-c(1,2)]), group = "CD")
# dmdata <- dmdata[, 46:90]
# names(dmdata)<- strsplit(names(dmdata),"_within")
# 
# dmdata$Episode <- df$Episode
# dmdata$CD <- df$CD

cd <- as.character(unique(df$CD))
cds <- as.character(unique(df$CD))[c(4,6,10,11,15,17,18,21,25,26,28)]

L <- 2:12
th <- 2:10

mrmspe <- matrix(0, length(th), length(L))
mauc <- matrix(0, length(th), length(L))

for (g in 1:length(th)) {
  data2 <- data1 %>%
    dplyr::group_by(CD) %>%
    mutate(ccs = movavg(cases, th[g], "s")) %>%
    mutate(updown <- ifelse(cases > ccs, 1, 0)) %>%
    ungroup()
  
  data2 <- data2[, c(49,1:4,6:47)]
  names(data2)[1] <- "updown"
  
  for (i in 1:length(L)) {
    data3 <- data2 %>%
      dplyr::group_by(CD) %>%
      mutate(Lcases = lag(updown,L[i])) %>%
      ungroup()
    
    dm <- data3[, c(48, 2:(ncol(data2)))]
    dm <- dm[complete.cases(dm), ]
    
    rmspe <- c()
    auc <- c()
    
    #sliding window
    step = 10
    # Number of observation for each region
    n = (nrow(dm)/length(cd))-60
    
    for(k in 1:n) {
      train <- matrix(0, 0, ncol(dm))
      test <- matrix(0, 0, ncol(dm))
      for (j in 1:length(cds)) {
        arr <- dm[dm$CD == cd[j], ]
        s <- k - 1 + step
        tr <- arr[k:s,]
        train <- rbind(train, tr)
        tt <- arr[c((s+1),(s+2)),]
        test <- rbind(test, tt)
      }
      
      model <- lm(Lcases~.-1, data = train)
      lst <- ols_step_backward_p(model)
      rmv <- lst$removed
      
      ind_tr <- which(names(train) %in% rmv)
      final_tr <- train[, -c(ind_tr)]
      ind_tt <- which(names(test) %in% rmv)
      final_tt <- test[, -c(ind_tt)]
      
      model <- lm(Lcases~.-1, data = final_tr)
      yhat <- predict(model, newdata = final_tt)
      
      rmspe[k] <- sqrt(mean((final_tt$Lcases - yhat)^2))
      #AUC
      phat <- rescale(yhat, to = c(0, 1),
                      from = range(yhat, na.rm = TRUE, finite = TRUE))
      final_tt$Lcases <- as.factor(final_tt$Lcases)
      pred_rocr <- prediction(phat, final_tt$Lcases)
      auc_ROCR <- performance(pred_rocr, measure = "auc")
      auc[k] <- auc_ROCR@y.values[[1]]
    }
    mrmspe[g, i] <- mean(rmspe)
    mauc[g, i] <- mean(auc)    
  }  
}
mauc

##            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
##  [1,] 0.5118872 0.5396695 0.4944077 0.5529247 0.5178809 0.5012637 0.5163914
##  [2,] 0.5299799 0.4963428 0.4993872 0.5299924 0.4934876 0.5244277 0.4961056
##  [3,] 0.5206529 0.4952314 0.4896386 0.5032291 0.4871756 0.5134877 0.4882870
##  [4,] 0.5290461 0.5044342 0.5115684 0.5203528 0.4947340 0.5086228 0.5223935
##  [5,] 0.5063811 0.5091636 0.5124110 0.5234754 0.5119877 0.5151747 0.5114066
##  [6,] 0.5269921 0.5112450 0.5202445 0.5298808 0.5152335 0.5232409 0.5140756
##  [7,] 0.5316238 0.5012354 0.5231277 0.5458470 0.5181580 0.5232384 0.5236800
##  [8,] 0.5434355 0.5119887 0.5268718 0.5161558 0.5117887 0.5085543 0.5307354
##  [9,] 0.5573029 0.4985612 0.5280653 0.5457150 0.5417617 0.5355803 0.5186650
##            [,8]      [,9]     [,10]     [,11]
##  [1,] 0.5111290 0.5406291 0.5205796 0.4834508
##  [2,] 0.4986134 0.4980452 0.5221568 0.5066070
##  [3,] 0.4982964 0.4977495 0.5301191 0.5141146
##  [4,] 0.5043868 0.5010629 0.5136011 0.5173384
##  [5,] 0.5064905 0.5073739 0.5004918 0.5250408
##  [6,] 0.5141413 0.5319692 0.5225467 0.5357547
##  [7,] 0.5090316 0.5096228 0.5110844 0.5264995
##  [8,] 0.5093734 0.5073274 0.5336256 0.5366277
##  [9,] 0.5162832 0.5278893 0.5413386 0.5471153

This table of AUC measures (rows th and columns L) confirms the fact that temporal prediction is nearly impossible at least with simple parametric models. Although, the main issue would be the use of case numbers that are not able to reveal the correct number of infected people, more appropriate and trainable models based on a nonparametric structure and a proper time series segmentation may capture the underlying relationship between case numbers and environmental conditions. Here is the quote from Forecasting COVID-19 by Hyndman:

(. . .) While we have a good understanding of how it works in terms of person-to-person infections, we have limited and misleading data. The current numbers of confirmed cases are known to be vastly underestimated due to the limited testing available. There are almost certainly many more cases of COVID-19 that have not been diagnosed than those that have. Also, the level of under-estimation varies enormously between countries. In a country like South Korea with a lot of testing, the numbers of confirmed cases are going to be closer to the numbers of actual cases than in the US where there has been much less testing. So we simply cannot easily model the spread of the pandemic using the data that is available (. . .) That’s why we asked NSHA the test (811) numbers in addition to case numbers!

Conventional time-series methods and SIR.

A work by ML-Portal

Parametric Models - I

Fixed effects models

2020-10-05