NOTE: If you want to replicate the results, please click on the Code button to see the codes, load the packages, and download the data. We drop a couple of variables that still have few NA’s, as before.

Please also check our works released earlier and coming soon for Ontario epi curves as a part of our Nova Scotia COVID-19 Research Coalition Project:

  1. Ontario Covid-19 Exploratory Data Analysis,
  2. Ontario Covid-19 Matrix Profile Analysis,
  3. Trainable Fixed Effect Models for Ontario COVID-19 Epi Curves,
  4. Conventional Time Series EDA and ETS Forecasting for Ontario COVID-19 Epi Curves - Current analysis,
  5. Shrinkage Methods for Panel Epi Curves for Model Identification - Coming Soon,
  6. Panel Predictions with AdaBoost for COVID-19 Epi Curves - Coming Soon,
  7. Neural Net Applications with GeoTemporal Epi Curves - Coming Soon.
rm(list = ls())

library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr)

# Data Cleaning
library(janitor)

# EDA
library(skimr)
library(DataExplorer)
library(DT)
library(corrplot)
library(parameters)
library(smooth)
library(kableExtra)

# ggplot2 Helpers
library(scales)
theme_set(theme_tq())
library(raster)
library(spdplyr)
library(dplyr)
library(RColorBrewer)

# Missing data
library(mice)
library(VIM)
library(lattice)

#TS 
library(forecast)
library(fpp2)
library(fpp3)#data library for `forecast`
library(lubridate)
library(tsibble)
library(ggplot2)
setwd("~/Dropbox/RNS2")
rm(list = ls())

load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
df <- data[, -c(16, 35:42, 48)]
# Adding day, month, and week)
df <- cbind(df, "month" = as.numeric(format(df$Episode, format = "%m")),
                "day" = as.numeric(format(df$Episode, format = "%d")),
                "week" = as.numeric(format(df$Episode, format = "%w")))

1. Time Series EDA

1.1 Initial Epi Curves

All regions:

# Epi Curves - All
ggplot(df) +
 aes(x = Episode, y = cases, fill = CD, colour = CD) +
 geom_line(size = 0.5) +
 scale_fill_hue() +
 scale_color_hue() +
 theme_minimal() +
 theme(legend.position = "none") +
 facet_wrap(vars(CD), scales = "free_y", ncol = 5L)

Only three neighboring regions to Toronto: York, Peel, and Durham.

# Epi Curves - Hotspots
cd_high <- c("York", "Toronto", "Peel", "Durham")
ind <- which(df$CD %in% cd_high)
dmdata <- df[ind, ]

ggplot(dmdata) +
 aes(x = Episode, y = cases, colour = CD) +
 geom_line(size = 1L) +
 scale_color_hue() +
 theme_minimal() +
 facet_grid(vars(CD), vars(), scales = "free_y")

1.2 Calendar Epi Curves

The graph below shows the epi curve for Toronto every month with weekly calendar coloring. As expected, each month is different but Mondays have almost always have a spike in case numbers.

# Level
dmdata[dmdata$CD == "Toronto", ] %>%
  mutate(
    Day = lubridate::wday(Episode, label = TRUE),
    Weekend = (Day %in% c("Sun", "Sat"))
  ) %>%
  ggplot(aes(x = day, y = cases, group = month)) +
    geom_line(aes(col = Weekend)) +
    facet_grid(CD ~ .)

This could be seen better with weekly split done for Peel also: The case numbers are trending up on Mondays and Tuesdays. Given that we use Episode dates which report when the person’s symptoms start, not the dates when test results are officially recorded, the trend should either reflect a reporting bias or the effect of a higher mobility during weekends on transmissions.

dmdata[dmdata$CD == "Peel", ] %>%
  mutate(
    Day = lubridate::wday(Episode, label = TRUE),
    Week = lubridate::week(Episode),
    Weekend = (Day %in% c("Sun", "Sat"))
  ) %>%
  ggplot(aes(x = week, y = cases, group = Week)) +
    geom_line(aes(col = Weekend)) +
    facet_grid(CD ~ .)

1.3 Correlations across Epi Curves for Hotspots

We’ve asked the same question in our earlier MPA and found that there is no spatial correlation between neighboring regions close to Toronto. Before de-trending the correlation coefficients are high:

### Level
tr <- df[df$CD=="Toronto", 5]
pl <- df[df$CD=="Peel", 5]
dr <- df[df$CD=="Durham", 5]
yr <- df[df$CD=="York", 5]

cases <- data.frame(tr, pl, dr, yr)
colnames(cases) <- c("Toronto", "Peel", "Durham", "York")

corr <- round(cor(cases, use = "complete.obs"),5)
corr %>%
  kbl(caption = "Correlations across epi curves - with levels") %>%
   kable_material(c("striped", "hover"))
Correlations across epi curves - with levels
Toronto Peel Durham York
Toronto 1.00000 0.89766 0.83306 0.80817
Peel 0.89766 1.00000 0.77069 0.78102
Durham 0.83306 0.77069 1.00000 0.80114
York 0.80817 0.78102 0.80114 1.00000

After de-trending with first-differencing the correlations become very low as we also found it in our MPA work earlier:

### After de-trending
cases <- as.matrix(cases)
dcases <- diff(cases, lag = 1)
corr <- round(cor(dcases, use = "complete.obs"),5)
corr %>%
  kbl(caption = "Correlations across epi curves - after detrending") %>%
   kable_material(c("striped", "hover"))
Correlations across epi curves - after detrending
Toronto Peel Durham York
Toronto 1.00000 0.28027 0.12042 0.21791
Peel 0.28027 1.00000 0.20956 0.33268
Durham 0.12042 0.20956 1.00000 0.10807
York 0.21791 0.33268 0.10807 1.00000

We will now look at the correlation after we remove the trend with and without seasonality.

2. STL Decomposition

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. Although there are many time series decomposition methods, STL is the most versatile and robust method for decomposing time series. STL is an acronym for Seasonal and Trend decomposition using Loess, while Loess (LOcal regrESSion) is a method for estimating nonlinear relationships. The STL method was developed by Cleveland, Cleveland, McRae, & Terpenning (1990).

The graph below for Toronto, as an example, shows the weekly smoothing with Loess to extract the trend (red):

fts <- ts(df[df$CD=="Toronto", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fit <-  stl(fts[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
plot.ts(fts[, "cases"], col = 'gray', ylab = "Cases",
        xaxt = "n", xlab = "")
lines(fit$time.series[,2], 
      col = "red", lwd = 2, lty = 2)

Next, we detrend our data by subtracting the trend from the original data.

\[ y_{t}^{*}=y_{t}-T_{t} \]

detrended <- fts[, "cases"] - fit$time.series[,2]
plot.ts(detrended, 
        ylab = "Cases", 
        main = 'Seasonal variability of cases in Toronto',
        cex.main = 0.85, col = "blue", lwd = 2,
        xaxt = "n", xlab = "")

It seems that these data do not exhibit a weekly seasonality. As we expect, there is no apparent seasonal weekly pattern in the data.

Further, let us investigate the remainder and review if the residuals are normally distributed. Therefore we plot a QQ-plot using qqnorm() and qqline() functions. Although a very small p-value does not support normality, the Q-Q plot shows that it’s normal at the pick of the pandemic that the decomposition did a good job in extracting the trend and the seasonal components, as the residuals seems to be just noise.

plot(fit$time.series[,3], 
     col = "darkgreen", main = 'Remainder', ylab = "",
     lwd = 2, xlab = "", xaxt = "n")

qqnorm(fit$time.series[,3])
qqline(fit$time.series[,3])

shapiro.test(fit$time.series[,3])
## 
##  Shapiro-Wilk normality test
## 
## data:  fit$time.series[, 3]
## W = 0.75826, p-value = 2.194e-14

Loess decomposition is intended to smooth the series by applying averages to the data so that it collapses into components, e.g. the trend or seasonal, that are interesting for the analysis of the data. But this methodology is not intended to do a formal test for the presence of seasonality.

Although our decomposition did not return a smoothed pattern of seasonal periodicity, we can still compare the variance of each component with respect to the variance of the original series, which reveals their individual contributions to the dynamics of the data.

apply(fit$time.series, 2, var) / var(fts[, "cases"])
##   seasonal      trend  remainder 
## 0.10643067 0.89371379 0.07504542

The relative variances indicate that trend is the main component explaining the dynamics of the series.

fck <- as_tsibble(dmdata, key = CD)
## Using `Episode` as index variable.
C <- fck %>% features(cases, feat_stl)
C %>% kbl(caption = "Seasonality") %>%
      kable_paper() %>%
      kable_material(c("striped", "hover")) %>%
      row_spec(0, angle = -45)
Seasonality
CD trend_strength seasonal_strength_week seasonal_peak_week seasonal_trough_week spikiness linearity curvature stl_e_acf1 stl_e_acf10
Durham 0.8611606 0.1541656 6 0 0.0490268 -52.18190 -69.20813 -0.2278144 0.1751434
Peel 0.9158026 0.2373876 6 1 0.5418180 -54.47934 -243.69358 -0.1533988 0.2376020
Toronto 0.9549695 0.4630639 2 1 4.0455005 -241.87120 -605.48768 -0.1032323 0.0764386
York 0.8752620 0.2488219 2 0 0.1665330 -80.25322 -95.80548 -0.1289709 0.2995586

Although it seems that there is no significant seasonality in the epi curves of these regions, we will now look at if these hotspots have correlated seasonal patterns to identify any hidden regional dependence in their epi curves.

# Toronto
ftsT <- ts(df[df$CD=="Toronto", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitT <-  stl(ftsT[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
ST <- ftsT[, "cases"] - fitT$time.series[,2]

# Peel
ftsP <- ts(df[df$CD=="Peel", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitP <-  stl(ftsP[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
SP <- ftsP[, "cases"] - fitP$time.series[,2]

# York
ftsY <- ts(df[df$CD=="York", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitY <-  stl(ftsY[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
SY <- ftsY[, "cases"] - fitY$time.series[,2]

# Durham
ftsD <- ts(df[df$CD=="Durham", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitD <-  stl(ftsD[, "cases"], t.window=7, s.window="periodic", robust=TRUE)
SD <- ftsD[, "cases"] - fitD$time.series[,2]

season <- data.frame(ST, SP, SY, SD)
colnames(season) <- c("Toronto", "Peel", "York", "Durham")

corr <- round(cor(season, use = "complete.obs"),5)
corr %>%
  kbl(caption = "Correlation across Neighboring Hotspots - Seasonal Patterns") %>%
   kable_material(c("striped", "hover"))
Correlation across Neighboring Hotspots - Seasonal Patterns
Toronto Peel York Durham
Toronto 1.00000 0.52587 0.43161 0.40290
Peel 0.52587 1.00000 0.44503 0.30987
York 0.43161 0.44503 1.00000 0.22371
Durham 0.40290 0.30987 0.22371 1.00000

The results also confirm the same weak geotemporal dependence across epi curves for the neighboring hotspot regions.

3. Complex Seasonality

So far, we have considered relatively simple weekly seasonal patterns. However, higher frequency time series often exhibit more complicated seasonal patterns. For example, daily data may have a weekly pattern as well as a monthly pattern. To deal with such series, we will define multiple seasonality in our epi curves. This allows us to specify all of the frequencies that might be relevant.

dff <- msts(df[df$CD=="Toronto", ], seasonal.periods = c(7,365.25), start = decimal_date(as.Date("2020-03-01")))
po <- dff[, "cases"]
tbat <- tbats(po)
plot(tbat, main="Multiple Season Decomposition for Toronto")

ddf <- df[df$CD=="Toronto", ]
ddf <- dff[15:80, ]
ah <- msts(ddf, seasonal.periods = c(7,365.25), start = decimal_date(as.Date("2020-03-15")))
poo <- ah[, "cases"]
tbats <- tbats(poo)

It seems that every week/weekend a similar (not the same) pattern when we relax a fixed seasonality assumption. Unlike other techniques such as Holt-Winters, TBATS can forecast by accommodating multiple seasonality. The univariate TBATS results can be seen below.

sp <- predict(tbats, h=7)
plot(sp, main = "TBATS Forecast", include=7)

summary(sp)
## 
## Forecast method: BATS(0.188, {0,0}, -, -)
## 
## Model Information:
## BATS(0.188, {0,0}, -, -)
## 
## Call: tbats(y = poo)
## 
## Parameters
##   Lambda: 0.188114
##   Alpha: 0.5059235
## 
## Seed States:
##          [,1]
## [1,] 6.816144
## attr(,"lambda")
## [1] 0.1881144
## 
## Sigma: 0.5367092
## AIC: 724.3438
## 
## Error measures:
##                    ME     RMSE      MAE        MPE     MAPE MASE        ACF1
## Training set 3.907603 31.59504 24.91002 -0.3406509 17.70827  NaN -0.01717625
## 
## Forecasts:
##           Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 2020.3830       168.4537 128.7365 217.5884 111.02112 247.9593
## 2020.3857       168.4537 124.5085 224.2316 105.33145 259.3246
## 2020.3885       168.4537 120.7910 230.3662 100.40298 269.9391
## 2020.3912       168.4537 117.4568 236.1194  96.04337 279.9952
## 2020.3940       168.4537 114.4239 241.5733  92.12864 289.6170
## 2020.3967       168.4537 111.6355 246.7845  88.57339 298.8897
## 2020.3994       168.4537 109.0507 251.7941  85.31594 307.8752

As the results indicate a smoothing forecast has a very poor performance.

4. ACF

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series. ACF for cases:

B <- fck %>% features(cases, feat_acf)
B %>% kbl(caption = "ACF TABLE") %>%
      kable_paper() %>%
      kable_material(c("striped", "hover"))
ACF TABLE
CD acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 season_acf1
Durham 0.7891266 4.648250 -0.5133877 0.2947698 -0.6838462 0.5276317 0.6443945
Peel 0.8548508 4.841692 -0.4211286 0.3066436 -0.6196727 0.5642036 0.6349811
Toronto 0.8973274 6.004804 -0.3871947 0.4022273 -0.6034202 0.6414196 0.7609879
York 0.7950619 4.025464 -0.4463045 0.2907238 -0.6505234 0.5535499 0.5595870

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive values that slowly decrease as the lags increase. When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags. When data are both trended and seasonal, you see a combination of these effects. As shown below, we do not see a strong seasonal pattern, which was also the case in our MPA:

fck %>%
  ACF(cases) %>%
  autoplot()

5. Forecasting with other ETS smoothing methods

The ETS models are a family of time series models with an underlying state space model consisting of a level component, a trend component (T), a seasonal component (S), and an error term (E)

Most of the forecasting using univariate epi curves based on conventional time series smoothing methods are basically useless for any type of policy analysis. Below, we apply automatically selected ETS model ETS(M,A,N) - Holt’s linear method with multiplicative errors - for forecasting the epi curve of Toronto (7-day and 14-day forecast, respectively):

fit1 <- forecast(poo,h = 7)
poo %>% ets() %>% forecast(h=7) %>% autoplot()

fit1 <- forecast(poo,h = 14)
poo %>% ets() %>% forecast(h=14) %>% autoplot()

summary(fit1)
## 
## Forecast method: ETS(M,A,N)
## 
## Model Information:
## ETS(M,A,N) 
## 
## Call:
##  ets(y = object, model = "ZZN", lambda = lambda, biasadj = biasadj,  
## 
##  Call:
##      allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.4305 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 58.1507 
##     b = 3.0485 
## 
##   sigma:  0.2007
## 
##      AIC     AICc      BIC 
## 719.8509 720.8509 730.7992 
## 
## Error measures:
##                     ME     RMSE      MAE       MPE     MAPE MASE       ACF1
## Training set -3.086879 31.25318 24.08791 -5.314455 17.61659  NaN 0.06656339
## 
## Forecasts:
##           Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 2020.3830       174.4721 129.5975 219.3468 105.84236 243.1019
## 2020.3857       177.5003 127.7741 227.2264 101.45071 253.5498
## 2020.3885       180.5284 126.2187 234.8381  97.46889 263.5879
## 2020.3912       183.5565 124.8653 242.2477  93.79604 273.3170
## 2020.3940       186.5846 123.6688 249.5005  90.36311 282.8061
## 2020.3967       189.6127 122.5966 256.6289  87.12046 292.1050
## 2020.3994       192.6409 121.6248 263.6570  84.03109 301.2506
## 2020.4022       195.6690 120.7345 270.6035  81.06659 310.2714
## 2020.4049       198.6971 119.9113 277.4829  78.20463 319.1896
## 2020.4077       201.7252 119.1435 284.3070  75.42730 328.0232
## 2020.4104       204.7533 118.4214 291.0853  72.72001 336.7867
## 2020.4131       207.7815 117.7372 297.8257  70.07070 345.4922
## 2020.4159       210.8096 117.0844 304.5348  67.46926 354.1499
## 2020.4186       213.8377 116.4573 311.2182  64.90716 362.7683

Since univariate ARIMA models will have the same poor forecasting performance, we will not try them here.

6. Time Series features management

We will compute different features on regional epi curves and use them to explore the properties of the series.

Here is an example for how we summarize the data weekly by CD:

### tsibble()
fck <- as_tsibble(dmdata, key = CD)
## Using `Episode` as index variable.
A <- fck %>%
  group_by_key() %>%
  index_by(date = ~ yearweek(.)) %>% 
  summarise(
    cases_high = max(cases, na.rm = TRUE),
    cases_low = min(cases, na.rm = TRUE)
  )
A %>% kbl(caption = "Weekly Max - Min Cases") %>%
      kable_paper() %>%
      scroll_box(width = "100%", height = "200px")
Weekly Max - Min Cases
CD date cases_high cases_low
Durham 2020 W09 0 0
Durham 2020 W10 2 0
Durham 2020 W11 14 1
Durham 2020 W12 22 10
Durham 2020 W13 17 9
Durham 2020 W14 28 11
Durham 2020 W15 52 13
Durham 2020 W16 49 21
Durham 2020 W17 35 15
Durham 2020 W18 35 13
Durham 2020 W19 23 6
Durham 2020 W20 21 11
Durham 2020 W21 20 8
Durham 2020 W22 16 9
Durham 2020 W23 21 3
Durham 2020 W24 13 2
Durham 2020 W25 9 1
Durham 2020 W26 7 0
Durham 2020 W27 5 3
Durham 2020 W28 10 1
Durham 2020 W29 5 1
Durham 2020 W30 3 0
Durham 2020 W31 0 0
Peel 2020 W09 1 1
Peel 2020 W10 5 1
Peel 2020 W11 17 7
Peel 2020 W12 54 23
Peel 2020 W13 45 30
Peel 2020 W14 71 39
Peel 2020 W15 84 59
Peel 2020 W16 120 77
Peel 2020 W17 80 51
Peel 2020 W18 94 43
Peel 2020 W19 69 55
Peel 2020 W20 77 49
Peel 2020 W21 92 62
Peel 2020 W22 81 42
Peel 2020 W23 63 26
Peel 2020 W24 35 26
Peel 2020 W25 51 33
Peel 2020 W26 51 22
Peel 2020 W27 37 20
Peel 2020 W28 33 20
Peel 2020 W29 35 14
Peel 2020 W30 21 0
Peel 2020 W31 0 0
Toronto 2020 W09 5 5
Toronto 2020 W10 10 6
Toronto 2020 W11 49 18
Toronto 2020 W12 79 44
Toronto 2020 W13 90 75
Toronto 2020 W14 156 94
Toronto 2020 W15 185 141
Toronto 2020 W16 286 144
Toronto 2020 W17 216 120
Toronto 2020 W18 197 102
Toronto 2020 W19 170 112
Toronto 2020 W20 189 135
Toronto 2020 W21 186 117
Toronto 2020 W22 168 79
Toronto 2020 W23 171 65
Toronto 2020 W24 87 37
Toronto 2020 W25 71 45
Toronto 2020 W26 55 38
Toronto 2020 W27 64 22
Toronto 2020 W28 62 21
Toronto 2020 W29 40 17
Toronto 2020 W30 24 0
Toronto 2020 W31 0 0
York 2020 W09 1 1
York 2020 W10 6 0
York 2020 W11 21 5
York 2020 W12 41 26
York 2020 W13 45 19
York 2020 W14 43 27
York 2020 W15 51 32
York 2020 W16 71 36
York 2020 W17 39 19
York 2020 W18 45 20
York 2020 W19 26 17
York 2020 W20 43 14
York 2020 W21 57 12
York 2020 W22 32 14
York 2020 W23 33 11
York 2020 W24 17 6
York 2020 W25 17 12
York 2020 W26 17 8
York 2020 W27 18 7
York 2020 W28 18 4
York 2020 W29 8 3
York 2020 W30 6 0
York 2020 W31 0 0

7. Univariate to Multivariate TS models - ARDL

We cannot use a pure ARDL (Autoregressive Distributed Lag) model with all features and their lags. This is because the sample size is not enough to keep a proper degree of freedom. Instead, we will apply LASSO algorithms on ARDL while keeping the panel structure to see the effective predictors of epi curves in Ontario.

The results here confirm the fact that temporal prediction is nearly impossible at least with simple parametric models. Although, the main issue would be the use of case numbers that are not able to reveal the correct number of infected people, more appropriate and trainable models built on a proper panel structure may capture the underlying relationship between case numbers and environmental conditions. We have shown that our trainbale fixed-effect models can identify the important predictors from weather and mobility features. Here, again, is the quote from Forecasting COVID-19 by Hyndman:

(. . .) While we have a good understanding of how it works in terms of person-to-person infections, we have limited and misleading data. The current numbers of confirmed cases are known to be vastly underestimated due to the limited testing available. There are almost certainly many more cases of COVID-19 that have not been diagnosed than those that have. Also, the level of under-estimation varies enormously between countries. In a country like South Korea with a lot of testing, the numbers of confirmed cases are going to be closer to the numbers of actual cases than in the US where there has been much less testing. So we simply cannot easily model the spread of the pandemic using the data that is available (. . .)

That’s why we asked NSHA the test (811) numbers in addition to case numbers!

Next, we will have penalized regressions with new engineered features followed by tree models utilizing geotemporal structure of the data, as we did in trainable fixed effect models.

 

A work by ML-Portal