Time Series Analysis

NOTE: If you want to replicate the results, please click on the Code button to see the codes, load the packages, and download the data. We drop a couple of variables that still have few NA’s, as before.

Please also check our works released earlier and coming soon for Ontario epi curves as a part of our Nova Scotia COVID-19 Research Coalition Project:

Ontario Covid-19 Exploratory Data Analysis,
Ontario Covid-19 Matrix Profile Analysis,
Trainable Fixed Effect Models for Ontario COVID-19 Epi Curves,
Conventional Time Series EDA and ETS Forecasting for Ontario COVID-19 Epi Curves - Current analysis,
Shrinkage Methods for Panel Epi Curves for Model Identification - Coming Soon,
Panel Predictions with AdaBoost for COVID-19 Epi Curves - Coming Soon,
Neural Net Applications with GeoTemporal Epi Curves - Coming Soon.

rm(list = ls())

library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr)

# Data Cleaning
library(janitor)

# EDA
library(skimr)
library(DataExplorer)
library(DT)
library(corrplot)
library(parameters)
library(smooth)
library(kableExtra)

# ggplot2 Helpers
library(scales)
theme_set(theme_tq())
library(raster)
library(spdplyr)
library(dplyr)
library(RColorBrewer)

# Missing data
library(mice)
library(VIM)
library(lattice)

#TS 
library(forecast)
library(fpp2)
library(fpp3)#data library for `forecast`
library(lubridate)
library(tsibble)
library(ggplot2)

setwd("~/Dropbox/RNS2")
rm(list = ls())

load("~/Dropbox/RNS2/data_v3.RData")
data <- data_v3
df <- data[, -c(16, 35:42, 48)]
# Adding day, month, and week)
df <- cbind(df, "month" = as.numeric(format(df$Episode, format = "%m")),
                "day" = as.numeric(format(df$Episode, format = "%d")),
                "week" = as.numeric(format(df$Episode, format = "%w")))

1. Time Series EDA

1.1 Initial Epi Curves

All regions:

# Epi Curves - All
ggplot(df) +
 aes(x = Episode, y = cases, fill = CD, colour = CD) +
 geom_line(size = 0.5) +
 scale_fill_hue() +
 scale_color_hue() +
 theme_minimal() +
 theme(legend.position = "none") +
 facet_wrap(vars(CD), scales = "free_y", ncol = 5L)

Only three neighboring regions to Toronto: York, Peel, and Durham.

# Epi Curves - Hotspots
cd_high <- c("York", "Toronto", "Peel", "Durham")
ind <- which(df$CD %in% cd_high)
dmdata <- df[ind, ]

ggplot(dmdata) +
 aes(x = Episode, y = cases, colour = CD) +
 geom_line(size = 1L) +
 scale_color_hue() +
 theme_minimal() +
 facet_grid(vars(CD), vars(), scales = "free_y")

1.2 Calendar Epi Curves

The graph below shows the epi curve for Toronto every month with weekly calendar coloring. As expected, each month is different but Mondays have almost always have a spike in case numbers.

# Level
dmdata[dmdata$CD == "Toronto", ] %>%
  mutate(
    Day = lubridate::wday(Episode, label = TRUE),
    Weekend = (Day %in% c("Sun", "Sat"))
  ) %>%
  ggplot(aes(x = day, y = cases, group = month)) +
    geom_line(aes(col = Weekend)) +
    facet_grid(CD ~ .)

This could be seen better with weekly split done for Peel also: The case numbers are trending up on Mondays and Tuesdays. Given that we use Episode dates which report when the person’s symptoms start, not the dates when test results are officially recorded, the trend should either reflect a reporting bias or the effect of a higher mobility during weekends on transmissions.

dmdata[dmdata$CD == "Peel", ] %>%
  mutate(
    Day = lubridate::wday(Episode, label = TRUE),
    Week = lubridate::week(Episode),
    Weekend = (Day %in% c("Sun", "Sat"))
  ) %>%
  ggplot(aes(x = week, y = cases, group = Week)) +
    geom_line(aes(col = Weekend)) +
    facet_grid(CD ~ .)

1.3 Correlations across Epi Curves for Hotspots

We’ve asked the same question in our earlier MPA and found that there is no spatial correlation between neighboring regions close to Toronto. Before de-trending the correlation coefficients are high:

### Level
tr <- df[df$CD=="Toronto", 5]
pl <- df[df$CD=="Peel", 5]
dr <- df[df$CD=="Durham", 5]
yr <- df[df$CD=="York", 5]

cases <- data.frame(tr, pl, dr, yr)
colnames(cases) <- c("Toronto", "Peel", "Durham", "York")

corr <- round(cor(cases, use = "complete.obs"),5)
corr %>%
  kbl(caption = "Correlations across epi curves - with levels") %>%
   kable_material(c("striped", "hover"))

Correlations across epi curves - with levels
	Toronto	Peel	Durham	York
Toronto	1.00000	0.89766	0.83306	0.80817
Peel	0.89766	1.00000	0.77069	0.78102
Durham	0.83306	0.77069	1.00000	0.80114
York	0.80817	0.78102	0.80114	1.00000

After de-trending with first-differencing the correlations become very low as we also found it in our MPA work earlier:

### After de-trending
cases <- as.matrix(cases)
dcases <- diff(cases, lag = 1)
corr <- round(cor(dcases, use = "complete.obs"),5)
corr %>%
  kbl(caption = "Correlations across epi curves - after detrending") %>%
   kable_material(c("striped", "hover"))

Correlations across epi curves - after detrending
	Toronto	Peel	Durham	York
Toronto	1.00000	0.28027	0.12042	0.21791
Peel	0.28027	1.00000	0.20956	0.33268
Durham	0.12042	0.20956	1.00000	0.10807
York	0.21791	0.33268	0.10807	1.00000

We will now look at the correlation after we remove the trend with and without seasonality.

2. STL Decomposition

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. Although there are many time series decomposition methods, STL is the most versatile and robust method for decomposing time series. STL is an acronym for Seasonal and Trend decomposition using Loess, while Loess (LOcal regrESSion) is a method for estimating nonlinear relationships. The STL method was developed by Cleveland, Cleveland, McRae, & Terpenning (1990).

The graph below for Toronto, as an example, shows the weekly smoothing with Loess to extract the trend (red):

fts <- ts(df[df$CD=="Toronto", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fit <-  stl(fts[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
plot.ts(fts[, "cases"], col = 'gray', ylab = "Cases",
        xaxt = "n", xlab = "")
lines(fit$time.series[,2], 
      col = "red", lwd = 2, lty = 2)

Next, we detrend our data by subtracting the trend from the original data.

\[ y_{t}^{*}=y_{t}-T_{t} \]

detrended <- fts[, "cases"] - fit$time.series[,2]
plot.ts(detrended, 
        ylab = "Cases", 
        main = 'Seasonal variability of cases in Toronto',
        cex.main = 0.85, col = "blue", lwd = 2,
        xaxt = "n", xlab = "")

It seems that these data do not exhibit a weekly seasonality. As we expect, there is no apparent seasonal weekly pattern in the data.

Further, let us investigate the remainder and review if the residuals are normally distributed. Therefore we plot a QQ-plot using qqnorm() and qqline() functions. Although a very small p-value does not support normality, the Q-Q plot shows that it’s normal at the pick of the pandemic that the decomposition did a good job in extracting the trend and the seasonal components, as the residuals seems to be just noise.

plot(fit$time.series[,3], 
     col = "darkgreen", main = 'Remainder', ylab = "",
     lwd = 2, xlab = "", xaxt = "n")

qqnorm(fit$time.series[,3])
qqline(fit$time.series[,3])

shapiro.test(fit$time.series[,3])

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$time.series[, 3]
## W = 0.75826, p-value = 2.194e-14

Loess decomposition is intended to smooth the series by applying averages to the data so that it collapses into components, e.g. the trend or seasonal, that are interesting for the analysis of the data. But this methodology is not intended to do a formal test for the presence of seasonality.

Although our decomposition did not return a smoothed pattern of seasonal periodicity, we can still compare the variance of each component with respect to the variance of the original series, which reveals their individual contributions to the dynamics of the data.

apply(fit$time.series, 2, var) / var(fts[, "cases"])

##   seasonal      trend  remainder 
## 0.10643067 0.89371379 0.07504542

The relative variances indicate that trend is the main component explaining the dynamics of the series.

fck <- as_tsibble(dmdata, key = CD)

## Using `Episode` as index variable.

C <- fck %>% features(cases, feat_stl)
C %>% kbl(caption = "Seasonality") %>%
      kable_paper() %>%
      kable_material(c("striped", "hover")) %>%
      row_spec(0, angle = -45)

Seasonality
CD	trend_strength	seasonal_strength_week	seasonal_peak_week	seasonal_trough_week	spikiness	linearity	curvature	stl_e_acf1	stl_e_acf10
Durham	0.8611606	0.1541656	6	0	0.0490268	-52.18190	-69.20813	-0.2278144	0.1751434
Peel	0.9158026	0.2373876	6	1	0.5418180	-54.47934	-243.69358	-0.1533988	0.2376020
Toronto	0.9549695	0.4630639	2	1	4.0455005	-241.87120	-605.48768	-0.1032323	0.0764386
York	0.8752620	0.2488219	2	0	0.1665330	-80.25322	-95.80548	-0.1289709	0.2995586

Although it seems that there is no significant seasonality in the epi curves of these regions, we will now look at if these hotspots have correlated seasonal patterns to identify any hidden regional dependence in their epi curves.

# Toronto
ftsT <- ts(df[df$CD=="Toronto", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitT <-  stl(ftsT[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
ST <- ftsT[, "cases"] - fitT$time.series[,2]

# Peel
ftsP <- ts(df[df$CD=="Peel", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitP <-  stl(ftsP[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
SP <- ftsP[, "cases"] - fitP$time.series[,2]

# York
ftsY <- ts(df[df$CD=="York", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitY <-  stl(ftsY[, "cases"], t.window=7, s.window="periodic", robust=TRUE) 
SY <- ftsY[, "cases"] - fitY$time.series[,2]

# Durham
ftsD <- ts(df[df$CD=="Durham", ], start = c(2020, as.numeric(format(df$Episode[1], "%j"))),
           frequency = 52)
fitD <-  stl(ftsD[, "cases"], t.window=7, s.window="periodic", robust=TRUE)
SD <- ftsD[, "cases"] - fitD$time.series[,2]

season <- data.frame(ST, SP, SY, SD)
colnames(season) <- c("Toronto", "Peel", "York", "Durham")

corr <- round(cor(season, use = "complete.obs"),5)
corr %>%
  kbl(caption = "Correlation across Neighboring Hotspots - Seasonal Patterns") %>%
   kable_material(c("striped", "hover"))

Correlation across Neighboring Hotspots - Seasonal Patterns
	Toronto	Peel	York	Durham
Toronto	1.00000	0.52587	0.43161	0.40290
Peel	0.52587	1.00000	0.44503	0.30987
York	0.43161	0.44503	1.00000	0.22371
Durham	0.40290	0.30987	0.22371	1.00000

The results also confirm the same weak geotemporal dependence across epi curves for the neighboring hotspot regions.

3. Complex Seasonality

So far, we have considered relatively simple weekly seasonal patterns. However, higher frequency time series often exhibit more complicated seasonal patterns. For example, daily data may have a weekly pattern as well as a monthly pattern. To deal with such series, we will define multiple seasonality in our epi curves. This allows us to specify all of the frequencies that might be relevant.

dff <- msts(df[df$CD=="Toronto", ], seasonal.periods = c(7,365.25), start = decimal_date(as.Date("2020-03-01")))
po <- dff[, "cases"]
tbat <- tbats(po)
plot(tbat, main="Multiple Season Decomposition for Toronto")

ddf <- df[df$CD=="Toronto", ]
ddf <- dff[15:80, ]
ah <- msts(ddf, seasonal.periods = c(7,365.25), start = decimal_date(as.Date("2020-03-15")))
poo <- ah[, "cases"]
tbats <- tbats(poo)

It seems that every week/weekend a similar (not the same) pattern when we relax a fixed seasonality assumption. Unlike other techniques such as Holt-Winters, TBATS can forecast by accommodating multiple seasonality. The univariate TBATS results can be seen below.

sp <- predict(tbats, h=7)
plot(sp, main = "TBATS Forecast", include=7)

summary(sp)

## 
## Forecast method: BATS(0.188, {0,0}, -, -)
## 
## Model Information:
## BATS(0.188, {0,0}, -, -)
## 
## Call: tbats(y = poo)
## 
## Parameters
##   Lambda: 0.188114
##   Alpha: 0.5059235
## 
## Seed States:
##          [,1]
## [1,] 6.816144
## attr(,"lambda")
## [1] 0.1881144
## 
## Sigma: 0.5367092
## AIC: 724.3438
## 
## Error measures:
##                    ME     RMSE      MAE        MPE     MAPE MASE        ACF1
## Training set 3.907603 31.59504 24.91002 -0.3406509 17.70827  NaN -0.01717625
## 
## Forecasts:
##           Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 2020.3830       168.4537 128.7365 217.5884 111.02112 247.9593
## 2020.3857       168.4537 124.5085 224.2316 105.33145 259.3246
## 2020.3885       168.4537 120.7910 230.3662 100.40298 269.9391
## 2020.3912       168.4537 117.4568 236.1194  96.04337 279.9952
## 2020.3940       168.4537 114.4239 241.5733  92.12864 289.6170
## 2020.3967       168.4537 111.6355 246.7845  88.57339 298.8897
## 2020.3994       168.4537 109.0507 251.7941  85.31594 307.8752

As the results indicate a smoothing forecast has a very poor performance.

4. ACF

Just as correlation measures the extent of a linear relationship between two variables, autocorrelation measures the linear relationship between lagged values of a time series. ACF for cases:

B <- fck %>% features(cases, feat_acf)
B %>% kbl(caption = "ACF TABLE") %>%
      kable_paper() %>%
      kable_material(c("striped", "hover"))

ACF TABLE
CD	acf1	acf10	diff1_acf1	diff1_acf10	diff2_acf1	diff2_acf10	season_acf1
Durham	0.7891266	4.648250	-0.5133877	0.2947698	-0.6838462	0.5276317	0.6443945
Peel	0.8548508	4.841692	-0.4211286	0.3066436	-0.6196727	0.5642036	0.6349811
Toronto	0.8973274	6.004804	-0.3871947	0.4022273	-0.6034202	0.6414196	0.7609879
York	0.7950619	4.025464	-0.4463045	0.2907238	-0.6505234	0.5535499	0.5595870

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the ACF of trended time series tend to have positive values that slowly decrease as the lags increase. When data are seasonal, the autocorrelations will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags. When data are both trended and seasonal, you see a combination of these effects. As shown below, we do not see a strong seasonal pattern, which was also the case in our MPA:

fck %>%
  ACF(cases) %>%
  autoplot()

5. Forecasting with other ETS smoothing methods

The ETS models are a family of time series models with an underlying state space model consisting of a level component, a trend component (T), a seasonal component (S), and an error term (E)

Most of the forecasting using univariate epi curves based on conventional time series smoothing methods are basically useless for any type of policy analysis. Below, we apply automatically selected ETS model ETS(M,A,N) - Holt’s linear method with multiplicative errors - for forecasting the epi curve of Toronto (7-day and 14-day forecast, respectively):

fit1 <- forecast(poo,h = 7)
poo %>% ets() %>% forecast(h=7) %>% autoplot()

fit1 <- forecast(poo,h = 14)
poo %>% ets() %>% forecast(h=14) %>% autoplot()

summary(fit1)

## 
## Forecast method: ETS(M,A,N)
## 
## Model Information:
## ETS(M,A,N) 
## 
## Call:
##  ets(y = object, model = "ZZN", lambda = lambda, biasadj = biasadj,  
## 
##  Call:
##      allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.4305 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 58.1507 
##     b = 3.0485 
## 
##   sigma:  0.2007
## 
##      AIC     AICc      BIC 
## 719.8509 720.8509 730.7992 
## 
## Error measures:
##                     ME     RMSE      MAE       MPE     MAPE MASE       ACF1
## Training set -3.086879 31.25318 24.08791 -5.314455 17.61659  NaN 0.06656339
## 
## Forecasts:
##           Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 2020.3830       174.4721 129.5975 219.3468 105.84236 243.1019
## 2020.3857       177.5003 127.7741 227.2264 101.45071 253.5498
## 2020.3885       180.5284 126.2187 234.8381  97.46889 263.5879
## 2020.3912       183.5565 124.8653 242.2477  93.79604 273.3170
## 2020.3940       186.5846 123.6688 249.5005  90.36311 282.8061
## 2020.3967       189.6127 122.5966 256.6289  87.12046 292.1050
## 2020.3994       192.6409 121.6248 263.6570  84.03109 301.2506
## 2020.4022       195.6690 120.7345 270.6035  81.06659 310.2714
## 2020.4049       198.6971 119.9113 277.4829  78.20463 319.1896
## 2020.4077       201.7252 119.1435 284.3070  75.42730 328.0232
## 2020.4104       204.7533 118.4214 291.0853  72.72001 336.7867
## 2020.4131       207.7815 117.7372 297.8257  70.07070 345.4922
## 2020.4159       210.8096 117.0844 304.5348  67.46926 354.1499
## 2020.4186       213.8377 116.4573 311.2182  64.90716 362.7683

Since univariate ARIMA models will have the same poor forecasting performance, we will not try them here.

6. Time Series features management

We will compute different features on regional epi curves and use them to explore the properties of the series.

Here is an example for how we summarize the data weekly by CD:

### tsibble()
fck <- as_tsibble(dmdata, key = CD)

## Using `Episode` as index variable.

A <- fck %>%
  group_by_key() %>%
  index_by(date = ~ yearweek(.)) %>% 
  summarise(
    cases_high = max(cases, na.rm = TRUE),
    cases_low = min(cases, na.rm = TRUE)
  )
A %>% kbl(caption = "Weekly Max - Min Cases") %>%
      kable_paper() %>%
      scroll_box(width = "100%", height = "200px")

Weekly Max - Min Cases
CD	date	cases_high	cases_low
Durham	2020 W09	0	0
Durham	2020 W10	2	0
Durham	2020 W11	14	1
Durham	2020 W12	22	10
Durham	2020 W13	17	9
Durham	2020 W14	28	11
Durham	2020 W15	52	13
Durham	2020 W16	49	21
Durham	2020 W17	35	15
Durham	2020 W18	35	13
Durham	2020 W19	23	6
Durham	2020 W20	21	11
Durham	2020 W21	20	8
Durham	2020 W22	16	9
Durham	2020 W23	21	3
Durham	2020 W24	13	2
Durham	2020 W25	9	1
Durham	2020 W26	7	0
Durham	2020 W27	5	3
Durham	2020 W28	10	1
Durham	2020 W29	5	1
Durham	2020 W30	3	0
Durham	2020 W31	0	0
Peel	2020 W09	1	1
Peel	2020 W10	5	1
Peel	2020 W11	17	7
Peel	2020 W12	54	23
Peel	2020 W13	45	30
Peel	2020 W14	71	39
Peel	2020 W15	84	59
Peel	2020 W16	120	77
Peel	2020 W17	80	51
Peel	2020 W18	94	43
Peel	2020 W19	69	55
Peel	2020 W20	77	49
Peel	2020 W21	92	62
Peel	2020 W22	81	42
Peel	2020 W23	63	26
Peel	2020 W24	35	26
Peel	2020 W25	51	33
Peel	2020 W26	51	22
Peel	2020 W27	37	20
Peel	2020 W28	33	20
Peel	2020 W29	35	14
Peel	2020 W30	21	0
Peel	2020 W31	0	0
Toronto	2020 W09	5	5
Toronto	2020 W10	10	6
Toronto	2020 W11	49	18
Toronto	2020 W12	79	44
Toronto	2020 W13	90	75
Toronto	2020 W14	156	94
Toronto	2020 W15	185	141
Toronto	2020 W16	286	144
Toronto	2020 W17	216	120
Toronto	2020 W18	197	102
Toronto	2020 W19	170	112
Toronto	2020 W20	189	135
Toronto	2020 W21	186	117
Toronto	2020 W22	168	79
Toronto	2020 W23	171	65
Toronto	2020 W24	87	37
Toronto	2020 W25	71	45
Toronto	2020 W26	55	38
Toronto	2020 W27	64	22
Toronto	2020 W28	62	21
Toronto	2020 W29	40	17
Toronto	2020 W30	24	0
Toronto	2020 W31	0	0
York	2020 W09	1	1
York	2020 W10	6	0
York	2020 W11	21	5
York	2020 W12	41	26
York	2020 W13	45	19
York	2020 W14	43	27
York	2020 W15	51	32
York	2020 W16	71	36
York	2020 W17	39	19
York	2020 W18	45	20
York	2020 W19	26	17
York	2020 W20	43	14
York	2020 W21	57	12
York	2020 W22	32	14
York	2020 W23	33	11
York	2020 W24	17	6
York	2020 W25	17	12
York	2020 W26	17	8
York	2020 W27	18	7
York	2020 W28	18	4
York	2020 W29	8	3
York	2020 W30	6	0
York	2020 W31	0	0

7. Univariate to Multivariate TS models - ARDL

We cannot use a pure ARDL (Autoregressive Distributed Lag) model with all features and their lags. This is because the sample size is not enough to keep a proper degree of freedom. Instead, we will apply LASSO algorithms on ARDL while keeping the panel structure to see the effective predictors of epi curves in Ontario.

The results here confirm the fact that temporal prediction is nearly impossible at least with simple parametric models. Although, the main issue would be the use of case numbers that are not able to reveal the correct number of infected people, more appropriate and trainable models built on a proper panel structure may capture the underlying relationship between case numbers and environmental conditions. We have shown that our trainbale fixed-effect models can identify the important predictors from weather and mobility features. Here, again, is the quote from Forecasting COVID-19 by Hyndman:

(. . .) While we have a good understanding of how it works in terms of person-to-person infections, we have limited and misleading data. The current numbers of confirmed cases are known to be vastly underestimated due to the limited testing available. There are almost certainly many more cases of COVID-19 that have not been diagnosed than those that have. Also, the level of under-estimation varies enormously between countries. In a country like South Korea with a lot of testing, the numbers of confirmed cases are going to be closer to the numbers of actual cases than in the US where there has been much less testing. So we simply cannot easily model the spread of the pandemic using the data that is available (. . .)

That’s why we asked NSHA the test (811) numbers in addition to case numbers!

Next, we will have penalized regressions with new engineered features followed by tree models utilizing geotemporal structure of the data, as we did in trainable fixed effect models.

A work by ML-Portal