EDA with Matrix Profile

This rmd file shows several exploratory data experiments with matrix profile analysis.
NOTE: Please click on the Code button to see the codes.

Here are some of the questions we ask in this document:

Are there any “motifs” and “discords” in Covid-19 Epidemic Curves for any region, specifically for Toronto?
Is there any common patterns in epidemic curves between Toronto and neighboring regions?
Are the segments (“regime” changes) of epidemic curves in Toronto similar in neighboring regions?

Let’s download the data and load the packages. (This rmd file uses several packages. If you want to replicate the same results, please install them first.)

setwd("~/Dropbox/RNS2")
load("~/Dropbox/RNS2/data_v3.RData")
data1 <- data_v3

library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr)
library(maps)

# Data Cleaning
library(janitor)

# EDA
library(skimr)
library(DataExplorer)
library(DT)
library(corrplot)

# ggplot2 Helpers
library(scales)
theme_set(theme_tq())
library(raster)
library(spdplyr)
library(dplyr)
library(RColorBrewer)

# Missing data
library(mice)
library(VIM)
library(lattice)

# MPA
library(tsmp)
library(pracma)

We are ready! An epidemic curve (or epi curve) is a histogram (bar chart) that shows the distribution of cases over time. The time intervals are displayed on the x axis (the horizontal axis), and case counts are displayed on the y axis (the vertical axis). However, we will here use a line instead of a histogram to take advantage of MPA.

1. Time Series Patterns & Epi Curve

Repeated Pattern (Motifs): The subsequences having very high similarity to each other. Given a length, the most similar/least distant pair of non-overlapping subsequences.

Anomalous Pattern (Discords): The subsequence with maximal distance to its nearest neighbor. The highest values correspond to the time series discords

The size of an epidemic curve can help validate whether we are in an outbreak by showing an increased number of cases over baseline visually. The overall shape of the epi curve may help identify the mode of transmission of the outbreak. There are several well-described types of epi curves: point source, continuous common source, propagated, and intermittent. Here how Outbreak Toolkit describes them:

Point source – Persons are exposed to the same common source over a brief period of time, such as through a single meal or event attended by all cases; number of cases rise rapidly to a peak and falls off gradually; majority of cases occur within one incubation period.
Continuous common source – Exposure is not confined to one point in time (prolonged over a period of days, weeks or longer); as such, cases are spread over a greater period of time depending on how long the exposure persists; lasts more than one incubation period
Propagated source – does not have a common source but instead caused by spread of pathogen from one susceptible person to another; transmission may occur directly (person-to-person) or via an intermediate host; tends to have a series of irregular peaks reflecting the number of generations of infection; multiple peaks separated by approx. one incubation period; e.g., person-to-person spread of shigellosis
Intermittent - similar to continuous but exposure is intermittent; multiple peaks – length: no relation to the incubation period (reflects intermittent times of exposure) e.g., contaminated food product sold over period of time

If we connect the tip of the boxes we would have a better idea about their patterns as a curve.

2. Matrix Profile Analysis

In time series analysis, one is typically interested in two things; anomalies and trends. One method to find anomalies and trends within a time series is to perform a similarity join. Essentially, you compare snippets of the time series against itself by computing the distance between each pair of snippets. Although it would be easy with a naive algorithm that runs over loops, the Matrix Profile (MP) algorithms would be better and quicker.

The Matrix Profile is a relatively new, introduced in 2016, data structure for time series analysis developed by Eamonn Keogh at the University of California Riverside and Abdullah Mueen at the University of New Mexico. Some of the advantages of using the Matrix Profile is that it is domain agnostic, fast, provides an exact solution (approximate when desired) and only requires a single parameter. See Matrix Profile Foundation for more details and sources.

We can create a companion “time series”, called a Matrix Profile or MP.

Here is the intuition:
(Figures are taken from http://www.cs.ucr.edu/~eamonn/MatrixProfile.html)

Given a time series, T and a desired subsequence length, m;
We can use sliding window of length m to extract all subsequences of length m
We can then compute the pairwise distance among these subsequences and store them to a matrix

This matrix would be OK for short TS, but could be too big for long TS.
Moreover, most part of the matrix contains information that is repeated or irrelevant for tasks in time series data mining.
What kind of information should we extract from the matrix?
Who is each subsequence’s nearest neighbor?
After finding the nearest neighbor (kNN where k=1) for each subsequence, we calculate the distance: how far away from their corresponding nearest neighbor?

The distance to the corresponding nearest neighbor of each subsequence can be stored in a vector called matrix profile.

The following applications use the tsmp package. Fore more details: https://indico.in2p3.fr/event/13934/contributions/15841/attachments/13238/16227/Mueen_presentation_TIMECLEAN_workshop.pdf

The following example shows how we can identify if any given subsequence can be part of another time series by calculating distance profiles.

# Toronto cases
tr <- unlist(data1[data1$CD == "Toronto", "cases"])
plot(tr, main = "Toronto", type = "l", ylab = "", xlab = "time",
     col = "darkgreen", lwd = 2)

# Query
query <- tr[71:85] # pretend you didn't see this line.

# Distance Profile by tsmp
res <- dist_profile(tr, query)
str(res)

## List of 3
##  $ distance_profile: num [1:135] 40.4 41.5 42 40.2 36.7 ...
##  $ last_product    : num [1:135] 44881 52970 63206 72037 81388 ...
##  $ par             :List of 9
##   ..$ window_size: int 15
##   ..$ data_fft   : cplx [1:256] 14079+0i 64-10705i -4466-1347i ...
##   ..$ data_size  : int 149
##   ..$ data_mean  : num [1:135] 21 24.9 29.7 33.5 37.3 ...
##   ..$ data_sd    : num [1:135] 16 18.5 22.2 23.3 23.1 ...
##   ..$ query_mean : num 149
##   ..$ query_sd   : num 24.3
##   ..$ data       : num [1:149] 5 6 10 6 7 10 8 10 18 28 ...
##   ..$ k          : NULL

What this vector represents is the euclidean distance of our query (the pattern that we want to look for) related to the tr. This means that the minimum value contains the place that is most similar to our query:

first <- which.min(res$distance_profile)
first

## [1] 71

Yes! There is a match! But there are more similar patterns to this query?

# this will order the results and give us the position of them
st <- sort(res$distance_profile, index.return = TRUE)  
w <- 30

plot(tr, main = "Toronto", type = "l", ylab = "", xlab = "time",
     col = "dark green")

for (i in 1:3) {
    lines(st$ix[i]:(st$ix[i] + w - 1), tr[st$ix[i]:(st$ix[i] + w - 1)], col = i + 1, lwd = 2)
    text(st$ix[i], -2, label = paste("match", i), adj = 0, col = i + 1, cex = 0.9)
}

Here, the red “match1” is our query. Let’s smooth the data and see if it will still match?

trdf <- as.data.frame(tr)
trsmth <- trdf %>%
 mutate("3dma" = (tr+
                lag(tr,1)+
                lag(tr,2))/3)

plot(trsmth[,2], type = "l", col = "green", lwd = 2)

res2 <- dist_profile(trsmth[3:149,2], query)
which.min(res2$distance_profile)

## [1] 70

st <- sort(res2$distance_profile, index.return = TRUE)  
w <- 30

plot(trsmth[,2], main = "Toronto", type = "l", ylab = "", xlab = "time",
     col = "dark green")
lines(st$ix[1]:(st$ix[1] + w - 1), tr[st$ix[1]:(st$ix[1] + w - 1)], col = 1 + 1, lwd = 2)
    text(st$ix[1], -2, label = paste("match", 1), adj = 0, col = 1 + 1, cex = 0.9)

3. Motifs & Discords - Toronto

3.1 Level data (Epi Curve)

Are there any repeated patterns in Toronto cases?

tr_obj <- tsmp(tr, window_size = 10, exclusion_zone = 1, verbose = 0)
str(tr_obj, 1)

## List of 9
##  $ mp  : num [1:140, 1] 1.21 1.05 1.01 1.36 1.16 ...
##  $ pi  : num [1:140, 1] 23 24 25 25 37 37 35 36 36 38 ...
##  $ rmp : num [1:140, 1] 1.21 1.05 1.01 1.36 1.16 ...
##  $ rpi : num [1:140, 1] 23 24 25 25 37 37 35 36 36 38 ...
##  $ lmp : num [1:140, 1] Inf Inf Inf Inf Inf ...
##  $ lpi : num [1:140, 1] -Inf -Inf -Inf -Inf -Inf ...
##  $ w   : num 10
##  $ ez  : num 1
##  $ data:List of 1
##  - attr(*, "class")= chr "MatrixProfile"
##  - attr(*, "join")= logi FALSE
##  - attr(*, "origin")=List of 8

Unlike a distance profile, we now have a matrix profile that is the minimum value of all distance profiles of all possible rolling window in this data. Together, we have the profile index that points to the most similar pattern in this data.

tr_obj$rmp <- NULL  # we don't need that 
tr_obj$lmp <- NULL  # we don't need that
plot(tr_obj, data = TRUE, col = "red")

The corresponding subsequence in the raw data at the location where the distance is very low must have at least one similar subsequence somewhere. We have almost 4 points where the distance is closed to zero.

The Matrix Profile gets lower values if pattern repeats, and higher values when a different pattern arises.

Where you see relatively low values, you know that the subsequence in the original time series must have (at least one) relatively similar subsequence elsewhere in the data (such regions are “motifs” or reoccurring patterns).
Where you see relatively high values, you know that the subsequence in the original time series must be unique in its shape (such areas are “discords” or anomalies).

For example, in genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance.

compute(tr, 10) %>% motifs() %>% visualize()

However, one immediate problem is that the choice of subsequence length will change both the number and location of motifs! We set the window size to 10 before.

When we don’t specify any information regarding our subsequence length, we can use the pan-matrix profile (or PMP - see http://bigdataieee.org/BigData2017/files/Tutorial4.pdf more details) to generate insights that will help us evaluate different subsequence lengths. In the first graph, it is a global calculation of all possible subsequence lengths condensed into a single visual summary. The X-axis is the index of the matrix profile, and the Y-axis is the corresponding subsequence length. The darker the shade, the lower the Euclidean distance at that point. We can use the “peaks” of those shades to find the “big” motifs visually present our time series.

The graphs after PMP are created by analyze showing the top three motifs and top three discords, along with the corresponding window size and position within the Matrix Profile (and, by extension, the time series data).

analyze(tr)

Hence, given PMP, we set w = 6.

analyze(tr, 6)

See Francisco Bishoff] for more details on tsmp. This link would also help.

Dashed lines shows similar motifs (discords), and solid lines (bold and thin) show the postilion of each type of motifs (discords). Up to this point, we see that Covid-19 cases in Toronto discords are more prevalent than motifs.

3.2 Smoothed Epi Curve

n = 3
ctr <- c(0,cumsum(tr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
plot(rsum, type = "l", col = "orange", lwd = 2)

analyze(rsum)

analyze(rsum, 10)

4. Common Motifs in Neighboring Regions

This is also called Common Subsequence Mining with matrix profile. Given two (or more) time series collected under different conditions or treatments, a data analyst may wish to know what patterns (if any) are conserved between the two time series. We can use MP to discover common subsequences in two time series.

Given two time series \(T_A\) and \(T_B\), matrix profile can be computed by finding the nearest neighbor of \(T_A\)’s subsequences in \(T_B\) or vise verse. We can think of the MP as a type of similarity self-join. For every subsequence in \(T_A\), we join it with its nearest (non- trivial) neighbor in \(T_A\), or \(J_{T_AT_A}\) or \(T_{A} \bowtie\) 1nn \(T_{A}\). We can generalize it to an AB-similarity join. For every subsequence in A, we join can it with its nearest neighbor in B, or \(J_{T_AT_B}\) or \(T_{A} \bowtie\) 1nn \(T_{B}\).

Is there any pattern that is common to COVID19 cases across regions neighboring to Toronto? We will look at Toronto, York, Peel, and Durham.

cases <- aggregate(data1[, 5], list(data1$CD), function(x) sum(x))
cases

##                           Group.1 cases
## 1                           Brant   149
## 2                    Chatham-Kent   225
## 3                        Cochrane    70
## 4                          Durham  1793
## 5                           Elgin   105
## 6                           Essex  2210
## 7                       Frontenac   109
## 8                            Grey   113
## 9               Haldimand-Norfolk   437
## 10                         Halton   824
## 11                       Hamilton   878
## 12                        Lambton   295
## 13            Leeds and Grenville   355
## 14                      Middlesex   663
## 15                        Niagara   826
## 16                 Northumberland   213
## 17                         Ottawa  2438
## 18                           Peel  6495
## 19                          Perth    66
## 20                   Peterborough    94
## 21                         Simcoe   657
## 22 Stormont, Dundas and Glengarry   175
## 23                        Sudbury    71
## 24                    Thunder Bay    92
## 25                        Toronto 14079
## 26                       Waterloo  1364
## 27                     Wellington   525
## 28                           York  3222

4.1 `simple_fast()`

We will first use SiMPle - Similarity Matrix Profile (see: Fast Similarity Matrix Profile for Music Analysis and Exploration)

# Peel is the 2nd worst in total cases with other neighboring regions
pl <- unlist(data1[data1$CD == "Peel", "cases"])
yk <- unlist(data1[data1$CD == "York", "cases"])
dm <- unlist(data1[data1$CD == "Durham", "cases"])

# Graph cases for Toronto, Peel, Durham and York
def_par <- par(no.readonly = TRUE)
par(mfrow = c(2, 1), cex = 0.8)
plot(tr, main = "Toronto", lwd = 1, type = "l", ylab = "",
     col = "red")
plot(pl, main = "Peel", lwd = 1, type = "l", ylab = "",
     col = "green")

par(def_par)

def_par <- par(no.readonly = TRUE)  
par(mfrow = c(2, 1), cex = 0.8)
plot(dm, main = "Durham", lwd = 1, type = "l", ylab = "",
     col = "brown")
plot(yk, main = "York", lwd = 1, type = "l", ylab = "",
     col = "blue")

par(def_par)

Here is the matrix profile for all 4 regions:

mp_obj <- simple_fast(cbind(tr, pl, yk, dm), window_size = 6,
                      exclusion_zone = 1, verbose = 1)

plot(mp_obj$mp, main = "Matrix Profile - Toronto/Peel/York/Durham", lwd = 2,
     type = "l", ylab = "", col = "Blue")

How can we read these matrix profiles? To answer this question informally, let’s see what the MP would look like for two common time series? Sometimes a musician “borrows” section of other music to generate new music. The borrowed section is the common subsequence in both music and can be discovered using matrix profile:

baseurl <- "https://raw.githubusercontent.com/matrix-profile-foundation/mpf-datasets/102b8b0451cc9436879c025d21d5709cb4c8a601/"
dataset_queen <- read.csv(paste0(baseurl, "real/mfcc_queen.csv"),
                          header = FALSE)$V1
dataset_vanilla <- read.csv(paste0(baseurl, "real/mfcc_vanilla_ice.csv"),
                            header = FALSE)$V1
dataset_queen <- dataset_queen[1:length(dataset_vanilla)]
mp_obj <- simple_fast(cbind(dataset_queen, dataset_vanilla),
                      window_size = 300, exclusion_zone = 1, verbose = 0)
plot(dataset_queen, main = "Queen Song", lwd = 1, type = "l",
     ylab = "", xlim = c(0, 25000), col = "pink")

plot(dataset_vanilla, main = "Vanilla Ice Song", lwd = 1,
     type = "l", ylab = "", xlim = c(0, 25000), col = "red")

plot(mp_obj$mp, main = "Matrix Profile", lwd = 1,
     type = "l", ylab = "", xlim = c(0, 25000), col="orange")

This application is taken from the latest post by Francisco Bischoff: You can see that the lower parts of the Matrix Profile align well with that famous bassline from “Under Pressure” by Queen which was plagiarized by Vanilla Ice.

In our case, it is interesting to see that, when the transmission rate is rising in all regions, the level of dissimilarity in the transmission rate across the neighboring regions to Toronto is also increasing.

4.2 `stomp()`

After talking with Francisco Bischoff (the developer of the R package of tsmp) about simple_fast(), he asked me to use stomp(): I’m actually not entirely happy with simple_fast…later I’ll change the way it handles the time series. Making this post actually got me aware of the limitations. What you can try to do is to use STOMP and use c(region1, region2).

STOMP: Scalable Time series Ordered Matrix Profile, for Toronto and Peel:

# Matrix Profile
mp1 <- stomp(c(pl, tr), window_size = 10, exclusion_zone = 1, verbose = 0)
mp1$rmp <- NULL
mp1$lmp <- NULL
plot(mp1, col = "blue")

We can see that other than the very beginning of the period, there is no identifiable motif in this joint time series. We can think that the spatial transmission from one region to another would take time. Since MP uses sliding windows, it is a perfect tool to search for the existence of any pair of subsequences that are similar to each other in the whole period.

And following his recommendations, I use find_motif() play with the exclusion_zone and radius. First I try to use n_motifs = 1 and n_neigbors = 20. As he stated, This will find the best motif and its repetitions (use larger radius to find farther matches).

# Method2
data <- c(pl, tr)
mp2 <- tsmp(data, window_size = 6, verbose = 0)
# Default radius=3
mp2 <- find_motif(mp2, n_motifs = 1, n_neighbors = 20, radius = 50,
                  exclusion_zone = 1)
plot(mp2)

It is clear from this that the best (closest 1NN) motif is inside the Toronto time series. There is no shared motif between Torornto and Peel. However, we can force to find a joint motif in both series at the same location and see how similar they are:

# Method3
data2 <- cbind(pl, tr)
mp3 <- tsmp(data2, window_size = 6, mode = "mstomp", verbose = 0)
mp3 <- find_motif(mp3, n_motifs = 1, n_neighbors = 40, radius = 50, exclusion_zone = 1)
plot(mp3)

4.3 Smoothed Epi Curve with `stomp()`

n = 3
ctr <- c(0,cumsum(tr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
cpl <- c(0,cumsum(pl))
rsum2 <- (cpl[(n+1):length(cpl)] - cpl[1:(length(cpl) - n)]) / n

# Method2
data6 <- c(rsum2, rsum)
mp6 <- tsmp(data6, window_size = 6, verbose = 0)
# Default radius=3
mp6 <- find_motif(mp6, n_motifs = 1, n_neighbors = 20, radius = 50,
                  exclusion_zone = 1)
plot(mp6)

4.4 After de-trending & Smoothing

# Toronto
temp <- as.data.frame(tr)
rate <- temp %>%
   dplyr::mutate(Rate_percent = (tr-lag(tr))/tr * 100)
grtr <- rate[2:146,2]
# Peel
temp <- as.data.frame(pl)
rate <- temp %>%
   dplyr::mutate(Rate_percent = (pl-lag(pl))/pl * 100)
grpl <- rate[2:146,2]

#smoothing
n = 3
ctr <- c(0,cumsum(grtr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
cpl <- c(0,cumsum(grpl))
rsum2 <- (cpl[(n+1):length(cpl)] - cpl[1:(length(cpl) - n)]) / n

# Matrix Profile without smoothing
mp <- stomp(c(grpl, grtr), window_size = 10, exclusion_zone = 1, verbose = 0)
mp$rmp <- NULL
mp$lmp <- NULL
plot(mp, col = "blue")

# Matrix Profile with smoothing
mp <- stomp(c(rsum2, rsum), window_size = 10, exclusion_zone = 1, verbose = 0)
mp$rmp <- NULL
mp$lmp <- NULL
plot(mp, col = "red")

# Without smoothing
data <- c(grpl, grtr)
mp <- tsmp(data, window_size = 10, verbose = 0)
# Default radius=3
mp <- find_motif(mp, n_motifs = 1, n_neighbors = 20, radius = 50,
                  exclusion_zone = 1)
plot(mp)

# With smoothing
data <- c(rsum2, rsum)
mp <- tsmp(data, window_size = 10, verbose = 0)
# Default radius=3
mp <- find_motif(mp, n_motifs = 1, n_neighbors = 20, radius = 50,
                  exclusion_zone = 1)
plot(mp)

5. Common Motifs in Distant Regions

Windsor-Essex. Let’s see how the matrix profile is for these two distant regions?

# Windsor
wr <- unlist(data1[data1$CD == "Essex", "cases"])

# Graph Toronto/Peel
def_par <- par(no.readonly = TRUE)  # always save the previous configuration
par(mfrow = c(2, 1), cex = 0.8)
plot(tr, main = "Toronto", lwd = 1, type = "l", ylab = "",
     col = "red")
plot(wr, main = "Essex", lwd = 1, type = "l", ylab = "",
     col = "blue")

par(def_par)

#MP
mp4 <- stomp(c(wr, tr), window_size = 10, exclusion_zone = 1, verbose = 0)
mp4$rmp <- NULL
mp4$lmp <- NULL
plot(mp4, col = "darkgreen")

#motif
data5 <- c(wr, tr)
mp5 <- tsmp(data5, window_size = 6, verbose = 0)
# Default radius=3
mp5 <- find_motif(mp5, n_motifs = 3, n_neighbors = 20, radius = 50,
                  exclusion_zone = 1)
plot(mp5)

This time, we find two motifs that are common in both regions. In fact, Motif1 is almost identical in bith regions.

6. Semantic Segmentation in Epi Curve

Sometime the system we are monitoring changes regimes, can we detect such changes? Here is the intuition in a nutshell:

Recall that the Matrix Profile Index has pointers (arrows, arcs) that point to the nearest neighbor of each subsequence.
If we have a change of behavior, say from walk to run, we should expect very few arrows to span that change of behavior: Most walk subsequences will point to another walk subsequence and most run subsequences will point to another run subsequence.
Rarely, if ever, will a run point to a walk.
So, if we slide across the Matrix Profile Index and count how many arrows cross each particular point, we expect to find few that span the change of behavior.

Here is the excerpt from Matrix Profile VIII: Domain Agnostic Online Semantic Segmentation at Superhuman Performance Levels:

Change point detection is a method for detecting various changes in the statistical properties of time series, such as the mean, variance, or spectral density. (. . .) In contrast to change point detection, we are interested in regimens that are defined by changes in the shapes of the time series subsequences, which can change without any obvious effect on the statistical properties.
Another interpretation of segmentation refers to Piecewise Linear Approximation (PLA). The goal is to approximate a time series T with a more compact representation by fitting k piecewise polynomials using linear interpolation or linear regression, while minimizing the error with respect to the original T. Success here is measured in terms of root-mean-squared-error, and it does not indicate any semantic meaning of the solution.

6.1 FLUSS - Toronto

FLUSS (Fast Low-cost Unipotent Semantic Segmentation): The task of FLUSS is to produce a companion time series called the Arc Curve (AC), which annotates the raw time series with information about the likelihood of a regime change at each location. We also need to provide an algorithm to examine this Arc Curve and decide how many (if any) regimes exist. We consider that issue separately in Section III-B.

tr_obj <- tsmp(tr, window_size = 6, exclusion_zone = 1, verbose = 0)
sgtr <- fluss(tr_obj, num_segments = 2, exclusion_zone = 1)
sgtr

## Matrix Profile
## --------------
## Profile size = 144 
## Window size = 6 
## Exclusion zone = 6 
## Contains 1 set of data with 149 observations and 1 dimension 
## 
## Arc Count
## ---------
## Profile size = 144 
## Minimum normalized count = 0.63 at index 59 
## 
## Fluss
## -----
## Segments = 2 
## Segmentation indexes = 59 42

plot(sgtr)

6.2 FLUSS - Peel

Are they similar?

pl_obj <- tsmp(pl, window_size = 6, exclusion_zone = 1, verbose = 0)
sgpl <- fluss(pl_obj, num_segments = 2, exclusion_zone = 1)
sgpl

## Matrix Profile
## --------------
## Profile size = 144 
## Window size = 6 
## Exclusion zone = 6 
## Contains 1 set of data with 149 observations and 1 dimension 
## 
## Arc Count
## ---------
## Profile size = 144 
## Minimum normalized count = 0.77 at index 135 
## 
## Fluss
## -----
## Segments = 2 
## Segmentation indexes = 135 25

plot(sgpl)

The segments in transmission (“regime” changes) are also different in Peel and Toronto.

7. Snippets

Time Series snippets try to solve mainly the common problem of summarization “show me some representative/typical data”. The main difference between motifs/discords and snippets are explained in A Novel Time Series Distance Measure to Allow Data Mining in More Challenging Scenarios. In summary, snippets are representative patterns present in the data, not rare events like discords, neither (almost) perfectly similar as motifs. Remember, we try the compare two times series data: epi curves of Toronto and Peel by trying to find “common” motifs. As explained above, these motifs are found by euclidean distance. Now, snippets use Matrix Profile distances.

Let’s start “summarizing” the epi curves for Toronto and Peel before and after smooting:

Toronto

mp_obj <- find_snippet(tr, 8, 2)
mp_obj %T>% plot

## 
## Snippet
## -------
## Snippets found = 2 
## Snippets indexes = 17 73 
## Snippets fractions = 62.76% 37.24% 
## Snippet size = 8

#smoothing
n = 3
ctr <- c(0,cumsum(tr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
mp_obj <- find_snippet(rsum, 8, 2)
mp_obj %T>% plot

## 
## Snippet
## -------
## Snippets found = 2 
## Snippets indexes = 17 121 
## Snippets fractions = 60.00% 40.00% 
## Snippet size = 8

Peel

mp_obj <- find_snippet(pl, 8, 2)
mp_obj %T>% plot

## 
## Snippet
## -------
## Snippets found = 2 
## Snippets indexes = 105 41 
## Snippets fractions = 61.38% 38.62% 
## Snippet size = 8

#smoothing
n = 3
cpl <- c(0,cumsum(pl))
rsum <- (cpl[(n+1):length(cpl)] - cpl[1:(length(cpl) - n)]) / n
mp_obj <- find_snippet(rsum, 8, 2)
mp_obj %T>% plot

## 
## Snippet
## -------
## Snippets found = 2 
## Snippets indexes = 113 1 
## Snippets fractions = 49.66% 50.34% 
## Snippet size = 8

It seems that none of the snippets are similar in both epi curves.

Parametric estimations!

A work by Yigit Aydede

yigit.aydede@smu.ca

EDA with Matrix Profile

Covid-19 Epidemic Curves in Toronto and neighboring regions

2020-09-29

1. Time Series Patterns & Epi Curve

2. Matrix Profile Analysis

3. Motifs & Discords - Toronto

3.1 Level data (Epi Curve)

3.2 Smoothed Epi Curve

4. Common Motifs in Neighboring Regions

4.1 `simple_fast()`

4.2 `stomp()`

4.3 Smoothed Epi Curve with `stomp()`

5. Common Motifs in Distant Regions

6. Semantic Segmentation in Epi Curve

6.1 FLUSS - Toronto

6.2 FLUSS - Peel

7. Snippets

Toronto

Peel

Next

EDA with Matrix Profile

Covid-19 Epidemic Curves in Toronto and neighboring regions

2020-09-29

1. Time Series Patterns & Epi Curve

2. Matrix Profile Analysis

3. Motifs & Discords - Toronto

3.1 Level data (Epi Curve)

3.2 Smoothed Epi Curve

4. Common Motifs in Neighboring Regions

4.1 simple_fast()

4.2 stomp()

4.3 Smoothed Epi Curve with stomp()

4.4 After de-trending & Smoothing

5. Common Motifs in Distant Regions

6. Semantic Segmentation in Epi Curve

6.1 FLUSS - Toronto

6.2 FLUSS - Peel

7. Snippets

Toronto

Peel

Next

4.1 `simple_fast()`

4.2 `stomp()`

4.3 Smoothed Epi Curve with `stomp()`