This rmd file shows several exploratory data experiments with matrix profile analysis.
NOTE: Please click on the Code
button to see the codes.
Here are some of the questions we ask in this document:
Let’s download the data and load the packages. (This rmd file uses several packages. If you want to replicate the same results, please install them first.)
setwd("~/Dropbox/RNS2")
load("~/Dropbox/RNS2/data_v3.RData")
data1 <- data_v3
library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr)
library(maps)
# Data Cleaning
library(janitor)
# EDA
library(skimr)
library(DataExplorer)
library(DT)
library(corrplot)
# ggplot2 Helpers
library(scales)
theme_set(theme_tq())
library(raster)
library(spdplyr)
library(dplyr)
library(RColorBrewer)
# Missing data
library(mice)
library(VIM)
library(lattice)
# MPA
library(tsmp)
library(pracma)
We are ready! An epidemic curve (or epi curve) is a histogram (bar chart) that shows the distribution of cases over time. The time intervals are displayed on the x axis (the horizontal axis), and case counts are displayed on the y axis (the vertical axis). However, we will here use a line instead of a histogram to take advantage of MPA.
Repeated Pattern (Motifs): The subsequences having very high similarity to each other. Given a length, the most similar/least distant pair of non-overlapping subsequences.
Anomalous Pattern (Discords): The subsequence with maximal distance to its nearest neighbor. The highest values correspond to the time series discords
The size of an epidemic curve can help validate whether we are in an outbreak by showing an increased number of cases over baseline visually. The overall shape of the epi curve may help identify the mode of transmission of the outbreak. There are several well-described types of epi curves: point source, continuous common source, propagated, and intermittent. Here how Outbreak Toolkit describes them:
If we connect the tip of the boxes we would have a better idea about their patterns as a curve.
In time series analysis, one is typically interested in two things; anomalies and trends. One method to find anomalies and trends within a time series is to perform a similarity join. Essentially, you compare snippets of the time series against itself by computing the distance between each pair of snippets. Although it would be easy with a naive algorithm that runs over loops, the Matrix Profile (MP) algorithms would be better and quicker.
The Matrix Profile is a relatively new, introduced in 2016, data structure for time series analysis developed by Eamonn Keogh at the University of California Riverside and Abdullah Mueen at the University of New Mexico. Some of the advantages of using the Matrix Profile is that it is domain agnostic, fast, provides an exact solution (approximate when desired) and only requires a single parameter. See Matrix Profile Foundation for more details and sources.
We can create a companion “time series”, called a Matrix Profile or MP.
Here is the intuition:
(Figures are taken from http://www.cs.ucr.edu/~eamonn/MatrixProfile.html)
The distance to the corresponding nearest neighbor of each subsequence can be stored in a vector called matrix profile.
The following applications use the tsmp
package. Fore more details: https://indico.in2p3.fr/event/13934/contributions/15841/attachments/13238/16227/Mueen_presentation_TIMECLEAN_workshop.pdf
The following example shows how we can identify if any given subsequence can be part of another time series by calculating distance profiles.
# Toronto cases
tr <- unlist(data1[data1$CD == "Toronto", "cases"])
plot(tr, main = "Toronto", type = "l", ylab = "", xlab = "time",
col = "darkgreen", lwd = 2)
# Query
query <- tr[71:85] # pretend you didn't see this line.
# Distance Profile by tsmp
res <- dist_profile(tr, query)
str(res)
## List of 3
## $ distance_profile: num [1:135] 40.4 41.5 42 40.2 36.7 ...
## $ last_product : num [1:135] 44881 52970 63206 72037 81388 ...
## $ par :List of 9
## ..$ window_size: int 15
## ..$ data_fft : cplx [1:256] 14079+0i 64-10705i -4466-1347i ...
## ..$ data_size : int 149
## ..$ data_mean : num [1:135] 21 24.9 29.7 33.5 37.3 ...
## ..$ data_sd : num [1:135] 16 18.5 22.2 23.3 23.1 ...
## ..$ query_mean : num 149
## ..$ query_sd : num 24.3
## ..$ data : num [1:149] 5 6 10 6 7 10 8 10 18 28 ...
## ..$ k : NULL
What this vector represents is the euclidean distance of our query (the pattern that we want to look for) related to the tr
. This means that the minimum value contains the place that is most similar to our query:
first <- which.min(res$distance_profile)
first
## [1] 71
Yes! There is a match! But there are more similar patterns to this query?
# this will order the results and give us the position of them
st <- sort(res$distance_profile, index.return = TRUE)
w <- 30
plot(tr, main = "Toronto", type = "l", ylab = "", xlab = "time",
col = "dark green")
for (i in 1:3) {
lines(st$ix[i]:(st$ix[i] + w - 1), tr[st$ix[i]:(st$ix[i] + w - 1)], col = i + 1, lwd = 2)
text(st$ix[i], -2, label = paste("match", i), adj = 0, col = i + 1, cex = 0.9)
}
Here, the red “match1” is our query. Let’s smooth the data and see if it will still match?
trdf <- as.data.frame(tr)
trsmth <- trdf %>%
mutate("3dma" = (tr+
lag(tr,1)+
lag(tr,2))/3)
plot(trsmth[,2], type = "l", col = "green", lwd = 2)
res2 <- dist_profile(trsmth[3:149,2], query)
which.min(res2$distance_profile)
## [1] 70
st <- sort(res2$distance_profile, index.return = TRUE)
w <- 30
plot(trsmth[,2], main = "Toronto", type = "l", ylab = "", xlab = "time",
col = "dark green")
lines(st$ix[1]:(st$ix[1] + w - 1), tr[st$ix[1]:(st$ix[1] + w - 1)], col = 1 + 1, lwd = 2)
text(st$ix[1], -2, label = paste("match", 1), adj = 0, col = 1 + 1, cex = 0.9)
Are there any repeated patterns in Toronto cases?
tr_obj <- tsmp(tr, window_size = 10, exclusion_zone = 1, verbose = 0)
str(tr_obj, 1)
## List of 9
## $ mp : num [1:140, 1] 1.21 1.05 1.01 1.36 1.16 ...
## $ pi : num [1:140, 1] 23 24 25 25 37 37 35 36 36 38 ...
## $ rmp : num [1:140, 1] 1.21 1.05 1.01 1.36 1.16 ...
## $ rpi : num [1:140, 1] 23 24 25 25 37 37 35 36 36 38 ...
## $ lmp : num [1:140, 1] Inf Inf Inf Inf Inf ...
## $ lpi : num [1:140, 1] -Inf -Inf -Inf -Inf -Inf ...
## $ w : num 10
## $ ez : num 1
## $ data:List of 1
## - attr(*, "class")= chr "MatrixProfile"
## - attr(*, "join")= logi FALSE
## - attr(*, "origin")=List of 8
Unlike a distance profile, we now have a matrix profile that is the minimum value of all distance profiles of all possible rolling window in this data. Together, we have the profile index that points to the most similar pattern in this data.
tr_obj$rmp <- NULL # we don't need that
tr_obj$lmp <- NULL # we don't need that
plot(tr_obj, data = TRUE, col = "red")
The corresponding subsequence in the raw data at the location where the distance is very low must have at least one similar subsequence somewhere. We have almost 4 points where the distance is closed to zero.
The Matrix Profile gets lower values if pattern repeats, and higher values when a different pattern arises.
For example, in genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance.
compute(tr, 10) %>% motifs() %>% visualize()
However, one immediate problem is that the choice of subsequence length will change both the number and location of motifs! We set the window size to 10 before.
When we don’t specify any information regarding our subsequence length, we can use the pan-matrix profile (or PMP - see http://bigdataieee.org/BigData2017/files/Tutorial4.pdf more details) to generate insights that will help us evaluate different subsequence lengths. In the first graph, it is a global calculation of all possible subsequence lengths condensed into a single visual summary. The X-axis is the index of the matrix profile, and the Y-axis is the corresponding subsequence length. The darker the shade, the lower the Euclidean distance at that point. We can use the “peaks” of those shades to find the “big” motifs visually present our time series.
The graphs after PMP are created by analyze
showing the top three motifs and top three discords, along with the corresponding window size and position within the Matrix Profile (and, by extension, the time series data).
analyze(tr)
Hence, given PMP, we set w = 6.
analyze(tr, 6)
See Francisco Bishoff] for more details on tsmp
. This link would also help.
Dashed lines shows similar motifs (discords), and solid lines (bold and thin) show the postilion of each type of motifs (discords). Up to this point, we see that Covid-19 cases in Toronto discords are more prevalent than motifs.
n = 3
ctr <- c(0,cumsum(tr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
plot(rsum, type = "l", col = "orange", lwd = 2)
analyze(rsum)
analyze(rsum, 10)
This is also called Common Subsequence Mining with matrix profile. Given two (or more) time series collected under different conditions or treatments, a data analyst may wish to know what patterns (if any) are conserved between the two time series. We can use MP to discover common subsequences in two time series.
Given two time series \(T_A\) and \(T_B\), matrix profile can be computed by finding the nearest neighbor of \(T_A\)’s subsequences in \(T_B\) or vise verse. We can think of the MP as a type of similarity self-join. For every subsequence in \(T_A\), we join it with its nearest (non- trivial) neighbor in \(T_A\), or \(J_{T_AT_A}\) or \(T_{A} \bowtie\) 1nn \(T_{A}\). We can generalize it to an AB-similarity join. For every subsequence in A, we join can it with its nearest neighbor in B, or \(J_{T_AT_B}\) or \(T_{A} \bowtie\) 1nn \(T_{B}\).
Is there any pattern that is common to COVID19 cases across regions neighboring to Toronto? We will look at Toronto, York, Peel, and Durham.
cases <- aggregate(data1[, 5], list(data1$CD), function(x) sum(x))
cases
## Group.1 cases
## 1 Brant 149
## 2 Chatham-Kent 225
## 3 Cochrane 70
## 4 Durham 1793
## 5 Elgin 105
## 6 Essex 2210
## 7 Frontenac 109
## 8 Grey 113
## 9 Haldimand-Norfolk 437
## 10 Halton 824
## 11 Hamilton 878
## 12 Lambton 295
## 13 Leeds and Grenville 355
## 14 Middlesex 663
## 15 Niagara 826
## 16 Northumberland 213
## 17 Ottawa 2438
## 18 Peel 6495
## 19 Perth 66
## 20 Peterborough 94
## 21 Simcoe 657
## 22 Stormont, Dundas and Glengarry 175
## 23 Sudbury 71
## 24 Thunder Bay 92
## 25 Toronto 14079
## 26 Waterloo 1364
## 27 Wellington 525
## 28 York 3222
simple_fast()
We will first use SiMPle - Similarity Matrix Profile (see: Fast Similarity Matrix Profile for Music Analysis and Exploration)
# Peel is the 2nd worst in total cases with other neighboring regions
pl <- unlist(data1[data1$CD == "Peel", "cases"])
yk <- unlist(data1[data1$CD == "York", "cases"])
dm <- unlist(data1[data1$CD == "Durham", "cases"])
# Graph cases for Toronto, Peel, Durham and York
def_par <- par(no.readonly = TRUE)
par(mfrow = c(2, 1), cex = 0.8)
plot(tr, main = "Toronto", lwd = 1, type = "l", ylab = "",
col = "red")
plot(pl, main = "Peel", lwd = 1, type = "l", ylab = "",
col = "green")
par(def_par)
def_par <- par(no.readonly = TRUE)
par(mfrow = c(2, 1), cex = 0.8)
plot(dm, main = "Durham", lwd = 1, type = "l", ylab = "",
col = "brown")
plot(yk, main = "York", lwd = 1, type = "l", ylab = "",
col = "blue")
par(def_par)
Here is the matrix profile for all 4 regions:
mp_obj <- simple_fast(cbind(tr, pl, yk, dm), window_size = 6,
exclusion_zone = 1, verbose = 1)
plot(mp_obj$mp, main = "Matrix Profile - Toronto/Peel/York/Durham", lwd = 2,
type = "l", ylab = "", col = "Blue")
How can we read these matrix profiles? To answer this question informally, let’s see what the MP would look like for two common time series? Sometimes a musician “borrows” section of other music to generate new music. The borrowed section is the common subsequence in both music and can be discovered using matrix profile:
baseurl <- "https://raw.githubusercontent.com/matrix-profile-foundation/mpf-datasets/102b8b0451cc9436879c025d21d5709cb4c8a601/"
dataset_queen <- read.csv(paste0(baseurl, "real/mfcc_queen.csv"),
header = FALSE)$V1
dataset_vanilla <- read.csv(paste0(baseurl, "real/mfcc_vanilla_ice.csv"),
header = FALSE)$V1
dataset_queen <- dataset_queen[1:length(dataset_vanilla)]
mp_obj <- simple_fast(cbind(dataset_queen, dataset_vanilla),
window_size = 300, exclusion_zone = 1, verbose = 0)
plot(dataset_queen, main = "Queen Song", lwd = 1, type = "l",
ylab = "", xlim = c(0, 25000), col = "pink")
plot(dataset_vanilla, main = "Vanilla Ice Song", lwd = 1,
type = "l", ylab = "", xlim = c(0, 25000), col = "red")
plot(mp_obj$mp, main = "Matrix Profile", lwd = 1,
type = "l", ylab = "", xlim = c(0, 25000), col="orange")
This application is taken from the latest post by Francisco Bischoff: You can see that the lower parts of the Matrix Profile align well with that famous bassline from “Under Pressure” by Queen which was plagiarized by Vanilla Ice.
In our case, it is interesting to see that, when the transmission rate is rising in all regions, the level of dissimilarity in the transmission rate across the neighboring regions to Toronto is also increasing.
stomp()
After talking with Francisco Bischoff (the developer of the R package of tsmp
) about simple_fast()
, he asked me to use stomp()
: I’m actually not entirely happy with simple_fast…later I’ll change the way it handles the time series. Making this post actually got me aware of the limitations. What you can try to do is to use STOMP and use c(region1, region2).
STOMP: Scalable Time series Ordered Matrix Profile, for Toronto and Peel:
# Matrix Profile
mp1 <- stomp(c(pl, tr), window_size = 10, exclusion_zone = 1, verbose = 0)
mp1$rmp <- NULL
mp1$lmp <- NULL
plot(mp1, col = "blue")
We can see that other than the very beginning of the period, there is no identifiable motif in this joint time series. We can think that the spatial transmission from one region to another would take time. Since MP uses sliding windows, it is a perfect tool to search for the existence of any pair of subsequences that are similar to each other in the whole period.
And following his recommendations, I use find_motif()
play with the exclusion_zone
and radius
. First I try to use n_motifs = 1
and n_neigbors = 20
. As he stated, This will find the best motif and its repetitions (use larger radius to find farther matches).
# Method2
data <- c(pl, tr)
mp2 <- tsmp(data, window_size = 6, verbose = 0)
# Default radius=3
mp2 <- find_motif(mp2, n_motifs = 1, n_neighbors = 20, radius = 50,
exclusion_zone = 1)
plot(mp2)
It is clear from this that the best (closest 1NN) motif is inside the Toronto time series. There is no shared motif between Torornto and Peel. However, we can force to find a joint motif in both series at the same location and see how similar they are:
# Method3
data2 <- cbind(pl, tr)
mp3 <- tsmp(data2, window_size = 6, mode = "mstomp", verbose = 0)
mp3 <- find_motif(mp3, n_motifs = 1, n_neighbors = 40, radius = 50, exclusion_zone = 1)
plot(mp3)
stomp()
n = 3
ctr <- c(0,cumsum(tr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
cpl <- c(0,cumsum(pl))
rsum2 <- (cpl[(n+1):length(cpl)] - cpl[1:(length(cpl) - n)]) / n
# Method2
data6 <- c(rsum2, rsum)
mp6 <- tsmp(data6, window_size = 6, verbose = 0)
# Default radius=3
mp6 <- find_motif(mp6, n_motifs = 1, n_neighbors = 20, radius = 50,
exclusion_zone = 1)
plot(mp6)
# Toronto
temp <- as.data.frame(tr)
rate <- temp %>%
dplyr::mutate(Rate_percent = (tr-lag(tr))/tr * 100)
grtr <- rate[2:146,2]
# Peel
temp <- as.data.frame(pl)
rate <- temp %>%
dplyr::mutate(Rate_percent = (pl-lag(pl))/pl * 100)
grpl <- rate[2:146,2]
#smoothing
n = 3
ctr <- c(0,cumsum(grtr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
cpl <- c(0,cumsum(grpl))
rsum2 <- (cpl[(n+1):length(cpl)] - cpl[1:(length(cpl) - n)]) / n
# Matrix Profile without smoothing
mp <- stomp(c(grpl, grtr), window_size = 10, exclusion_zone = 1, verbose = 0)
mp$rmp <- NULL
mp$lmp <- NULL
plot(mp, col = "blue")
# Matrix Profile with smoothing
mp <- stomp(c(rsum2, rsum), window_size = 10, exclusion_zone = 1, verbose = 0)
mp$rmp <- NULL
mp$lmp <- NULL
plot(mp, col = "red")
# Without smoothing
data <- c(grpl, grtr)
mp <- tsmp(data, window_size = 10, verbose = 0)
# Default radius=3
mp <- find_motif(mp, n_motifs = 1, n_neighbors = 20, radius = 50,
exclusion_zone = 1)
plot(mp)
# With smoothing
data <- c(rsum2, rsum)
mp <- tsmp(data, window_size = 10, verbose = 0)
# Default radius=3
mp <- find_motif(mp, n_motifs = 1, n_neighbors = 20, radius = 50,
exclusion_zone = 1)
plot(mp)
Windsor-Essex. Let’s see how the matrix profile is for these two distant regions?
# Windsor
wr <- unlist(data1[data1$CD == "Essex", "cases"])
# Graph Toronto/Peel
def_par <- par(no.readonly = TRUE) # always save the previous configuration
par(mfrow = c(2, 1), cex = 0.8)
plot(tr, main = "Toronto", lwd = 1, type = "l", ylab = "",
col = "red")
plot(wr, main = "Essex", lwd = 1, type = "l", ylab = "",
col = "blue")
par(def_par)
#MP
mp4 <- stomp(c(wr, tr), window_size = 10, exclusion_zone = 1, verbose = 0)
mp4$rmp <- NULL
mp4$lmp <- NULL
plot(mp4, col = "darkgreen")
#motif
data5 <- c(wr, tr)
mp5 <- tsmp(data5, window_size = 6, verbose = 0)
# Default radius=3
mp5 <- find_motif(mp5, n_motifs = 3, n_neighbors = 20, radius = 50,
exclusion_zone = 1)
plot(mp5)
This time, we find two motifs that are common in both regions. In fact, Motif1 is almost identical in bith regions.
Sometime the system we are monitoring changes regimes, can we detect such changes? Here is the intuition in a nutshell:
Here is the excerpt from Matrix Profile VIII: Domain Agnostic Online Semantic Segmentation at Superhuman Performance Levels:
FLUSS (Fast Low-cost Unipotent Semantic Segmentation): The task of FLUSS is to produce a companion time series called the Arc Curve (AC), which annotates the raw time series with information about the likelihood of a regime change at each location. We also need to provide an algorithm to examine this Arc Curve and decide how many (if any) regimes exist. We consider that issue separately in Section III-B.
tr_obj <- tsmp(tr, window_size = 6, exclusion_zone = 1, verbose = 0)
sgtr <- fluss(tr_obj, num_segments = 2, exclusion_zone = 1)
sgtr
## Matrix Profile
## --------------
## Profile size = 144
## Window size = 6
## Exclusion zone = 6
## Contains 1 set of data with 149 observations and 1 dimension
##
## Arc Count
## ---------
## Profile size = 144
## Minimum normalized count = 0.63 at index 59
##
## Fluss
## -----
## Segments = 2
## Segmentation indexes = 59 42
plot(sgtr)
Are they similar?
pl_obj <- tsmp(pl, window_size = 6, exclusion_zone = 1, verbose = 0)
sgpl <- fluss(pl_obj, num_segments = 2, exclusion_zone = 1)
sgpl
## Matrix Profile
## --------------
## Profile size = 144
## Window size = 6
## Exclusion zone = 6
## Contains 1 set of data with 149 observations and 1 dimension
##
## Arc Count
## ---------
## Profile size = 144
## Minimum normalized count = 0.77 at index 135
##
## Fluss
## -----
## Segments = 2
## Segmentation indexes = 135 25
plot(sgpl)
The segments in transmission (“regime” changes) are also different in Peel and Toronto.
Time Series snippets try to solve mainly the common problem of summarization “show me some representative/typical data”. The main difference between motifs/discords and snippets are explained in A Novel Time Series Distance Measure to Allow Data Mining in More Challenging Scenarios. In summary, snippets are representative patterns present in the data, not rare events like discords, neither (almost) perfectly similar as motifs. Remember, we try the compare two times series data: epi curves of Toronto and Peel by trying to find “common” motifs. As explained above, these motifs are found by euclidean distance. Now, snippets use Matrix Profile distances.
Let’s start “summarizing” the epi curves for Toronto and Peel before and after smooting:
mp_obj <- find_snippet(tr, 8, 2)
mp_obj %T>% plot
##
## Snippet
## -------
## Snippets found = 2
## Snippets indexes = 17 73
## Snippets fractions = 62.76% 37.24%
## Snippet size = 8
#smoothing
n = 3
ctr <- c(0,cumsum(tr))
rsum <- (ctr[(n+1):length(ctr)] - ctr[1:(length(ctr) - n)]) / n
mp_obj <- find_snippet(rsum, 8, 2)
mp_obj %T>% plot
##
## Snippet
## -------
## Snippets found = 2
## Snippets indexes = 17 121
## Snippets fractions = 60.00% 40.00%
## Snippet size = 8
mp_obj <- find_snippet(pl, 8, 2)
mp_obj %T>% plot
##
## Snippet
## -------
## Snippets found = 2
## Snippets indexes = 105 41
## Snippets fractions = 61.38% 38.62%
## Snippet size = 8
#smoothing
n = 3
cpl <- c(0,cumsum(pl))
rsum <- (cpl[(n+1):length(cpl)] - cpl[1:(length(cpl) - n)]) / n
mp_obj <- find_snippet(rsum, 8, 2)
mp_obj %T>% plot
##
## Snippet
## -------
## Snippets found = 2
## Snippets indexes = 113 1
## Snippets fractions = 49.66% 50.34%
## Snippet size = 8
It seems that none of the snippets are similar in both epi curves.