Temporal Analysis and Visualization: Leveraging Time Series Capabilities in HVT (Hierarchical Voronoi Tessellation)

Zubin Dowlaty, Chepuri Gopi Krishna, Siddharth Shorya, Pon Anureka Seenivasan, Vishwavani

Created Date: 2023-10-26
Modified Date: 2026-01-22

1. Background

The HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages:

  1. Data Compression: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction.

  2. Data Projection: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures.

  3. Tessellation: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns.

  4. Scoring: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required.

What’s New?

Below are the new functions and its brief descriptions:

2. Experimental setup

The Lorenz attractor is a three-dimensional figure that is generated by a set of differential equations that model a simple chaotic dynamic system of convective flow. Lorenz Attractor arises from a simplified set of equations that describe the behavior of a system involving three variables. These variables represent the state of the system at any given time and are typically denoted by (x, y, z). The equations are as follows:

\[ dx/dt = σ*(y-x) \] \[ dy/dt = x*(r -z)-y \] \[ dz/dt = x*y-β*z \] where dx/dt, dy/dt, and dz/dt represent the rates of change of x, y, and z respectively over time (t). σ, r and β are constant parameters of the system, with σ(σ = 10) controlling the rate of convection, r(r=28) controlling the difference in temperature between the convective and stable regions, and β(β = 8/3) representing the ratio of the width to the height of the convective layer. When these equations are plotted in three-dimensional space, they produce a chaotic trajectory that never repeats. The Lorenz attractor exhibits sensitive dependence on initial conditions, meaning even small differences in the initial conditions can lead to drastically different trajectories over time. This sensitivity to initial conditions is a defining characteristic of chaotic systems.

In this notebook, we will use the Lorenz Attractor Dataset. This dataset contains 200,000 (Two hundred thousand) observations and 5 columns. The dataset can be downloaded from here.

The dataset includes the following columns:

3. Notebook Requirements

This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.

list.of.packages <- c("dplyr", "kableExtra", "plotly", "purrr", "data.table", "gridExtra", "grid", "reactable",
                    "reshape", "tidyr",  "stringr", "DT", "knitr", "feather","gganimate","gifski", "HVT")

new.packages <-
  list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
  install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')
invisible(lapply(list.of.packages, library, character.only = TRUE))

4. Data Understanding

Here, we load the data. Let’s explore the Lorenz Attractor Dataset. For the sake of brevity, we are displaying only the first ten rows.

file_path <- ("./sample_dataset/lorenz_attractor.feather")
dataset <- read_feather(file_path) %>% as.data.frame()
dataset <- dataset %>% select(X,Y,Z,U,t)
dataset$t <- round(dataset$t, 5)
displayTable(head(dataset, 10))
X Y Z U t
0.0000 1.0000 20.0000 0.0000 0.0000
0.0025 0.9998 19.9867 0.0005 0.0003
0.0050 0.9995 19.9734 0.0010 0.0005
0.0075 0.9993 19.9601 0.0015 0.0008
0.0099 0.9990 19.9468 0.0020 0.0010
0.0124 0.9988 19.9335 0.0025 0.0013
0.0149 0.9986 19.9202 0.0030 0.0015
0.0173 0.9984 19.9069 0.0035 0.0018
0.0198 0.9982 19.8937 0.0040 0.0020
0.0222 0.9980 19.8804 0.0045 0.0022

Now, let’s try to visualize the Lorenz attractor (overlapping spirals) in 3D Space.

data_3d <- dataset[sample(1:nrow(dataset), 1000), ]
plot_ly(data_3d, x= ~X, y= ~Y, z = ~Z) %>% add_markers( marker = list(
                          size = 2,
                          symbol = "circle",
                          color = ~Z,
                          colorscale = "Bluered",
                          colorbar = (list(title = 'Z'))))

Figure 1: Lorenz attractor in 3D space

Now, let’s have a look at the structure of the Lorenz Attractor dataset.

str(dataset)
## 'data.frame':    200000 obs. of  5 variables:
##  $ X: num  0 0.0025 0.00499 0.00747 0.00995 ...
##  $ Y: num  1 1 1 0.999 0.999 ...
##  $ Z: num  20 20 20 20 19.9 ...
##  $ U: num  0 0.0005 0.001 0.0015 0.002 ...
##  $ t: num  0 0.00025 0.0005 0.00075 0.001 0.00125 0.0015 0.00175 0.002 0.00225 ...

EDA Plots

This section displays four objects.

Variable Histograms: The histogram distribution of all the features in the dataset.

Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.

Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

Summary EDA: The table provides descriptive statistics for all the features in the dataset.

Time Series Plots: Plots of all features (including time) against the time column.

It uses an inbuilt function called edaPlots to display the above-mentioned four objects.

NOTE: The input dataset should be a data frame object and the columns should be only numeric type.

Summary Table

edaPlots(dataset)

Histogram

edaPlots(dataset, output_type = 'histogram')

Boxplots

edaPlots(dataset, output_type = 'boxplot')

Correlation Plot

edaPlots(dataset, output_type = 'correlation')

Time Series Plot

edaPlots(dataset, time_column = "t", output_type = "timeseries")


5. Model Constructing and Visualization

The dataset is prepped and ready for constructing the HVT model, which is the first and most prominent step. Model Training involves applying Hierarchical Vector Quantization (HVQ) to iteratively compress and project data into a hierarchy of cells. The process uses a quantization error threshold to determine the number of cells and levels in the hierarchy. The compressed data is then projected onto a 2D space, and the resulting tessellation provides a visual representation of the data distribution, enabling insights into the underlying patterns and relationships.

Model Parameters

NOTE: The compression takes place only for the X, Y and Z coordinates and not for U(velocity) and t(Timestamp). After training & Scoring, we merge back the U and t columns with the dataset.

hvt.results <- trainHVT(
  dataset[,-c(4:5)],
  n_cells = 100,
  depth = 1,
  quant.err = 0.1,
  normalize = TRUE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans",
  dim_reduction_method = "sammon")

Let’s check out the compression summary.

summary(hvt.results)
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 100 2 0.02 n_cells: 100 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

NOTE: Based on the provided table, it’s evident that the ‘percentOfCellsBelowQuantizationErrorThreshold’ value is 0.02, indicating that only 2% compression has taken place for the specified number of cells, which is 100. Typically, we would continue increasing the number of cells until at least 80% compression occurs. However, in this vignette demonstration, we’re not doing so, because the plots generated from temporal analysis functions would become cluttered and complex, making explanations less clear.

Now, Let’s plot the Voronoi tessellation for 100 cells.

plotHVT(
  hvt.results,
  centroid.size = c(0.6),
  plot.type = '2Dhvt',
  cell_id = FALSE)
Figure 2: The Voronoi tessellation for layer 1 shown for the 100 cells in the dataset ’Lorenz attractor’

Figure 2: The Voronoi tessellation for layer 1 shown for the 100 cells in the dataset ’Lorenz attractor’

To understand how cell IDs are distributed across the map, we again plot Voronoi tessellation with cell_id = TRUE.

plotHVT(
  hvt.results,
  centroid.size = c(0.6),
  plot.type = '2Dhvt',
  cell_id = TRUE)
Figure 3: The Voronoi tessellation for layer 1 shown for the 100 cells in the dataset ’Lorenz attractor’ with Cell ID

Figure 3: The Voronoi tessellation for layer 1 shown for the 100 cells in the dataset ’Lorenz attractor’ with Cell ID

6. Scoring

Once the model is constructed, the next step is scoring the data points using the trained HVT model. Scoring involves assigning each data point to a specific cell based on the trained model. This process helps map data points to their correct hierarchical cell without the need for forecasting.

scoring <- scoreHVT(dataset,
                    hvt.results,
                    child.level = 1)

Here we are displaying only the first 20 rows of the scored dataset.

summary(scoring)
Row_Number Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error centroidRadius diff anomalyFlag X Y Z
1 1 1 88 1 57 0.07 0.3618 0.2918 0 -0.0905 0.0338 -0.3663
2 1 1 88 1 57 0.0694 0.3618 0.2924 0 -0.0902 0.0338 -0.3678
3 1 1 88 1 57 0.0688 0.3618 0.2930 0 -0.0899 0.0337 -0.3693
4 1 1 88 1 57 0.0682 0.3618 0.2936 0 -0.0896 0.0337 -0.3708
5 1 1 88 1 57 0.0676 0.3618 0.2942 0 -0.0893 0.0337 -0.3723
6 1 1 88 1 57 0.0669 0.3618 0.2949 0 -0.0889 0.0336 -0.3738
7 1 1 88 1 57 0.0663 0.3618 0.2955 0 -0.0886 0.0336 -0.3753
8 1 1 88 1 57 0.0657 0.3618 0.2961 0 -0.0883 0.0336 -0.3768
9 1 1 88 1 57 0.0651 0.3618 0.2967 0 -0.0880 0.0336 -0.3783
10 1 1 88 1 57 0.0645 0.3618 0.2973 0 -0.0877 0.0336 -0.3798
11 1 1 88 1 57 0.0639 0.3618 0.2979 0 -0.0874 0.0335 -0.3813
12 1 1 88 1 57 0.0633 0.3618 0.2985 0 -0.0871 0.0335 -0.3828
13 1 1 88 1 57 0.0627 0.3618 0.2991 0 -0.0867 0.0335 -0.3843
14 1 1 88 1 57 0.0621 0.3618 0.2997 0 -0.0864 0.0335 -0.3858
15 1 1 88 1 57 0.0615 0.3618 0.3003 0 -0.0861 0.0334 -0.3872
16 1 1 88 1 57 0.0608 0.3618 0.3009 0 -0.0858 0.0334 -0.3887
17 1 1 88 1 57 0.0602 0.3618 0.3016 0 -0.0855 0.0334 -0.3902
18 1 1 88 1 57 0.0596 0.3618 0.3022 0 -0.0852 0.0334 -0.3917
19 1 1 88 1 57 0.059 0.3618 0.3028 0 -0.0849 0.0334 -0.3932
20 1 1 88 1 57 0.0584 0.3618 0.3034 0 -0.0846 0.0334 -0.3947

These are the brief explanation of all the features in the above table.

Let’s look at the scored model summary from scoreHVT()

scoring$model_info$scored_model_summary
## $input_dataset
## [1] "200000 Rows & 5 Columns"
## 
## $scored_qe_range
## [1] "5e-04 to 0.3433"
## 
## $mad.threshold
## [1] 0.2
## 
## $no_of_anomaly_datapoints
## [1] 1706
## 
## $no_of_anomaly_cells
## [1] 18

The model info displays five attributes which are explained below:

  1. Dimension of the dataset that is used in scoring.
  2. The range of ‘Quantization Error’ after scoring of all the data points of the given dataset. (from minimum to maximum)
  3. The value of Mean Absolute Deviation(MAD) Threshold.
  4. The value of how many data points are anomalous. If the quantization error of a data point is greater than the ‘mad.threshold’, then it is considered as anomaly.
  5. The value of how much cells have anomaly. If a cell has even 1 anomalous data point, the cell will be considered as anomaly_cell.

Now, let’s merge back the U and t columns from the dataset to the scoring output and prepare the dataset for ‘Temporal Analysis’ Functions.

temporal_data <- cbind(scoring$scoredPredictedData, dataset[,c(4,5)]) %>% select(Cell.ID,t)


7. Timeseries plot with State Transitions

The first novel function - plotStateTransition which is used to create a time series plotly object. The plot displays the movement of data points across the cells over time. Below is the function signature and its arguments.

plotStateTransition(
       df,
       sample_size,
       line_plot,
       cellid_column,
       time_column,
       v_intercept,
       time_periods)
plotStateTransition(df = temporal_data, 
                    cellid_column = "Cell.ID", 
                    time_column = "t",
                    sample_size = 0.2,
                    line_plot = TRUE)