The HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages:
Data Compression: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction.
Data Projection: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures.
Tessellation: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns.
Scoring: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required.
Compression is a technique used to reduce the data size while preserving its essential information, allowing for efficient storage and decompression to reconstruct the original data. While Vector quantization (VQ) is a technique used in data compression to represent a set of data points with a smaller number of representative vectors. It achieves compression by exploiting redundancies or patterns in the data and replacing similar data points with representative vectors.
This package offers several advantages for performing data
compression as it is designed to handle
high-dimensional data more efficiently. It provides a
hierarchical compression approach, allowing
multi-resolution representation of the data. The hierarchical structure
enables efficient compression and storage of
the data while preserving different levels of detail. HVT aims to
preserve the topological structure of the data during compression.
Spatial data with irregular shapes and complex
structures in high-dimensional data can contain valuable information
about relationships and patterns. HVT seeks to capture and retain these
topological characteristics, enabling meaningful analysis and
visualization.This package employs tessellation to divide the compressed
data space into distinct cells or regions while preserving the topology
of the original data. This means that the relationships and connectivity
between data points are maintained in the compressed representation.
This package can perform vector quantization using the following algorithms-
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2).
These algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n_cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.
The steps for this method are as follows:
The stop criterion is when the quantization error of a cell satisfies one of the below conditions:
The quantization error for a cell is defined as follows:
\[QE = \max_i(||A-F_i||_{p})\]
where
Let us try to understand quantization error with an example.
Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed
An example of a 2 dimensional VQ is shown above.
In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.
Now we want to calculate quantization error for each cell. For the
sake of simplicity, let’s consider only one cell having centroid
A and m data points \(F_i\) for calculating quantization
error.
For each point, we calculate the distance between the point and the centroid.
\[ d = ||A - F_i||_{p} \]
In the above equation, p = 1 means L1_Norm distance
whereas p = 2 means L2_Norm distance. In the package, the
L1_Norm distance is chosen by default. The user can pass
either L1_Norm, L2_Norm or a custom function
to calculate the distance between two points in n dimensions.
\[QE = \max_i(||A-F_i||_{p})\]
Now, we take the maximum calculated distance of all m points. This
gives us the furthest distance of a point in the cell from the centroid,
which we refer to as Quantization Error. If the
Quantization Error is higher than the given threshold, the centroid/
codevector is not a good representation for the points in the cell. Now
we can perform further Vector Quantization on these points and repeat
the above steps.
Please note that the user can select mean, max or any custom function
to calculate the Quantization Error. The custom function takes a vector
of m value (where each value is a distance between point in
n dimensions and centroids) and returns a single value
which is the Quantization Error for the cell.
If we select mean as the error metric, the above
Quantization Error equation will look like this:
\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]
Projection mainly involves converting data from its original form to a different space or coordinate system while preserving certain properties of it. By projecting data into a common coordinate system, spatial relationships, distances, areas, and other spatial attributes can be accurately measured and compared.
HVT performs projection as part of its workflow to visualize and explore high-dimensional data. The projection step in HVT involves mapping the compressed data, represented by the hierarchical structure of cells, onto a lower-dimensional space for visualization purposes, as human perception is more suited to interpreting information in lower-dimensional spaces.Users can zoom in/out, rotate, and explore different regions of the projected space to gain insights and understand the data from different perspectives.
Sammon’s projection is an algorithm used in this package to map a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.
Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error.
\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]
The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.
A Voronoi diagram is a way of dividing space into a number of
regions. A set of points (called seeds, sites, or generators) is
specified beforehand and for each seed, there will be a corresponding
region consisting of all points within proximity of that seed. These
regions are called Voronoi cells. It is complementary to
Delaunay triangulation is a geometrical algorithm used to
create a triangulated mesh from a set of points in a plane which has the
property that no data point lies within the circumcircle of any triangle
in the triangulation. This property guarantees that the resulting cells
in the tessellation do not overlap with each other.
By using Delaunay triangulation, HVT can achieve a
partitioning of the data space into distinct and non-overlapping
regions, which is crucial for accurately representing and analyzing the
compressed data.Additionally, the use of Delaunay triangulation for
tessellation ensures that the resulting cells have well-defined shapes,
typically triangles in two dimensions or tetrahedra in three
dimensions.
The hierarchical structure resulting from tessellation preserves the inherent structure and relationships within the data. It captures clusters, subclusters, and other patterns in the data, allowing for a more organized and interpretable representation. The hierarchical structure reduces redundancy and enables more compact representations.
Tessellate: Constructing Voronoi Tesselation
In this package, we use sammons from the package
MASS to project higher dimensional data to a 2D space. The
function hvq called from the trainHVT function
returns hierarchical quantized data which will be the input for
construction of the tessellations. The data is then represented in 2D
coordinates and the tessellations are plotted using these coordinates as
centroids. We use the package deldir for this purpose. The
deldir package computes the Delaunay triangulation (and
hence the Dirichlet or Voronoi tessellation) of a planar point set
according to the second (iterative) algorithm of Lee and Schacter. For
subsequent levels, transformation is performed on the 2D coordinates to
get all the points within its parent tile. Tessellations are plotted
using these transformed points as centroids. The lines in the
tessellations are chopped in places so that they do not protrude outside
the parent polygon. This is done for all the subsequent levels.
Scoring basically refers to the process of chalking up or estimating future values or outcomes based on existing data patterns.In training process, a model is developed based on historical data or a training dataset, and this model is then used to score new, unseen data. The model captures the underlying patterns, trends, and relationships present in the training data, allowing it to pin point the cell of the similar or related data points.
In this package, we use scoreHVT function to score each
point in the testing dataset.
Scoring Algorithm
The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the testing dataset:
This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.
list.of.packages <- c("dplyr", "kableExtra", "geozoo", "plotly", "purrr",
"data.table", "gridExtra","plyr","DT","HVT")
new.packages <-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, dependencies = TRUE, repos='https://cloud.r-project.org/')
invisible(lapply(list.of.packages, library, character.only = TRUE))In this section we explore the capacity of the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring.
Data Understanding
First of all, let us see how to generate data for torus. We are using
a library geozoo for this purpose. Geo Zoo (stands for
Geometric Zoo) is a compilation of geometric objects ranging from three
to ten dimensions. Geo Zoo contains regular or well-known objects, eg
cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus
and Hyper-Torus.
Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.
Torus Dataset
The torus dataset includes the following columns:
Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.
set.seed(240)
# Here p represents dimension of object, n represents number of points
torus <- geozoo::torus(p = 3,n = 12000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df <- torus_df %>% round(4)
displayTable(head(torus_df))| x | y | z |
|---|---|---|
| -2.6282 | 0.5656 | -0.7253 |
| -1.4179 | -0.8903 | 0.9455 |
| -1.0308 | 1.1066 | -0.8731 |
| 1.8847 | 0.1895 | 0.9944 |
| -1.9506 | -2.2507 | 0.2071 |
| -1.4824 | 0.9229 | 0.9672 |
Let’s visualize the torus (donut) in 3D Space.
plot_ly(x = torus_df$x, y = torus_df$y, z = torus_df$z, type = 'scatter3d',mode = 'markers',
marker = list(color = torus_df$z,colorscale = c('red', 'blue'),showscale = TRUE,size = 3,colorbar = list(title = 'z'))) %>%
layout(scene = list(xaxis = list(title = 'x'),yaxis = list(title = 'y'),zaxis = list(title = 'z'),
aspectratio = list(x = 1, y = 1, z = 0.4)))