trainHVT() Function: Parameters and
Hyperparameters for Dimensionality Reduction MethodstrainHVT with 20 cellsThe HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages:
Data Compression: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction.
Data Projection: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures.
Tessellation: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns.
Scoring: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required.
Temporal Analysis and Visualization: Functions in this stage examine time-series data to identify patterns, estimate transition probabilities, and visualize data flow over time.
What’s New?
This notebook showcases the enhancement made to the
trainHVT function through the integration of dimensionality
reduction techniques and comprehensive evaluation metrics. These
advancements aim to enhance the visualization, analysis, and
interpretability of high-dimensional data within the HVT framework.
1. Integration of Advanced Dimensionality Reduction Techniques:
The trainHVT function now includes dimensionality
reduction techniques like t-SNE and UMAP, alongside the previously
implemented Sammon’s method. This integration enhances the function’s
capacity to explore and apply various dimensionality reduction
approaches.
t-Distributed Stochastic Neighbor Embedding
(t-SNE): Integrating t-SNE into the trainHVT
function facilitates non-linear dimensionality reduction, particularly
by preserving local structures and visualization of intricate data
structures. It efficiently processes large datasets with minimal
computational overhead.
Uniform Manifold Approximation and Projection
(UMAP): Integrating UMAP into the trainHVT
function used to preserve both local and global data structures. UMAP
excels at maintaining local relationships between data points while also
preserving the broader global structure, which helps in revealing more
meaningful clusters and patterns in complex datasets.
2. Integration of Evaluation Metrics:
Dimensionality reduction evaluation metrics help to determine the quality and effectiveness of the dimensionality reduction process by evaluating aspects such as data point proximity, cluster separation, and overall fidelity of the reduced representation.
t-SNE is a widely recognized technique for visualizing high-dimensional data in a low-dimensional space, typically two or three dimensions. Developed by Laurens van der Maaten and Geoffrey Hinton, t-SNE is particularly effective at preserving the local structure of the data, ensuring that similar data points are positioned close to one another in the reduced dimensional space.
Advantages of t-SNE
The key advantage of using t-SNE for dimensionality reduction lies in its probabilistic approach to measuring the similarities between data points.
t-SNE focuses on preserving the relative distances between data points, rather than just the absolute distances. This results in visually intuitive maps where similar data points form dense clusters, while dissimilar points are more spread out.
This property makes the resulting visualizations not only accurate in terms of capturing the underlying data structure, but also highly interpretable, even for users without extensive statistical expertise.
UMAP is a cutting-edge technique for dimension reduction and data visualization, known for its speed, scalability, and ability to maintain both global and local data structure. Developed by Leland McInnes, John Healy, and James Melville, UMAP has quickly become a favorite among data scientists for its versatility and robust performance across a wide range of applications.
Advantages of UMAP
The key advantage of using UMAP as a dimensionality reduction technique is its ability to simultaneously preserve the global structure of the data while also highlighting the local relationships between data points.
This dual focus results in visualizations that accurately represent the underlying clusters and patterns within the dataset, providing insights that may be missed by other dimensionality reduction methods.
UMAP delivers high-quality, interpretable visualizations that significantly enhance our understanding of complex, multi-dimensional datasets.
Dimensionality reduction evaluation metrics are measures used to assess the effectiveness of dimensionality reduction techniques. They help evaluate how well these techniques preserve the structure, relationships, and quality of the data when reducing its dimensions.
These six metrics are organized into three main categories. Below is
a brief overview of each metric included in the trainHVT
function:
Structure Preservation Metrics
Distance Preservation Metrics
Human Centered Metrics
Ground Truth: We have performed dimensionality reduction techniques on torus data. The underlying structure of this data is a torus, a surface shaped like a doughnut. The true shape of the data in its original high-dimensional space must resemble an annulus(two concentric circles) when properly reduced to two or three dimensions.
Interpretive Quality Metrics
Computational Efficiency Metrics
This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.
list.of.packages <- c("DT","plotly", "magrittr", "data.table", "tidyverse", "crosstalk",
"kableExtra", "cowplot","gdata","tidyverse", "ggplot2", "gridExtra","tibble","HVT")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages)){install.packages(new.packages, repos='https://cloud.r-project.org/')}
invisible(lapply(list.of.packages, library, character.only = TRUE))First, let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from three to 10 dimensions. Geo Zoo contains regular or well-known objects, e.g., cube and sphere, and some abstract objects, e.g., Boy’s surface, Torus and Hyper-Torus.
Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.
The torus dataset includes the following columns:
Lets, explore the raw torus dataset containing 12000 points. For the sake of brevity, we are displaying the 20 rows.
set.seed(124)
torus <- geozoo::torus(p = 3,n = 12000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df1 <- torus_df %>% round(4)
colnames(torus_df1) <- c("x","y","z")
torus_df1$Row.No <- as.numeric(row.names(torus_df))
torus_df1 <- torus_df1 %>% dplyr::select(x,y,z)
displayTable(torus_df1)| x | y | z |
|---|---|---|
| 1.0055 | 0.5779 | 0.5422 |
| -1.1971 | -0.1153 | 0.6035 |
| 0.2963 | 1.7116 | 0.9648 |
| -0.8651 | -0.5048 | 0.0571 |
| 1.6057 | -0.8437 | 0.9825 |
| 0.3565 | -2.5977 | -0.7830 |
| 0.1319 | -2.5860 | -0.8079 |
| -2.4760 | 1.5867 | 0.3388 |
| -1.7364 | -0.9281 | -0.9995 |
| 2.2525 | -1.9531 | 0.1922 |
| -0.8521 | -0.8509 | -0.6056 |
| 1.0110 | 0.5333 | 0.5153 |
| -1.2629 | 1.9172 | 0.9553 |
| 1.0199 | -1.6021 | 0.9949 |
| 0.6109 | -1.0920 | 0.6628 |
| 0.7072 | -1.4855 | 0.9350 |
| -2.4762 | 0.9804 | -0.7484 |
| 2.2901 | 1.9104 | -0.1875 |
| 1.0356 | 2.7392 | -0.3715 |
| 0.9449 | 1.2767 | 0.9113 |
Now, let’s try to visualize the torus dataset in 3D.
plot_ly(x = torus_df1$x, y = torus_df1$y, z = torus_df1$z, type = 'scatter3d',mode = 'markers',
marker = list(color = torus_df1$z,colorscale = c('#F50000', '#000FFF'),showscale = TRUE,size = 3,colorbar = list(title = 'z'))) %>%
layout(scene = list(xaxis = list(title = 'x'),yaxis = list(title = 'y'),zaxis = list(title = 'z'),
aspectratio = list(x = 1, y = 1, z = 0.5)))