The HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages:
Data Compression: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction.
Data Projection: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures.
Tessellation: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns.
Scoring: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required.
Temporal Analysis and Visualization: Functions in this stage examine time-series data to identify patterns, estimate transition probabilities, and visualize data flow over time.
What’s New?
HVT – Version 25.2.4
This notebook introduces a new feature, MSM (Monte Carlo Simulation of Markov Chains), in the HVT package. MSM is designed for dynamic forecasting of time series data using a transition probability matrix to forecast n states ahead.
The workflow supports both:
The notebook provides a step-by-step walkthrough covering data preparation, model setup, and forecast generation. It also highlights challenges arising from transition probability issues in certain states, outlines mechanisms to handle such problematic states, and evaluates forecast performance using appropriate accuracy metrics.
HVT – Version 25.2.8
In this latest version of notebook, introducing an enhancement to dynamic forecasting with the release of raw series ex-ante forecasting. In this approach, a simple 12-month direct lookback method is used, where raw historical values combined with the forecasted year-over-year (YoY) changes are applied to generate raw-level forecasts for each feature over the ex-ante period.
This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, the code will install those packages and attach to the session environment.
list.of.packages <- c("dplyr", "tidyr","patchwork", "feather","ggplot2","kableExtra", "htmltools",
"plotly","tibble","purrr", "gganimate", "DT","readr", "NbClust", "HVT")
new.packages <-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')
invisible(lapply(list.of.packages, library, character.only = TRUE))Below is the function for more dynamic drop down display in the data tables.
Let’s start with importing the dataset. The below code reads and displays the dataset.
csv_path <- "sample_dataset/macro_eco_data_2025.csv"
entire_dataset_raw <- read.csv(csv_path)
entire_dataset_raw <- entire_dataset_raw %>% mutate(across(where(is.numeric), ~ round(., 4)))
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset_raw))
datatable(entire_dataset_raw,options = list(pageLength = 10,scrollX = TRUE, lengthMenu = dynamic_length_menu), rownames = FALSE)This dataset includes a collection of key economic and financial indicators. These indicators are essential for monitoring macroeconomic performance, analyzing market trends, and assessing financial stability. The data ranges from December 1998 to September 2025.
Before proceeding, it is crucial to examine the structure of the dataset. This involves verifying the data types of the columns and resolving any inconsistencies. Make sure all data types are accurate and suitable for the intended functions.
## 'data.frame': 322 obs. of 18 variables:
## $ t : chr "1998/12" "1999/01" "1999/02" "1999/03" ...
## $ CPI_Food : num 175 176 176 176 176 ...
## $ COIN : num 80.1 80.3 80.8 80.8 80.9 81.2 81.4 81.7 82 81.9 ...
## $ Copper_ETF : num 0.663 0.641 0.624 0.622 0.717 ...
## $ SnP500_ETF : num 1229 1280 1238 1286 1335 ...
## $ Spot_Oil : num 11.3 12.5 12 14.7 17.3 ...
## $ USD_Index : num 94.2 96.1 98.7 100.1 101 ...
## $ Unemp_Rate : num 4.4 4.3 4.4 4.2 4.3 4.2 4.3 4.3 4.2 4.2 ...
## $ Y10_Note : num 4.65 4.72 5 5.23 5.18 5.54 5.9 5.79 5.94 5.92 ...
## $ Y2_Note : num 4.51 4.62 4.88 5.05 4.98 5.25 5.62 5.55 5.68 5.66 ...
## $ Yield_Curve: num 10.5 10.4 10.7 10.6 10.2 ...
## $ XLY : num 19.4 20.3 20.2 21.2 21.8 ...
## $ XLP : num 14.6 14.4 14.3 14.3 13.8 ...
## $ XLE : num 11.8 11 10.9 12.5 14.3 ...
## $ XLF : num 11.4 11.6 11.7 12.1 13 ...
## $ XLV : num 17.7 18.5 18.5 19 19.7 ...
## $ XLI : num 15.4 15.2 15.4 15.7 18 ...
## $ XLB : num 12.2 11.8 11.9 12.2 15.2 ...
Since the time column is in ‘Character’ format, we are changing it to ‘datetime’ (POSIXct) format.
We transform the data to compute the 12-month rate of change, which standardizes the features and brings them to a comparable scale, simplifying the analysis of their relative changes. Log difference reduces variability and removes trends, stabilizing the data for more accurate forecasting.
The Rate of change is calculated as follows:
\[ \text{Rate of Change} = \log(\text{Current Value}) - \log(\text{12-Month Lag Value}) \]
entire_dataset <- entire_dataset_raw
features_data <- entire_dataset %>% select(-t) %>% colnames()
entire_dataset[features_data] <- lapply(entire_dataset[features_data], function(x) as.numeric(as.character(x)))
invisible(lapply(features_data, function(col) {
entire_dataset[[col]] <<- entire_dataset[[col]] %>% log()
entire_dataset[[col]] <<- c(rep(NA, 12), round(diff(entire_dataset[[col]], 12),4))}))
entire_dataset <- entire_dataset %>% na.omit() %>% data.frame()
rownames(entire_dataset) <- NULLAfter the log difference of 12 datapoints (months), the dataset ranges from December 1999 to September 2025. Below is the table displaying the transformed dataset.
entire_dataset$t <- as.character(entire_dataset$t)
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset))
datatable(entire_dataset,options = list(pageLength = 10,scrollX = TRUE,lengthMenu = dynamic_length_menu,columnDefs = list(list(width = '150px', targets = 0))),
class = "nowrap display",rownames = FALSE)The ultimate goal of this notebook is to dynamically forecast the HVT states in both ex-post and ex-ante scenarios. To structure the analysis, we will define the timelines based on the main topics of interest:
Constructing HVT Model: 1999-12-01 to 2024-09-01
Scoring Using HVT Model: 1999-12-01 to 2025-09-01 (as actual states needed for ex-post actuals and studentized residuals)
Dynamic Forecasting
For the Exploratory Data Analysis (EDA), we will create a statistic table and series of plots to visualize the dataset’s distribution, trends, and relationships. These plots provide insights into the data structure, enabling a better understanding of the underlying patterns and correlations.
Dataset used for EDA: 1999-12-01 to 2025-09-01
For creating tabsets, follow this markdown syntax;
## Main Heading {.tabset}
### Tab 1
- Content for Tab 1.
### Tab 2
- Content for Tab 2.
## Next Sectionrecession_periods <- list(c("2001-03-01", "2001-11-01"),c("2007-12-01", "2009-06-01"),c("2020-02-01", "2020-04-01"))
recession_periods <- lapply(recession_periods, function(x) as.POSIXct(x))
edaPlots(entire_dataset, time_column = "t", output_type = "timeseries", grey_bars = recession_periods)The dataset is prepped and ready for constructing the HVT model, which is the first and most prominent step. Model Training involves applying Hierarchical Vector Quantization (HVQ) to iteratively compress and project data into a hierarchy of cells. The process uses a quantization error threshold to determine the number of cells and levels in the hierarchy. The compressed data is then projected onto a 2D space, and the resulting tessellation provides a visual representation of the data distribution, enabling insights into the underlying patterns and relationships.
We use the trainHVT function to compress the dataset,
and timestamp feature is not needed for this process.
Input Parameters
hvt.results <- trainHVT(
trainHVT_data[,-1],
n_cells = 75,
depth = 1,
quant.err = 0.25,
normalize = TRUE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
dim_reduction_method = "sammon")| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 75 | 56 | 0.75 | n_cells: 75 quant.err: 0.25 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
The value of
percentOfCellsBelowQuantizationErrorThreshold is crucial
for evaluating the model’s performance and the quality of the
compression. It is recommended to construct a model where atleast 80% of
the cells fall below the quantization error threshold.
The value 0.75, indicates that 75% compression has been achieved, Typically, the number of cells is increased till, atleast 80% compression is achieved. However, for this vignette, we proceed with the current level of compression, as this specific combination of cells and dataset intentionally induces problematic transition states, which are useful for demonstrating the proposed solution.
For the visual stability and aesthetic check, we will plot and compare our current model with tessellations from 3 cells above and below our current cell count of 75. This comparison helps ensure the model’s structural integrity across a similar range of cells and allows for the identification of any significant sudden structural changes.