The HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages:
Data Compression: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction.
Data Projection: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures.
Tessellation: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns.
Scoring: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required.
Temporal Analysis and Visualization: Functions in this stage examine time-series data to identify patterns, estimate transition probabilities, and visualize data flow over time.
Dynamic Forecasting: Monte Carlo simulations of Markov chain provides forecasting capabilities for both ex-post and ex-ante scenarios with meticulously handling problematic states when found.
Objective
This vignette is designed to systematically explore different
combinations of parameters — specifically, the number of cells in
trainHVT
function, the number of clusters (k) and the
number of nearest neighbors in the msm
function. The goal
is to evaluate the MAE (Mean Absolute Error) across all variables for
each combination and identify the configuration that yields the lowest
MAE, thereby indicating the most accurate model selection.
Number of Cells:
This value determines how many distinct cells should be formed at each level of the hierarchy during the vector quantization process in the trainHVT function.
Number of Clusters(k):
This value defines how many clusters will be formed. A range of 2 to 15 is recommended, as higher values may produce overly granular clusters that reduce interpretability and practical significance.
Number of Nearest Neighbors(nn):
This value specifies how many nearest neighbors are considered within each cluster to determine the next state for problematic states during simulation. The maximum value here is based on the cluster structure.
This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, the code will install those packages and attach to the session environment.
list.of.packages <- c("dplyr", "xts")
new.packages <-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')
invisible(lapply(list.of.packages, library, character.only = TRUE))
Since all the new functions for hyperparameter tuning approach is not in CRAN yet, we are sourcing the scripts from local.
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))
Below is the function for more dynamic drop down display of data tables.
In this section, we will explore the dataset used for demonstrating hyperparameter tuning, perform the necessary pre-processing steps, and prepare it for subsequent analysis.
Let’s begin by importing the dataset. The following code loads the
dataset from the sample_dataset
folder and displays it in a
tabular format. The dataset ranges from December 1998 to
Septemeber 2025.
#importing the dataset
entire_dataset <- read.csv("./sample_dataset/macro_raw_model_inputs_v5.csv")
#rounding to max 4 significant figures
entire_dataset <- entire_dataset %>% mutate(across(where(is.numeric), ~ round(., 4)))
#displaying the table
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset))
DT::datatable(entire_dataset,options = list(pageLength = 25,scrollX = TRUE, lengthMenu = dynamic_length_menu), rownames = FALSE)
Before proceeding with analysis, it is crucial to examine the structure of the dataset. This involves verifying the data types of the columns and resolving any inconsistencies and making sure the dataset is suitable for the usage in new functions.
## 'data.frame': 322 obs. of 18 variables:
## $ Date : chr "12/1/98" "1/1/99" "2/1/99" "3/1/99" ...
## $ CPI_U_All_Items_Less_Food_Energy : num 175 176 176 176 176 ...
## $ Coincident_Index : num 80.1 80.3 80.8 80.8 80.9 81.2 81.4 81.7 82 81.9 ...
## $ Copper_ETF : num 0.663 0.641 0.624 0.622 0.717 ...
## $ SnP500_ETF : num 1229 1280 1238 1286 1335 ...
## $ Spot_Oil_Price_West_Texas_Intermediate: num 11.3 12.5 12 14.7 17.3 ...
## $ US_Dollar_Index : num 94.2 96.1 98.7 100.1 101 ...
## $ Unemployment_Rate : num 4.4 4.3 4.4 4.2 4.3 4.2 4.3 4.3 4.2 4.2 ...
## $ X10_Year_Treasury_Note_Yield : num 4.65 4.72 5 5.23 5.18 5.54 5.9 5.79 5.94 5.92 ...
## $ X2_Year_Treasury_Note_Yield : num 4.51 4.62 4.88 5.05 4.98 5.25 5.62 5.55 5.68 5.66 ...
## $ High_Yield_Effective_Yield : num 10.5 10.4 10.7 10.6 10.2 ...
## $ XLY : num 19.4 20.3 20.2 21.2 21.8 ...
## $ XLP : num 14.6 14.4 14.3 14.3 13.8 ...
## $ XLE : num 11.8 11 10.9 12.5 14.3 ...
## $ XLF : num 11.4 11.6 11.7 12.1 13 ...
## $ XLV : num 17.7 18.5 18.5 19 19.7 ...
## $ XLI : num 15.4 15.2 15.4 15.7 18 ...
## $ XLB : num 12.2 11.8 11.9 12.2 15.2 ...
Since the time column is currently in character format, let’s convert it to a datetime (POSIXct) object and adjust its format from “%m/%d/%y” to “%Y-%m-%d” for easy understanding & consistency.
Note: This conversion will be applied after most chunks because the DT package does not support displaying POSIXct values (%Y-%m-%d) in this format instead they appear in UTC. A workaround is pre-setting: first convert the timestamps to character format for display, and then convert back to POSIXct.
In this section, we’ll transform the data into a 12-month rate of change, standardizing the features and bringing them onto a comparable scale. Since each variable is measured in different units, this transformation allows us to analyze their relative movements more effectively. By applying a log difference, we further reduce variability and eliminate long-term trends, resulting in a more stable dataset that enhances the accuracy of subsequent forecasting.
The Rate of change is calculated as follows:
\[ \text{Rate of Change} = \log(\text{Current Value}) - \log(\text{12-Month Lag Value}) \]
Since the rate of change applies only to the features and not the timestamp, we first remove the timestamp column. The dataset is then converted to an xts object, where we compute the 12-month log difference, remove any resulting NA columns, and finally convert the data back to its original format.
#changing to xts object
entire_dataset_xts <- xts(entire_dataset[,-1], order.by = as.Date(entire_dataset$Date))
#convert back the xts object
entire_dataset <- data.frame(Date = index(entire_dataset_xts),round(coredata(entire_dataset_xts), 4))
After the log difference of 12 months, the dataset now ranges from December 1999 to September 2025. Below is the table displaying the transformed dataset.
In this section, we split the complete dataset into train and test sets, using the last 12 timestamps as the test and the remaining data as the train.
train_dataset <- entire_dataset %>% slice_head(n = nrow(entire_dataset) - 12)
test_dataset <- entire_dataset %>% slice_tail(n = 12)
The train dataset is from 1999-12-01 to 2024-09-01. Below is the table displaying the train data.
The test dataset is from 2024-10-01 to 2025-09-01. Below is the table displaying the test data.
test_dataset$Date <- as.character(test_dataset$Date)
DT::datatable(test_dataset,options = list(pageLength = 25,scrollX = TRUE,lengthMenu = calculate_dynamic_length_menu(nrow(test_dataset)),
columnDefs = list(list(width = '150px', targets = 0))),class = "nowrap display",rownames = FALSE)
It is important to understand how train and test data are utilized in each stage of the workflow. Using only the test data for scoring captures the temporal flow solely for the test period. Conversely, relying exclusively on the training data causes overfitting issues.
Therefore, it is advisable to perform scoring on the full dataset, both train and test data to ensure a seamless and accurate workflow. The following example demonstrates this approach.
trainHVT()
= train dataset from 1999-12-01 to
2024-09-01Calling trainHVT with default parameters except for number of cells: 13 (random selection made for demo)
# trainHVT
hvt.results <- trainHVT(
train_dataset[,-1],
n_cells = 13,
depth = 1,
quant.err = 0.2,
normalize = FALSE,
distance_metric = "L1_Norm",
error_metric = "max",
dim_reduction_method = "sammon")
scoreHVT()
= train + test dataset from 1999-12-01 to
2025-09-01# scoreHVT
scoring <- scoreHVT(entire_dataset,hvt.results)
scoring$scoredPredictedData$Date <- format(entire_dataset$Date, "%Y-%m-%d")
scoring$scoredPredictedData <- scoring$scoredPredictedData %>% dplyr::select(c(Date, everything()))
Let’s create a temporal dataset consisting of two columns: one for timestamps and another for cell IDs. This dataset will be used in dynamic forecasting to track the movement of cells over time and to construct the transition probability matrix.
# Temporal Data
temporal_data <-scoring$scoredPredictedData %>% select(Date,Cell.ID)
temporal_data$Date <- as.POSIXct(temporal_data$Date, format = "%Y-%m-%d")
Below is the display of temporal data achieved from model scoring.
temporal_data$Date <- as.character(temporal_data$Date)
DT::datatable(temporal_data,options = list(pageLength = 25,scrollX = TRUE,lengthMenu = calculate_dynamic_length_menu(nrow(temporal_data)),
columnDefs = list(list(width = '150px', targets = 0))),class = "nowrap display",rownames = FALSE)
ids <- sort(unique(temporal_data$Cell.ID))
cat("The Unique Cell IDs in Temporal Data:", paste(ids, collapse = ", "), "\n")
## The Unique Cell IDs in Temporal Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
Observation:
The printed Cell IDs from 1 to 13 show that none are lost during the temporal data creation process, ensuring that the resulting probability transition matrix will be accurate and support precise forecasts.
In this section, we will understand a detailed overview of hyperparameter tuning: what it is, why it is important, and how it is performed.
Hyperparameter tuning is performed using a grid search over a defined range of parameters, such as the number of cells (n_cell), number of clusters (k), and number of nearest neighbors (nn).
For each combination, the trainHVT
,
scoreHVT
, and msm
functions are executed, and
the Mean Absolute Error (MAE) is computed for the forecasting period.
The configuration with the lowest MAE is selected as the Champion
model.
The parameters k and nn are applied only when problematic states are detected in a cell configuration, enabling accurate identification of the Champion model while minimizing manual testing and improving efficiency.
Below are the detailed explaination of hyperparameters:
Hyperparameter 1 – Number of Cells in
trainHVT
:
Hyperparameter 2 – Number of Clusters:
msm
function when problematic states
are identified during the simulation processk
clusters based on
centroid distance in 2D spaceHyperparameter 3 – Number of Nearest Neighbors:
msm
function when problematic states
are identified during the simulation processBelow is the workflow for the hyperparameter tuning:
HVT Model Training: For each value in the
ncell
range, the HVT model is trained on the training
dataset using trainHVT
. Each training run is executed
sequentially, one after the other.
HVT Model Scoring, Temporal Data Preparation, and
Transition Probability Computation: Once the HVT models are
trained, they are scored using scoreHVT
. The resulting
Cell.ID
assignments are merged with the entire dataset’s
time column to generate temporal data representing state transitions.
This temporal data is then used to construct a transition probability
matrix only from training period. From this point onward, the process
executes in parallel.
Identify Problematic States and Run MSM
Simulations: For each ncell
value:
If problematic states are not detected, the baseline simulations (without k & nn requirement) for the cell will run.
If problematic states are detected, the cell centroids are
clustered using the specified range of k
value and the next
state is decided from the nearest neighbor value. The status and
configuration of these iterations are stored.
NOTE - What are problematic states actually? There are certain edge cases that may cause simulations as a straight line (stuck in a single state) or a dead state transition cycle (only between two states - back and forth) without any variability or exploring other states. Such States are called Problematic states and these are the caused by Absence Transitions, Self-State Only Transitions and Cyclic Transitions.
Please refer to this vignette to more about the problematic states and how it can be resolved
expost_forecasting
). MAE is calculated using any or all of
the three metrics: Mean, Median, and
Mode.For each variable MAE is calculates as:
\[\text{MAE} = \sum_{t=1}^{T} |\text{Actual}_{t} - \text{Predicted}_{t}|\] where,
T = Total number of forecast periods (the forecasting horizon)
t = Individual time period counter (goes from 1 to T)
For the states : The predicted states is selected based on mae_metric - mean/median/mode of all values across simulations for a particular timestamp, will be subtracted from the actual state which is the scored from temporal data.
For the variables : The predicted states from the simulations are translated to respective feature centroids from the trainHVT (HVQ object) and scaled to the raw dataset values by multiplying them by the standard deviation of the raw feature and adding the mean of the raw feature (if data is normalized during model training). The actual values is from the primary dataset which is log diffed in the section 3.3
Identify and Aggregate Optimal Parameters: MAE
results are recorded for every combination of ncell
,
k
, and nearest neighbor
values across all
iterations. Comprehensive results are displayed in tables, including a
table of all results and a separate table highlighting the top three
performing combinations.
Visualize Results: An MAE plot is generated, displaying the number of cells on the Y-axis and MAE on the X-axis to visualize all the iterations.
Note: The expost_forecasting
period
will be fixed for all iterations.
The HVTMSMoptimization
function automates the entire
workflow described above. Rather than executing trainHVT
,
scoreHVT
, and msm
separately, it encapsulates
all these steps into a single call. This function systematically
explores the parameter space to identify the champion combination of
clustering granularity and temporal modeling parameters for superior
forecasting performance.
entire_dataset
– The train dataset
containing timestamps and features; used for HVT model training,
scoring, and MSM simulations.expost_forecasting
– The test dataset
containing timestamps and features; used for scoring and to calculate
Mean Absolute Error (MAE) for MSM predictions.time_column
– Column name of
timestamps; not included in HVT model training but utilized for building
temporal sequences and MSM analysis.ncell_range
– Possible numbers of
cells for tessellation in trainHVT
, min = 3.k_range
– Range of clusters in
msm
when problematic states are found, min = 2nn_range
– Range of nearest-neighbor
counts within a cluster to select as next state for problematic states,
min = 1. The maximum value will be number of states in the cluster -
number of problematic states in that clusternum_simulations
– Number of
simulations in MSM function (more number of simulations = more time
taken)mae_metric
– Method for MAE
calculation (“mean”, “median”, “mode” or “all”)depth
– Integer specifying hierarchy
depth for vector quantizationquant.err
– Numeric quantization error
threshold for vector quantizationnormalize
– Logical value to indicate
whether to normalize input datasetdistance_metric
– Distance measure
used (e.g., “L1_Norm”, “L2_Norm”)error_metric
– Error calculation
method for HVT (e.g., “max”, “mean”)dim_reduction_method
– Dimensionality
reduction technique (e.g., “sammon”, “tsne”, “umap”)parallel
– Enables parallel processing
across cores to speed up optimization (default: TRUE).HVTMSMoptimization
Input Parameters for the HVTMSMoptimization function:
entire_dataset
: 1999-12-01 to 2024-09-01expost_forecasting
: 2024-10-01 to 2025-09-01num_simulations
: 100mae_metric
: medianhvt_params
with default values of trainHVT
functionIn this vignette, we are using the following hyperparameter ranges:
ncell_range
: 10:Maximum cells (i.e., 298/2 = 149)k_range
: 2:9nn_range
: 2:7max_cells <- round((nrow(train_dataset)/2))
msm_model <- HVTMSMoptimization(
entire_dataset = train_dataset,
expost_forecasting = test_dataset,
time_column = "Date",
ncell_range = 10:max_cells,
k_range = 2:9,
nn_range = 2:7,
num_simulations = 100,
mae_metric = "median",
hvt_params = list(
depth = 1,
quant.err = 0.2,
normalize = FALSE, #since the data is already log differenced
distance_metric = "L1_Norm",
error_metric = "max",
dim_reduction_method = "sammon"),
parallel = TRUE)
The table below displays all tested combinations of parameters to find the optimal model for the given dataset.
The status column in the table displays various messages. The following guide explains the meaning of each message:
Successful Iteration by Handling Problematic state: The simulation ran successfully after resolving problematic states through clustering and selecting the next state. The MAE was then computed for the corresponding combination of n_cells, k, and nn.
Successful Iteration without Problematic State: No problematic states were detected, allowing the baseline simulation to execute smoothly. The MAE was computed without the need for specifying k or nn.
Isolated Problematic States: Some clusters may contain problematic states that are either solitary or grouped exclusively with other problematic states, leaving no non-problematic states available for neighbor selection.
Limited Availability of Neighbors (Req = n, Pre = n): A problematic state may have a number of valid neighbors that does not match the requested count; the neighbors present may be fewer or greater than the requested number.
Mathematical Bound Reached (message): Certain configurations of k or nn may violate inherent mathematical constraints, such as k exceeding the total number of cells or nn exceeding (cells - k). Example_1: k = 6 is invalid for n_cells = 5. Example_2: For n_cells = 10, k = 3, and nn = 8, the maximum permissible nn is 7.
ERROR: (message): The simulation was halted due to a system, data, or processing failure. Further details are provided in the accompanying message.
Legend for color strips in the table below:
The table below displays the top three lowest MAE
values obtained from the HVTMSMoptimization
,
calculated across all combinations of the number of cells, clusters (k),
and nearest neighbors (nn).
The color legend is as follows:
Champion : GREEN
Challenger 1: SILVER
Challenger 2: PLATINUM
In this section, we have an interactive plot visualizes MAE on the Y-axis and the number of cells on the X-axis. Hovering over a point reveals configurations such as the number of cells, k, nn and MAE. The red dot indicates the highest MAE, while the green dot marks the lowest.
Note: Each datapoint represents the lowest MAE obtained for that specific cell.
From the above plot, we can see that the cell 27 has lowest MAE which is 0.0717, when the hyperparameters k is set to NA and number of nearest neighbor is set to NA.
We will now execute the msm function using the champion model parameters obtained from HVTMSMoptimization. This process proceeds step by step:
Step 1: Train the HVT model on the train dataset using
trainHVT
, specifying the number of cells from the champion
model
Step 2: Score the model on the entire dataset (train +
test) using scoreHVT
Step 3: Create a temporal dataset
containing only timestamps and cell IDs, which will be used to construct
the transition probability matrix
Step 4: Run the msm
function to perform
Ex-Post dynamic forecasting
HVTMSMoptimization
The temporal dataset is generated from the output of scoreHVT, creating a dataframe which then sliced to include only the training period, excluding the test period.
# Temporal Data
temporal_data <-scoring$scoredPredictedData %>% select(Date,Cell.ID)
temporal_data$Date <- as.POSIXct(temporal_data$Date, format = "%Y-%m-%d")
train_end_timestamp <- max(as.POSIXct(train_dataset$Date), na.rm = TRUE)
temporal_train <- temporal_data %>% dplyr::filter(Date <= train_end_timestamp) %>% dplyr::arrange(Date)
Below is the display of temporal data for train dataset.
temporal_train$Date <- as.character(temporal_train$Date)
DT::datatable(temporal_train,options = list(pageLength = 25,scrollX = TRUE,lengthMenu = calculate_dynamic_length_menu(nrow(temporal_train)),
columnDefs = list(list(width = '150px', targets = 0))),class = "nowrap display",rownames = FALSE)
Using the above train-only temporal dataset, the transition probability matrix is calculated and displayed.
# Transition matrix
prob_trans_matx <- getTransitionProbability(df = temporal_train,cellid_column = "Cell.ID",time_column = "Date")
DT::datatable(prob_trans_matx,options = list(pageLength = 25,scrollX = TRUE,lengthMenu = calculate_dynamic_length_menu(nrow(prob_trans_matx)),
columnDefs = list(list(width = '150px', targets = 0))),class = "nowrap display",rownames = FALSE)
Next, the initial state for forecasting is set to the final state of the train period, which acts as t−1 for the first timestamp (t0) of the test period.
Ex-Post forecast time period:
Transition Probability Matrix: 1999-12-01 to 2024-09-01
Initial timestamp: 2024-09-01
Initial state: 16
Residuals: Actual - Predicted Median
MAE Metric: Median
NOTE: The following line of code verifies if the champion
combination was executed without specifying k and nn(i.e., Successful
Iteration without Problematic State). Based on this check, it sets
handle_problematic_state in msm
to FALSE if no values were
provided, and TRUE otherwise.
lowest_mae_model <- !(is.na(best_config$k) || best_config$k == "-" ||
is.na(best_config$`Number of nearest neighbors`) || best_config$`Number of nearest neighbors` == "-")
Let’s run msm for ex-post forecasting using the Champion model parameters.
ex_post <- msm(
state_time_data = temporal_data,
forecast_type = "ex-post",
transition_probability_matrix = prob_trans_matx,
initial_state = initial_state,
num_simulations = 100,
scoreHVT_results = scoring,
trainHVT_results = hvt.results,
actual_data = test_dataset,
raw_dataset = entire_dataset,
handle_problematic_states = lowest_mae_model,
k = if(lowest_mae_model) as.numeric(best_config$k) else NULL,
n_nearest_neighbor = if(lowest_mae_model) as.numeric(best_config$`Number of nearest neighbors`) else NULL,
mae_metric = "median",
show_simulation = FALSE,
time_column = "Date")
Let’s view the ex-post forecasting plots as interactive Plotly visuals with timestamps on the X-axis and states on the Y-axis, showing all variables with hover details for timestamp and value. The studentized residuals plot displays standardized prediction errors with blue dashed lines at ±1σ and a black dashed line at 0, highlighting significant deviations between predicted and actual values.
The Ex-post forecasting will range from 2024-10-01 to 2025-09-01 as per the selected test period.
The table below displays the MAE values for each feature and state, where the Mean MAE is computed only from the feature variables, excluding the states.
mae_values_centroid <- sapply(ex_post[["plots"]][[1]], function(x) x[["mae"]])
mae_value_states <- ex_post[["plots"]][[2]][["mae"]]
mean_mae <- round(mean(mae_values_centroid),4)
mae_values <- c(mae_value_states,mae_values_centroid,mean_mae)
plot_labels <- c(
"States", "CPI_U_All_Items_Less_Food_Energy","Coincident_Index" ,"Copper_ETF" , "SnP500_ETF","Spot_Oil_Price_West_Texas_Intermediate","US_Dollar_Index" , "Unemployment_Rate","X10_Year_Treasury_Note_Yield" ,"X2_Year_Treasury_Note_Yield", "High_Yield_Effective_Yield" ,"XLY" ,"XLP","XLE" , "XLF", "XLV","XLI", "XLB" , "Mean of Variables' MAE")
data_1 <- data.frame(Plot = plot_labels,MAE = mae_values)
DT::datatable(data_1, rownames = FALSE, options = list(pageLength = 25))
Result Analysis:
The analysis indicates that the Mean of Variables MAE is 0.0867, closely aligning with the MAE obtained from the HVTMSMoptimization process - 0.0717.
This close correspondence validates the consistency of the modeling approach. The minor deviation, within an expected error range of 0.01 to 0.1, can be attributed to the inherent stochasticity of the MSM simulation, which relies on random number generation for state transitions.
Automated Optimization Framework - The
implementation of the workflow into the HVTMSMoptimization
is highly valuable as it automates the complex sequence of training,
scoring, simulation, and MAE calculation. This consolidation
significantly streamlines the computationally intensive tuning process
by allowing systematic parameter exploration and automatically handling
issues like problematic states in a single call.
Hyperparameter Tuning Methodology - Hyperparameter selection was performed via grid search across three key parameters: Number of Cells (ncell), Number of Clusters (k), and Number of Nearest Neighbors (nn). The k and nn parameters are applied only for problematic states. For each combination, the Mean Absolute Error (MAE) against test data was calculated, with the configuration achieving the lowest MAE chosen as the Champion model.
Handling Problematic States - Problematic states are edge cases where simulations get stuck in a single state or enter dead transition cycles, caused by the absence, self-state or cyclic transitions. When detected, the framework automatically applies clustering (k) to segment cell centroids and uses nearest neighbors (nn) to select a valid next state. This strategic use of k and nn ensures robust state-switching dynamics and mitigates simulation issues.
Threshold-based Validation - The validation compares the Mean Absolute Error (MAE) from the Ex-Post dynamic forecast against the global minimum MAE achieved during optimization. The model’s consistency is validated if the deviation between these two MAE values falls within an expected tolerance range of 0.01 to 0.1. This tolerance accounts for the inherent stochasticity of the MSM simulations, which uses random number generation for state transitions.