Dynamic Forecasting of Macroeconomic Time Series Dataset using HVT

Zubin Dowlaty, Chepuri Gopi Krishna, Siddharth Shorya, Vishwavani

Created Date: 2024-11-12
Modified Date: 2026-01-22

1. Background

The HVT package offers a suite of R functions designed to construct topology preserving maps for in-depth analysis of multivariate data. It is particularly well-suited for datasets with numerous records. The package organizes the typical workflow into several key stages:

  1. Data Compression: Long datasets are compressed using Hierarchical Vector Quantization (HVQ) to achieve the desired level of data reduction.

  2. Data Projection: Compressed cells are projected into one and two dimensions using dimensionality reduction algorithms, producing embeddings that preserve the original topology. This allows for intuitive visualization of complex data structures.

  3. Tessellation: Voronoi tessellation partitions the projected space into distinct cells, supporting hierarchical visualizations. Heatmaps and interactive plots facilitate exploration and insights into the underlying data patterns.

  4. Scoring: Test dataset is evaluated against previously generated maps, enabling their placement within the existing structure. Sequential application across multiple maps is supported if required.

  5. Temporal Analysis and Visualization: Functions in this stage examine time-series data to identify patterns, estimate transition probabilities, and visualize data flow over time.

What’s New?

HVT – Version 25.2.4

This notebook introduces a new feature, MSM (Monte Carlo Simulation of Markov Chains), in the HVT package. MSM is designed for dynamic forecasting of time series data using a transition probability matrix to forecast n states ahead.

The workflow supports both:

The notebook provides a step-by-step walkthrough covering data preparation, model setup, and forecast generation. It also highlights challenges arising from transition probability issues in certain states, outlines mechanisms to handle such problematic states, and evaluates forecast performance using appropriate accuracy metrics.

HVT – Version 25.2.8

In this latest version of notebook, introducing an enhancement to dynamic forecasting with the release of raw series ex-ante forecasting. In this approach, a simple 12-month direct lookback method is used, where raw historical values combined with the forecasted year-over-year (YoY) changes are applied to generate raw-level forecasts for each feature over the ex-ante period.


2. Notebook Requirements

This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, the code will install those packages and attach to the session environment.

list.of.packages <- c("dplyr", "tidyr","patchwork", "feather","ggplot2","kableExtra", "htmltools",
                      "plotly","tibble","purrr", "gganimate", "DT","readr", "NbClust", "HVT")

new.packages <-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
  install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')
invisible(lapply(list.of.packages, library, character.only = TRUE))

Below is the function for more dynamic drop down display in the data tables.

calculate_dynamic_length_menu <- function(total_entries, base_step = 100) {
  max_option <- ceiling(total_entries / base_step) * base_step
  max_option <- max(max_option, 100)
  options <- seq(base_step, by = base_step, length.out = max_option / base_step)
  options <- c(10, options)
  return(options)}

3. Dataset Preparation and Exploration

3.1 Dataset Loading

Let’s start with importing the dataset. The below code reads and displays the dataset.

csv_path <- "sample_dataset/macro_eco_data_2025.csv"

entire_dataset_raw <- read.csv(csv_path)
entire_dataset_raw <-  entire_dataset_raw %>% mutate(across(where(is.numeric), ~ round(., 4)))
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset_raw))
datatable(entire_dataset_raw,options = list(pageLength = 10,scrollX = TRUE, lengthMenu = dynamic_length_menu), rownames = FALSE)

This dataset includes a collection of key economic and financial indicators. These indicators are essential for monitoring macroeconomic performance, analyzing market trends, and assessing financial stability. The data ranges from December 1998 to September 2025.

3.2 Dataset Preprocessing

Before proceeding, it is crucial to examine the structure of the dataset. This involves verifying the data types of the columns and resolving any inconsistencies. Make sure all data types are accurate and suitable for the intended functions.

str(entire_dataset_raw)
## 'data.frame':    322 obs. of  18 variables:
##  $ t          : chr  "1998/12" "1999/01" "1999/02" "1999/03" ...
##  $ CPI_Food   : num  175 176 176 176 176 ...
##  $ COIN       : num  80.1 80.3 80.8 80.8 80.9 81.2 81.4 81.7 82 81.9 ...
##  $ Copper_ETF : num  0.663 0.641 0.624 0.622 0.717 ...
##  $ SnP500_ETF : num  1229 1280 1238 1286 1335 ...
##  $ Spot_Oil   : num  11.3 12.5 12 14.7 17.3 ...
##  $ USD_Index  : num  94.2 96.1 98.7 100.1 101 ...
##  $ Unemp_Rate : num  4.4 4.3 4.4 4.2 4.3 4.2 4.3 4.3 4.2 4.2 ...
##  $ Y10_Note   : num  4.65 4.72 5 5.23 5.18 5.54 5.9 5.79 5.94 5.92 ...
##  $ Y2_Note    : num  4.51 4.62 4.88 5.05 4.98 5.25 5.62 5.55 5.68 5.66 ...
##  $ Yield_Curve: num  10.5 10.4 10.7 10.6 10.2 ...
##  $ XLY        : num  19.4 20.3 20.2 21.2 21.8 ...
##  $ XLP        : num  14.6 14.4 14.3 14.3 13.8 ...
##  $ XLE        : num  11.8 11 10.9 12.5 14.3 ...
##  $ XLF        : num  11.4 11.6 11.7 12.1 13 ...
##  $ XLV        : num  17.7 18.5 18.5 19 19.7 ...
##  $ XLI        : num  15.4 15.2 15.4 15.7 18 ...
##  $ XLB        : num  12.2 11.8 11.9 12.2 15.2 ...

Since the time column is in ‘Character’ format, we are changing it to ‘datetime’ (POSIXct) format.

entire_dataset_raw$t <- as.POSIXct(paste0(entire_dataset_raw$t, "/01"), format = "%Y/%m/%d")

3.3 Dataset Transformation

We transform the data to compute the 12-month rate of change, which standardizes the features and brings them to a comparable scale, simplifying the analysis of their relative changes. Log difference reduces variability and removes trends, stabilizing the data for more accurate forecasting.

The Rate of change is calculated as follows:

\[ \text{Rate of Change} = \log(\text{Current Value}) - \log(\text{12-Month Lag Value}) \]

entire_dataset <- entire_dataset_raw
features_data <- entire_dataset %>% select(-t) %>% colnames()
entire_dataset[features_data] <- lapply(entire_dataset[features_data], function(x) as.numeric(as.character(x)))

invisible(lapply(features_data, function(col) {
  entire_dataset[[col]] <<- entire_dataset[[col]] %>% log()
  entire_dataset[[col]] <<- c(rep(NA, 12), round(diff(entire_dataset[[col]], 12),4))}))
entire_dataset <- entire_dataset %>% na.omit() %>% data.frame()
rownames(entire_dataset) <- NULL

After the log difference of 12 datapoints (months), the dataset ranges from December 1999 to September 2025. Below is the table displaying the transformed dataset.

entire_dataset$t <- as.character(entire_dataset$t)
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset))
datatable(entire_dataset,options = list(pageLength = 10,scrollX = TRUE,lengthMenu = dynamic_length_menu,columnDefs = list(list(width = '150px', targets = 0))),
class = "nowrap display",rownames = FALSE)
entire_dataset$t <- as.POSIXct(entire_dataset$t, format = "%Y-%m-%d")

3.4 Data split

The ultimate goal of this notebook is to dynamically forecast the HVT states in both ex-post and ex-ante scenarios. To structure the analysis, we will define the timelines based on the main topics of interest:

trainHVT_data <- entire_dataset[-tail(seq_len(nrow(entire_dataset)), 12), ]
expost_forecasting <- entire_dataset[entire_dataset$t > "2024-09-01" & entire_dataset$t <= "2025-09-01", ]

3.5 EDA Plots

For the Exploratory Data Analysis (EDA), we will create a statistic table and series of plots to visualize the dataset’s distribution, trends, and relationships. These plots provide insights into the data structure, enabling a better understanding of the underlying patterns and correlations.

Dataset used for EDA: 1999-12-01 to 2025-09-01

For creating tabsets, follow this markdown syntax;

## Main Heading {.tabset}
### Tab 1
 - Content for Tab 1.
### Tab 2
 - Content for Tab 2.
## Next Section

Summary Table

edaPlots(entire_dataset)

Histograms

edaPlots(entire_dataset, output_type = 'histogram')


Boxplots

edaPlots(entire_dataset, output_type = 'boxplot')


Correlation Plot

edaPlots(entire_dataset, output_type = 'correlation')


Time Series Plots

recession_periods <- list(c("2001-03-01", "2001-11-01"),c("2007-12-01", "2009-06-01"),c("2020-02-01", "2020-04-01"))
recession_periods <- lapply(recession_periods, function(x) as.POSIXct(x))
edaPlots(entire_dataset, time_column = "t", output_type = "timeseries", grey_bars = recession_periods)



4. Constructing and Visualizing the HVT Model

The dataset is prepped and ready for constructing the HVT model, which is the first and most prominent step. Model Training involves applying Hierarchical Vector Quantization (HVQ) to iteratively compress and project data into a hierarchy of cells. The process uses a quantization error threshold to determine the number of cells and levels in the hierarchy. The compressed data is then projected onto a 2D space, and the resulting tessellation provides a visual representation of the data distribution, enabling insights into the underlying patterns and relationships.

We use the trainHVT function to compress the dataset, and timestamp feature is not needed for this process.

Input Parameters

hvt.results <- trainHVT(
  trainHVT_data[,-1],
  n_cells = 75,
  depth = 1,
  quant.err = 0.25,
  normalize = TRUE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans",
  dim_reduction_method = "sammon")
summary(hvt.results)
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 75 56 0.75 n_cells: 75 quant.err: 0.25 distance_metric: L1_Norm error_metric: max quant_method: kmeans

The value of percentOfCellsBelowQuantizationErrorThreshold is crucial for evaluating the model’s performance and the quality of the compression. It is recommended to construct a model where atleast 80% of the cells fall below the quantization error threshold.

The value 0.75, indicates that 75% compression has been achieved, Typically, the number of cells is increased till, atleast 80% compression is achieved. However, for this vignette, we proceed with the current level of compression, as this specific combination of cells and dataset intentionally induces problematic transition states, which are useful for demonstrating the proposed solution.

Visual Stability & Aesthetics

For the visual stability and aesthetic check, we will plot and compare our current model with tessellations from 3 cells above and below our current cell count of 75. This comparison helps ensure the model’s structural integrity across a similar range of cells and allows for the identification of any significant sudden structural changes.

78 Cells

plotHVT(hvt.results_2,plot.type = '2Dhvt', cell_id = TRUE)

77 Cells

plotHVT(hvt.results_3,plot.type = '2Dhvt', cell_id = TRUE)

76 Cells

plotHVT(hvt.results_4,plot.type = '2Dhvt', cell_id = TRUE)

75 Cells

plotHVT(hvt.results,plot.type = '2Dhvt', cell_id = TRUE)