Overview

In this tutorial, we work through a complete conformal workflow using Coracle.

Coracle is a conformalized framework for multimodal AI with continuous outcomes that adapts to early, late, and intermediate fusion. It also provides theoretical marginal confidence guarantees and achieves valid finite-sample coverage without relying on distributional assumptions.

In this tutorial, we applied Coracle to a pregnancy dataset to produce finite-sample–valid prediction intervals and compared its coverage behavior against prediction intervals using BART and BayesCOOP. The results illustrate how Coracle provides robust, distribution-free uncertainty quantification across fusion paradigms, particularly in settings where competing methods yield overly narrow intervals.

1. Installation and Package Setup

We install Coracle from GitHub if it is not already available in our R library. We recommend doing this from a fresh R session so we are not accidentally masking a function from an older development build.

if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("himelmallick/Coracle")
## Using GitHub PAT from the git credential store.
## Skipping install of 'Coracle' from a github remote, the SHA1 (479acc0e) has not changed since last install.
##   Use `force = TRUE` to force installation
library(Coracle)

Now we load the packages we are going to use in the rest of the tutorial: Coracle (main method), dplyr (data pre-processing), bartMachine, SuperLearner, IntegratedLearner, BayesCOOP (base learner) and ggplot2 (visualization).

Now we ensure that Java-based models (e.g., BART) have sufficient memory and smoother garbage collection, preventing crashes and long freezes during computation.

2. Load the StelzerDOS Dataset

We load the StelzerDOS dataset from the IntegratedLearner GitHub repository.
The dataset is divided into training and testing sets, each provided as a list containing three components:

  • feature_table – a matrix of features (e.g., microbial taxa, metabolites) with samples as columns and features as rows.
  • feature_metadata – annotations describing each feature, such as modality or biological type.
  • sample_metadata – accompanying sample-level information, including phenotype, covariates, and grouping variables.
# loading the required datasets
data_train <- get(load(url("https://raw.githubusercontent.com/himelmallick/IntegratedLearner/master/data/StelzerDOS.RData")))
data_test <- get(load(url("https://raw.githubusercontent.com/himelmallick/IntegratedLearner/master/data/StelzerDOS_valid.RData")))
rm(pcl)

3. Preprocessing the Data

We remove metabolomics features since it’s not present in the testing set. Then we realign the feature tables to the filtered metadata so that only valid rows and columns remain. Finally, we consider baseline samples only—training samples labeled with “A” and test samples labeled with “G1”—ensuring that both features and sample metadata stay perfectly synchronized across datasets.

# extract info from train data
feature_table <- data_train$feature_table
sample_metadata <- data_train$sample_metadata
feature_metadata <- data_train$feature_metadata

# Remove metabolomics to match with validation
feature_metadata <- feature_metadata %>% dplyr::filter(featureType!='Metabolomics')
feature_table <- feature_table[rownames(feature_metadata),]

# consider only baseline observations
positions <- grep("A", colnames(feature_table), ignore.case = TRUE)
feature_table <- feature_table[, positions]
sample_metadata <- sample_metadata[positions, ]
rm(positions)

# extract info from test data
feature_table_valid <- data_test$feature_table
sample_metadata_valid <- data_test$sample_metadata
feature_metadata_valid <- data_test$feature_metadata

# extracting baseline measurements
positions <- grep("G1", colnames(feature_table_valid))
feature_table_valid <- feature_table_valid[, positions]
sample_metadata_valid <- sample_metadata_valid[positions, ]
rm(positions)

4. Dividing the Training Set into Proper Training Set and Calibration Set

Next we divide the training set into proper training set and calibration set in the ratio of 3:1.

# construct the proper training set
set.seed(1234)
train_elements <- sample(1:dim(feature_table)[2], (0.75*dim(feature_table)[2]), replace = FALSE) 
feature_table_train <- t(t(feature_table)[train_elements, ])
sample_metadata_train <- sample_metadata[train_elements, ]
feature_metadata_train <- feature_metadata

# construct the calibration set 
feature_table_calib <- t(t(feature_table)[-train_elements, ])
sample_metadata_calib <- sample_metadata[-train_elements, ]
feature_metadata_calib <- feature_metadata

We filter low-prevalence features and ensure common features across training, calibration, and validation sets.

# Filtering of features and selecting common features across proper train, calib and validation sets

## Filtering of features from proper training set
data_train <- list(feature_table = feature_table_train, sample_metadata = sample_metadata_train, feature_metadata = feature_metadata_train)
xList_train <- Coracle:::gen_datalist(data_train)$xList; y_train <- Coracle:::gen_datalist(data_train)$y
xList_train <- lapply(xList_train, function(foo) Coracle:::filter_features(foo, abd_thresh = 0, prev_thresh = 0.1))

## Filtering of features from calibration set
data_calib <- list(feature_table = feature_table_calib, sample_metadata = sample_metadata_calib, feature_metadata = feature_metadata_calib)
xList_calib <- Coracle:::gen_datalist(data_calib)$xList; y_calib <- Coracle:::gen_datalist(data_calib)$y
xList_calib <- lapply(xList_calib, function(foo) Coracle:::filter_features(foo, abd_thresh = 0, prev_thresh = 0.1))

## Filtering of features from validation set
data_valid <- list(feature_table = feature_table_valid, sample_metadata = sample_metadata_valid, feature_metadata = feature_metadata_valid)
xList_valid <- Coracle:::gen_datalist(data_valid)$xList; y_valid <- Coracle:::gen_datalist(data_valid)$y
xList_valid <- lapply(xList_valid, function(foo) Coracle:::filter_features(foo, abd_thresh = 0, prev_thresh = 0.1))

# Selecting common features across proper training, calibration and validation set 
keep_features <- vector("list", length = length(xList_train))
for(i in 1:length(keep_features)) {
  vec_list <- list(
    colnames(xList_train[[i]]),
    colnames(xList_calib[[i]]),
    colnames(xList_valid[[i]])
  )
    
  keep_features[[i]] <- Reduce(intersect, vec_list)
  xList_train[[i]] <- as.matrix(xList_train[[i]][, keep_features[[i]], drop = FALSE])
  xList_calib[[i]] <- as.matrix(xList_calib[[i]][, keep_features[[i]], drop = FALSE])
  xList_valid[[i]] <- as.matrix(xList_valid[[i]][, keep_features[[i]], drop = FALSE])
}
  
# Reconstruct data into original format
data_train <- Coracle:::reconstruct_data(xList_train, y_train)
data_calib <- Coracle:::reconstruct_data(xList_calib, y_calib)
data_valid <- Coracle:::reconstruct_data(xList_valid, y_valid)

5. Running Base Learner Models for Different Fusion Choices

We fit base learner models under three distinct fusion strategies—early, late, and intermediate—to assess how the point of multimodal integration affects predictive performance and uncertainty quantification. Specifically, IntegratedLearner is used to implement early and late fusion, while BayesCOOP is employed for intermediate fusion.

6. Run Coracle

Here, we run Coracle for different fusion choices and store the output in a dataframe.

# Running Coracle
conformal_pred = Coracle(fit = fit,
                    fit_coop = fit_bcoop,
                    data_calib,
                    data_valid,
                    fusion_choice = c("late", "early", "intermediate"),
                    conf_level = 0.95)
## Prediction using posterior samples...
## Prediction using posterior samples...
df = conformal_pred$df

7. Performance Comparison

We compare the performance of Coracle with prediction intervals derived from BART under early and late fusion strategies, as well as from BayesCOOP under intermediate fusion. To ensure a fair and consistent comparison across methods, we use the same training dataset (data_train) for model fitting and the same independent validation dataset (data_valid) for evaluating the resulting prediction intervals.

# Construct the BART prediction interval for early fusion
BART_early_interval = t(apply(post_samples_concat, 1, quantile, probs = c(0.025, 0.975)))
BART_early_lower= as.vector(BART_early_interval[, 1])
BART_early_upper = as.vector(BART_early_interval[, 2])

# Construct the BART prediction interval for late fusion
BART_late_interval = t(apply(post_samples_stacked, 1, quantile, probs = c(0.025, 0.975)))
BART_late_lower = as.vector(BART_late_interval[, 1])
BART_late_upper = as.vector(BART_late_interval[, 2])
predict_bcoop <- predict(fit_bcoop, newdata = data_valid)

# Construct BayesCOOP prediction interval for intermediate fusion
BayesCOOP_lower = apply((predict_bcoop$y_samples + mean(predict_bcoop$y_valid)), 2, quantile, probs = 0.025)
BayesCOOP_upper = apply((predict_bcoop$y_samples + mean(predict_bcoop$y_valid)), 2, quantile, probs = 0.975)
# Obtaining the prediction intervals for Coracle
base_learner = "SL.BART"
base_method = sub(".*\\.", "", base_learner)

Coracle_early_lower = sapply(df[["Coracle(early)"]], function(x) Coracle:::parse_interval(x)[1])
Coracle_early_upper = sapply(df[["Coracle(early)"]], function(x) Coracle:::parse_interval(x)[2])
Coracle_late_lower = sapply(df[["Coracle(late)"]], function(x) Coracle:::parse_interval(x)[1])
Coracle_late_upper = sapply(df[["Coracle(late)"]], function(x) Coracle:::parse_interval(x)[2])
Coracle_intermediate_lower = sapply(df[["Coracle(intermediate)"]], function(x) Coracle:::parse_interval(x)[1])
Coracle_intermediate_upper = sapply(df[["Coracle(intermediate)"]], function(x) Coracle:::parse_interval(x)[2])
# Comparison of prediction intervals
df_new <- data.frame(Subjects = df$subjectID, TrueY = data_valid$sample_metadata$Y, Coracle_early_lower = Coracle_early_lower, Coracle_early_upper = Coracle_early_upper,
                     Coracle_intermediate_lower = Coracle_intermediate_lower, Coracle_intermediate_upper = Coracle_intermediate_upper,
                     Coracle_late_lower = Coracle_late_lower, Coracle_late_upper = Coracle_late_upper,
                     BayesCOOP_lower = BayesCOOP_lower, BayesCOOP_upper = BayesCOOP_upper,
                     BART_early_lower = BART_early_lower, BART_early_upper = BART_early_upper,
                     BART_late_lower = BART_late_lower, BART_late_upper = BART_late_upper)

rownames(df_new) <- 1:length(data_valid$sample_metadata$Y)

inside_Coracle_early <- with(df_new, TrueY >= Coracle_early_lower & TrueY <= Coracle_early_upper)
inside_Coracle_intermediate  <- with(df_new, TrueY >= Coracle_intermediate_lower & TrueY <= Coracle_intermediate_upper)
inside_Coracle_late  <- with(df_new, TrueY >= Coracle_late_lower & TrueY <= Coracle_late_upper)
inside_BART_early  <- with(df_new, TrueY >= BART_early_lower & TrueY <= BART_early_upper)
inside_BayesCOOP <- with(df_new, TrueY >= BayesCOOP_lower & TrueY <= BayesCOOP_upper)
inside_BART_late  <- with(df_new, TrueY >= BART_late_lower & TrueY <= BART_late_upper)

# Find points which are within the predicted intervals of at least one Coracle method but are not within the predicted intervals of any of the compared methods (BART, BayesCOOP)
df_new$mark <- (
  (inside_Coracle_early | inside_Coracle_intermediate | inside_Coracle_late) &
    (!inside_BART_early & !inside_BART_late & !inside_BayesCOOP)
)

# Find points which are within the predicted intervals of all the methods
df_new$all <- (
  inside_Coracle_early & inside_Coracle_intermediate & inside_Coracle_late &
    inside_BART_early & inside_BART_late & inside_BayesCOOP
)

## True Value Status 
df_new$true_status <- "True Value"
df_new$true_status[df_new$all]  <- "Consensus"
df_new$true_status[df_new$mark] <- "Coracle-only"

df_new$true_status <- factor(
  df_new$true_status,
  levels = c("True Value", "Coracle-only", "Consensus")
)

8. Visualization

## Plot
p <- ggplot(df_new, aes(x = factor(Subjects))) +
  
  ## ---- INTERVALS (COLOR LEGEND) ----
geom_errorbar(aes(ymin = Coracle_early_lower, ymax = Coracle_early_upper, color = "Coracle (Early)"),
              width = 0.2, position = position_nudge(x = -0.2)) +
  geom_errorbar(aes(ymin = Coracle_late_lower, ymax = Coracle_late_upper, color = "Coracle (Late)"),
                width = 0.2, position = position_nudge(x = -0.1)) +
  geom_errorbar(aes(ymin = Coracle_intermediate_lower, ymax = Coracle_intermediate_upper, color = "Coracle (Intermediate)"),
                width = 0.2, position = position_nudge(x = 0)) +
  geom_errorbar(aes(ymin = BART_early_lower, ymax = BART_early_upper, color = "Bart (Early)"),
                width = 0.2, position = position_nudge(x = 0.1)) +
  geom_errorbar(aes(ymin = BayesCOOP_lower, ymax = BayesCOOP_upper,
                    color = "BayesCOOP (Intermediate)"),
                width = 0.2, position = position_nudge(x = 0.2)) +
  geom_errorbar(aes(ymin = BART_late_lower, ymax = BART_late_upper, color = "Bart (Late)"),
                width = 0.2, position = position_nudge(x = 0.3)) +
  
  ## ---- TRUE VALUES (FILL LEGEND) ----
geom_point(
  aes(y = TrueY, fill = true_status),
  shape = 22,           # filled square
  size  = 3.5,
  color = "black",
  stroke = 0.3
) +
  
  ## ---- LEGENDS ----
scale_color_manual(
  name = "Intervals",
  values = c(
    "Coracle (Early)"        = "#0072B2",
    "Coracle (Late)"         = "#009E73",
    "Coracle (Intermediate)" = "#D55E00",
    "Bart (Early)"           = "#E69F00",
    "Bart (Late)"            = "#56B4E9",
    "BayesCOOP (Intermediate)" = "#CC79A7"
  )
) +
  
  scale_fill_manual(
    name = "",
    values = c(
      "True Value"   = "#999999",
      "Consensus"    = "red",
      "Coracle-only" = "blue"
    ),
    labels = c(
      "True Value",
      "Consensus (all methods cover the true value)",
      "Coracle-only (only Coracle covers the true value)"
    )
  ) +
  
  ## ---- LABELS & THEME ----
labs(
  title = "Comparison of Prediction Intervals for StelzerDOS",
  x = "Subjects",
  y = "Predicted Labor Onset"
) +
  theme_classic() +
  theme(
    axis.text.x  = element_blank(),
    axis.ticks.x = element_blank(),
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, size = 12)
  ) +
  scale_x_discrete(expand = expansion(mult = c(0.05, 0.05)))

print(p)

The figure compares 95% prediction intervals for labor onset across multiple fusion strategies and learning paradigms. Overall, the Coracle-based intervals (early, intermediate, and late fusion) are consistently wider and more stable than those from BART and BayesCOOP, reflecting their distribution-free construction and explicit calibration for coverage. Several subjects exhibit consensus points (blue squares), where the true labor onset lies within all competing intervals, indicating agreement across methods. In contrast, Coracle-only points (red squares) highlight subjects for whom the true value is covered exclusively by Coracle, while being missed by BART and BayesCOOP. These cases illustrate that Coracle can provide additional robustness in uncertainty quantification, particularly for difficult or outlying subjects where model-based Bayesian or ensemble methods produce intervals that are too narrow. Taken together, the figure suggests that while different fusion strategies yield comparable central predictions, Coracle offers more reliable coverage guarantees across subjects, especially in regimes where competing methods fail to adequately capture uncertainty.

9. Citation

10. Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Kolkata
## tzcode source: internal
## 
## attached base packages:
## [1] splines   stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_4.0.1           bartMachine_1.3.4.1     missForest_1.5         
##  [4] randomForest_4.7-1.2    bartMachineJARs_1.2.1   rJava_1.0-11           
##  [7] SuperLearner_2.0-40     gam_1.22-7              foreach_1.5.2          
## [10] nnls_1.6                BayesCOOP_0.1.0         IntegratedLearner_0.0.2
## [13] dplyr_1.1.4             glmnet_4.1-10           Matrix_1.7-4           
## [16] Coracle_0.1.0          
## 
## loaded via a namespace (and not attached):
##   [1] pROC_1.19.0.1        remotes_2.5.0        rlang_1.1.6         
##   [4] magrittr_2.0.4       prediction_0.3.18    compiler_4.5.1      
##   [7] vctrs_0.6.5          reshape2_1.4.5       rmutil_1.1.10       
##  [10] quantreg_6.1         quadprog_1.5-8       stringr_1.6.0       
##  [13] pkgconfig_2.0.3      shape_1.4.6.1        fastmap_1.2.0       
##  [16] mcmc_0.9-8           ellipsis_0.3.2       labeling_0.4.3      
##  [19] rmarkdown_2.30       prodlim_2025.04.28   sessioninfo_1.2.3   
##  [22] nloptr_2.2.1         itertools_0.1-3      glmnetUtils_1.1.9   
##  [25] MatrixModels_0.5-4   purrr_1.2.0          xfun_0.53           
##  [28] cachem_1.1.0         jsonlite_2.0.0       recipes_1.3.1       
##  [31] parallel_4.5.1       R6_2.6.1             bslib_0.9.0         
##  [34] stringi_1.8.7        RColorBrewer_1.1-3   parallelly_1.46.0   
##  [37] pkgload_1.4.1        rpart_4.1.24         lubridate_1.9.4     
##  [40] jquerylib_0.1.4      Rcpp_1.1.0           iterators_1.0.14    
##  [43] knitr_1.50           future.apply_1.20.1  usethis_3.2.1       
##  [46] bayesplot_1.14.0     nnet_7.3-20          timechange_0.3.0    
##  [49] tidyselect_1.2.1     rstudioapi_0.17.1    yaml_2.3.10         
##  [52] timeDate_4051.111    codetools_0.2-20     curl_7.0.0          
##  [55] listenv_0.10.0       pkgbuild_1.4.8       doRNG_1.8.6.2       
##  [58] lattice_0.22-7       tibble_3.3.0         plyr_1.8.9          
##  [61] withr_3.0.2          S7_0.2.1             coda_0.19-4.1       
##  [64] evaluate_1.0.5       future_1.68.0        survival_3.8-3      
##  [67] pillar_1.11.1        rngtools_1.5.2       stats4_4.5.1        
##  [70] insight_1.4.2        generics_0.1.4       truncnorm_1.0-9     
##  [73] scales_1.4.0         globals_0.18.0       class_7.3-23        
##  [76] glue_1.8.0           tools_4.5.1          data.table_1.17.8   
##  [79] SparseM_1.84-2       ModelMetrics_1.2.2.2 gower_1.0.2         
##  [82] fs_1.6.6             cowplot_1.2.0        grid_4.5.1          
##  [85] MCMCpack_1.7-1       tidyr_1.3.2          ipred_0.9-15        
##  [88] devtools_2.4.6       nlme_3.1-168         performance_0.15.2  
##  [91] cli_3.6.5            lava_1.8.2           gtable_0.3.6        
##  [94] sass_0.4.10          digest_0.6.39        caret_7.0-1         
##  [97] farver_2.1.2         memoise_2.0.1        htmltools_0.5.8.1   
## [100] lifecycle_1.0.4      hardhat_1.4.2        MASS_7.3-65