In this tutorial, we work through a complete conformal workflow using Coracle.
Coracle is a conformalized framework for multimodal AI with continuous outcomes that adapts to early, late, and intermediate fusion. It also provides theoretical marginal confidence guarantees and achieves valid finite-sample coverage without relying on distributional assumptions.
In this tutorial, we applied Coracle to a pregnancy dataset to produce finite-sample–valid prediction intervals and compared its coverage behavior against prediction intervals using BART and BayesCOOP. The results illustrate how Coracle provides robust, distribution-free uncertainty quantification across fusion paradigms, particularly in settings where competing methods yield overly narrow intervals.
We install Coracle from GitHub if it is not already available in our R library. We recommend doing this from a fresh R session so we are not accidentally masking a function from an older development build.
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("himelmallick/Coracle")
## Using GitHub PAT from the git credential store.
## Skipping install of 'Coracle' from a github remote, the SHA1 (479acc0e) has not changed since last install.
## Use `force = TRUE` to force installation
library(Coracle)
Now we load the packages we are going to use in the rest of the
tutorial: Coracle (main method), dplyr (data
pre-processing), bartMachine, SuperLearner,
IntegratedLearner, BayesCOOP (base learner)
and ggplot2 (visualization).
Now we ensure that Java-based models (e.g., BART) have sufficient memory and smoother garbage collection, preventing crashes and long freezes during computation.
We load the StelzerDOS dataset from the IntegratedLearner
GitHub repository.
The dataset is divided into training and
testing sets, each provided as a list containing three
components:
feature_table – a matrix of features
(e.g., microbial taxa, metabolites) with samples as columns and features
as rows.feature_metadata – annotations
describing each feature, such as modality or biological type.sample_metadata – accompanying
sample-level information, including phenotype, covariates, and grouping
variables.# loading the required datasets
data_train <- get(load(url("https://raw.githubusercontent.com/himelmallick/IntegratedLearner/master/data/StelzerDOS.RData")))
data_test <- get(load(url("https://raw.githubusercontent.com/himelmallick/IntegratedLearner/master/data/StelzerDOS_valid.RData")))
rm(pcl)
We remove metabolomics features since it’s not present in the testing set. Then we realign the feature tables to the filtered metadata so that only valid rows and columns remain. Finally, we consider baseline samples only—training samples labeled with “A” and test samples labeled with “G1”—ensuring that both features and sample metadata stay perfectly synchronized across datasets.
# extract info from train data
feature_table <- data_train$feature_table
sample_metadata <- data_train$sample_metadata
feature_metadata <- data_train$feature_metadata
# Remove metabolomics to match with validation
feature_metadata <- feature_metadata %>% dplyr::filter(featureType!='Metabolomics')
feature_table <- feature_table[rownames(feature_metadata),]
# consider only baseline observations
positions <- grep("A", colnames(feature_table), ignore.case = TRUE)
feature_table <- feature_table[, positions]
sample_metadata <- sample_metadata[positions, ]
rm(positions)
# extract info from test data
feature_table_valid <- data_test$feature_table
sample_metadata_valid <- data_test$sample_metadata
feature_metadata_valid <- data_test$feature_metadata
# extracting baseline measurements
positions <- grep("G1", colnames(feature_table_valid))
feature_table_valid <- feature_table_valid[, positions]
sample_metadata_valid <- sample_metadata_valid[positions, ]
rm(positions)
Next we divide the training set into proper training set and calibration set in the ratio of 3:1.
# construct the proper training set
set.seed(1234)
train_elements <- sample(1:dim(feature_table)[2], (0.75*dim(feature_table)[2]), replace = FALSE)
feature_table_train <- t(t(feature_table)[train_elements, ])
sample_metadata_train <- sample_metadata[train_elements, ]
feature_metadata_train <- feature_metadata
# construct the calibration set
feature_table_calib <- t(t(feature_table)[-train_elements, ])
sample_metadata_calib <- sample_metadata[-train_elements, ]
feature_metadata_calib <- feature_metadata
We filter low-prevalence features and ensure common features across training, calibration, and validation sets.
# Filtering of features and selecting common features across proper train, calib and validation sets
## Filtering of features from proper training set
data_train <- list(feature_table = feature_table_train, sample_metadata = sample_metadata_train, feature_metadata = feature_metadata_train)
xList_train <- Coracle:::gen_datalist(data_train)$xList; y_train <- Coracle:::gen_datalist(data_train)$y
xList_train <- lapply(xList_train, function(foo) Coracle:::filter_features(foo, abd_thresh = 0, prev_thresh = 0.1))
## Filtering of features from calibration set
data_calib <- list(feature_table = feature_table_calib, sample_metadata = sample_metadata_calib, feature_metadata = feature_metadata_calib)
xList_calib <- Coracle:::gen_datalist(data_calib)$xList; y_calib <- Coracle:::gen_datalist(data_calib)$y
xList_calib <- lapply(xList_calib, function(foo) Coracle:::filter_features(foo, abd_thresh = 0, prev_thresh = 0.1))
## Filtering of features from validation set
data_valid <- list(feature_table = feature_table_valid, sample_metadata = sample_metadata_valid, feature_metadata = feature_metadata_valid)
xList_valid <- Coracle:::gen_datalist(data_valid)$xList; y_valid <- Coracle:::gen_datalist(data_valid)$y
xList_valid <- lapply(xList_valid, function(foo) Coracle:::filter_features(foo, abd_thresh = 0, prev_thresh = 0.1))
# Selecting common features across proper training, calibration and validation set
keep_features <- vector("list", length = length(xList_train))
for(i in 1:length(keep_features)) {
vec_list <- list(
colnames(xList_train[[i]]),
colnames(xList_calib[[i]]),
colnames(xList_valid[[i]])
)
keep_features[[i]] <- Reduce(intersect, vec_list)
xList_train[[i]] <- as.matrix(xList_train[[i]][, keep_features[[i]], drop = FALSE])
xList_calib[[i]] <- as.matrix(xList_calib[[i]][, keep_features[[i]], drop = FALSE])
xList_valid[[i]] <- as.matrix(xList_valid[[i]][, keep_features[[i]], drop = FALSE])
}
# Reconstruct data into original format
data_train <- Coracle:::reconstruct_data(xList_train, y_train)
data_calib <- Coracle:::reconstruct_data(xList_calib, y_calib)
data_valid <- Coracle:::reconstruct_data(xList_valid, y_valid)
We fit base learner models under three distinct fusion strategies—early, late, and intermediate—to assess how the point of multimodal integration affects predictive performance and uncertainty quantification. Specifically, IntegratedLearner is used to implement early and late fusion, while BayesCOOP is employed for intermediate fusion.
Here, we run Coracle for different fusion choices and
store the output in a dataframe.
# Running Coracle
conformal_pred = Coracle(fit = fit,
fit_coop = fit_bcoop,
data_calib,
data_valid,
fusion_choice = c("late", "early", "intermediate"),
conf_level = 0.95)
## Prediction using posterior samples...
## Prediction using posterior samples...
df = conformal_pred$df
We compare the performance of Coracle with prediction intervals
derived from BART under early and late fusion strategies, as well as
from BayesCOOP under intermediate fusion. To ensure a fair and
consistent comparison across methods, we use the same training dataset
(data_train) for model fitting and the same independent
validation dataset (data_valid) for evaluating the
resulting prediction intervals.
# Construct the BART prediction interval for early fusion
BART_early_interval = t(apply(post_samples_concat, 1, quantile, probs = c(0.025, 0.975)))
BART_early_lower= as.vector(BART_early_interval[, 1])
BART_early_upper = as.vector(BART_early_interval[, 2])
# Construct the BART prediction interval for late fusion
BART_late_interval = t(apply(post_samples_stacked, 1, quantile, probs = c(0.025, 0.975)))
BART_late_lower = as.vector(BART_late_interval[, 1])
BART_late_upper = as.vector(BART_late_interval[, 2])
predict_bcoop <- predict(fit_bcoop, newdata = data_valid)
# Construct BayesCOOP prediction interval for intermediate fusion
BayesCOOP_lower = apply((predict_bcoop$y_samples + mean(predict_bcoop$y_valid)), 2, quantile, probs = 0.025)
BayesCOOP_upper = apply((predict_bcoop$y_samples + mean(predict_bcoop$y_valid)), 2, quantile, probs = 0.975)
# Obtaining the prediction intervals for Coracle
base_learner = "SL.BART"
base_method = sub(".*\\.", "", base_learner)
Coracle_early_lower = sapply(df[["Coracle(early)"]], function(x) Coracle:::parse_interval(x)[1])
Coracle_early_upper = sapply(df[["Coracle(early)"]], function(x) Coracle:::parse_interval(x)[2])
Coracle_late_lower = sapply(df[["Coracle(late)"]], function(x) Coracle:::parse_interval(x)[1])
Coracle_late_upper = sapply(df[["Coracle(late)"]], function(x) Coracle:::parse_interval(x)[2])
Coracle_intermediate_lower = sapply(df[["Coracle(intermediate)"]], function(x) Coracle:::parse_interval(x)[1])
Coracle_intermediate_upper = sapply(df[["Coracle(intermediate)"]], function(x) Coracle:::parse_interval(x)[2])
# Comparison of prediction intervals
df_new <- data.frame(Subjects = df$subjectID, TrueY = data_valid$sample_metadata$Y, Coracle_early_lower = Coracle_early_lower, Coracle_early_upper = Coracle_early_upper,
Coracle_intermediate_lower = Coracle_intermediate_lower, Coracle_intermediate_upper = Coracle_intermediate_upper,
Coracle_late_lower = Coracle_late_lower, Coracle_late_upper = Coracle_late_upper,
BayesCOOP_lower = BayesCOOP_lower, BayesCOOP_upper = BayesCOOP_upper,
BART_early_lower = BART_early_lower, BART_early_upper = BART_early_upper,
BART_late_lower = BART_late_lower, BART_late_upper = BART_late_upper)
rownames(df_new) <- 1:length(data_valid$sample_metadata$Y)
inside_Coracle_early <- with(df_new, TrueY >= Coracle_early_lower & TrueY <= Coracle_early_upper)
inside_Coracle_intermediate <- with(df_new, TrueY >= Coracle_intermediate_lower & TrueY <= Coracle_intermediate_upper)
inside_Coracle_late <- with(df_new, TrueY >= Coracle_late_lower & TrueY <= Coracle_late_upper)
inside_BART_early <- with(df_new, TrueY >= BART_early_lower & TrueY <= BART_early_upper)
inside_BayesCOOP <- with(df_new, TrueY >= BayesCOOP_lower & TrueY <= BayesCOOP_upper)
inside_BART_late <- with(df_new, TrueY >= BART_late_lower & TrueY <= BART_late_upper)
# Find points which are within the predicted intervals of at least one Coracle method but are not within the predicted intervals of any of the compared methods (BART, BayesCOOP)
df_new$mark <- (
(inside_Coracle_early | inside_Coracle_intermediate | inside_Coracle_late) &
(!inside_BART_early & !inside_BART_late & !inside_BayesCOOP)
)
# Find points which are within the predicted intervals of all the methods
df_new$all <- (
inside_Coracle_early & inside_Coracle_intermediate & inside_Coracle_late &
inside_BART_early & inside_BART_late & inside_BayesCOOP
)
## True Value Status
df_new$true_status <- "True Value"
df_new$true_status[df_new$all] <- "Consensus"
df_new$true_status[df_new$mark] <- "Coracle-only"
df_new$true_status <- factor(
df_new$true_status,
levels = c("True Value", "Coracle-only", "Consensus")
)
## Plot
p <- ggplot(df_new, aes(x = factor(Subjects))) +
## ---- INTERVALS (COLOR LEGEND) ----
geom_errorbar(aes(ymin = Coracle_early_lower, ymax = Coracle_early_upper, color = "Coracle (Early)"),
width = 0.2, position = position_nudge(x = -0.2)) +
geom_errorbar(aes(ymin = Coracle_late_lower, ymax = Coracle_late_upper, color = "Coracle (Late)"),
width = 0.2, position = position_nudge(x = -0.1)) +
geom_errorbar(aes(ymin = Coracle_intermediate_lower, ymax = Coracle_intermediate_upper, color = "Coracle (Intermediate)"),
width = 0.2, position = position_nudge(x = 0)) +
geom_errorbar(aes(ymin = BART_early_lower, ymax = BART_early_upper, color = "Bart (Early)"),
width = 0.2, position = position_nudge(x = 0.1)) +
geom_errorbar(aes(ymin = BayesCOOP_lower, ymax = BayesCOOP_upper,
color = "BayesCOOP (Intermediate)"),
width = 0.2, position = position_nudge(x = 0.2)) +
geom_errorbar(aes(ymin = BART_late_lower, ymax = BART_late_upper, color = "Bart (Late)"),
width = 0.2, position = position_nudge(x = 0.3)) +
## ---- TRUE VALUES (FILL LEGEND) ----
geom_point(
aes(y = TrueY, fill = true_status),
shape = 22, # filled square
size = 3.5,
color = "black",
stroke = 0.3
) +
## ---- LEGENDS ----
scale_color_manual(
name = "Intervals",
values = c(
"Coracle (Early)" = "#0072B2",
"Coracle (Late)" = "#009E73",
"Coracle (Intermediate)" = "#D55E00",
"Bart (Early)" = "#E69F00",
"Bart (Late)" = "#56B4E9",
"BayesCOOP (Intermediate)" = "#CC79A7"
)
) +
scale_fill_manual(
name = "",
values = c(
"True Value" = "#999999",
"Consensus" = "red",
"Coracle-only" = "blue"
),
labels = c(
"True Value",
"Consensus (all methods cover the true value)",
"Coracle-only (only Coracle covers the true value)"
)
) +
## ---- LABELS & THEME ----
labs(
title = "Comparison of Prediction Intervals for StelzerDOS",
x = "Subjects",
y = "Predicted Labor Onset"
) +
theme_classic() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
legend.position = "right",
plot.title = element_text(hjust = 0.5, size = 12)
) +
scale_x_discrete(expand = expansion(mult = c(0.05, 0.05)))
print(p)
The figure compares 95% prediction intervals for labor onset across multiple fusion strategies and learning paradigms. Overall, the Coracle-based intervals (early, intermediate, and late fusion) are consistently wider and more stable than those from BART and BayesCOOP, reflecting their distribution-free construction and explicit calibration for coverage. Several subjects exhibit consensus points (blue squares), where the true labor onset lies within all competing intervals, indicating agreement across methods. In contrast, Coracle-only points (red squares) highlight subjects for whom the true value is covered exclusively by Coracle, while being missed by BART and BayesCOOP. These cases illustrate that Coracle can provide additional robustness in uncertainty quantification, particularly for difficult or outlying subjects where model-based Bayesian or ensemble methods produce intervals that are too narrow. Taken together, the figure suggests that while different fusion strategies yield comparable central predictions, Coracle offers more reliable coverage guarantees across subjects, especially in regimes where competing methods fail to adequately capture uncertainty.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Kolkata
## tzcode source: internal
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_4.0.1 bartMachine_1.3.4.1 missForest_1.5
## [4] randomForest_4.7-1.2 bartMachineJARs_1.2.1 rJava_1.0-11
## [7] SuperLearner_2.0-40 gam_1.22-7 foreach_1.5.2
## [10] nnls_1.6 BayesCOOP_0.1.0 IntegratedLearner_0.0.2
## [13] dplyr_1.1.4 glmnet_4.1-10 Matrix_1.7-4
## [16] Coracle_0.1.0
##
## loaded via a namespace (and not attached):
## [1] pROC_1.19.0.1 remotes_2.5.0 rlang_1.1.6
## [4] magrittr_2.0.4 prediction_0.3.18 compiler_4.5.1
## [7] vctrs_0.6.5 reshape2_1.4.5 rmutil_1.1.10
## [10] quantreg_6.1 quadprog_1.5-8 stringr_1.6.0
## [13] pkgconfig_2.0.3 shape_1.4.6.1 fastmap_1.2.0
## [16] mcmc_0.9-8 ellipsis_0.3.2 labeling_0.4.3
## [19] rmarkdown_2.30 prodlim_2025.04.28 sessioninfo_1.2.3
## [22] nloptr_2.2.1 itertools_0.1-3 glmnetUtils_1.1.9
## [25] MatrixModels_0.5-4 purrr_1.2.0 xfun_0.53
## [28] cachem_1.1.0 jsonlite_2.0.0 recipes_1.3.1
## [31] parallel_4.5.1 R6_2.6.1 bslib_0.9.0
## [34] stringi_1.8.7 RColorBrewer_1.1-3 parallelly_1.46.0
## [37] pkgload_1.4.1 rpart_4.1.24 lubridate_1.9.4
## [40] jquerylib_0.1.4 Rcpp_1.1.0 iterators_1.0.14
## [43] knitr_1.50 future.apply_1.20.1 usethis_3.2.1
## [46] bayesplot_1.14.0 nnet_7.3-20 timechange_0.3.0
## [49] tidyselect_1.2.1 rstudioapi_0.17.1 yaml_2.3.10
## [52] timeDate_4051.111 codetools_0.2-20 curl_7.0.0
## [55] listenv_0.10.0 pkgbuild_1.4.8 doRNG_1.8.6.2
## [58] lattice_0.22-7 tibble_3.3.0 plyr_1.8.9
## [61] withr_3.0.2 S7_0.2.1 coda_0.19-4.1
## [64] evaluate_1.0.5 future_1.68.0 survival_3.8-3
## [67] pillar_1.11.1 rngtools_1.5.2 stats4_4.5.1
## [70] insight_1.4.2 generics_0.1.4 truncnorm_1.0-9
## [73] scales_1.4.0 globals_0.18.0 class_7.3-23
## [76] glue_1.8.0 tools_4.5.1 data.table_1.17.8
## [79] SparseM_1.84-2 ModelMetrics_1.2.2.2 gower_1.0.2
## [82] fs_1.6.6 cowplot_1.2.0 grid_4.5.1
## [85] MCMCpack_1.7-1 tidyr_1.3.2 ipred_0.9-15
## [88] devtools_2.4.6 nlme_3.1-168 performance_0.15.2
## [91] cli_3.6.5 lava_1.8.2 gtable_0.3.6
## [94] sass_0.4.10 digest_0.6.39 caret_7.0-1
## [97] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
## [100] lifecycle_1.0.4 hardhat_1.4.2 MASS_7.3-65