In general, MEFISTO can be used with different likelihood models for each view depending on the nature of each data modality, namely Gaussian (for continuous data), Poisson (for count data) and Bernoulli (for binary data), in the same manner as MOFA2. The implementation of non-Gaussian likelihood models rely on Gaussian approximations to enable a fast variational inference in non-conjugate models following Seeger & Bouchard.
In many applications however data-specific preprocessing is preferable to take data-characteristics of each view into account and make the use of the Gaussian likelihood models (and its underlying homoscedasticiy assumption) appropriate. In particular, for sequencing count data, data-specific preprocessing should be applied in most cases to correct for technical factors, such as library size, and to remove variance mean relationships in the data. We here provide an overview of some possible approaches for this task, but more recent proposals or normalization methods tailored to a certain data type at hand might exists and could be useful instead.
The below examples illustrate some general useful pre-processing steps for count data before applying MOFA2 or MEFISTO.
Library size correction and variance stabilization is a common procedure to prepare count data for the application of a computational methods that are based on a normality assumption. For this purpose, DESeq2 can be with one of the following transformations. (See also the corresponding tutorial for details)
First, we do a library size correction.
library(DESeq2)
dds <- makeExampleDESeqDataSet()
dds <- estimateSizeFactors(dds)
Second, we remove the mean variance trend that is present in the data by one of the following procedures:
varianceStabilizingTransformation
# option 1 variance stabilizing transformation
vsd <- varianceStabilizingTransformation(dds)
mefisto_input <- assay(vsd)
rlog
# option 2: 'regularized log' transformation
rld <- rlog(dds)
mefisto_input <- assay(rld)
# see ?NormalizeData
library(Seurat)
data("pbmc_small")
mefisto_input <- NormalizeData(pbmc_small, normalization.method = "LogNormalize", scale.factor = 10000)
For Python users, similar pre-processing steps are implemented for example in scanpy
.
# see ?SCTransform
library(Seurat)
data("pbmc_small")
mefisto_input <- SCTransform(object = pbmc_small)
Once the mean-variance relationship was removed from the data, further filtering steps can be useful such as
Some detailed example of the pre-processing used in the MEFISTO application of the manuscript can be found here:
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SeuratObject_4.0.0 Seurat_4.0.0
## [3] DESeq2_1.28.1 SummarizedExperiment_1.18.1
## [5] DelayedArray_0.14.0 matrixStats_0.56.0
## [7] Biobase_2.48.0 GenomicRanges_1.40.0
## [9] GenomeInfoDb_1.24.2 IRanges_2.22.2
## [11] S4Vectors_0.26.1 BiocGenerics_0.34.0
## [13] BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.15 colorspace_2.0-0 deldir_0.1-25
## [4] ellipsis_0.3.1 ggridges_0.5.2 XVector_0.28.0
## [7] spatstat.data_1.4-3 leiden_0.3.7 listenv_0.8.0
## [10] ggrepel_0.8.2 bit64_0.9-7 AnnotationDbi_1.50.1
## [13] codetools_0.2-16 splines_4.0.0 geneplotter_1.66.0
## [16] knitr_1.29 polyclip_1.10-0 jsonlite_1.7.0
## [19] annotate_1.66.0 ica_1.0-2 cluster_2.1.0
## [22] png_0.1-7 uwot_0.1.10 shiny_1.5.0
## [25] sctransform_0.3.2 BiocManager_1.30.10 compiler_4.0.0
## [28] httr_1.4.1 lazyeval_0.2.2 Matrix_1.2-18
## [31] fastmap_1.0.1 later_1.1.0.1 htmltools_0.5.0
## [34] tools_4.0.0 igraph_1.2.6 gtable_0.3.0
## [37] glue_1.4.2 GenomeInfoDbData_1.2.3 RANN_2.6.1
## [40] reshape2_1.4.4 dplyr_1.0.0 spatstat_1.64-1
## [43] Rcpp_1.0.5 scattermore_0.7 vctrs_0.3.1
## [46] nlme_3.1-148 lmtest_0.9-37 xfun_0.15
## [49] stringr_1.4.0 globals_0.12.5 mime_0.9
## [52] miniUI_0.1.1.1 lifecycle_0.2.0 irlba_2.3.3
## [55] goftest_1.2-2 XML_3.99-0.4 future_1.17.0
## [58] zlibbioc_1.34.0 MASS_7.3-51.6 zoo_1.8-8
## [61] scales_1.1.1 spatstat.utils_1.17-0 promises_1.1.1
## [64] RColorBrewer_1.1-2 yaml_2.2.1 memoise_1.1.0
## [67] reticulate_1.16 pbapply_1.4-2 gridExtra_2.3
## [70] ggplot2_3.3.2 rpart_4.1-15 stringi_1.5.3
## [73] RSQLite_2.2.0 genefilter_1.70.0 BiocParallel_1.22.0
## [76] rlang_0.4.9 pkgconfig_2.0.3 bitops_1.0-6
## [79] evaluate_0.14 lattice_0.20-41 tensor_1.5
## [82] ROCR_1.0-11 purrr_0.3.4 htmlwidgets_1.5.1
## [85] patchwork_1.0.1 cowplot_1.0.0 bit_1.1-15.2
## [88] tidyselect_1.1.0 RcppAnnoy_0.0.18 plyr_1.8.6
## [91] magrittr_2.0.1 bookdown_0.20 R6_2.5.0
## [94] generics_0.1.0 DBI_1.1.0 mgcv_1.8-31
## [97] pillar_1.4.4 fitdistrplus_1.1-1 abind_1.4-5
## [100] survival_3.2-3 RCurl_1.98-1.2 tibble_3.0.2
## [103] future.apply_1.6.0 crayon_1.3.4 KernSmooth_2.23-17
## [106] plotly_4.9.2.1 rmarkdown_2.7 locfit_1.5-9.4
## [109] grid_4.0.0 data.table_1.12.8 blob_1.2.1
## [112] digest_0.6.27 xtable_1.8-4 tidyr_1.1.0
## [115] httpuv_1.5.4 munsell_0.5.0 viridisLite_0.3.0