1 Introduction

In general, MEFISTO can be used with different likelihood models for each view depending on the nature of each data modality, namely Gaussian (for continuous data), Poisson (for count data) and Bernoulli (for binary data), in the same manner as MOFA2. The implementation of non-Gaussian likelihood models rely on Gaussian approximations to enable a fast variational inference in non-conjugate models following Seeger & Bouchard.

In many applications however data-specific preprocessing is preferable to take data-characteristics of each view into account and make the use of the Gaussian likelihood models (and its underlying homoscedasticiy assumption) appropriate. In particular, for sequencing count data, data-specific preprocessing should be applied in most cases to correct for technical factors, such as library size, and to remove variance mean relationships in the data. We here provide an overview of some possible approaches for this task, but more recent proposals or normalization methods tailored to a certain data type at hand might exists and could be useful instead.

2 General pre-processing

The below examples illustrate some general useful pre-processing steps for count data before applying MOFA2 or MEFISTO.

2.1 Example 1: Bulk RNA-seq data

Library size correction and variance stabilization is a common procedure to prepare count data for the application of a computational methods that are based on a normality assumption. For this purpose, DESeq2 can be with one of the following transformations. (See also the corresponding tutorial for details)

First, we do a library size correction.

library(DESeq2)
dds <- makeExampleDESeqDataSet()
dds <- estimateSizeFactors(dds)

Second, we remove the mean variance trend that is present in the data by one of the following procedures:

Variance stabilization using varianceStabilizingTransformation

# option 1 variance stabilizing transformation
vsd <- varianceStabilizingTransformation(dds)
mefisto_input <- assay(vsd)

regularized logarithmic transformation rlog

# option 2: 'regularized log' transformation
rld <- rlog(dds)
mefisto_input <- assay(rld)

2.2 Example 2: Sparse single cell count data

Simplest (and hence widely applied) pre-processing uses a total sum scaling for size factor correction and a shifted logarithmic transformation \(x_{ij} = log(K_{ij}/\sum_jK_{ij} + a)\). This is for example the default procedure by Seurat

# see ?NormalizeData
library(Seurat)
data("pbmc_small")
mefisto_input <- NormalizeData(pbmc_small, normalization.method = "LogNormalize", scale.factor = 10000)

For Python users, similar pre-processing steps are implemented for example in scanpy.

Pearson residuals from regularized negative binomial regression as proposed in sctransform

# see ?SCTransform
library(Seurat)
data("pbmc_small")
mefisto_input <- SCTransform(object = pbmc_small)

Deviance or Pearson residuals under a multinomial model as proposed by Townes et al

2.3 Example 3: Microbiome data

In principle, similar pre-processing can be used as for RNA-seq data. A detailed tutorial can be found here. Alternatively, centered log-ratio can be applied, such as implemented as part of gemelli.

3 Additional optional steps in the preprocessing

Once the mean-variance relationship was removed from the data, further filtering steps can be useful such as

filtering to the top N most variable features (can reduce memory requirements and computation time)
filtering to temporally or spatially variable features (e.g. using maSigPro or spatialDE) (useful if mainly smooth sources of variation or alignment are of interest)
removal of known covariates using e.g. residuals from a regression model as input to MEFISTO to regress out known covariates

4 Detailed use case

Some detailed example of the pre-processing used in the MEFISTO application of the manuscript can be found here:

Log-transformation for spatial transcriptomics data is integrated in our R tutorial on spatial transcriptomics data or in the corresponding Python tutorial
variance stabiliation for RNA-seq data in the evodevo atlas
RCLR transformation for microbiome data

5 SessionInfo

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS:   /usr/lib64/libblas.so.3.4.2
## LAPACK: /usr/lib64/liblapack.so.3.4.2
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SeuratObject_4.0.0          Seurat_4.0.0               
##  [3] DESeq2_1.28.1               SummarizedExperiment_1.18.1
##  [5] DelayedArray_0.14.0         matrixStats_0.56.0         
##  [7] Biobase_2.48.0              GenomicRanges_1.40.0       
##  [9] GenomeInfoDb_1.24.2         IRanges_2.22.2             
## [11] S4Vectors_0.26.1            BiocGenerics_0.34.0        
## [13] BiocStyle_2.16.0           
## 
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15             colorspace_2.0-0       deldir_0.1-25         
##   [4] ellipsis_0.3.1         ggridges_0.5.2         XVector_0.28.0        
##   [7] spatstat.data_1.4-3    leiden_0.3.7           listenv_0.8.0         
##  [10] ggrepel_0.8.2          bit64_0.9-7            AnnotationDbi_1.50.1  
##  [13] codetools_0.2-16       splines_4.0.0          geneplotter_1.66.0    
##  [16] knitr_1.29             polyclip_1.10-0        jsonlite_1.7.0        
##  [19] annotate_1.66.0        ica_1.0-2              cluster_2.1.0         
##  [22] png_0.1-7              uwot_0.1.10            shiny_1.5.0           
##  [25] sctransform_0.3.2      BiocManager_1.30.10    compiler_4.0.0        
##  [28] httr_1.4.1             lazyeval_0.2.2         Matrix_1.2-18         
##  [31] fastmap_1.0.1          later_1.1.0.1          htmltools_0.5.0       
##  [34] tools_4.0.0            igraph_1.2.6           gtable_0.3.0          
##  [37] glue_1.4.2             GenomeInfoDbData_1.2.3 RANN_2.6.1            
##  [40] reshape2_1.4.4         dplyr_1.0.0            spatstat_1.64-1       
##  [43] Rcpp_1.0.5             scattermore_0.7        vctrs_0.3.1           
##  [46] nlme_3.1-148           lmtest_0.9-37          xfun_0.15             
##  [49] stringr_1.4.0          globals_0.12.5         mime_0.9              
##  [52] miniUI_0.1.1.1         lifecycle_0.2.0        irlba_2.3.3           
##  [55] goftest_1.2-2          XML_3.99-0.4           future_1.17.0         
##  [58] zlibbioc_1.34.0        MASS_7.3-51.6          zoo_1.8-8             
##  [61] scales_1.1.1           spatstat.utils_1.17-0  promises_1.1.1        
##  [64] RColorBrewer_1.1-2     yaml_2.2.1             memoise_1.1.0         
##  [67] reticulate_1.16        pbapply_1.4-2          gridExtra_2.3         
##  [70] ggplot2_3.3.2          rpart_4.1-15           stringi_1.5.3         
##  [73] RSQLite_2.2.0          genefilter_1.70.0      BiocParallel_1.22.0   
##  [76] rlang_0.4.9            pkgconfig_2.0.3        bitops_1.0-6          
##  [79] evaluate_0.14          lattice_0.20-41        tensor_1.5            
##  [82] ROCR_1.0-11            purrr_0.3.4            htmlwidgets_1.5.1     
##  [85] patchwork_1.0.1        cowplot_1.0.0          bit_1.1-15.2          
##  [88] tidyselect_1.1.0       RcppAnnoy_0.0.18       plyr_1.8.6            
##  [91] magrittr_2.0.1         bookdown_0.20          R6_2.5.0              
##  [94] generics_0.1.0         DBI_1.1.0              mgcv_1.8-31           
##  [97] pillar_1.4.4           fitdistrplus_1.1-1     abind_1.4-5           
## [100] survival_3.2-3         RCurl_1.98-1.2         tibble_3.0.2          
## [103] future.apply_1.6.0     crayon_1.3.4           KernSmooth_2.23-17    
## [106] plotly_4.9.2.1         rmarkdown_2.7          locfit_1.5-9.4        
## [109] grid_4.0.0             data.table_1.12.8      blob_1.2.1            
## [112] digest_0.6.27          xtable_1.8-4           tidyr_1.1.0           
## [115] httpuv_1.5.4           munsell_0.5.0          viridisLite_0.3.0

Notes on data pre-processing for the use of MEFISTO

2021-04-19