Contents

First we need to download the data. We provide a subset of the full raw data. It is about 2.6 GB in size. Please set your working directory to where you stored this vignette and make sure that enough disk space is available.

workdir <- "/home/max/code/single_cell_course/" # Change to directory where this vignette is stored
knitr::opts_knit$set(root.dir = workdir)
#system("curl ...")
#system("unzip gastrulation_data.zip -d gastrulation_data")

Load all necessary libraries.

library(data.table)
library(dplyr)
library(Seurat)
library(ggplot2)
library(MOFA2)
# Define stage colors
stage_colors_acc <- c(
  "E4.5"="#eff3ff",
  "E5.5"="#9ecae1",
  "E6.5"="#3182bd",
  "E7.5"="#08519c"
)
stage_colors_met <- c(
  "E4.5"="#FDCC8A",
  "E5.5"="#FC8D59",
  "E6.5"="#E34A33",
  "E7.5"="#600707"
)

stage_colors_rna <- c(
  E4.5="#B2E2E2", 
  E5.5="#66C2A4", 
  E6.5="#2CA25F", 
  E7.5="#006D2C"
)

1 Introduction

As you have heard, scNMT-seq allows the simultaneous profiling of transcriptome, methylome and chromatin accessibility in a single cell. We will now see how such a dataset can be analyzed in practice.

In particular we will be working with the data from the following study: Multi-omics profiling of mouse gastrulation at single-cell resolution. This study sequenced mouse embryonic stem cells as they undergo the process of gastrulation, during which the three primary germ layers are formed.

Later on we will use MOFA to analyze the three layers together, but in this tutorial we just want to get an overview of each layer separately. This should make it clear how the data of each layer looks like and how each of the layers changes over the course of mouse development. We will also reproduce some of the panels in Figure 1 of the paper. Note that the plots will not exactly match the figures since we are using slightly different tools and preprocessing strategies.

2 Data analysis

2.1 Metadata

The first step in every single-cell experiment is to get the data into a usable format and to do quality control. Since these steps can often be quite resource intensive, we already provide preprocessed data for you. If you are interested in the preprocessing you can have a look at the full analysis folder of the paper. You can also find links to the full data there. Here we provide you with a metadata file that will give an overview of the quality control and links samples across the different layers. Please have a look at the metadata table.

metadata <- fread("gastrulation_data/sample_metadata_filtered_clean.txt")
metadata
##                        sample pass_rnaQC pass_metQC pass_accQC stage
##    1: E4.5-5.5_new_Plate3_E09       TRUE       TRUE       TRUE  E4.5
##    2: E4.5-5.5_new_Plate3_G09       TRUE      FALSE      FALSE  E4.5
##    3: E4.5-5.5_new_Plate3_H09       TRUE       TRUE       TRUE  E4.5
##    4: E4.5-5.5_new_Plate3_B09       TRUE      FALSE      FALSE  E4.5
##    5: E4.5-5.5_new_Plate4_G01       TRUE       TRUE       TRUE  E4.5
##   ---                                                               
## 2174:        PS_VE_Plate6_D10       TRUE         NA         NA  E7.5
## 2175:        PS_VE_Plate6_F10       TRUE         NA         NA  E7.5
## 2176:        PS_VE_Plate6_A10       TRUE         NA         NA  E7.5
## 2177:        PS_VE_Plate6_B10       TRUE         NA         NA  E7.5
## 2178:        PS_VE_Plate6_H10       TRUE         NA         NA  E7.5
##                lineage         stage_lineage
##    1:         Epiblast         E4.5_Epiblast
##    2:         Epiblast         E4.5_Epiblast
##    3:         Epiblast         E4.5_Epiblast
##    4:         Epiblast         E4.5_Epiblast
##    5:         Epiblast         E4.5_Epiblast
##   ---                                       
## 2174:         Endoderm         E7.5_Endoderm
## 2175: Primitive_Streak E7.5_Primitive_Streak
## 2176:         Mesoderm         E7.5_Mesoderm
## 2177: Primitive_Streak E7.5_Primitive_Streak
## 2178:         Mesoderm         E7.5_Mesoderm

Note that every cell already has annotations for the lineage it belongs to and a stage column. The stage column corresponds to the embryonic day the cells were sequenced, whereas the lineage labels come from a mapping to an extensive (100.000 cells) 10X single-cell RNA-seq atlas that annotated these celltypes. For more informations on the mapping please have a look in the methods section of the paper and also check out the mouse gastrulation atlas here.

2.2 RNA layer

2.2.1 Preprocessing

The RNA data is stored in a seurat object that we will use for further analysis. This section will be a quick recap of things you learned yesterday.

rna <- readRDS("gastrulation_data/rna/seurat_object.rds")
rna
## An object of class Seurat 
## 18345 features across 2147 samples within 1 assay 
## Active assay: RNA (18345 features, 0 variable features)

We use Seurat to identify highly variable genes.

Let’s make sure all our cells pass the

# Add the metadata table to the seurat object
rna <- AddMetaData(rna, metadata = data.frame(metadata, row.names = metadata$sample))
# Make sure only cells that pass the qc are used
rna <- rna[, rna$pass_rnaQC == TRUE]
# Log t
rna <- NormalizeData(rna)
rna <- FindVariableFeatures(rna, nfeatures = 1000)

VariableFeaturePlot(rna)
## Warning: Transformation introduced infinite values in continuous x-axis

We will also center and scale the data to remove library size effects.

all.genes <- rownames(rna)
rna <- ScaleData(rna, features = all.genes)
## Centering and scaling data matrix

2.2.2 Dimensionality reduction

Now we can perform dimensionality reduction by PCA followed by UMAP.

rna <- RunPCA(rna, features = VariableFeatures(object = rna))

DimPlot(rna, reduction = "pca", group.by = "stage")

The first 2 principal components seem to capture some variation in the latest stage. But there is still a lot of variation not explained by the first 2 components.

Question: How would you determine how many PC’s to use for further analysis?

Answer There are multiple ways to determine the dimensionality of a dataset. The quickest one is to plot the percentage of variance explained by each PC.

ElbowPlot(rna)

Let’s run umap to vizualize the dataset.

rna <- RunUMAP(rna, dims = 1:10, n.neighbors = 20, min.dist = 0.7)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 19:10:25 UMAP embedding parameters a = 0.3208 b = 1.563
## 19:10:25 Read 2147 rows and found 10 numeric columns
## 19:10:25 Using Annoy for neighbor search, n_neighbors = 20
## 19:10:25 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 19:10:26 Writing NN index file to temp file /tmp/RtmpKrjUjJ/file238bf46c7ce983
## 19:10:26 Searching Annoy index using 1 thread, search_k = 2000
## 19:10:27 Annoy recall = 100%
## 19:10:27 Commencing smooth kNN distance calibration using 1 thread
## 19:10:28 Initializing from normalized Laplacian + noise
## 19:10:29 Commencing optimization for 500 epochs, with 55212 positive edges
## 19:10:34 Optimization finished
DimPlot(rna, reduction = "umap", group.by = "stage") + scale_color_manual(values = stage_colors_rna)

The rna expression clearly separates the stages. Since we used different analysis tools than the authors of the paper our plot looks slightly different to Figure 1b.

Question: Can you visualize the mapped lineages from the 10X atlas on the UMAP? Are there any preliminary insights you can draw?

Answer

ElbowPlot(rna)

DimPlot(rna, reduction = "umap", group.by = "stage", shape.by = "lineage")+ 
  scale_color_manual(values = stage_colors_rna)

2.3 DNA methylation and accessibility layers

The DNA methylation and accessibility data looks quite a bit different to the count matrices we see for RNA expression data. Again we will not perform the preprocessing here, but to give you an idea we will describe how the different steps work. The sequencing technique used is called Bisulfite sequencing. Briefly, bisulfite sequencing exploits the fact that bisulfite converts cytosine residues in the DNA to uracil, but only if they are unmethylated. This converts the epigenetic mark into a sequencable readout. Additionally, a technique called NoMe seq is used to gain accessibility information. Here an external Methyltransferase is introduced into cells, methylating accessible GpC sites. This signal can then also be picked up by Bisulfite sequencing.

Since the alignment to the genome is complicated by the fact that some cytosines have to be converted, a special aligner (in this case Bismarck) has to be used. The tool then calls methylation at CpG and GpC sites in all cells. Here is an example of a typical raw dataset of a single cell (The accessibility data has the same format).

2.3.1 Preprocessing

met_raw <- fread("gastrulation_data/met/cpg_level/E4.5-5.5_new_Plate1_A03.tsv.gz")
met_raw
##         chr      pos met_reads nonmet_reads rate
##      1:   1  3020972         2            0    1
##      2:   1  3020988         2            0    1
##      3:   1  3022017         1            0    1
##      4:   1  3022571         1            0    1
##      5:   1  3022581         1            0    1
##     ---                                         
## 467734:   Y 90810708         0            3    0
## 467735:   Y 90810761         0            1    0
## 467736:   Y 90811542         2            0    1
## 467737:   Y 90812328         1            0    1
## 467738:   Y 90812376         1            0    1

In the paper a descision was taken to binarize the data (i.e. representing each CpG/GpC site as methylated or unmethylated).

Question: Take a look at the distribution of positive and negative reads. Can you see why the rate was binarized?

Answer

A single cell should only have one methylation state at each CpG site (different methylation on the 2 alleles is rare). Thus any intermediate state likely comes from sequencing errors. In the histrogram we see that intermediate states are very rare.

hist(met_raw[, met_reads/(met_reads + nonmet_reads)])

Instead of our usual gene x cells count matrix, we are now working with a binary cell x CpG/GpC matrix for accessibility and methylation respectively. Additionally this matrix is extremely sparse. Have a look at Supplementary Figure 1 from the paper. On average less than 1% of CpG sites are covered in any given cell. This means that it is very tough to make accurate assessments for single sites. For this reason it is often more useful to aggregate methylation and accessibility information over known regulatory regions. You can find some aggregated files in gastrulation_data/met/feature_level/. Let’s have a look at how this looks for gene promoters. We provide a dataset of promoter regions that extend 2000 bp to each side of the transcription start sites of genes.

# Read promoter methylation
met_prom <- fread("gastrulation_data/met/feature_level/prom_2000_2000_clean.tsv.gz")
head(met_prom)
##                     sample                 id  N rate
## 1: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000001  3   33
## 2: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000028 29    7
## 3: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000037  1  100
## 4: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000049 22    5
## 5: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000058  5    0
## 6: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000078  6   67

In this table sample corresponds to the cell_id, id is the identifier for the genomic region, N is the number of CpG sites that the signal is based on and rate is the mean methylation rate. Let’s see how aggregating the individual signals affects the sparsity of the signal.

hist(met_prom$N)

It is always important to check the distribution of our data. Let’s see how the aggregated rates are distributed.

hist(met_prom$rate)

This data is definitely not normally distributed. Analogous to the log-transfromation for the RNA expression data we should find a transformation that brings our data close to normal.

Here we will be working with M-values, which are calculated as log2(((mean_rate/100)+0.01)/(1-(mean_rate/100)+0.01)).

Let’s plot the distribution of the rate value and the m-value for the DNAse hypersensitivity sites. Can you see why this transformation is used? For more information see here.

met_prom[,m:=log2(((rate/100)+0.01)/(1-(rate/100)+0.01))]

hist(met_prom$m)

While this is still not normally distributed, it is much closer than before. In an ideal scenario we would use tools that can directly model the binomial nature of the data, but normalizing is common practice in many workflows.

2.3.2 Promoter methylation and gene expression

Promoter methylation is known to silence the expression of it’s gene. Let’s check if that is also the case in this dataset. We will calculate the correlation between gene expression and promoter methylation for each gene.

# Make sure cells pass QC
cells <- metadata[pass_metQC == TRUE, sample]
met_prom <- met_prom[sample %in% cells]

# Filter by coverage
met_prom <- met_prom %>% .[,N:=.N, by="id"] %>% .[N>=50] %>% .[,N:=NULL]

# Get rna data in long format
rna_dt <- GetAssayData(rna) %>%
  as.data.table(keep.rownames = "id") %>%
  melt(id.vars = "id", variable.name = "sample", value.name = "expr")

# Merge met and rna information
metrna_dt <- merge(met_prom, rna_dt, by = c("id", "sample"))

# Compute correlations and test for significance
cor_met <- metrna_dt[, .(V1 = unlist(cor.test(m, expr)[c("estimate", "statistic", "p.value")])), by = "id"] %>%
  .[, para := rep(c("r","t","p"), .N/3)] %>% data.table::dcast(id ~ para, value.var = "V1") %>%
  .[, c("padj_fdr", "padj_bonf") := list(p.adjust(p, method="fdr"), p.adjust(p, method="bonferroni"))] %>%
  .[, c("log_padj_fdr","log_padj_bonf") := list(-log10(padj_fdr), -log10(padj_bonf))] %>%
  .[, sig := padj_fdr <= 0.1] %>%  setorder(padj_fdr) %>% .[!is.na(p)]

p1 <- hist(cor_met$p, breaks = 50)

p2 <- plot(cor_met$r, cor_met$log_padj_fdr)

Question: How do you interpret this p-value histogram? Would you say this is well-behaved? What would you expect if none of the genes were anticorrelated to promoter methylation? What if almost all of the genes were?

Answer

P-values are uniformly distributed between 0 and 1 if the null hypothesis is true. In a dataset where some tests come from the null and some from the alternative hypothesis we will observe a mixture of a uniform distribution and p-values close to 0. We can clearly see that most of our correlations are not significant, while a subset is.

2.3.3 Promoter acessibility and gene expression

We also have accessibility information for promoters. We will perform the same analysis, with a few modifications: - We are using windows of 200 bp around the TSS. This gives better results since our coverage is higher for acessibility. - We require 5 GpC sites to be detected in each cell and window. Again, we can do this since we have higher coverage.

# Read promoter accessibility
acc_prom <- fread("gastrulation_data/acc/feature_level/prom_200_200_clean.tsv.gz")

# Calculate m values
acc_prom[,m:=log2(((rate/100)+0.01)/(1-(rate/100)+0.01))]

cells <- metadata[pass_accQC == TRUE, sample]
acc_prom <- acc_prom[sample %in% cells]

# Filter for minimum number of GpC sites
acc_prom <- acc_prom[N >= 5]

# Filter by coverage
acc_prom <- acc_prom %>% .[,N:=.N, by="id"] %>% .[N>=50] %>% .[,N:=NULL]

accrna_dt <- merge(acc_prom, rna_dt, by = c("id", "sample"))

cor_acc <- accrna_dt[, .(V1 = unlist(cor.test(m, expr)[c("estimate", "statistic", "p.value")])), by = "id"] %>%
  .[, para := rep(c("r","t","p"), .N/3)] %>% data.table::dcast(id ~ para, value.var = "V1") %>%
  .[, c("padj_fdr", "padj_bonf") := list(p.adjust(p, method="fdr"), p.adjust(p, method="bonferroni"))] %>%
  .[, c("log_padj_fdr","log_padj_bonf") := list(-log10(padj_fdr), -log10(padj_bonf))] %>%
  .[, sig := padj_fdr <= 0.1] %>%  setorder(padj_fdr) %>% .[!is.na(p)]

p1 <- hist(cor_acc$p, breaks = 50)

p2 <- plot(cor_acc$r, cor_acc$log_padj_fdr)

Now we can reproduce Panel g of Figure 1.

plot_dt <- merge(cor_met, cor_acc, by = "id", suffixes = c("_met", "_acc"))
ggplot(plot_dt, aes(x = r_met, y = r_acc, color = sig_met & sig_acc)) +
  geom_point(size = 1, aes(alpha = sig_met & sig_acc)) + 
  geom_hline(yintercept = 0, color = "orange", linewidth = 1) +
  geom_vline(xintercept = 0, color = "orange", linewidth = 1) +
  scale_color_manual(values = c("black", "red")) +
  theme_classic()
## Warning: Ignoring unknown parameters: linewidth

## Warning: Ignoring unknown parameters: linewidth
## Warning: Using alpha for a discrete variable is not advised.

2.3.4 Visualization of individual genes

We will now look at a specific example gene in more detail. Specifically we will explore the promoter epigenetics of Dppa4 in relation to its expression.

gene_name <- "ENSMUSG00000058550"

# Merge together all 3 layers and metadata
all_layers <- rna_dt[id == gene_name] %>%
  #.[, rna_expression := logcounts(sce)[gene_name,][id_rna]] %>%
  merge(acc_prom[id == gene_name, c("sample", "rate")], by = c("sample"), all.x = T) %>%
  merge(met_prom[id == gene_name, c("sample", "rate")], by = c("sample"), all.x = T, suffixes = c("_acc","_met")) %>%
  merge(metadata, by = "sample")
  
plot_dt <- melt(all_layers, id.vars = c("sample", "stage"),
                measure.vars = c("expr", "rate_acc", "rate_met"),
                variable.name = "Layer")
## Warning in melt.data.table(all_layers, id.vars = c("sample", "stage"),
## measure.vars = c("expr", : 'measure.vars' [expr, rate_acc, rate_met] are not all
## of the same type. By order of hierarchy, the molten data value column will be of
## type 'double'. All measure variables not of type 'double' will be coerced too.
## Check DETAILS in ?melt.data.table for more on coercion.
plot_dt %>%
  ggplot(aes(x = stage, y = value, color = Layer, fill = Layer)) +
  geom_jitter() +
  geom_violin(alpha = 0.3, color = "black") +
  geom_boxplot(alpha = 0.3, width = 0.1, color = "black") +
  facet_grid(Layer~., scales = "free_y") +
  theme_classic() +
  theme(legend.position = "none")
## Warning: Removed 3913 rows containing non-finite values (stat_ydensity).
## Warning: Removed 3913 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3913 rows containing missing values (geom_point).

We approximately reproduced panel 1h from the paper. Feel free to explore more genes on your own!

2.3.5 Dimensionality reduction of methylome

Now we will attempt to do dimensionality reduction with the methylome layer, to see if we get similar results to the UMAP of the RNA layer. As in the paper we will use DNAse hypersensitivity sites for this.

met_dnase <- fread("gastrulation_data/met/feature_level/ESC_DHS_clean.tsv.gz")
# Keep only cells that pass the QC
cells <- metadata[pass_metQC == TRUE, sample]
met_dnase <- met_dnase[sample %in% cells]
hist(met_dnase$N)

Since the Signal is so sparse, we need a robust way of doing dimensionality reduction that can handle missing values. Linear Bayesian Factor analysis (which is the basis for MOFA) will work well in these scenarios. However some preprocessing of the data has to be done first. This is similar to the preprocessing of RNAseq data.

# Calculate M value from Beta value 
met_dnase[,m:=log2(((rate/100)+0.01)/(1-(rate/100)+0.01))]

The remaining preprocessing steps are: - We transform the rate value into m values to make them approximately normally distributed. - We filter out DNAse hypersensitivity sites that are covered by less than 10% of cells - We select only the 5000 most highly variable sites

min_coverage = 0.10
n_hv = 5000

# Filter features by coverage
nsamples <- length(unique(met_dnase$sample))
met_dnase <- met_dnase %>% .[,cov:=.N/nsamples,by=id] %>%
  .[cov>=min_coverage] %>% .[,c("cov"):=NULL]

# Keep only highly variable sites
keep_hv_sites <- met_dnase %>%
  .[,.(var = var(rate)), by="id"] %>% 
  .[var>0] %>% setorder(-var) %>%
  head(n = n_hv) %>% .$id

met_dnase <- met_dnase[id %in% keep_hv_sites]

Now we can transform the data into matrix format to run dimensionality reduction.

met_mat <- dcast(met_dnase, formula = sample~id, value.var = "m") %>%
  tibble::column_to_rownames("sample") %>%
  as.matrix %>% t

met_mat[1:5, 1:5]
##                E4.5-5.5_new_Plate1_A02 E4.5-5.5_new_Plate1_A03
## ESC_DHS_100007                      NA                      NA
## ESC_DHS_100069                      NA               -6.658211
## ESC_DHS_10008                -6.658211                      NA
## ESC_DHS_100123                6.658211                      NA
## ESC_DHS_100129                      NA                      NA
##                E4.5-5.5_new_Plate1_A04 E4.5-5.5_new_Plate1_A07
## ESC_DHS_100007                      NA                      NA
## ESC_DHS_100069                      NA                      NA
## ESC_DHS_10008                       NA                      NA
## ESC_DHS_100123                      NA                      NA
## ESC_DHS_100129                      NA                6.658211
##                E4.5-5.5_new_Plate1_A08
## ESC_DHS_100007                      NA
## ESC_DHS_100069                      NA
## ESC_DHS_10008                       NA
## ESC_DHS_100123                6.658211
## ESC_DHS_100129                      NA

We will use MOFA for this Factor analysis. Note that this is not yet doing any multiomics integration. It is just a convenient way of performing Factor analysis.

MOFAobject <- create_mofa(list(met_mat))
## Creating MOFA object from a list of matrices (features as rows, sample as columns)...
## No groups provided as argument... we assume that all samples are coming from the same group.
## View names are not specified in the data, using default: view_1
# Set options
ModelOptions <- get_default_model_options(MOFAobject)
ModelOptions$num_factors <- 2
 
TrainOptions <- get_default_training_options(MOFAobject)
TrainOptions$seed <- 42

# Prepare
MOFAobject <- prepare_mofa(MOFAobject,
  model_options = ModelOptions, 
  training_options = TrainOptions
)
## Checking data options...
## No data options specified, using default...
## Checking training options...
## Checking model options...
# Train the model
model <- run_mofa(MOFAobject)
## Warning in run_mofa(MOFAobject): No output filename provided. Using /tmp/mofa_20200603-191206.hdf5 to store the trained model.
## Warning in quality_control(object, verbose = verbose): Factor(s) 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total 'levels' between your samples, *sometimes* because of poor normalisation in the preprocessing steps.

Let’s visualize the Factors.

metadata_plot <- metadata %>% setkey(sample) %>%
  .[samples_names(model)]
p <- plot_factors(
  model, 
  factors = c(1,2),
  color_by = metadata_plot$stage
  )
p + scale_color_manual(values = stage_colors_met)

Great! The methylation rate within DNAse hypersensitive sites is clearly distinguishing cells of different stages. This means there is some regulation that we can further explore. We were also able to get similar results to Figure 1c from the paper.

2.3.6 Dimensionality reduction of methylome

acc_dnase <- fread("gastrulation_data/acc/feature_level/ESC_DHS_clean.tsv.gz")
# Keep only cells that pass the QC
cells <- metadata[pass_accQC == TRUE, sample]
acc_dnase <- acc_dnase[sample %in% cells]
head(acc_dnase)
##                     sample             id N rate
## 1: E4.5-5.5_new_Plate1_A02     ESC_DHS_10 8   75
## 2: E4.5-5.5_new_Plate1_A02    ESC_DHS_100 6  100
## 3: E4.5-5.5_new_Plate1_A02 ESC_DHS_100003 8   88
## 4: E4.5-5.5_new_Plate1_A02 ESC_DHS_100012 2   50
## 5: E4.5-5.5_new_Plate1_A02 ESC_DHS_100013 7   29
## 6: E4.5-5.5_new_Plate1_A02 ESC_DHS_100014 4  100

We run the same preprocessing with slightly different parameters that were empirically found to work well. Feel free to explore how changing these influences the result.

min_coverage = 0.10
n_hv = 10000
min_GpC = 5

# Filter for minimum number of GpC sites
acc_dnase <- acc_dnase[N >= min_GpC]

# Calculate M value from Beta value 
acc_dnase[,m:=log2(((rate/100)+0.01)/(1-(rate/100)+0.01))]

# Filter features by coverage
nsamples <- length(unique(acc_dnase$sample))
acc_dnase <- acc_dnase[,cov:=.N/nsamples,by=id] %>%
  .[cov>=min_coverage] %>% .[,c("cov"):=NULL]

# Keep only highly variable sites
keep_hv_sites <- acc_dnase %>%
  .[,.(var = var(rate)), by="id"] %>% 
  .[var>0] %>% setorder(-var) %>%
  head(n = n_hv) %>% .$id

acc_dnase <- acc_dnase[id %in% keep_hv_sites]
acc_dnase
##                           sample             id  N rate          m
##       1: E4.5-5.5_new_Plate1_A02 ESC_DHS_100055  5   60  0.5731853
##       2: E4.5-5.5_new_Plate1_A02 ESC_DHS_100153 10   30 -1.1955508
##       3: E4.5-5.5_new_Plate1_A02 ESC_DHS_100312  5    0 -6.6582115
##       4: E4.5-5.5_new_Plate1_A02 ESC_DHS_100349  6  100  6.6582115
##       5: E4.5-5.5_new_Plate1_A02 ESC_DHS_100366  8   88  2.7752937
##      ---                                                          
## 1257541:        PS_VE_Plate9_H08  ESC_DHS_99172  8    0 -6.6582115
## 1257542:        PS_VE_Plate9_H08  ESC_DHS_99278  9    0 -6.6582115
## 1257543:        PS_VE_Plate9_H08  ESC_DHS_99281  6   17 -2.2223924
## 1257544:        PS_VE_Plate9_H08  ESC_DHS_99692  9   89  2.9068906
## 1257545:        PS_VE_Plate9_H08  ESC_DHS_99940  5  100  6.6582115
acc_mat <- dcast(acc_dnase, formula = sample~id, value.var = "m") %>%
  tibble::column_to_rownames("sample") %>%
  as.matrix %>% t

acc_mat[1:5, 1:5]
##                E4.5-5.5_new_Plate1_A02 E4.5-5.5_new_Plate1_A04
## ESC_DHS_1                           NA                      NA
## ESC_DHS_100040                      NA                      NA
## ESC_DHS_100055               0.5731853                      NA
## ESC_DHS_100098                      NA               6.6582115
## ESC_DHS_100110                      NA              -0.5731853
##                E4.5-5.5_new_Plate1_A07 E4.5-5.5_new_Plate1_A08
## ESC_DHS_1                           NA               6.6582115
## ESC_DHS_100040               -2.775294                      NA
## ESC_DHS_100055                      NA               0.6918777
## ESC_DHS_100098                      NA               2.5360529
## ESC_DHS_100110                6.658211               6.6582115
##                E4.5-5.5_new_Plate1_A12
## ESC_DHS_1                           NA
## ESC_DHS_100040                1.947533
## ESC_DHS_100055                      NA
## ESC_DHS_100098                      NA
## ESC_DHS_100110                      NA

Now we can create a mofa object and run the model.

MOFAobject <- create_mofa(list(acc_mat))
## Creating MOFA object from a list of matrices (features as rows, sample as columns)...
## No groups provided as argument... we assume that all samples are coming from the same group.
## View names are not specified in the data, using default: view_1
# Set options
ModelOptions <- get_default_model_options(MOFAobject)
ModelOptions$num_factors <- 2
 
TrainOptions <- get_default_training_options(MOFAobject)
TrainOptions$seed <- 42

# Prepare
MOFAobject <- prepare_mofa(MOFAobject,
  model_options = ModelOptions, 
  training_options = TrainOptions
)
## Checking data options...
## No data options specified, using default...
## Checking training options...
## Checking model options...
# Train the model
model_acc <- run_mofa(MOFAobject)
## Warning in run_mofa(MOFAobject): No output filename provided. Using /tmp/mofa_20200603-191426.hdf5 to store the trained model.
## Warning in quality_control(object, verbose = verbose): Factor(s) 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total 'levels' between your samples, *sometimes* because of poor normalisation in the preprocessing steps.

Let’s visualize the Factors

metadata_plot <- metadata %>% setkey(sample) %>%
  .[samples_names(model_acc)]
p <- plot_factors(
  model_acc, 
  factors = c(1,2),
  color_by = metadata_plot$stage
  )
p + scale_color_manual(values = stage_colors_acc)

This plot is similar to Figure 1d in the paper.

Extra Question: Can you produce similar results with other dimensionality reduction methods that were introduced so far? (Note: you might have to impute the missing values for some of them to work)