About

This tutorial demonstrates the R package cpam for the analysis of time series omics data. It serves an introduction to the package and reproduces the results for the first case study presented in the accompanying manuscript by Yates et al. (2024). The second case study using Arabidopsis thaliana time series data is available here.

Data

This tutorial uses time-series RNA-seq data tracking human embryo development across multiple stages. The data are publicly available for download here. The data consist of transcript counts for the human preimplantation embryo and are described in original data manuscript by Torre et al. (2023). The experiment captures transcriptional changes across six early developmental stages: zygote, 2-cell, 4-cell, 8-cell, morula, and blastocyst. These time points span a major developmental transition - embryonic genome activation (EGA) that occurs after the 4-cell stage – and captures major transcriptional transitions in development.

Installation

You can install cpam from GitHub using:

devtools::install_github("l-a-yates/cpam")

Getting started

Load packages

library(cpam)  
library(dplyr)
library(readr)
library(tidyr)
library(stringr)
library(ggplot2)
library(purrr)

directory_path <- "torre/data/"

Load data and transcript-to-gene mapping

We load the transcript counts and the transcript-to-gene mapping. The transcript counts are stored in a matrix where the rows are the transcript IDs and the columns are the samples.

sample_names <- readr::read_tsv(
  paste0(directory_path, "human_embryo-transcript_counts.tsv"),
  n_max = 1, 
  col_names = F
) %>%
  t %>% as.vector()

counts <- readr::read_tsv(
  paste0(directory_path, "human_embryo-transcript_counts.tsv"),
  skip = 1,
  col_names = c("target_id", sample_names)
)

count_matrix <- counts %>% dplyr::select(-target_id) %>% as.matrix
rownames(count_matrix) <- counts %>% pull(target_id)

# convert to tibble (for display only)
as.data.frame(count_matrix) %>% head %>% tibble::rownames_to_column("target_id")

The transcript-to-gene mapping is a data frame with two columns, target_id and gene_id.

transcriptome <- readr::read_tsv(paste0(directory_path,"Human_embryo_isoform_transcriptome_GRCh38-isoform_classification.tsv"))

t2g <- transcriptome %>%
  select(target_id = transcript_id, gene_id) %>%
  filter(target_id %in% rownames(count_matrix))

t2g

Experimental design

The experimental design is a tibble containing at least the following columns: time and sample. The stage column is optional and used here to match sample names with the time points (here expressed in hours).


stage_to_time <- tibble(stage = c("2PN","2C","4C","8C","morula","blastocyst"),
                        time = c(0.71,1.5,2,3,4.5,5.5))

ed <-
  tibble(sample = colnames(count_matrix)) %>%
  mutate(stage = str_match(sample,"_(.*?)_")[,2]) %>%
  left_join(stage_to_time, by = "stage")

ed

Fitting the models

To fit the models, we first prepare the cpam object, then compute p-values, estimate the changepoint, and select the shape for each transcript. For this analysis, the count data are supplied as a matrix (see this case study for importing data directly from common quantification software). We keep all transcripts that have more than 20 reads in at least 3/5 of the samples.

cpo <- prepare_cpam(
  exp_design = ed,
  count_matrix = count_matrix,
  model_type = "case-only",
  filter_fun_args = list(min_reads = 20, min_prop = 3/5),
  t2g = t2g,
  num_cores = 4
)
cpo <- compute_p_values(cpo) # 00:03:37 hms
cpo <- estimate_changepoint(cpo) # 00:23:26 hms
cpo <- select_shape(cpo) #  00:37:28 hms
    

For larger transcriptomes, such as this human transciptome with many additional transcripts beyond the standard reference, the model fitting can take a long time. Here, using 4 cores, we were able to estimate \(p\)-values in 3 minutes, changepoints in 23 minutes, and shape selections in 37 minutes. The latter stages, although slower to compute, are worth the wait for the inferences they provide.

We can look at a summary of the fitted cpam object

cpo
## cpam object
## -----------
## case-only time series
## 73 samples
## 6 time points

If you run the code on your own computer, you can launch the Shiny app to visualise the results interactively using visualise(cpo).

Results

The results of the analysis are summarised using the results function. Here we use the default settings for log-fold change and p-value thresholds.



res <- results(
  cpo,
  min_lfc = 0,
  p_threshold = 0.05,
  add_counts = T,
  add_lfc = F
)

res

Using the results table, we can plot the number of differentially expressed genes at each time point.

res %>%
  pull(cp) %>%
  table %>%
  {tibble(cp = names(.), n = as.numeric(.))} %>%
  ggplot(aes(cp, n)) +
  geom_point() +
  geom_line(aes(group = 1)) +
  theme_classic()

There is an obvious peak in differentially expressed genes at the 4C stage (t = 2.0 hours), corresponding to the embryonic genome activation (EGA).

Plotting

To plot the fitted models for a specific gene, we can use the plot_cpam function.

g <- "ENSG00000141378"
plot_cpam(
  cpo,
  gene_id = g,
  facet = T, # plot each transcript on a separate facet
  remove_null = T, # do show transcripts with null trajectories
  common_y_scale = F # do not use a common y scale
) 

The facet argument can be set to FALSE to plot all the fits on the same plot.

g <- "ENSG00000141378"
plot_cpam(cpo,
          gene_id = g,
          facet = F, # plot each transcript in a common plot
          remove_null = T)

Customising the plot

If you require more customisation than the plot_cpam function allows, you can extract the model fits and plot their predictions manually. Here we make a logged version of the above plot with some extra customisations such as changepoint ticks.


# set gene
g <- "ENSG00000141378"

# set logged
logged <- T
if(logged) fun <- function(x) log(x + 0.5) else fun <- identity
scls <- if_else(logged, "free_x", "free")

# get data
obs <-
  cpo$data_long %>% 
  filter(gene_id == g) %>%
  transmute(target_id, time, counts = fun(counts_raw/norm_factor + 0.5))
cps <- cpo$changepoints %>% filter(target_id %in% unique(obs$target_id))

# set colours
tx_cols <- c(RColorBrewer::brewer.pal(7, name = "Dark2"))[-4]

# plot
plot_cpam(cpo, gene_id = g, return_fits_only = T) %>%
  map_dfr( ~ cpam:::predict_cpgam(.x, logged = logged), .id = "target_id") %>%
  ggplot(aes(time, counts)) +
  scale_x_continuous(breaks = round(sort(cpo$times), 1),
                     labels = c("0.7", "", "2.0", "3.0", "4.5", "5.5")) +
                     geom_line(aes(col = target_id)) +
                       geom_ribbon(aes(ymin = q_lo, ymax = q_hi, fill = target_id), alpha = 0.15) +
                       facet_wrap( ~ target_id, scales = scls) +
                       geom_point(data = obs, aes(time, counts, col = target_id)) +
                       scale_fill_manual(values = tx_cols, aesthetics = c("fill", "color")) + 
                       theme_classic() +
                       theme(
                         legend.position = "bottom",
                         plot.subtitle = element_text(
                           hjust = 0,
                           face = "plain",
                           margin = margin(b = 10)
                         ),
                         strip.background = element_blank(),
                         strip.text = element_blank(),
                         axis.line.x = element_line(color = "black"),
                         axis.ticks.x = element_line(),
                         panel.spacing.y = unit(1, "lines"),
                         panel.spacing.x = unit(1, "lines"),
                         axis.text.x = element_text(angle = 0, hjust = 0.5)
                       ) +
                       labs(
                         y = "logged counts",
                         x = "time (days)",
                         col = NULL,
                         fill = NULL,
                         subtitle = paste0(g)
                       ) +
                       guides(color = guide_legend(byrow = TRUE), fill = guide_legend(byrow = TRUE)) +
                       geom_point( # add ticks for changepoints
                         data = cps,
                         aes(x = cp_1se, y = -Inf),
                         shape = 17,
                         color = "grey20",
                         size = 2.5
                       )

Clusters

We can identify clusters of targets that have similar expression profiles. In the first instance, we can use the results table with the function plot_cluster

# filter to keep number of targets reasonable
res <- results(cpo, min_lfc = 2, min_count = 1000)

plot_cluster(cpo, res, changepoints = 2.0, shapes = "micv")

While the default clustering by changepoint and shape identifies meaningful patterns, this cluster contains over 4000 targets. Such a large number of targets can make it difficult to visualise differences in expression dynamics. For these cases, it is useful to be able to define custom clusters based on additional criteria.

In this example, we subset the cluster based on the speed of response by calculating the ratio of log-fold changes between the 8-cell (3.0 days) and morula (4.5 days) stages. This ratio captures how quickly genes reach their maximum expression after activation. To manage the scale of the plot, we first filter out a small number of extreme values (log-fold changes above 10), then assign the remaining targets into four sub-clusters using specific thresholds (0.6, 0.75, and 0.87). This flexible clustering approach reveals distinct expression profiles within a cluster that share the same changepoint and shape.


res_cluster <-
  res %>% 
  filter(lfc.5.5<10) %>% # remove extreme values
  filter(cp %in% 2, shape %in% "micv") %>% # select the 4C stage and the 'micv' shape
  mutate(rate = lfc.3/lfc.4.5) %>% # calculate the rate/speed
  mutate(cluster = cut(rate,breaks = c(0, 0.6, 0.75, 0.87,10), labels = paste0("cluster ",1:4)))

res_cluster

This yields 1282 targets across four clusters. Next we generate the plot data which takes a minute to compute.

cluster_plot_data <-
  res_cluster %>%
  pull(target_id) %>%
  set_names() %>%
  map_dfr(~plot_cpam(cpo, target_id = .x, return_fits_only = T) %>%
            cpam:::predict_lfc(),.id = "target_id")

Finally, we plot the clusters.


cluster_plot_data %>%
  left_join(res_cluster %>% select(target_id,speed), by = "target_id") %>%
  ggplot(aes(x = time,y = lfc, group = target_id, col = speed)) +
  geom_line(alpha = 0.2) +
  scale_x_continuous(breaks = round(sort(cpo$times),1),
                     labels = c("0.7","","2.0","3.0","4.5","5.5")) +
  scale_color_manual(values = blues9[5:8]) +
  facet_wrap(~ speed, nrow = 1) +
  theme_classic() +
  theme(legend.position = "none",
        strip.background = element_blank(),
        panel.spacing.x = unit(1.5, "lines"),
        strip.text = element_text(size = 10, family = "sans")) +
  labs(title = NULL, subtitle = NULL,
       x = "time (days)", y = "log-fold change")

For further, such as GO term enrichment, we can easily extract tables of clustered targets. To look at the numbers of targets in each cluster.

res_cluster %>% pull(cluster) %>% table
## .
## cluster 1 cluster 2 cluster 3 cluster 4 
##       318       257       237       469

To match targets with their rate and cluster.

res_cluster %>% select(target_id,rate,cluster)

To look at the targets in cluster 1.

res_cluster %>% 
  select(target_id,cluster,rate) %>%
  filter(cluster == "cluster 1") %>% 
  pull(target_id) %>% 
  head() # print first 5 targets
## [1] "ENST00000031146" "ENST00000184266" "ENST00000216756" "ENST00000225298" "ENST00000225430" "ENST00000228825"

References

Torre, Denis, Nancy J. Francoeur, Yael Kalma, Ilana Gross Carmel, Betsaida S. Melo, Gintaras Deikus, Kimaada Allette, et al. 2023. “Isoform-Resolved Transcriptome of the Human Preimplantation Embryo.” Nature Communications 14 (1). https://doi.org/10.1038/s41467-023-42558-y.
Yates, Luke A., Jazmine L. Humphreys, Michael A. Charleston, and Steven M. Smith. 2024. “Shape-Constrained, Changepoint Additive Models for Time Series Omics Data with Cpam.” BioRxiv, December. https://doi.org/10.1101/2024.12.22.630003.