This tutorial demonstrates the R package cpam
for the
analysis of time series omics data. It serves an introduction to the
package and reproduces the results for the first case study presented in
the accompanying manuscript by Yates et al. (2024). The second case study using
Arabidopsis thaliana time series data is available here.
This tutorial uses time-series RNA-seq data tracking human embryo development across multiple stages. The data are publicly available for download here. The data consist of transcript counts for the human preimplantation embryo and are described in original data manuscript by Torre et al. (2023). The experiment captures transcriptional changes across six early developmental stages: zygote, 2-cell, 4-cell, 8-cell, morula, and blastocyst. These time points span a major developmental transition - embryonic genome activation (EGA) that occurs after the 4-cell stage – and captures major transcriptional transitions in development.
We load the transcript counts and the transcript-to-gene mapping. The transcript counts are stored in a matrix where the rows are the transcript IDs and the columns are the samples.
sample_names <- readr::read_tsv(
paste0(directory_path, "human_embryo-transcript_counts.tsv"),
n_max = 1,
col_names = F
) %>%
t %>% as.vector()
counts <- readr::read_tsv(
paste0(directory_path, "human_embryo-transcript_counts.tsv"),
skip = 1,
col_names = c("target_id", sample_names)
)
count_matrix <- counts %>% dplyr::select(-target_id) %>% as.matrix
rownames(count_matrix) <- counts %>% pull(target_id)
# convert to tibble (for display only)
as.data.frame(count_matrix) %>% head %>% tibble::rownames_to_column("target_id")
The transcript-to-gene mapping is a data frame with two columns, target_id and gene_id.
transcriptome <- readr::read_tsv(paste0(directory_path,"Human_embryo_isoform_transcriptome_GRCh38-isoform_classification.tsv"))
t2g <- transcriptome %>%
select(target_id = transcript_id, gene_id) %>%
filter(target_id %in% rownames(count_matrix))
t2g
The experimental design is a tibble containing at least the following columns: time and sample. The stage column is optional and used here to match sample names with the time points (here expressed in hours).
stage_to_time <- tibble(stage = c("2PN","2C","4C","8C","morula","blastocyst"),
time = c(0.71,1.5,2,3,4.5,5.5))
ed <-
tibble(sample = colnames(count_matrix)) %>%
mutate(stage = str_match(sample,"_(.*?)_")[,2]) %>%
left_join(stage_to_time, by = "stage")
ed
To fit the models, we first prepare the cpam
object,
then compute p-values, estimate the changepoint, and select the shape
for each transcript. For this analysis, the count data are supplied as a
matrix (see this case
study for importing data directly from common quantification
software). We keep all transcripts that have more than 20 reads in at
least 3/5 of the samples.
cpo <- prepare_cpam(
exp_design = ed,
count_matrix = count_matrix,
model_type = "case-only",
filter_fun_args = list(min_reads = 20, min_prop = 3/5),
t2g = t2g,
num_cores = 4
)
cpo <- compute_p_values(cpo) # 00:03:37 hms
cpo <- estimate_changepoint(cpo) # 00:23:26 hms
cpo <- select_shape(cpo) # 00:37:28 hms
For larger transcriptomes, such as this human transciptome with many additional transcripts beyond the standard reference, the model fitting can take a long time. Here, using 4 cores, we were able to estimate \(p\)-values in 3 minutes, changepoints in 23 minutes, and shape selections in 37 minutes. The latter stages, although slower to compute, are worth the wait for the inferences they provide.
We can look at a summary of the fitted cpam object
If you run the code on your own computer, you can launch the Shiny
app to visualise the results interactively using
visualise(cpo)
.
The results of the analysis are summarised using the
results
function. Here we use the default settings for
log-fold change and p-value thresholds.
Using the results table, we can plot the number of differentially expressed genes at each time point.
res %>%
pull(cp) %>%
table %>%
{tibble(cp = names(.), n = as.numeric(.))} %>%
ggplot(aes(cp, n)) +
geom_point() +
geom_line(aes(group = 1)) +
theme_classic()
There is an obvious peak in differentially expressed genes at the 4C
stage (t = 2.0 hours), corresponding to the embryonic genome activation
(EGA).
To plot the fitted models for a specific gene, we can use the
plot_cpam
function.
g <- "ENSG00000141378"
plot_cpam(
cpo,
gene_id = g,
facet = T, # plot each transcript on a separate facet
remove_null = T, # do show transcripts with null trajectories
common_y_scale = F # do not use a common y scale
)
The
facet
argument can be set to FALSE to plot all the fits
on the same plot.
g <- "ENSG00000141378"
plot_cpam(cpo,
gene_id = g,
facet = F, # plot each transcript in a common plot
remove_null = T)
If you require more customisation than the plot_cpam
function allows, you can extract the model fits and plot their
predictions manually. Here we make a logged version of the above plot
with some extra customisations such as changepoint ticks.
# set gene
g <- "ENSG00000141378"
# set logged
logged <- T
if(logged) fun <- function(x) log(x + 0.5) else fun <- identity
scls <- if_else(logged, "free_x", "free")
# get data
obs <-
cpo$data_long %>%
filter(gene_id == g) %>%
transmute(target_id, time, counts = fun(counts_raw/norm_factor + 0.5))
cps <- cpo$changepoints %>% filter(target_id %in% unique(obs$target_id))
# set colours
tx_cols <- c(RColorBrewer::brewer.pal(7, name = "Dark2"))[-4]
# plot
plot_cpam(cpo, gene_id = g, return_fits_only = T) %>%
map_dfr( ~ cpam:::predict_cpgam(.x, logged = logged), .id = "target_id") %>%
ggplot(aes(time, counts)) +
scale_x_continuous(breaks = round(sort(cpo$times), 1),
labels = c("0.7", "", "2.0", "3.0", "4.5", "5.5")) +
geom_line(aes(col = target_id)) +
geom_ribbon(aes(ymin = q_lo, ymax = q_hi, fill = target_id), alpha = 0.15) +
facet_wrap( ~ target_id, scales = scls) +
geom_point(data = obs, aes(time, counts, col = target_id)) +
scale_fill_manual(values = tx_cols, aesthetics = c("fill", "color")) +
theme_classic() +
theme(
legend.position = "bottom",
plot.subtitle = element_text(
hjust = 0,
face = "plain",
margin = margin(b = 10)
),
strip.background = element_blank(),
strip.text = element_blank(),
axis.line.x = element_line(color = "black"),
axis.ticks.x = element_line(),
panel.spacing.y = unit(1, "lines"),
panel.spacing.x = unit(1, "lines"),
axis.text.x = element_text(angle = 0, hjust = 0.5)
) +
labs(
y = "logged counts",
x = "time (days)",
col = NULL,
fill = NULL,
subtitle = paste0(g)
) +
guides(color = guide_legend(byrow = TRUE), fill = guide_legend(byrow = TRUE)) +
geom_point( # add ticks for changepoints
data = cps,
aes(x = cp_1se, y = -Inf),
shape = 17,
color = "grey20",
size = 2.5
)
We can identify clusters of targets that have similar expression
profiles. In the first instance, we can use the results table with the
function plot_cluster
# filter to keep number of targets reasonable
res <- results(cpo, min_lfc = 2, min_count = 1000)
plot_cluster(cpo, res, changepoints = 2.0, shapes = "micv")
While the default clustering by changepoint and shape identifies meaningful patterns, this cluster contains over 4000 targets. Such a large number of targets can make it difficult to visualise differences in expression dynamics. For these cases, it is useful to be able to define custom clusters based on additional criteria.
In this example, we subset the cluster based on the speed of response by calculating the ratio of log-fold changes between the 8-cell (3.0 days) and morula (4.5 days) stages. This ratio captures how quickly genes reach their maximum expression after activation. To manage the scale of the plot, we first filter out a small number of extreme values (log-fold changes above 10), then assign the remaining targets into four sub-clusters using specific thresholds (0.6, 0.75, and 0.87). This flexible clustering approach reveals distinct expression profiles within a cluster that share the same changepoint and shape.
res_cluster <-
res %>%
filter(lfc.5.5<10) %>% # remove extreme values
filter(cp %in% 2, shape %in% "micv") %>% # select the 4C stage and the 'micv' shape
mutate(rate = lfc.3/lfc.4.5) %>% # calculate the rate/speed
mutate(cluster = cut(rate,breaks = c(0, 0.6, 0.75, 0.87,10), labels = paste0("cluster ",1:4)))
res_cluster
This yields 1282 targets across four clusters. Next we generate the plot data which takes a minute to compute.
cluster_plot_data <-
res_cluster %>%
pull(target_id) %>%
set_names() %>%
map_dfr(~plot_cpam(cpo, target_id = .x, return_fits_only = T) %>%
cpam:::predict_lfc(),.id = "target_id")
Finally, we plot the clusters.
cluster_plot_data %>%
left_join(res_cluster %>% select(target_id,speed), by = "target_id") %>%
ggplot(aes(x = time,y = lfc, group = target_id, col = speed)) +
geom_line(alpha = 0.2) +
scale_x_continuous(breaks = round(sort(cpo$times),1),
labels = c("0.7","","2.0","3.0","4.5","5.5")) +
scale_color_manual(values = blues9[5:8]) +
facet_wrap(~ speed, nrow = 1) +
theme_classic() +
theme(legend.position = "none",
strip.background = element_blank(),
panel.spacing.x = unit(1.5, "lines"),
strip.text = element_text(size = 10, family = "sans")) +
labs(title = NULL, subtitle = NULL,
x = "time (days)", y = "log-fold change")
For further, such as GO term enrichment, we can easily extract tables of clustered targets. To look at the numbers of targets in each cluster.
res_cluster %>% pull(cluster) %>% table
## .
## cluster 1 cluster 2 cluster 3 cluster 4
## 318 257 237 469
To match targets with their rate and cluster.
To look at the targets in cluster 1.