This tutorial demonstrates the R package cpam
for the
analysis of time series omics data. It serves an introduction to the
package and reproduces the results for the first case study presented in
the accompanying manuscript by Yates et al. (2024). The first case study using
human embryo time series data is available here.
The data are publicly available …
First we create the experimental design which must have at least the following columns: time, sample, and path. The rep column is optional and is used here to generate the sample names. We have already run kallisto to quantify transcript abundances (with 100 bootstraps), and here the path column contains the path to the abundance file for each sample.
ed <-
expand_grid(time = c(0,30,60,67.5,75,90,120), rep = 1:3) %>%
mutate(sample = paste0("pc_t",time,"_r",rep),
path = paste0("case_studies/crisp/data/kallisto/",sample,"/abundance.h5"))
ed
The transcript-to-gene mapping for Arabidopsis thaliana can be downloaded from TAIR (https://www.arabidopsis.org/). We have downloaded it already and we load it now
This file should have two columns, target_id and gene_id.
To fit the models, we first prepare the cpam
object,
then compute p-values, estimate the changepoint, and select the shape
for each transcript. The last step takes the longest (here just under 13
minutes) but it is worth the wait to be able to visualise and cluster
the transcripts by shape.
cpo <- prepare_cpam(exp_design = ed,
model_type = "case-only",
t2g = t2g,
import_type = "kallisto",
num_cores = 4)
cpo <- compute_p_values(cpo) # 1:52
cpo <- estimate_changepoint(cpo) # 6:32 secs
cpo <- select_shape(cpo) # 12:54 secs
We can look at a summary of the fitted cpam object
cpo
## cpam object
## -----------
## case-only time series
## 21 samples
## 7 time points
## Overdispersion estimated using 100 inferential replicates
## Counts rescaled by estimated overdispersion
Although it cannot be launched here within the online tutorial, you
can launch the Shiny app on your own computer to visualise the results
interactively using visualise(cpo)
.
The results of the analysis are summarised using the
results
function.
The generated results can be filtered by specifying minimum counts, minimum log-fold changes, and maximum p-values. For example, to return only the transcripts with a log-fold change greater than 2, at least 50 counts, and a p-value less than 0.01, we can run
A single gene can be plotted using the plot_cpam
function. Here we plot the gene AT1G64140
The
subtitle shows
(0,tp)
indicating a changepoint at time
point 0 (i.e., no changepoint) and an unconstrained ‘tp’ (thinplate)
shape. This selection of ‘tp’ suggests that the trend for this gene does
not conform to one of the simpler shape types that cpam
uses. To force the cpam
to choose among the simpler forms,
we set shape_type = "shape2"
in the plot_cpam
function. For example:
Here
a concave shape (‘cv’) is chosen, and we can see this trend deviates
from the data substantially more that the unconstrained shape. See the
manuscript for more details on the shape types available in
cpam
.
Next we plot a gene with multiple transcripts.
The first transcript has a changepoint at 30 mins and the second at 0.
Both have an unconstrained shape. The transcripts can plotted separately
by setting
facet =T
in the plot_cpam
function.
There are may other settings that can be adjusted in the
plot_cpam
function, see the documentation for more details.
Changepoints, shapes and other results from the fitted models can also
be extracted manually from the cpam
object. For example, to
extract the shape of the transcripts
Lastly, we plot the two remaining genes shown in the manuscript, AT4G34590 and AT3G23280.
The results function can be used to generate clusters according to
selected filters. Here we generate clusters of transcripts with at least
100 counts, a log-fold change greater than 1, and a p-value less than
0.01. The plot_cluster
function can then be used to
visualise the clusters which we do here for targets with changepoints at
30 mins and the ‘micv’ (montonic increasing concave) shape.
res <- results(cpo, min_count = 100, min_lfc = 1, p_threshold = 0.01)
plot_cluster(cpo, res, changepoints = 30, shapes = "micv")
Clustering can be further refined based on, for example, the rate at
which the above transcripts attain their maximum values. We illustrate
advanced refinements such as this in our other case study here.
When estimating the changepoint, cpam
applies the
one-standard error to mitigate overfitting by taking into account model
selection uncertainty. To visualise the rule, we plot the pointwise
score differences and their standard errors for all changepoints.
First we extract the score table which contains the pointwise cross-validation scores for each model (changepoint).
score_table <-
cpo$changepoints %>%
filter(target_id == "AT3G23280.1") %>%
pull(score_table) %>%
.[[1]]
score_table
Next, we determine the minimum scoring model, and compute the pointwise score differences.
m.min <- score_table %>% purrr::map_dbl(mean) %>% which.min() %>% names; m.min
## [1] "30"
score_table_diff <- score_table - score_table[[m.min]]
score_table_diff
Finally, we compute and plot the mean score differences and their standard errors. The OSE-selected changepoint is latest changepoint whose score is within one standard error of the minimum score difference.
tibble(cp = names(score_table_diff) %>% {factor(.,.)},
score_diff = score_table_diff %>% purrr::map_dbl(mean),
se_diff = score_table_diff %>% purrr::map_dbl(~sd(.x)/sqrt(nrow(score_table_diff)))) %>%
ggplot(aes(x = cp)) +
geom_linerange(aes(ymin = score_diff - se_diff,
ymax = score_diff + se_diff)) +
geom_hline(yintercept = 0, lty = "dashed") +
geom_point(aes(y = score_diff, colour = "min"), shape = 1, size = 5,
data = ~ filter(.x, score_diff == min(score_diff))) +
geom_point(aes(y = score_diff, colour = "1se"), shape = 1, size = 5,
data = ~ filter(.x, se_diff >= score_diff) %>%
filter(as.numeric(cp) == max(as.numeric(cp)))) +
geom_point(aes(y = score_diff), size = 2) +
scale_colour_manual(values = c(`1se` = "#CB181D", min = "#08519C"),
breaks = c("min","1se")) +
labs(col = NULL, y = expression(Delta*"P"), x = "Changepoint",
subtitle = "One-standard-error rule") +
theme_classic() +
theme(legend.position = "bottom")
For
the target AT3G23280.1, the OSE rule selects the changepoint at 67.5
mins, although we can see that the subsequent point at 75 mins is a
close contender (compare the OSE plot above with the earlier plot of the
data and fitted model). For comparison, we plot below the fitted model
using the minimum scoring model instead of the OSE-selected model.
plot_cpam(cpo, target_id = "AT3G23280.1", cp_type = "cp_min", bs = "tp") +
labs(subtitle = "Minimum scoring model")
In
general, the minimum scoring model captures a more complex trend than
the OSE-selected model, although the OSE-selected model is likely to
generalise better to new data. See the
cpam
manuscript for a
simulation study comparing the two selection rules.
Here we show how to reproduce the isoform structure plots for the gene AT3G23280.
First, download the the gtf file for Arabidopsis thaliana from
Ensembl Plants. Using the rtracklayer
package, import the
gtf file and convert it to a tibble.
# Download the file
download.file(
url = "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-55/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.55.gtf.gz",
destfile = "Arabidopsis_thaliana.TAIR10.55.gtf.gz",
method = "auto"
)
gtf <- rtracklayer::import("Arabidopsis_thaliana.TAIR10.55.gtf.gz") %>% dplyr::as_tibble()
This next part is a bit more involved and requires some manual adjustments to achieve the desired result.
library(ggtranscript)
g_annotation <- gtf %>% filter(gene_id == "AT3G23280")
g_annotation_exons <- g_annotation %>% dplyr::filter(type == "exon")
g_annotation_cds <- g_annotation %>% dplyr::filter(type == "CDS")
tx_ids <- unique(g_annotation_exons$transcript_id)
g_annotation_exons_1 <- g_annotation_exons %>% filter(transcript_id == tx_ids[1])
g_annotation_exons_2 <- g_annotation_exons %>% filter(transcript_id == tx_ids[2])
g_annotation_cds_1 <- g_annotation_cds %>% filter(transcript_id == tx_ids[1])
g_annotation_cds_2 <- g_annotation_cds %>% filter(transcript_id == tx_ids[2])
tx_cols <- c("#CB181D","#08519C")
p_exon <-
g_annotation_cds_1 %>%
ggplot(aes(
xstart = start,
xend = end,
y = "y"
)) +
geom_half_range(
aes(fill = transcript_id, col = transcript_id),
height = 0.125,
range.orientation = "top",
alpha = 0.1,
linewidth = 0.5,
position = position_nudge(y = 0.001)
) +
geom_half_range(
aes(fill = transcript_id, col = transcript_id),
data = g_annotation_cds_2,
height = 0.125,
alpha = 0.1,
linewidth = 0.5,
position = position_nudge(y = -0.001)
) +
geom_intron(
data = to_intron(g_annotation_exons_1, "transcript_id"),
col = "grey20",
arrow = grid::arrow(ends = "last", length = grid::unit(2, "mm")),
linewidth = 0.5,
arrow.min.intron.length = 150,
strand = g_annotation_cds_1$strand[1]
) +
scale_fill_manual(values = tx_cols, aesthetics = c("color","fill")) +
theme_void() +
theme(
plot.margin = unit(c(-1.5, 1, -1.5, 1), "cm"),
legend.position = "none"
) +
annotate("segment", x = min(g_annotation_cds_2$start), xend = max(g_annotation_cds_2$end),
y = "y", yend = "y", col = "grey20", linewidth = 0.8) +
xlim(min(g_annotation_cds_2$start) - 1200,NA) +
annotate("text", x = min(g_annotation_cds_2$start) - 50, y = 1 + 0.125/2, label = tx_ids[1],
hjust = 1, size = 3.5, col = tx_cols[1]) +
annotate("text", x = min(g_annotation_cds_2$start) - 50, y = 1 - 0.125/2, label = tx_ids[2],
hjust = 1, size = 3.5, col = tx_cols[2]);p_exon
If you require more customisation than the plot_cpam
function allows, you can extract the model fits and plot their
predictions manually. Here we show how to extract the model fits, model
predictions, and observed data for a given target.
# extract the model fit(s)
fit <- plot_cpam(cpo, target_id = "AT3G23280.1",return_fits_only = T)
# generate model prediction data
cpam:::predict_cpgam(fit, logged = F, length.out = 200, ci_prob = "se")
The count data can be extract the cpam
object
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-openmp/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Hobart
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggtranscript_1.0.0 ggplot2_3.5.1 stringr_1.5.1 tidyr_1.3.1 readr_2.1.5 dplyr_1.1.4
## [7] cpam_0.0.0.9000
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.36.0 gtable_0.3.6 rjson_0.2.23 xfun_0.50
## [5] bslib_0.9.0 ggrepel_0.9.6 Biobase_2.66.0 lattice_0.22-6
## [9] tzdb_0.4.0 vctrs_0.6.5 tools_4.4.1 bitops_1.0-9
## [13] generics_0.1.3 parallel_4.4.1 stats4_4.4.1 curl_6.2.0
## [17] tibble_3.2.1 pkgconfig_2.0.3 Matrix_1.7-2 RColorBrewer_1.1-3
## [21] S4Vectors_0.44.0 lifecycle_1.0.4 GenomeInfoDbData_1.2.13 scam_1.2-18
## [25] compiler_4.4.1 farver_2.1.2 Rsamtools_2.22.0 Biostrings_2.74.1
## [29] munsell_0.5.1 codetools_0.2-20 GenomeInfoDb_1.42.3 htmltools_0.5.8.1
## [33] sass_0.4.9 RCurl_1.98-1.16 yaml_2.3.10 crayon_1.5.3
## [37] pillar_1.10.1 jquerylib_0.1.4 BiocParallel_1.40.0 DelayedArray_0.32.0
## [41] cachem_1.1.0 abind_1.4-8 nlme_3.1-167 aggregation_1.0.1
## [45] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4 purrr_1.0.4
## [49] restfulr_0.0.15 labeling_0.4.3 splines_4.4.1 fastmap_1.2.0
## [53] grid_4.4.1 SparseArray_1.6.1 colorspace_2.1-1 cli_3.6.4
## [57] magrittr_2.0.3 S4Arrays_1.6.0 XML_3.99-0.18 withr_3.0.2
## [61] scales_1.3.0 UCSC.utils_1.2.0 rmarkdown_2.29 XVector_0.46.0
## [65] httr_1.4.7 matrixStats_1.5.0 hms_1.1.3 evaluate_1.0.3
## [69] knitr_1.49 GenomicRanges_1.58.0 IRanges_2.40.1 BiocIO_1.16.0
## [73] rtracklayer_1.66.0 mgcv_1.9-1 rlang_1.1.5 Rcpp_1.0.14
## [77] glue_1.8.0 BiocGenerics_0.52.0 rstudioapi_0.17.1 jsonlite_1.8.9
## [81] R6_2.6.1 MatrixGenerics_1.18.1 GenomicAlignments_1.42.0 zlibbioc_1.52.0