About

This tutorial demonstrates the R package cpam for the analysis of time series omics data. It serves an introduction to the package and reproduces the results for the second case study presented in the accompanying manuscript by Yates et al. (2024). A tutorial for first case study using human embryo time series data is available here.

Data

This tutorial uses publicly available time-series RNA-seq data from an Arabidopsis light stress experiment due to Crisp et al. (2017) (NCBI Short Read Archive, BioProject Accession number: PRJNA391262). This experiment investigated transcriptional responses when plants were transferred from low light to excess light for 60 minutes, followed by recovery in low light for 60 minutes. In this example, we use a subset of 21 samples taken at 7 time points (0, 30, 60, 67.5, 70, 90, and 120 minutes) with 3 biological replicates per time point, that capture dynamic transcriptional responses to a rapid environmental change.

The transcript-to-gene mapping for Arabidopsis thaliana can be downloaded from The Arabidopsis Information Resource (TAIR) (https://www.arabidopsis.org/).

Installation

You can install cpam from GitHub using:

devtools::install_github("l-a-yates/cpam")

Getting started

Load packages

library(cpam)  
library(dplyr)
library(readr)
library(tidyr)
library(stringr)
library(ggplot2)

Experimental design and transcript-to-gene mapping

First we create the experimental design tibble which must have at least the following columns: time, sample, and path. The rep column is optional and is used here to generate the sample names. We used the software kallisto to quantify counts from the RNA-seq data (with 100 bootstrap replicates). The path column contains the path to the kallisto abundance file for each sample.

ed <- 
  expand_grid(time = c(0,30,60,67.5,75,90,120), rep = 1:3) %>%
  mutate(sample = paste0("pc_t",time,"_r",rep),
         path = paste0("case_studies/crisp/data/kallisto/",sample,"/abundance.h5"))

ed

We have already downloaded (https://www.arabidopsis.org/) the transcript-to-gene mapping for Arabidopsis thaliana and we load it now.

t2g <- readr::read_csv("t2g.csv")

This file should have two columns, target_id and gene_id.

head(t2g)

Fitting the models

To fit the models, we first prepare the cpam object, then compute p-values, estimate the changepoints, and select the shape for each transcript. The last step takes the longest (here just under 13 minutes) but it is worth the wait to be able to visualise and cluster the transcripts by shape.

  cpo <- prepare_cpam(exp_design = ed,
                      model_type = "case-only",
                      t2g = t2g,
                      import_type = "kallisto",
                      num_cores = 4)
  cpo <- compute_p_values(cpo) # 1:52
  cpo <- estimate_changepoint(cpo) # 6:32 secs
  cpo <- select_shape(cpo) # 12:54 secs
    

We can look at a summary of the fitted cpam object

cpo
## cpam object
## -----------
## case-only time series
## 21 samples
## 7 time points
## Overdispersion estimated using 100 inferential replicates
## Counts rescaled by estimated overdispersion

If you run the code on your own computer, you can launch the Shiny app on your own computer to visualise the results interactively using visualise(cpo).

Result tables

The results of the analysis are summarised using the results function.

results(cpo)

The generated results can be filtered by specifying minimum counts, minimum log-fold changes, and maximum \(p\)-values. For example, to return only the transcripts with a log-fold change greater than 2, at least 10 counts, and a \(p\)-value less than 0.01:

results(cpo, min_count = 10, min_lfc = 1, p_threshold = 0.01)

Plotting genes and transcripts

A single gene can be plotted using the plot_cpam function. Here we plot the gene AT1G64140

plot_cpam(cpo, gene_id = "AT1G64140")

The subtitle shows (0,tp) indicating a changepoint at time point 0 (i.e., no changepoint) and an unconstrained 'tp' (thinplate) shape. This selection of ‘tp’ suggests that the trend for this gene does not conform to one of the simpler shape types that cpam uses. To force the cpam to choose among the simpler forms, we set shape_type = "shape2" in the plot_cpam function. For example:

plot_cpam(cpo, gene_id = "AT1G64140",shape_type = "shape2")

Here a concave shape ('cv') is chosen, and we can see this trend deviates from the data substantially more that the unconstrained shape. See the manuscript for more details on the shape types available in cpam.

Next we plot a gene with multiple transcripts.

plot_cpam(cpo, gene_id = "AT1G28610")

The first transcript has a changepoint at 30 mins and the second at 0. Both have an unconstrained shape. The transcripts can plotted separately by setting facet = T in the plot_cpam function. There are may other settings that can be adjusted in the plot_cpam function, see the documentation for more details. Changepoints, shapes and other results from the fitted models can also be extracted manually from the cpam object. For example, to extract the shape of the transcripts

cpo$shapes %>% filter(str_starts(target_id,"AT1G28610"))

Lastly, we plot the two remaining genes featured in the manuscript, AT4G34590 and AT3G23280.

plot_cpam(cpo, gene_id = "AT4G34590")

plot_cpam(cpo, gene_id = "AT3G23280")

Clusters

The results function can be used to generate clusters according to selected filters. Here we generate clusters of transcripts with at least 100 counts, a log-fold change greater than 1, and a p-value less than 0.01. The plot_cluster function can then be used to visualise the clusters which we do here for targets with changepoints at 30 mins and the ‘micv’ (montonic increasing concave) shape.

res <- results(cpo, min_count = 100, min_lfc = 1, p_threshold = 0.01)
plot_cluster(cpo, res, changepoints = 30, shapes = "micv")

Clustering can be further refined based on, for example, the rate at which the above transcripts attain their maximum values. We illustrate advanced refinements such as this in our other case study here.

Advanced plotting

Isoform structure plots

Here we show how to reproduce the isoform structure plots for the gene AT3G23280.

First, download the the gtf file for Arabidopsis thaliana from Ensembl Plants. Using the rtracklayer package, import the gtf file and convert it to a tibble.

# Download the file
download.file(
  url = "https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-55/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.55.gtf.gz",
  destfile = "Arabidopsis_thaliana.TAIR10.55.gtf.gz",
  method = "auto"
)

gtf <- rtracklayer::import("Arabidopsis_thaliana.TAIR10.55.gtf.gz") %>% dplyr::as_tibble()

This next part is a bit more involved and requires some manual adjustments to achieve the desired result.

library(ggtranscript)

g_annotation <- gtf %>% filter(gene_id == "AT3G23280")
g_annotation_exons <- g_annotation %>% dplyr::filter(type == "exon")
g_annotation_cds <- g_annotation %>% dplyr::filter(type == "CDS")
tx_ids <- unique(g_annotation_exons$transcript_id)

g_annotation_exons_1 <- g_annotation_exons %>% filter(transcript_id == tx_ids[1])
g_annotation_exons_2 <- g_annotation_exons %>% filter(transcript_id == tx_ids[2])
g_annotation_cds_1 <- g_annotation_cds %>% filter(transcript_id == tx_ids[1])
g_annotation_cds_2 <- g_annotation_cds %>% filter(transcript_id == tx_ids[2])

tx_cols <- c("#CB181D","#08519C")

p_exon <-
  g_annotation_cds_1 %>%
  ggplot(aes(
    xstart = start,
    xend = end,
    y = "y"
  )) +
  geom_half_range(
    aes(fill = transcript_id, col = transcript_id),
    height = 0.125,
    range.orientation =  "top",
    alpha = 0.1,
    linewidth = 0.5,
    position = position_nudge(y = 0.001)
  ) +
  geom_half_range(
    aes(fill = transcript_id, col = transcript_id),
    data = g_annotation_cds_2,
    height = 0.125,
    alpha = 0.1,
    linewidth = 0.5,
    position = position_nudge(y = -0.001)
  ) +
  geom_intron(
    data = to_intron(g_annotation_exons_1, "transcript_id"),
    col = "grey20",
    arrow = grid::arrow(ends = "last", length = grid::unit(2, "mm")),
    linewidth = 0.5,
    arrow.min.intron.length = 150,
    strand = g_annotation_cds_1$strand[1]
  ) +
  scale_fill_manual(values = tx_cols, aesthetics = c("color","fill")) +
  theme_void() +
  theme(
    plot.margin = unit(c(-1.5, 1, -1.5, 1), "cm"),
    legend.position = "none"
  ) +
  annotate("segment", x = min(g_annotation_cds_2$start), xend = max(g_annotation_cds_2$end),
           y = "y", yend = "y", col = "grey20", linewidth = 0.8) +
  xlim(min(g_annotation_cds_2$start) - 1200,NA) +
  annotate("text", x = min(g_annotation_cds_2$start) - 50, y = 1 + 0.125/2, label = tx_ids[1],
           hjust = 1, size = 3.5, col = tx_cols[1]) +
  annotate("text", x = min(g_annotation_cds_2$start) - 50, y = 1 - 0.125/2, label = tx_ids[2],
           hjust = 1, size = 3.5, col = tx_cols[2]);p_exon

OSE rule plots

When estimating the changepoint, cpam applies the one-standard error to mitigate overfitting by taking into account model selection uncertainty. To visualise the rule, we plot the pointwise score differences and their standard errors for all changepoints.

First we extract the score table which contains the pointwise cross-validation scores for each model (changepoint).

score_table <- 
  cpo$changepoints %>% 
  filter(target_id == "AT3G23280.1") %>% 
  pull(score_table) %>% 
  .[[1]]

score_table

Next, we determine the minimum scoring model, and compute the pointwise score differences.

m.min <- score_table %>% purrr::map_dbl(mean) %>% which.min() %>% names; m.min
## [1] "30"
score_table_diff <- score_table - score_table[[m.min]]
score_table_diff

Finally, we compute and plot the mean score differences and their standard errors. The OSE-selected changepoint is latest changepoint whose score is within one standard error of the minimum score difference.

  tibble(cp = names(score_table_diff) %>% {factor(.,.)},
       score_diff = score_table_diff %>% purrr::map_dbl(mean),
       se_diff = score_table_diff %>% purrr::map_dbl(~sd(.x)/sqrt(nrow(score_table_diff)))) %>%
  ggplot(aes(x = cp)) +
  geom_linerange(aes(ymin = score_diff - se_diff,
                     ymax = score_diff + se_diff)) +
  geom_hline(yintercept = 0, lty = "dashed") +
  geom_point(aes(y = score_diff, colour = "min"), shape = 1, size = 5,
             data = ~ filter(.x, score_diff == min(score_diff))) +
  geom_point(aes(y = score_diff, colour = "1se"), shape = 1, size = 5,
             data = ~ filter(.x, se_diff >= score_diff) %>% 
               filter(as.numeric(cp) == max(as.numeric(cp)))) +
  geom_point(aes(y = score_diff), size = 2) +
  scale_colour_manual(values = c(`1se` = "#CB181D", min = "#08519C"),
                      breaks = c("min","1se")) +
  labs(col = NULL, y = expression(Delta*"P"), x = "Changepoint",
       subtitle = "One-standard-error rule") +
  theme_classic() +
  theme(legend.position = "bottom") 

For the target AT3G23280.1, the OSE rule selects the changepoint at 67.5 mins, although we can see that the subsequent point at 75 mins is a close contender (compare the OSE plot above with the earlier plot of the data and fitted model). For comparison, we plot below the fitted model using the minimum scoring model instead of the OSE-selected model.

plot_cpam(cpo, target_id = "AT3G23280.1", cp_type = "cp_min", bs = "tp") +
  labs(subtitle = "Minimum scoring model")

In general, the minimum scoring model captures a more complex trend than the OSE-selected model, although the OSE-selected model is likely to generalise better to new data. See the cpam manuscript for a simulation study comparing the two selection rules.

Extract plot data for full customisation

If you require more customisation than the plot_cpam function allows, you can extract the model fits and plot their predictions manually. Here we show how to extract the model fits, model predictions, and observed data for a given target.

# extract the model fit(s)
fit <- plot_cpam(cpo, target_id = "AT3G23280.1",return_fits_only = T)
# generate model prediction data 
cpam:::predict_cpgam(fit, logged = F, length.out = 200, ci_prob = "se")

The count data can be extract the cpam object

cpo$data_long %>% filter(target_id == "AT3G23280.1")

We demonstrate use of these data for manual plotting in the human embryo case study.

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Pop!_OS 22.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Hobart
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] ggtranscript_1.0.0 ggplot2_3.5.1      stringr_1.5.1      tidyr_1.3.1        readr_2.1.5        dplyr_1.1.4       
## [7] cpam_0.0.0.9000   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            farver_2.1.2                Biostrings_2.73.1           bitops_1.0-8               
##  [5] fastmap_1.2.0               RCurl_1.98-1.16             GenomicAlignments_1.42.0    XML_3.99-0.17              
##  [9] digest_0.6.37               lifecycle_1.0.4             magrittr_2.0.3              compiler_4.4.2             
## [13] rlang_1.1.5                 sass_0.4.9                  tools_4.4.2                 yaml_2.3.10                
## [17] rtracklayer_1.66.0          knitr_1.48                  S4Arrays_1.5.5              labeling_0.4.3             
## [21] curl_5.2.3                  DelayedArray_0.31.10        RColorBrewer_1.1-3          abind_1.4-8                
## [25] BiocParallel_1.39.0         withr_3.0.2                 purrr_1.0.4                 BiocGenerics_0.51.0        
## [29] grid_4.4.2                  stats4_4.4.2                colorspace_2.1-1            scales_1.3.0               
## [33] SummarizedExperiment_1.35.1 cli_3.6.4                   rmarkdown_2.27              crayon_1.5.3               
## [37] generics_0.1.3              rstudioapi_0.17.0           httr_1.4.7                  tzdb_0.4.0                 
## [41] rjson_0.2.21                cachem_1.1.0                zlibbioc_1.51.1             splines_4.4.2              
## [45] parallel_4.4.2              scam_1.2-18                 XVector_0.45.0              restfulr_0.0.15            
## [49] aggregation_1.0.1           matrixStats_1.5.0           vctrs_0.6.5                 Matrix_1.7-1               
## [53] jsonlite_1.8.9              IRanges_2.39.2              hms_1.1.3                   S4Vectors_0.43.2           
## [57] ggrepel_0.9.5               jquerylib_0.1.4             glue_1.8.0                  codetools_0.2-20           
## [61] stringi_1.8.4               gtable_0.3.6                GenomeInfoDb_1.41.1         GenomicRanges_1.57.1       
## [65] BiocIO_1.16.0               UCSC.utils_1.1.0            munsell_0.5.1               tibble_3.2.1               
## [69] bspm_0.5.7                  pillar_1.10.1               htmltools_0.5.8.1           GenomeInfoDbData_1.2.12    
## [73] R6_2.6.1                    evaluate_0.24.0             lattice_0.22-6              Biobase_2.65.0             
## [77] highr_0.11                  Rsamtools_2.21.0            bslib_0.9.0                 Rcpp_1.0.14                
## [81] SparseArray_1.5.30          nlme_3.1-166                mgcv_1.9-1                  xfun_0.46                  
## [85] MatrixGenerics_1.17.0       pkgconfig_2.0.3
Crisp, Peter A., Diep R. Ganguly, Aaron B. Smith, Kevin D. Murray, Gonzalo M. Estavillo, Iain Searle, Ethan Ford, et al. 2017. “Rapid Recovery Gene Downregulation During Excess-Light Stress and Recovery in Arabidopsis.” The Plant Cell 29 (8): 1836–63. https://doi.org/10.1105/tpc.16.00828.
Yates, Luke A., Jazmine L. Humphreys, Michael A. Charleston, and Steven M. Smith. 2024. “Shape-Constrained, Changepoint Additive Models for Time Series Omics Data with Cpam.” BioRxiv, December. https://doi.org/10.1101/2024.12.22.630003.