About

This tutorial demonstrates the R package cpam for the analysis of time series omics data. It serves an introduction to the package and reproduces the results for the second case study presented in the accompanying manuscript by Yates et al. (2024). A tutorial for first case study using human embryo time series data is available here.

Data

This tutorial uses publicly available time-series RNA-seq data from an Arabidopsis light stress experiment due to Crisp et al. (2017) (NCBI Short Read Archive, BioProject Accession number: PRJNA391262). This experiment investigated transcriptional responses when plants were transferred from low light to excess light for 60 minutes, followed by recovery in low light for 60 minutes. In this example, we use a subset of 21 samples taken at 7 time points (0, 30, 60, 67.5, 70, 90, and 120 minutes) with 3 biological replicates per time point, that capture dynamic transcriptional responses to a rapid environmental change.

The transcript-to-gene mapping for Arabidopsis thaliana can be downloaded from The Arabidopsis Information Resource (TAIR) (https://www.arabidopsis.org/).

Installation

You can install cpam from GitHub using:

devtools::install_github("l-a-yates/cpam")

Getting started

Load packages

library(cpam)  
library(dplyr)
library(readr)
library(tidyr)
library(stringr)
library(ggplot2)

Experimental design and transcript-to-gene mapping

First we create the experimental design tibble which must have at least the following columns: time, sample, and path. The rep column is optional and is used here to generate the sample names. We used the software kallisto to quantify counts from the RNA-seq data (with 100 bootstrap replicates). The path column contains the path to the kallisto abundance file for each sample.

ed <- 
  expand_grid(time = c(0,30,60,67.5,75,90,120), rep = 1:3) %>%
  mutate(sample = paste0("pc_t",time,"_r",rep),
         path = paste0("case_studies/crisp/data/kallisto/",sample,"/abundance.h5"))

ed

We have already downloaded (https://www.arabidopsis.org/) the transcript-to-gene mapping for Arabidopsis thaliana and we load it now.

t2g <- readr::read_csv("t2g.csv")

This file should have two columns, target_id and gene_id.

head(t2g)

Fitting the models

To fit the models, we first prepare the cpam object, then compute p-values, estimate the changepoints, and select the shape for each transcript. The last step takes the longest (here just under 13 minutes) but it is worth the wait to be able to visualise and cluster the transcripts by shape.

  cpo <- prepare_cpam(exp_design = ed,
                      model_type = "case-only",
                      t2g = t2g,
                      import_type = "kallisto",
                      num_cores = 4)
  cpo <- compute_p_values(cpo) # 1:52
  cpo <- estimate_changepoint(cpo) # 6:32 secs
  cpo <- select_shape(cpo) # 12:54 secs
    

We can look at a summary of the fitted cpam object

cpo
## cpam object
## -----------
## case-only time series
## 21 samples
## 7 time points
## Overdispersion estimated using 100 inferential replicates
## Counts rescaled by estimated overdispersion

If you run the code on your own computer, you can launch the Shiny app on your own computer to visualise the results interactively using visualise(cpo).

Result tables

The results of the analysis are summarised using the results function.

results(cpo)

The generated results can be filtered by specifying minimum counts, minimum log-fold changes, and maximum \(p\)-values. For example, to return only the transcripts with a log-fold change greater than 2, at least 10 counts, and a \(p\)-value less than 0.01:

results(cpo, min_count = 10, min_lfc = 1, p_threshold = 0.01)