Data import


Ex. 6.1

Download the Tcell_RNAseq_counts.txt file from this link that contains expression data in the form of raw gene counts from a RNA sequencing (RNA-seq) experiment of CD4+ T cells stimulated with various cytokines (Cano-Gamez A. et al. 2020).

Load the file in R as a data.frame using the read.table function and check its format by printing to screen the first five rows and columns.

Hint: make sure to set the row.names, header, and sep parameters according to your file format.

Download the Tcell_RNAseq_metadata.txt file from this link containing matched metadata about the samples.

Load this file in R as done for the count file.

Check the column names in the metadata to see which information is available for each sample.


Data analysis


Ex. 6.2

From the full metadata data.frame of Ex. 6.1, select only the rows corresponding to CD4 naive T cells (cell_type column) treated with Th1- or Th2-inducing cytokines (cytokine_condition column) for 5 days (stimulation_time column), and save it into a new data.frame called metadata.sel.

Hint: you can identify the intersection of samples (i.e., rows) satisfying all conditions.

Using the table function, check how many samples from each cytokine condition (i.e., “Th1” and “Th2”) are available in metadata.sel.

Set the sample identifiers (sample_id column) as row names of the metadata.sel data.frame.

Save in a vector named selsamples the list of sample identifiers (i.e., row names) of the metadata.sel data.frame.

Use this information to extract from the original count data only the samples (i.e., columns) corresponding to the selected identifiers and save the selected data into a new data.frame called counts.sel.

Remove from counts.sel the genes (i.e., rows) having zero counts in all samples.

Check the dimension of the counts.sel data.frame.

Save metadata.sel and counts.sel in a file called Th1_Th2_RNAseq.RData.


Ex. 6.3

We can use the metadata.sel and counts.sel data.frames generated in Ex. 6.2 (also available as RData at this link) to identify the gene differentially expressed after 5 days in naive CD4+ T cells stimulated with Th1- versus Th2-inducing cytokines.

To run differential gene expression analysis, we will use an R package called DESeq2 (Love M.I., Huber W., Anders S., 2014).

DESeq2 can be installed from Bioconductor using the following command:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

After DESeq2 installation and loading (using the library function), we can perform differential gene expression analysis in three steps.

First, we have to create a DESeq object:

dds <- DESeqDataSetFromMatrix(countData = counts.sel,
                              colData = metadata.sel,
                              design = ~cytokine_condition)
  • countData is a matrix or data.frame of counts with genes on the rows and sample identifiers on the columns (counts.sel in our case).

  • colData is a table of sample information (metadata.sel in our case).

  • design indicates how to model the samples; here, we want to measure the effect of the different cocktail of cytokines used for stimulation (i.e. either Th1- or Th2-inducing, as reported in the cytokine_condition of metadata.sel).

Second, we can use the DESeq function to analyze the created object:

dds <- DESeq(dds)

Third, we can extract the results in a tabular format:

res <- as.data.frame(results(dds))

For more information of the output format, have a look at DESeq2 manual.

Run a differential expression analysis with DESeq2 to compare Th2- versus Th1-induced CD4+ T cells following the 3 steps above.

Save the results of the differential expression analysis (i.e., DEres) in a file called Th1_Th2_DEres.RData.


Ex. 6.4

Sort the DEres data.frame generated in Ex. 6.5 (also available as RData at this link) to have the p-values (pvalue column) in increasing order.

Add a column called DE that indicates whether a gene is differentially expressed. The column should be equal to TRUE if a gene has a p-value adjusted for multiple testing (padj column) lower than 0.01 and an absolute log-fold-change (log2FoldChange column) greater than 2, and FALSE otherwise.

Count how many genes are differentially expressed.

Count how many genes are upregulated in Th2 versus Th1 cells, i.e. they are differentially expressed and have a log-fold-change (log2FoldChange column) greater than 0.

Count how many genes are downregulated in Th2 versus Th1 cells, i.e. they are differentially expressed and have a log-fold-change (log2FoldChange column) lower than 0.


Data visualization

Ex. 6.5

We can use the plot function to visualize the results of our differential gene expression analysis as a Volcano plot.

To do so, we can consider the p-values and log-fold-changes in the DEres file and plot as points (pch=19) the -log10(p-value) of each point (y-axis) versus its log-fold change (x-axis).

The genes with the smallest p-values (i.e., with more significant differences) are represented at the top of the plot. Points located in the rightmost side of the plot represent genes with large, positive log-fold-changes, i.e. upregulated in Th2 vs. Th1 cells. Points located in the leftmost side of the plot represent genes with large, negative log-fold-changes, i.e. downregulated in Th2 vs. Th1 cells.

To facilitate plot interpretation, add two thin lines representing the x=0 and y=0 axes, and add title and axis names with 1.2 magnification.

Set the color of the points corresponding to differentially expressed genes to “organge”, and to “grey” for the remaining ones.

Add a legend for interpreting the colors assigned to differentially expressed (“DE”) and non-differentially expressed (“nonDE”) genes.

Save the plot in a PDF file named Th2_vs_Th2_RNAseq_DE_volcano.