Data import

Ex. 6.1

Download the Tcell_RNAseq_counts.txt file from this link that contains expression data in the form of raw gene counts from a RNA sequencing (RNA-seq) experiment of CD4+ T cells stimulated with various cytokines (Cano-Gamez A. et al. 2020).

Load the file in R as a data.frame using the read.table function and check its format by printing to screen the first five rows and columns.

Hint: make sure to set the row.names, header, and sep parameters according to your file format.

counts <- read.table("../Data/Data/Day6/Tcell_RNAseq_counts.txt",
                     row.names = 1,
                     header = TRUE,
                     sep = "\t")
print(counts[1:5,1:5])

##                 I0712 I0713 I0717 I0721 I0726
## ENSG00000223972     0     0     0     0     0
## ENSG00000227232    55    93   198   131    75
## ENSG00000278267     6     6    28     5     4
## ENSG00000243485     0     0     0     3     1
## ENSG00000237613     0     0     0     0     0

Download the Tcell_RNAseq_metadata.txt file from this link containing matched metadata about the samples.

Load this file in R as done for the count file.

metadata <- read.table("../Data/Data/Day6/Tcell_RNAseq_metadata.txt",
                     row.names = 1,
                     header = TRUE,
                     sep = "\t")

Check the column names in the metadata to see which information is available for each sample.

colnames(metadata)

##  [1] "sample_id"            "cell_type"            "cytokine_condition"  
##  [4] "stimulation_time"     "donor_id"             "sex"                 
##  [7] "age"                  "sequencing_batch"     "cell_culture_batch"  
## [10] "rna_integrity_number"

Data analysis

Ex. 6.2

From the full metadata data.frame of Ex. 6.1, select only the rows corresponding to CD4 naive T cells (cell_type column) treated with Th1- or Th2-inducing cytokines (cytokine_condition column) for 5 days (stimulation_time column), and save it into a new data.frame called metadata.sel.

Hint: you can identify the intersection of samples (i.e., rows) satisfying all conditions.

ind1 <- which(metadata[, "cell_type"] == "CD4_Naive")
ind2 <- which(metadata[, "cytokine_condition"] %in% c("Th1", "Th2"))
ind3 <- which(metadata[, "stimulation_time"] == "5d")
ind <- intersect(ind1, intersect(ind2, ind3))
metadata.sel <- metadata[ind,]

Using the table function, check how many samples from each cytokine condition (i.e., “Th1” and “Th2”) are available in metadata.sel.

table(metadata.sel[, "cytokine_condition"])

## 
## Th1 Th2 
##   3   4

Set the sample identifiers (sample_id column) as row names of the metadata.sel data.frame.

rownames(metadata.sel) <- metadata.sel[, "sample_id"]

Save in a vector named selsamples the list of sample identifiers (i.e., row names) of the metadata.sel data.frame.

selsamples <- rownames(metadata.sel)

Use this information to extract from the original count data only the samples (i.e., columns) corresponding to the selected identifiers and save the selected data into a new data.frame called counts.sel.

counts.sel <- counts[, selsamples]

Remove from counts.sel the genes (i.e., rows) having zero counts in all samples.

rmgenes <- which(apply(counts.sel,1,sum)==0)
counts.sel <- counts.sel[-rmgenes, ]

Check the dimension of the counts.sel data.frame.

dim(counts.sel)

## [1] 38257     7

Save metadata.sel and counts.sel in a file called Th1_Th2_RNAseq.RData.

save(metadata.sel, 
     counts.sel, 
     file = "Th1_Th2_RNAseq.RData")

Ex. 6.3

We can use the metadata.sel and counts.sel data.frames generated in Ex. 6.2 (also available as RData at this link) to identify the gene differentially expressed after 5 days in naive CD4+ T cells stimulated with Th1- versus Th2-inducing cytokines.

To run differential gene expression analysis, we will use an R package called DESeq2 (Love M.I., Huber W., Anders S., 2014).

DESeq2 can be installed from Bioconductor using the following command:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

After DESeq2 installation and loading (using the library function), we can perform differential gene expression analysis in three steps.

First, we have to create a DESeq object:

dds <- DESeqDataSetFromMatrix(countData = counts.sel,
                              colData = metadata.sel,
                              design = ~cytokine_condition)

countData is a matrix or data.frame of counts with genes on the rows and sample identifiers on the columns (counts.sel in our case).
colData is a table of sample information (metadata.sel in our case).
design indicates how to model the samples; here, we want to measure the effect of the different cocktail of cytokines used for stimulation (i.e. either Th1- or Th2-inducing, as reported in the cytokine_condition of metadata.sel).

Second, we can use the DESeq function to analyze the created object:

dds <- DESeq(dds)

Third, we can extract the results in a tabular format:

res <- as.data.frame(results(dds))

For more information of the output format, have a look at DESeq2 manual.

Run a differential expression analysis with DESeq2 to compare Th2- versus Th1-induced CD4+ T cells following the 3 steps above.

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = counts.sel,
                              colData = metadata.sel,
                              design = ~cytokine_condition)
dds <- DESeq(dds)
DEres <- as.data.frame(results(dds))

head(DEres)

##                   baseMean log2FoldChange     lfcSE       stat    pvalue
## ENSG00000227232 89.1453519     -0.2796996 0.2181235 -1.2822995 0.1997376
## ENSG00000278267  5.8793927     -0.5644860 0.7199869 -0.7840226 0.4330268
## ENSG00000243485  0.4591155      2.0699388 3.2005691  0.6467409 0.5177997
## ENSG00000238009  0.2655925      1.3239357 3.8044554  0.3479961 0.7278431
## ENSG00000233750  1.3241781     -1.6467622 1.6096188 -1.0230759 0.3062720
## ENSG00000268903 39.9906629     -0.3799833 0.5096202 -0.7456206 0.4558966
##                      padj
## ENSG00000227232 0.4835463
## ENSG00000278267 0.7301260
## ENSG00000243485        NA
## ENSG00000238009        NA
## ENSG00000233750        NA
## ENSG00000268903 0.7463638

Save the results of the differential expression analysis (i.e., DEres) in a file called Th1_Th2_DEres.RData.

save(DEres, file = "Th1_Th2_DEres.RData")

Ex. 6.4

Sort the DEres data.frame generated in Ex. 6.5 (also available as RData at this link) to have the p-values (pvalue column) in increasing order.

DEres <- DEres[order(DEres[,"pvalue"]),]

Add a column called DE that indicates whether a gene is differentially expressed. The column should be equal to TRUE if a gene has a p-value adjusted for multiple testing (padj column) lower than 0.01 and an absolute log-fold-change (log2FoldChange column) greater than 2, and FALSE otherwise.

alpha <- 0.01
minLFC <- 2
DEres[,"DE"] <- FALSE
DEgenes <- intersect(which(DEres[,"padj"] < alpha),
                     which(abs(DEres[,"log2FoldChange"]) > minLFC))
DEres[DEgenes,"DE"] <- TRUE

Count how many genes are differentially expressed.

sum(DEres[,"DE"])

## [1] 400

Count how many genes are upregulated in Th2 versus Th1 cells, i.e. they are differentially expressed and have a log-fold-change (log2FoldChange column) greater than 0.

genesUP <- intersect(which(DEres[,"DE"]),
                     which(DEres[,"log2FoldChange"]>0))
length(genesUP)

## [1] 212

Count how many genes are downregulated in Th2 versus Th1 cells, i.e. they are differentially expressed and have a log-fold-change (log2FoldChange column) lower than 0.

genesDN <- intersect(which(DEres[,"DE"]),
                     which(DEres[,"log2FoldChange"]<0))
length(genesDN)

## [1] 188

Data visualization

Ex. 6.5

We can use the plot function to visualize the results of our differential gene expression analysis as a Volcano plot.

To do so, we can consider the p-values and log-fold-changes in the DEres file and plot as points (pch=19) the -log10(p-value) of each point (y-axis) versus its log-fold change (x-axis).

plot(DEres[,"log2FoldChange"],
     -log10(DEres[,"pvalue"]),
     pch = 19)

The genes with the smallest p-values (i.e., with more significant differences) are represented at the top of the plot. Points located in the rightmost side of the plot represent genes with large, positive log-fold-changes, i.e. upregulated in Th2 vs. Th1 cells. Points located in the leftmost side of the plot represent genes with large, negative log-fold-changes, i.e. downregulated in Th2 vs. Th1 cells.

To facilitate plot interpretation, add two thin lines representing the x=0 and y=0 axes, and add title and axis names with 1.2 magnification.

plot(DEres[,"log2FoldChange"],
     -log10(DEres[,"pvalue"]),
     pch = 19,
     main = "Th2 vs. Th1 DE analysis",
     xlab = "log2-fold-change",
     ylab = "-log10(p-value)",
     cex.lab = 1.2,
     cex.axis = 1.2,
     cex.main = 1.2)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)

Set the color of the points corresponding to differentially expressed genes to “organge”, and to “grey” for the remaining ones.

col <- rep("grey", nrow(DEres))
col[which(DEres[,"DE"])] <- "orange"

plot(DEres[,"log2FoldChange"],
     -log10(DEres[,"pvalue"]),
     pch = 19,
     main = "Th2 vs. Th1 DE analysis",
     xlab = "log2-fold-change",
     ylab = "-log10(p-value)",
     cex.lab = 1.2,
     cex.axis = 1.2,
     cex.main = 1.2,
     col = col)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)

Add a legend for interpreting the colors assigned to differentially expressed (“DE”) and non-differentially expressed (“nonDE”) genes.

plot(DEres[,"log2FoldChange"],
     -log10(DEres[,"pvalue"]),
     pch = 19,
     main = "Th2 vs. Th1 DE analysis",
     xlab = "log2-fold-change",
     ylab = "-log10(p-value)",
     cex.lab = 1.2,
     cex.axis = 1.2,
     cex.main = 1.2,
     col = col)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)

legend("topleft",
     pch = 19,
     bty = "n",
     col = c("grey", "orange"),
     legend = c("nonDE", "DE"))

Save the plot in a PDF file named Th2_vs_Th2_RNAseq_DE_volcano.

pdf("Th2_vs_Th2_RNAseq_DE_volcano.pdf")

plot(DEres[,"log2FoldChange"],
     -log10(DEres[,"pvalue"]),
     pch = 19,
     main = "Th2 vs. Th1 DE analysis",
     xlab = "log2-fold-change",
     ylab = "-log10(p-value)",
     cex.lab = 1.2,
     cex.axis = 1.2,
     cex.main = 1.2,
     col = col)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)

legend("topleft",
     pch = 19,
     bty = "n",
     col = c("grey", "orange"),
     legend = c("nonDE", "DE"))

dev.off()

## quartz_off_screen 
##                 2

Introduction to R programming for data science
Day 6 exercises

Dr. Francesca Finotello

Medical University of Innsbruck, Austria

Data import

Ex. 6.1

Data analysis

Ex. 6.2

Ex. 6.3

Ex. 6.4

Data visualization

Ex. 6.5

Introduction to R programming for data science Day 6 exercises

Dr. Francesca Finotello

Medical University of Innsbruck, Austria

Data import

Ex. 6.1

Data analysis

Ex. 6.2

Ex. 6.3

Ex. 6.4

Data visualization

Ex. 6.5

Introduction to R programming for data science
Day 6 exercises