Download the Tcell_RNAseq_counts.txt file from this link that contains expression data in the form of raw gene counts from a RNA sequencing (RNA-seq) experiment of CD4+ T cells stimulated with various cytokines (Cano-Gamez A. et al. 2020).
Load the file in R as a data.frame using the read.table function and check its format by printing to screen the first five rows and columns.
Hint: make sure to set the row.names, header, and sep parameters according to your file format.
counts <- read.table("../Data/Data/Day6/Tcell_RNAseq_counts.txt",
row.names = 1,
header = TRUE,
sep = "\t")
print(counts[1:5,1:5])
## I0712 I0713 I0717 I0721 I0726
## ENSG00000223972 0 0 0 0 0
## ENSG00000227232 55 93 198 131 75
## ENSG00000278267 6 6 28 5 4
## ENSG00000243485 0 0 0 3 1
## ENSG00000237613 0 0 0 0 0
Download the Tcell_RNAseq_metadata.txt file from this link containing matched metadata about the samples.
Load this file in R as done for the count file.
metadata <- read.table("../Data/Data/Day6/Tcell_RNAseq_metadata.txt",
row.names = 1,
header = TRUE,
sep = "\t")
Check the column names in the metadata to see which information is available for each sample.
colnames(metadata)
## [1] "sample_id" "cell_type" "cytokine_condition"
## [4] "stimulation_time" "donor_id" "sex"
## [7] "age" "sequencing_batch" "cell_culture_batch"
## [10] "rna_integrity_number"
From the full metadata data.frame of Ex. 6.1, select only the rows corresponding to CD4 naive T cells (cell_type column) treated with Th1- or Th2-inducing cytokines (cytokine_condition column) for 5 days (stimulation_time column), and save it into a new data.frame called metadata.sel.
Hint: you can identify the intersection of samples (i.e., rows) satisfying all conditions.
ind1 <- which(metadata[, "cell_type"] == "CD4_Naive")
ind2 <- which(metadata[, "cytokine_condition"] %in% c("Th1", "Th2"))
ind3 <- which(metadata[, "stimulation_time"] == "5d")
ind <- intersect(ind1, intersect(ind2, ind3))
metadata.sel <- metadata[ind,]
Using the table function, check how many samples from each cytokine condition (i.e., “Th1” and “Th2”) are available in metadata.sel.
table(metadata.sel[, "cytokine_condition"])
##
## Th1 Th2
## 3 4
Set the sample identifiers (sample_id column) as row names of the metadata.sel data.frame.
rownames(metadata.sel) <- metadata.sel[, "sample_id"]
Save in a vector named selsamples the list of sample identifiers (i.e., row names) of the metadata.sel data.frame.
selsamples <- rownames(metadata.sel)
Use this information to extract from the original count data only the samples (i.e., columns) corresponding to the selected identifiers and save the selected data into a new data.frame called counts.sel.
counts.sel <- counts[, selsamples]
Remove from counts.sel the genes (i.e., rows) having zero counts in all samples.
rmgenes <- which(apply(counts.sel,1,sum)==0)
counts.sel <- counts.sel[-rmgenes, ]
Check the dimension of the counts.sel data.frame.
dim(counts.sel)
## [1] 38257 7
Save metadata.sel and counts.sel in a file called Th1_Th2_RNAseq.RData.
save(metadata.sel,
counts.sel,
file = "Th1_Th2_RNAseq.RData")
We can use the metadata.sel and counts.sel data.frames generated in Ex. 6.2 (also available as RData at this link) to identify the gene differentially expressed after 5 days in naive CD4+ T cells stimulated with Th1- versus Th2-inducing cytokines.
To run differential gene expression analysis, we will use an R package called DESeq2 (Love M.I., Huber W., Anders S., 2014).
DESeq2 can be installed from Bioconductor using the following command:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
After DESeq2 installation and loading (using the library function), we can perform differential gene expression analysis in three steps.
First, we have to create a DESeq object:
dds <- DESeqDataSetFromMatrix(countData = counts.sel,
colData = metadata.sel,
design = ~cytokine_condition)
countData is a matrix or data.frame of counts with genes on the rows and sample identifiers on the columns (counts.sel in our case).
colData is a table of sample information (metadata.sel in our case).
design indicates how to model the samples; here, we want to measure the effect of the different cocktail of cytokines used for stimulation (i.e. either Th1- or Th2-inducing, as reported in the cytokine_condition of metadata.sel).
Second, we can use the DESeq function to analyze the created object:
dds <- DESeq(dds)
Third, we can extract the results in a tabular format:
res <- as.data.frame(results(dds))
For more information of the output format, have a look at DESeq2 manual.
Run a differential expression analysis with DESeq2 to compare Th2- versus Th1-induced CD4+ T cells following the 3 steps above.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts.sel,
colData = metadata.sel,
design = ~cytokine_condition)
dds <- DESeq(dds)
DEres <- as.data.frame(results(dds))
head(DEres)
## baseMean log2FoldChange lfcSE stat pvalue
## ENSG00000227232 89.1453519 -0.2796996 0.2181235 -1.2822995 0.1997376
## ENSG00000278267 5.8793927 -0.5644860 0.7199869 -0.7840226 0.4330268
## ENSG00000243485 0.4591155 2.0699388 3.2005691 0.6467409 0.5177997
## ENSG00000238009 0.2655925 1.3239357 3.8044554 0.3479961 0.7278431
## ENSG00000233750 1.3241781 -1.6467622 1.6096188 -1.0230759 0.3062720
## ENSG00000268903 39.9906629 -0.3799833 0.5096202 -0.7456206 0.4558966
## padj
## ENSG00000227232 0.4835463
## ENSG00000278267 0.7301260
## ENSG00000243485 NA
## ENSG00000238009 NA
## ENSG00000233750 NA
## ENSG00000268903 0.7463638
Save the results of the differential expression analysis (i.e., DEres) in a file called Th1_Th2_DEres.RData.
save(DEres, file = "Th1_Th2_DEres.RData")
Sort the DEres data.frame generated in Ex. 6.5 (also available as RData at this link) to have the p-values (pvalue column) in increasing order.
DEres <- DEres[order(DEres[,"pvalue"]),]
Add a column called DE that indicates whether a gene is differentially expressed. The column should be equal to TRUE if a gene has a p-value adjusted for multiple testing (padj column) lower than 0.01 and an absolute log-fold-change (log2FoldChange column) greater than 2, and FALSE otherwise.
alpha <- 0.01
minLFC <- 2
DEres[,"DE"] <- FALSE
DEgenes <- intersect(which(DEres[,"padj"] < alpha),
which(abs(DEres[,"log2FoldChange"]) > minLFC))
DEres[DEgenes,"DE"] <- TRUE
Count how many genes are differentially expressed.
sum(DEres[,"DE"])
## [1] 400
Count how many genes are upregulated in Th2 versus Th1 cells, i.e. they are differentially expressed and have a log-fold-change (log2FoldChange column) greater than 0.
genesUP <- intersect(which(DEres[,"DE"]),
which(DEres[,"log2FoldChange"]>0))
length(genesUP)
## [1] 212
Count how many genes are downregulated in Th2 versus Th1 cells, i.e. they are differentially expressed and have a log-fold-change (log2FoldChange column) lower than 0.
genesDN <- intersect(which(DEres[,"DE"]),
which(DEres[,"log2FoldChange"]<0))
length(genesDN)
## [1] 188
We can use the plot function to visualize the results of our differential gene expression analysis as a Volcano plot.
To do so, we can consider the p-values and log-fold-changes in the DEres file and plot as points (pch=19) the -log10(p-value) of each point (y-axis) versus its log-fold change (x-axis).
plot(DEres[,"log2FoldChange"],
-log10(DEres[,"pvalue"]),
pch = 19)
The genes with the smallest p-values (i.e., with more significant differences) are represented at the top of the plot. Points located in the rightmost side of the plot represent genes with large, positive log-fold-changes, i.e. upregulated in Th2 vs. Th1 cells. Points located in the leftmost side of the plot represent genes with large, negative log-fold-changes, i.e. downregulated in Th2 vs. Th1 cells.
To facilitate plot interpretation, add two thin lines representing the x=0 and y=0 axes, and add title and axis names with 1.2 magnification.
plot(DEres[,"log2FoldChange"],
-log10(DEres[,"pvalue"]),
pch = 19,
main = "Th2 vs. Th1 DE analysis",
xlab = "log2-fold-change",
ylab = "-log10(p-value)",
cex.lab = 1.2,
cex.axis = 1.2,
cex.main = 1.2)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)
Set the color of the points corresponding to differentially expressed genes to “organge”, and to “grey” for the remaining ones.
col <- rep("grey", nrow(DEres))
col[which(DEres[,"DE"])] <- "orange"
plot(DEres[,"log2FoldChange"],
-log10(DEres[,"pvalue"]),
pch = 19,
main = "Th2 vs. Th1 DE analysis",
xlab = "log2-fold-change",
ylab = "-log10(p-value)",
cex.lab = 1.2,
cex.axis = 1.2,
cex.main = 1.2,
col = col)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)
Add a legend for interpreting the colors assigned to differentially expressed (“DE”) and non-differentially expressed (“nonDE”) genes.
plot(DEres[,"log2FoldChange"],
-log10(DEres[,"pvalue"]),
pch = 19,
main = "Th2 vs. Th1 DE analysis",
xlab = "log2-fold-change",
ylab = "-log10(p-value)",
cex.lab = 1.2,
cex.axis = 1.2,
cex.main = 1.2,
col = col)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)
legend("topleft",
pch = 19,
bty = "n",
col = c("grey", "orange"),
legend = c("nonDE", "DE"))
Save the plot in a PDF file named Th2_vs_Th2_RNAseq_DE_volcano.
pdf("Th2_vs_Th2_RNAseq_DE_volcano.pdf")
plot(DEres[,"log2FoldChange"],
-log10(DEres[,"pvalue"]),
pch = 19,
main = "Th2 vs. Th1 DE analysis",
xlab = "log2-fold-change",
ylab = "-log10(p-value)",
cex.lab = 1.2,
cex.axis = 1.2,
cex.main = 1.2,
col = col)
abline(h=0, col="grey4", lwd=0.8, lty=2)
abline(v=0, col="grey4", lwd=0.8, lty=2)
legend("topleft",
pch = 19,
bty = "n",
col = c("grey", "orange"),
legend = c("nonDE", "DE"))
dev.off()
## quartz_off_screen
## 2