This package is used to process peptide quantification data obtained from DIA-NN (https://github.com/vdemichev/DiaNN) ‘report.tsv’ file and peptide binding affinity to MHC class I molecule prediction from NetMHCpan 4.1 .xls file with BA ranks.
R packages:
tidyverse
diann
ComplexHeatmap
scrime
DEP
The starting file is ‘report.tsv’ generated by DIA-NN software.
maxlfq()
- for peptide without modificationsThis function applies diann::maxlfq()
for peptide
without modifications MaxLFQ algorithm and maintains important columns
from the original report, such as “Protein.Ids”, “Genes”,
“Stripped.Sequence” and “Modified.Sequence”. Algorithm’s cut-offs are
Q.Value <= 0.01 & PG.Q.Value <= 0.01.
library(smartPep)
<- maxlfq("example_data/report.tsv") df_quantified
merge2reps()
- merge together technical duplicatesMake sure columns are arranged so the technical duplicates of the same sample are grouped together, one after another. Merging is done by averaging two measurements, or imputing one from another in case of NA’s. Samples are grouped as they were in sequence and tagged as Sample1, Sample2 etc.
<- merge2reps(df_quantified)
df_merged head(df_merged)
Protein.Ids | Genes | Stripped.Sequence | Modified.Sequence | Sample1 | Sample2 | Sample3 | Sample4 |
---|---|---|---|---|---|---|---|
O00299 | CLIC1 | AEEQPQVEL | (UniMod:1)AEEQPQVEL | NA | 762203.0 | NA | 896099.5 |
O00299 | CLIC1 | AEEQPQVELF | (UniMod:1)AEEQPQVELF | NA | NA | NA | NA |
O00629 | KPNA4 | ADNEKLDNQR | (UniMod:1)ADNEKLDNQR | NA | 536036.5 | NA | NA |
O00629 | KPNA4 | ADNEKLDNQRLKN | (UniMod:1)ADNEKLDNQRLKN | 121522 | NA | NA | NA |
O15173-2 | PGRMC2 | AAGDGDVKL | (UniMod:1)AAGDGDVKL | NA | 1438815.2 | 329632.2 | 289326.8 |
O43681 | GET3 | AAGVAGWGVEA | (UniMod:1)AAGVAGWGVEA | NA | NA | 112700.0 | NA |
<- mutate(df_merged, log2(df_merged[5:8]))
df_merged
<- pivot_longer(df_merged,
long_df cols = 5:8,
names_to = "Sample",
values_to = "Intensity",
values_drop_na = TRUE)
head(long_df)
Protein.Ids | Genes | Stripped.Sequence | Modified.Sequence | Sample | Intensity |
---|---|---|---|---|---|
O00299 | CLIC1 | AEEQPQVEL | (UniMod:1)AEEQPQVEL | Sample2 | 19.53982 |
O00299 | CLIC1 | AEEQPQVEL | (UniMod:1)AEEQPQVEL | Sample4 | 19.77330 |
O00629 | KPNA4 | ADNEKLDNQR | (UniMod:1)ADNEKLDNQR | Sample2 | 19.03197 |
O00629 | KPNA4 | ADNEKLDNQRLKN | (UniMod:1)ADNEKLDNQRLKN | Sample1 | 16.89086 |
O15173-2 | PGRMC2 | AAGDGDVKL | (UniMod:1)AAGDGDVKL | Sample2 | 20.45645 |
O15173-2 | PGRMC2 | AAGDGDVKL | (UniMod:1)AAGDGDVKL | Sample3 | 18.33050 |
sum_stats()
- compute basic statistics for each
samplesum_stats(long_df)
Sample | IDs | Mean | Median | SD | Var |
---|---|---|---|---|---|
Sample1 | 51 | 18.31008 | 18.60418 | 0.9616296 | 0.9247314 |
Sample2 | 38 | 18.38269 | 18.38587 | 1.0726181 | 1.1505095 |
Sample3 | 36 | 17.91953 | 17.97004 | 1.0729246 | 1.1511672 |
Sample4 | 26 | 18.48509 | 18.55270 | 1.1474953 | 1.3167455 |
plot_length_distr()
- analyze peptide length
distribution per sampleplot_length_distr(long_df)
pull_ids()
- retrieve peptide sequences or their genes
of origin<- pull_ids(long_df, sample_col = "Sample", sequence_col = "Stripped.Sequence", save = FALSE)
peptide_ids head(peptide_ids$Sample3)
#> [1] "AAGDGDVKL" "AAGVAGWGVEA" "AAGSITTL" "ACGLVAS"
#> [5] "ADQAPFDTDVNTL" "ADDFGF"
hm_all()
- plot a standardized global heatmap of
peptide intensityhm_all(df_merged, 5:8)
hm_small()
- plot a standardized heatmap of peptide
intensity based on a gene pattern of interestPeptide indexes are generated next to a gene symbol.
hm_small(df_merged, genePattern = "LGALS", cols = 5:8)
cleanup_ranks()
- extract BA ranks values for each HLA
allele from an .xlsx file NetMHCpan 4.1 binding affinity resultsStart by downloading .xls file from NetMHCpan 4.1 results with BA
predictions. Open in Excel and save as .xlsx or other format. Then
import it to your environment with the column names disabled. This is
important, as there are two rows of column names and the
cleanup_ranks()
function extracts the important info from
both rows.
<- readxl::read_excel("example_data/49497_NetMHCpan.xlsx", col_names = FALSE)
ba
<- cleanup_ranks(ba)
ba head(ba)
Peptide | HLA-A01:01 | HLA-A02:01 | HLA-A03:01 | HLA-B15:01 | HLA-B39:01 |
---|---|---|---|---|---|
ASFNPHGIST | 25.040700000000001 | 27.346699999999998 | 14.604900000000001 | 15.2117 | 37.842500000000001 |
ITAVGPAHF | 2.6301000000000001 | 31.979700000000001 | 18.332899999999999 | 0.71730000000000005 | 31.594899999999999 |
PSVNDITAVGPAHF | 29.412400000000002 | 70.384699999999995 | 62.4131 | 13.575799999999999 | 80.647800000000004 |
VMAPRTLVL | 6.1005000000000003 | 1.8270999999999999 | 4.1405000000000003 | 0.25119999999999998 | 0.4955 |
AAGAGLPESV | 36.256300000000003 | 10.8019 | 55.550600000000003 | 32.315100000000001 | 33.286499999999997 |
AAGAGLPESVIW | 51.688899999999997 | 65.208100000000002 | 73.395499999999998 | 45.195 | 75.871799999999993 |
binding_category()
- categorize NetMHCpan 4.1 BA_rank
for each HLA alleleCategorize extracted BA_ranks to “Strong binder” (BA rank < 0.5), “Weak binder” (BA rank < 2) and “Non-binder” or “Binder” in case of assessing any allele.
<- binding_category(ba)
ba head(ba)
Peptide | HLA-A01:01 | HLA-A02:01 | HLA-A03:01 | HLA-B15:01 | HLA-B39:01 | Any HLA allele |
---|---|---|---|---|---|---|
ASFNPHGIST | Non-binder | Non-binder | Weak binder | Weak binder | Non-binder | Binder |
ITAVGPAHF | Non-binder | Non-binder | Weak binder | Weak binder | Non-binder | Binder |
PSVNDITAVGPAHF | Non-binder | Non-binder | Non-binder | Weak binder | Non-binder | Binder |
VMAPRTLVL | Non-binder | Weak binder | Non-binder | Strong binder | Strong binder | Binder |
AAGAGLPESV | Non-binder | Weak binder | Non-binder | Non-binder | Non-binder | Binder |
AAGAGLPESVIW | Non-binder | Non-binder | Non-binder | Non-binder | Non-binder | Non-binder |