smartPep R package vignette

Martyna Muszczek

2023-09-04

Introduction

This package is used to process peptide quantification data obtained from DIA-NN (https://github.com/vdemichev/DiaNN) ‘report.tsv’ file and peptide binding affinity to MHC class I molecule prediction from NetMHCpan 4.1 .xls file with BA ranks.

Dependencies

R packages:

Usage

The starting file is ‘report.tsv’ generated by DIA-NN software.

maxlfq() - for peptide without modifications

This function applies diann::maxlfq() for peptide without modifications MaxLFQ algorithm and maintains important columns from the original report, such as “Protein.Ids”, “Genes”, “Stripped.Sequence” and “Modified.Sequence”. Algorithm’s cut-offs are Q.Value <= 0.01 & PG.Q.Value <= 0.01.

library(smartPep)
df_quantified <- maxlfq("example_data/report.tsv")

merge2reps() - merge together technical duplicates

Make sure columns are arranged so the technical duplicates of the same sample are grouped together, one after another. Merging is done by averaging two measurements, or imputing one from another in case of NA’s. Samples are grouped as they were in sequence and tagged as Sample1, Sample2 etc.

df_merged <- merge2reps(df_quantified)
head(df_merged)
Protein.Ids Genes Stripped.Sequence Modified.Sequence Sample1 Sample2 Sample3 Sample4
O00299 CLIC1 AEEQPQVEL (UniMod:1)AEEQPQVEL NA 762203.0 NA 896099.5
O00299 CLIC1 AEEQPQVELF (UniMod:1)AEEQPQVELF NA NA NA NA
O00629 KPNA4 ADNEKLDNQR (UniMod:1)ADNEKLDNQR NA 536036.5 NA NA
O00629 KPNA4 ADNEKLDNQRLKN (UniMod:1)ADNEKLDNQRLKN 121522 NA NA NA
O15173-2 PGRMC2 AAGDGDVKL (UniMod:1)AAGDGDVKL NA 1438815.2 329632.2 289326.8
O43681 GET3 AAGVAGWGVEA (UniMod:1)AAGVAGWGVEA NA NA 112700.0 NA

Pivot to long format with NA rows exclusion after log2 transformation

df_merged <- mutate(df_merged, log2(df_merged[5:8]))

long_df <- pivot_longer(df_merged, 
                        cols = 5:8, 
                        names_to = "Sample", 
                        values_to = "Intensity", 
                        values_drop_na = TRUE)
head(long_df)
Protein.Ids Genes Stripped.Sequence Modified.Sequence Sample Intensity
O00299 CLIC1 AEEQPQVEL (UniMod:1)AEEQPQVEL Sample2 19.53982
O00299 CLIC1 AEEQPQVEL (UniMod:1)AEEQPQVEL Sample4 19.77330
O00629 KPNA4 ADNEKLDNQR (UniMod:1)ADNEKLDNQR Sample2 19.03197
O00629 KPNA4 ADNEKLDNQRLKN (UniMod:1)ADNEKLDNQRLKN Sample1 16.89086
O15173-2 PGRMC2 AAGDGDVKL (UniMod:1)AAGDGDVKL Sample2 20.45645
O15173-2 PGRMC2 AAGDGDVKL (UniMod:1)AAGDGDVKL Sample3 18.33050

sum_stats() - compute basic statistics for each sample

sum_stats(long_df)
Sample IDs Mean Median SD Var
Sample1 51 18.31008 18.60418 0.9616296 0.9247314
Sample2 38 18.38269 18.38587 1.0726181 1.1505095
Sample3 36 17.91953 17.97004 1.0729246 1.1511672
Sample4 26 18.48509 18.55270 1.1474953 1.3167455

plot_length_distr() - analyze peptide length distribution per sample

plot_length_distr(long_df)

pull_ids() - retrieve peptide sequences or their genes of origin

peptide_ids <- pull_ids(long_df, sample_col = "Sample", sequence_col = "Stripped.Sequence", save = FALSE)
head(peptide_ids$Sample3)
#> [1] "AAGDGDVKL"     "AAGVAGWGVEA"   "AAGSITTL"      "ACGLVAS"      
#> [5] "ADQAPFDTDVNTL" "ADDFGF"

hm_all() - plot a standardized global heatmap of peptide intensity

hm_all(df_merged, 5:8)

hm_small() - plot a standardized heatmap of peptide intensity based on a gene pattern of interest

Peptide indexes are generated next to a gene symbol.

hm_small(df_merged, genePattern = "LGALS", cols = 5:8)

cleanup_ranks() - extract BA ranks values for each HLA allele from an .xlsx file NetMHCpan 4.1 binding affinity results

Start by downloading .xls file from NetMHCpan 4.1 results with BA predictions. Open in Excel and save as .xlsx or other format. Then import it to your environment with the column names disabled. This is important, as there are two rows of column names and the cleanup_ranks() function extracts the important info from both rows.

ba <- readxl::read_excel("example_data/49497_NetMHCpan.xlsx", col_names = FALSE)

ba <- cleanup_ranks(ba)
head(ba)
Peptide HLA-A01:01 HLA-A02:01 HLA-A03:01 HLA-B15:01 HLA-B39:01
ASFNPHGIST 25.040700000000001 27.346699999999998 14.604900000000001 15.2117 37.842500000000001
ITAVGPAHF 2.6301000000000001 31.979700000000001 18.332899999999999 0.71730000000000005 31.594899999999999
PSVNDITAVGPAHF 29.412400000000002 70.384699999999995 62.4131 13.575799999999999 80.647800000000004
VMAPRTLVL 6.1005000000000003 1.8270999999999999 4.1405000000000003 0.25119999999999998 0.4955
AAGAGLPESV 36.256300000000003 10.8019 55.550600000000003 32.315100000000001 33.286499999999997
AAGAGLPESVIW 51.688899999999997 65.208100000000002 73.395499999999998 45.195 75.871799999999993

binding_category() - categorize NetMHCpan 4.1 BA_rank for each HLA allele

Categorize extracted BA_ranks to “Strong binder” (BA rank < 0.5), “Weak binder” (BA rank < 2) and “Non-binder” or “Binder” in case of assessing any allele.

ba <- binding_category(ba)
head(ba)
Peptide HLA-A01:01 HLA-A02:01 HLA-A03:01 HLA-B15:01 HLA-B39:01 Any HLA allele
ASFNPHGIST Non-binder Non-binder Weak binder Weak binder Non-binder Binder
ITAVGPAHF Non-binder Non-binder Weak binder Weak binder Non-binder Binder
PSVNDITAVGPAHF Non-binder Non-binder Non-binder Weak binder Non-binder Binder
VMAPRTLVL Non-binder Weak binder Non-binder Strong binder Strong binder Binder
AAGAGLPESV Non-binder Weak binder Non-binder Non-binder Non-binder Binder
AAGAGLPESVIW Non-binder Non-binder Non-binder Non-binder Non-binder Non-binder

plot_binder_shares() - plot shares of binders for each HLA allele and globally

ba_long <- pivot_longer(ba, 2:7,
                        values_to = "Binding category", 
                        names_to = "HLA supertype")

plot_binder_shares(ba_long, 
                   MHC_col = "HLA supertype", 
                   category_col = "Binding category")