smartPep R package vignette

Usage

The starting file is ‘report.tsv’ generated by DIA-NN software.

`maxlfq()` - for peptide without modifications

This function applies diann::maxlfq() for peptide without modifications MaxLFQ algorithm and maintains important columns from the original report, such as “Protein.Ids”, “Genes”, “Stripped.Sequence” and “Modified.Sequence”. Algorithm’s cut-offs are Q.Value <= 0.01 & PG.Q.Value <= 0.01.

library(smartPep)
df_quantified <- maxlfq("example_data/report.tsv")

`merge2reps()` - merge together technical duplicates

Make sure columns are arranged so the technical duplicates of the same sample are grouped together, one after another. Merging is done by averaging two measurements, or imputing one from another in case of NA’s. Samples are grouped as they were in sequence and tagged as Sample1, Sample2 etc.

df_merged <- merge2reps(df_quantified)
head(df_merged)

Protein.Ids	Genes	Stripped.Sequence	Modified.Sequence	Sample1	Sample2	Sample3	Sample4
O00299	CLIC1	AEEQPQVEL	(UniMod:1)AEEQPQVEL	NA	762203.0	NA	896099.5
O00299	CLIC1	AEEQPQVELF	(UniMod:1)AEEQPQVELF	NA	NA	NA	NA
O00629	KPNA4	ADNEKLDNQR	(UniMod:1)ADNEKLDNQR	NA	536036.5	NA	NA
O00629	KPNA4	ADNEKLDNQRLKN	(UniMod:1)ADNEKLDNQRLKN	121522	NA	NA	NA
O15173-2	PGRMC2	AAGDGDVKL	(UniMod:1)AAGDGDVKL	NA	1438815.2	329632.2	289326.8
O43681	GET3	AAGVAGWGVEA	(UniMod:1)AAGVAGWGVEA	NA	NA	112700.0	NA

Pivot to long format with NA rows exclusion after log2 transformation

df_merged <- mutate(df_merged, log2(df_merged[5:8]))

long_df <- pivot_longer(df_merged, 
                        cols = 5:8, 
                        names_to = "Sample", 
                        values_to = "Intensity", 
                        values_drop_na = TRUE)
head(long_df)

Protein.Ids	Genes	Stripped.Sequence	Modified.Sequence	Sample	Intensity
O00299	CLIC1	AEEQPQVEL	(UniMod:1)AEEQPQVEL	Sample2	19.53982
O00299	CLIC1	AEEQPQVEL	(UniMod:1)AEEQPQVEL	Sample4	19.77330
O00629	KPNA4	ADNEKLDNQR	(UniMod:1)ADNEKLDNQR	Sample2	19.03197
O00629	KPNA4	ADNEKLDNQRLKN	(UniMod:1)ADNEKLDNQRLKN	Sample1	16.89086
O15173-2	PGRMC2	AAGDGDVKL	(UniMod:1)AAGDGDVKL	Sample2	20.45645
O15173-2	PGRMC2	AAGDGDVKL	(UniMod:1)AAGDGDVKL	Sample3	18.33050

`sum_stats()` - compute basic statistics for each sample

sum_stats(long_df)

Sample	IDs	Mean	Median	SD	Var
Sample1	51	18.31008	18.60418	0.9616296	0.9247314
Sample2	38	18.38269	18.38587	1.0726181	1.1505095
Sample3	36	17.91953	17.97004	1.0729246	1.1511672
Sample4	26	18.48509	18.55270	1.1474953	1.3167455

`plot_length_distr()` - analyze peptide length distribution per sample

plot_length_distr(long_df)

`pull_ids()` - retrieve peptide sequences or their genes of origin

peptide_ids <- pull_ids(long_df, sample_col = "Sample", sequence_col = "Stripped.Sequence", save = FALSE)
head(peptide_ids$Sample3)
#> [1] "AAGDGDVKL"     "AAGVAGWGVEA"   "AAGSITTL"      "ACGLVAS"      
#> [5] "ADQAPFDTDVNTL" "ADDFGF"

`hm_all()` - plot a standardized global heatmap of peptide intensity

hm_all(df_merged, 5:8)

`hm_small()` - plot a standardized heatmap of peptide intensity based on a gene pattern of interest

Peptide indexes are generated next to a gene symbol.

hm_small(df_merged, genePattern = "LGALS", cols = 5:8)

`cleanup_ranks()` - extract BA ranks values for each HLA allele from an .xlsx file NetMHCpan 4.1 binding affinity results

Start by downloading .xls file from NetMHCpan 4.1 results with BA predictions. Open in Excel and save as .xlsx or other format. Then import it to your environment with the column names disabled. This is important, as there are two rows of column names and the cleanup_ranks() function extracts the important info from both rows.

ba <- readxl::read_excel("example_data/49497_NetMHCpan.xlsx", col_names = FALSE)

ba <- cleanup_ranks(ba)
head(ba)

Peptide	HLA-A01:01	HLA-A02:01	HLA-A03:01	HLA-B15:01	HLA-B39:01
ASFNPHGIST	25.040700000000001	27.346699999999998	14.604900000000001	15.2117	37.842500000000001
ITAVGPAHF	2.6301000000000001	31.979700000000001	18.332899999999999	0.71730000000000005	31.594899999999999
PSVNDITAVGPAHF	29.412400000000002	70.384699999999995	62.4131	13.575799999999999	80.647800000000004
VMAPRTLVL	6.1005000000000003	1.8270999999999999	4.1405000000000003	0.25119999999999998	0.4955
AAGAGLPESV	36.256300000000003	10.8019	55.550600000000003	32.315100000000001	33.286499999999997
AAGAGLPESVIW	51.688899999999997	65.208100000000002	73.395499999999998	45.195	75.871799999999993

`binding_category()` - categorize NetMHCpan 4.1 BA_rank for each HLA allele

Categorize extracted BA_ranks to “Strong binder” (BA rank < 0.5), “Weak binder” (BA rank < 2) and “Non-binder” or “Binder” in case of assessing any allele.

ba <- binding_category(ba)
head(ba)

Peptide	HLA-A01:01	HLA-A02:01	HLA-A03:01	HLA-B15:01	HLA-B39:01	Any HLA allele
ASFNPHGIST	Non-binder	Non-binder	Weak binder	Weak binder	Non-binder	Binder
ITAVGPAHF	Non-binder	Non-binder	Weak binder	Weak binder	Non-binder	Binder
PSVNDITAVGPAHF	Non-binder	Non-binder	Non-binder	Weak binder	Non-binder	Binder
VMAPRTLVL	Non-binder	Weak binder	Non-binder	Strong binder	Strong binder	Binder
AAGAGLPESV	Non-binder	Weak binder	Non-binder	Non-binder	Non-binder	Binder
AAGAGLPESVIW	Non-binder	Non-binder	Non-binder	Non-binder	Non-binder	Non-binder

`plot_binder_shares()` - plot shares of binders for each HLA allele and globally

ba_long <- pivot_longer(ba, 2:7,
                        values_to = "Binding category", 
                        names_to = "HLA supertype")

plot_binder_shares(ba_long, 
                   MHC_col = "HLA supertype", 
                   category_col = "Binding category")

smartPep R package vignette

Martyna Muszczek

2023-09-04

Introduction

Dependencies

Usage

`maxlfq()` - for peptide without modifications

`merge2reps()` - merge together technical duplicates

Pivot to long format with NA rows exclusion after log2 transformation

`sum_stats()` - compute basic statistics for each sample

`plot_length_distr()` - analyze peptide length distribution per sample

`pull_ids()` - retrieve peptide sequences or their genes of origin

`hm_all()` - plot a standardized global heatmap of peptide intensity

`hm_small()` - plot a standardized heatmap of peptide intensity based on a gene pattern of interest

`cleanup_ranks()` - extract BA ranks values for each HLA allele from an .xlsx file NetMHCpan 4.1 binding affinity results

`binding_category()` - categorize NetMHCpan 4.1 BA_rank for each HLA allele

`plot_binder_shares()` - plot shares of binders for each HLA allele and globally

smartPep R package vignette

Martyna Muszczek

2023-09-04

Introduction

Dependencies

Usage

maxlfq() - for peptide without modifications

merge2reps() - merge together technical duplicates

Pivot to long format with NA rows exclusion after log2 transformation

sum_stats() - compute basic statistics for each sample

plot_length_distr() - analyze peptide length distribution per sample

pull_ids() - retrieve peptide sequences or their genes of origin

hm_all() - plot a standardized global heatmap of peptide intensity

hm_small() - plot a standardized heatmap of peptide intensity based on a gene pattern of interest

cleanup_ranks() - extract BA ranks values for each HLA allele from an .xlsx file NetMHCpan 4.1 binding affinity results

binding_category() - categorize NetMHCpan 4.1 BA_rank for each HLA allele

plot_binder_shares() - plot shares of binders for each HLA allele and globally

`maxlfq()` - for peptide without modifications

`merge2reps()` - merge together technical duplicates

`sum_stats()` - compute basic statistics for each sample

`plot_length_distr()` - analyze peptide length distribution per sample

`pull_ids()` - retrieve peptide sequences or their genes of origin

`hm_all()` - plot a standardized global heatmap of peptide intensity

`hm_small()` - plot a standardized heatmap of peptide intensity based on a gene pattern of interest

`cleanup_ranks()` - extract BA ranks values for each HLA allele from an .xlsx file NetMHCpan 4.1 binding affinity results

`binding_category()` - categorize NetMHCpan 4.1 BA_rank for each HLA allele

`plot_binder_shares()` - plot shares of binders for each HLA allele and globally