Recap Lab 6

Leon Eyrich Jessen

Prologue: Process vs. Product

  • This course is all about the process

  • Bio Data Science is the process by which you create analysis results (products)

  • Therefore, yes you can do, e.g.:

my_data$new_var <- some_calculation()
my_data[my_data == "N/A"] <- NA
  • And it will create the right product, but in the context of this course, it is wrong, it is the wrong process!

  • For your project, regardless of your product, the use of the wrong process, will get you points deducted, so you might as well get used to it!

  • This course is on tidyverse R, not base R (different dialect)

I understand you are used to thinking in products, if helpful consider in this course, that the process IS the product!

Joins

X <- tibble(
  id = c("blue_1", "blue_2", "red_3", "red_4"),
  var_x = c(27, 42, 33, 64))
X
# A tibble: 4 × 2
  id     var_x
  <chr>  <dbl>
1 blue_1    27
2 blue_2    42
3 red_3     33
4 red_4     64
Y <- tibble(
  id = c("blue_1", "blue_2", "green_5", "green_6"),
  var_x = c(63, 17, 8, 95))
Y
# A tibble: 4 × 2
  id      var_x
  <chr>   <dbl>
1 blue_1     63
2 blue_2     17
3 green_5     8
4 green_6    95

Joins

full_join()

full_join(X, Y, by = "id")
# A tibble: 6 × 3
  id      var_x.x var_x.y
  <chr>     <dbl>   <dbl>
1 blue_1       27      63
2 blue_2       42      17
3 red_3        33      NA
4 red_4        64      NA
5 green_5      NA       8
6 green_6      NA      95

Joins

inner_join()

inner_join(X, Y, by = "id")
# A tibble: 2 × 3
  id     var_x.x var_x.y
  <chr>    <dbl>   <dbl>
1 blue_1      27      63
2 blue_2      42      17

Joins

left_join()

left_join(X, Y, by = "id")
# A tibble: 4 × 3
  id     var_x.x var_x.y
  <chr>    <dbl>   <dbl>
1 blue_1      27      63
2 blue_2      42      17
3 red_3       33      NA
4 red_4       64      NA

Joins

right_join()

right_join(X, Y, by = "id")
# A tibble: 4 × 3
  id      var_x.x var_x.y
  <chr>     <dbl>   <dbl>
1 blue_1       27      63
2 blue_2       42      17
3 green_5      NA       8
4 green_6      NA      95

Joins

anti_join()

anti_join(X, Y, by = "id")
# A tibble: 2 × 2
  id    var_x
  <chr> <dbl>
1 red_3    33
2 red_4    64

Pivotting

my_data <- tibble(
  id = c(1, 2),
  X = c("a", "d"),
  Y = c("b", "e"),
  Z = c("c", "f"))
my_data
# A tibble: 2 × 4
     id X     Y     Z    
  <dbl> <chr> <chr> <chr>
1     1 a     b     c    
2     2 d     e     f    

Pivotting

my_data_long <- my_data |> 
  pivot_longer(
    cols = c(X, Y, Z),    # Equiv.: -id
    names_to = "names",   # Quoted: New variable
    values_to = "values") # Quoted: New variable
my_data_long
# A tibble: 6 × 3
     id names values
  <dbl> <chr> <chr> 
1     1 X     a     
2     1 Y     b     
3     1 Z     c     
4     2 X     d     
5     2 Y     e     
6     2 Z     f     

Pivotting

my_data_wide <- my_data_long |> 
  pivot_wider(
    id_cols = id,
    names_from = names,   # Un-quoted: Existing variable
    values_from = values) # Un-quoted: Existing variable
my_data_wide
# A tibble: 2 × 4
     id X     Y     Z    
  <dbl> <chr> <chr> <chr>
1     1 a     b     c    
2     2 d     e     f    

Lab 5 Exercises

Load libraries

library("tidyverse")

Lab 5 Exercises

Load meta data

meta_data <- read_csv(file = "data/subject-metadata.csv")
meta_data
# A tibble: 144 × 30
   Experiment Subject `Cell Type` `Target Type` Cohort        Age   Gender Race 
   <chr>        <dbl> <chr>       <chr>         <chr>         <chr> <chr>  <chr>
 1 eAM13          844 PBMC        C19_cI        COVID-19-Con… 34    F      White
 2 eAM23         5422 PBMC        C19_cI        COVID-19-Con… 48    M      N/A  
 3 eAV100        1995 PBMC        C19_cII       COVID-19-Con… 29    F      N/A  
 4 eAV105        1995 PBMC        C19_cII       COVID-19-Con… 29    F      N/A  
 5 eAV88        19830 naive_CD8   C19_cI        Healthy (No … 24    M      White
 6 eAV91        19855 naive_CD8   C19_cI        Healthy (No … 31    M      White
 7 eAV93        20655 naive_CD8   C19_cI        Healthy (No … 41    M      White
 8 eDH105        3533 PBMC        C19_cI        COVID-19-Con… 32    F      N/A  
 9 eDH107        1684 PBMC        C19_cI        COVID-19-Con… 72    F      N/A  
10 eDH113     1327414 PBMC        C19_cI        Healthy (No … 56    N/A    N/A  
# ℹ 134 more rows
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>, …

Lab 5 Exercises

Load meta data

meta_data <- read_csv(file = "data/subject-metadata.csv",
                      na = "N/A")
meta_data
# A tibble: 144 × 30
   Experiment Subject `Cell Type` `Target Type` Cohort          Age Gender Race 
   <chr>        <dbl> <chr>       <chr>         <chr>         <dbl> <chr>  <chr>
 1 eAM13          844 PBMC        C19_cI        COVID-19-Con…    34 F      White
 2 eAM23         5422 PBMC        C19_cI        COVID-19-Con…    48 M      <NA> 
 3 eAV100        1995 PBMC        C19_cII       COVID-19-Con…    29 F      <NA> 
 4 eAV105        1995 PBMC        C19_cII       COVID-19-Con…    29 F      <NA> 
 5 eAV88        19830 naive_CD8   C19_cI        Healthy (No …    24 M      White
 6 eAV91        19855 naive_CD8   C19_cI        Healthy (No …    31 M      White
 7 eAV93        20655 naive_CD8   C19_cI        Healthy (No …    41 M      White
 8 eDH105        3533 PBMC        C19_cI        COVID-19-Con…    32 F      <NA> 
 9 eDH107        1684 PBMC        C19_cI        COVID-19-Con…    72 F      <NA> 
10 eDH113     1327414 PBMC        C19_cI        Healthy (No …    56 <NA>   <NA> 
# ℹ 134 more rows
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>, …
  • Important to fix NAs
  • Can be: -99, N/A, -11, _, "" etc.
  • Data specific - Check!!!
  • If you’re lucky: Data documentation?

Lab 5 Exercises

Load peptide data

peptide_data <- read_csv(file = "data/peptide-detail-ci.csv.gz")
peptide_data
# A tibble: 154,320 × 7
   `TCR BioIdentity`            TCR Nucleotide Seque…¹ Experiment `ORF Coverage`
   <chr>                        <chr>                  <chr>      <chr>         
 1 CASSAQGTGDRGYTF+TCRBV27-01+… GAGTCGCCCAGCCCCAACCAG… eAV93      ORF1ab,surfac…
 2 CASSLVATGNTGELFF+TCRBV07-09… CGCACAGAGCAGGGGGACTCG… eOX56      ORF1ab,surfac…
 3 CASSKGTVSGLSG+TCRBV21-01+TC… GAGATCCAGTCCACGGAGTCA… eAV93      ORF1ab,surfac…
 4 CALKVGADTQYF+TCRBV30-01+TCR… CTGAGTTCTAAGAAGCTCCTT… eQD124     ORF1ab,surfac…
 5 CASSLWASGRGGTGELFF+TCRBV27-… AGCCCCAACCAGACCTCTCTG… eAV93      ORF1ab,surfac…
 6 CASSLLGWEQLDEQFF+TCRBV27-01… TCGCCCAGCCCCAACCAGACC… eMR16      ORF1ab,surfac…
 7 CASSSGTGVYGYTF+TCRBV12-X+TC… ATCCAGCCCTCAGAACCCAGG… eAV93      ORF1ab,surfac…
 8 CASSPLEWEGVTEAFF+TCRBV06-X+… TCGGCTGCTCCCTCCCAGACA… eQD123     ORF1ab,surfac…
 9 CASSFWSSGRGGTDTQYF+TCRBV27-… AGCCCCAACCAGACCTCTCTG… eAV93      ORF1ab,surfac…
10 CASSAGQGASDEQFF+TCRBV07-09+… CAGCGCACAGAGCAGGGGGAC… eMR15      ORF1ab,surfac…
# ℹ 154,310 more rows
# ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`
# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,
#   `End Index in Genome` <dbl>

Lab 5 Exercises

  • Q1: How many observations of how many variables are in the data?
meta_data |>
  dim()
[1] 144  30

Lab 5 Exercises

  • Q2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?
meta_data |> 
  colnames()
 [1] "Experiment"  "Subject"     "Cell Type"   "Target Type" "Cohort"     
 [6] "Age"         "Gender"      "Race"        "HLA-A...9"   "HLA-A...10" 
[11] "HLA-B...11"  "HLA-B...12"  "HLA-C...13"  "HLA-C...14"  "DPA1...15"  
[16] "DPA1...16"   "DPB1...17"   "DPB1...18"   "DQA1...19"   "DQA1...20"  
[21] "DQB1...21"   "DQB1...22"   "DRB1...23"   "DRB1...24"   "DRB3...25"  
[26] "DRB3...26"   "DRB4...27"   "DRB4...28"   "DRB5...29"   "DRB5...30"  

Lab 5 Exercises

  • T1: Re-create this plot
pl <- meta_data |> 
  drop_na(Cohort, Gender) |> 
  count(Cohort, Gender) |> 
  ggplot(aes(x = Cohort,
             y = n,
             fill = Gender)) +
  geom_col(position = position_dodge(
    preserve = "single"),
    colour = "black",
    alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "bottom",
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(angle = 10,
                                   hjust = 1,
                                   vjust = 1.5))

Lab 5 Exercises

  • T2: Re-create this plot
pl <- meta_data |> 
  mutate(Age_group = cut(
    x = Age,
    breaks = seq(from = 0,
                 to = 100,
                 by = 10))) |> 
  drop_na(Gender, Age_group) |> 
  count(Gender, Age_group) |> 
  ggplot(aes(x = Age_group,
             y = n,
             fill = Gender)) +
  geom_col(position = position_dodge(
    preserve = "single"),
    colour = "black",
    alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "bottom",
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(vjust = 5))

Lab 5 Exercises

  • T3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type
pl <- meta_data |>
  ggplot(aes(x = Cohort,
             y = Age,
             fill = Gender)) +
  geom_boxplot(position = position_dodge(
    preserve = "single")) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 10,
                                   hjust = 1,
                                   vjust = 1.5))
meta_data <- meta_data |> 
  select(-Subject, -`Cell Type`, -`Target Type`)

Note the backticks in selecting variables, which contain spaces!

Lab 5 Exercises

library("table1") # <= Yes, this should normally go at the beginning!
meta_data |>
  mutate(Gender = factor(Gender),
         Cohort = factor(Cohort)) |>
  table1(x = formula(~ Gender + Age + Race | Cohort),
         data = _)
COVID-19-Acute
(N=4)
COVID-19-B-Non-Acute
(N=8)
COVID-19-Convalescent
(N=90)
COVID-19-Exposed
(N=3)
Healthy (No known exposure)
(N=39)
Overall
(N=144)
Gender
F 1 (25.0%) 4 (50.0%) 33 (36.7%) 1 (33.3%) 17 (43.6%) 56 (38.9%)
M 2 (50.0%) 3 (37.5%) 36 (40.0%) 0 (0%) 21 (53.8%) 62 (43.1%)
Missing 1 (25.0%) 1 (12.5%) 21 (23.3%) 2 (66.7%) 1 (2.6%) 26 (18.1%)
Age
Mean (SD) 50.7 (17.0) 43.7 (7.74) 51.5 (15.3) 35.0 (NA) 33.3 (9.93) 44.9 (15.7)
Median [Min, Max] 52.0 [33.0, 67.0] 42.0 [33.0, 53.0] 53.0 [21.0, 79.0] 35.0 [35.0, 35.0] 31.0 [21.0, 62.0] 42.0 [21.0, 79.0]
Missing 1 (25.0%) 1 (12.5%) 21 (23.3%) 2 (66.7%) 0 (0%) 25 (17.4%)
Race
African American 1 (25.0%) 0 (0%) 0 (0%) 0 (0%) 1 (2.6%) 2 (1.4%)
White 2 (50.0%) 7 (87.5%) 13 (14.4%) 0 (0%) 28 (71.8%) 50 (34.7%)
Asian 0 (0%) 0 (0%) 3 (3.3%) 0 (0%) 2 (5.1%) 5 (3.5%)
Hispanic or Latino/a 0 (0%) 0 (0%) 1 (1.1%) 0 (0%) 0 (0%) 1 (0.7%)
Native Hawaiian or Other Pacific Islander 0 (0%) 0 (0%) 0 (0%) 1 (33.3%) 0 (0%) 1 (0.7%)
Black or African American 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3 (7.7%) 3 (2.1%)
Mixed Race 0 (0%) 0 (0%) 0 (0%) 0 (0%) 1 (2.6%) 1 (0.7%)
Missing 1 (25.0%) 1 (12.5%) 73 (81.1%) 2 (66.7%) 4 (10.3%) 81 (56.3%)

Super nice way to quickly get an overview of the data!

Lab 5 Exercises

  • T4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2
meta_data <- meta_data |>
  select(-matches(match = "D[PQR]")) |> # Regex: D and-then-any-of P or Q or R
  rename("A1" = `HLA-A...9`,
         "A2" = `HLA-A...10`,
         "B1" = `HLA-B...11`,
         "B2" = `HLA-B...12`,
         "C1" = `HLA-C...13`,
         "C2" = `HLA-C...14`)

Lab 5 Exercises

The Peptide Details Data

  • Q3: How many observations of how many variables are in the data?
peptide_data |> 
  dim()
[1] 154320      7

Lab 5 Exercises

  • T5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids
peptide_data <- peptide_data |>
  select(Experiment,
         `TCR BioIdentity`,
         `Amino Acids`)

Lab 5 Exercises

  • Q4: Is this tidy data? Why/why not?
peptide_data
# A tibble: 154,320 × 3
   Experiment `TCR BioIdentity`                        `Amino Acids`            
   <chr>      <chr>                                    <chr>                    
 1 eAV93      CASSAQGTGDRGYTF+TCRBV27-01+TCRBJ01-02    ADAGFIKQY,AELEGIQY,LADAG…
 2 eOX56      CASSLVATGNTGELFF+TCRBV07-09+TCRBJ02-02   ADAGFIKQY,AELEGIQY,LADAG…
 3 eAV93      CASSKGTVSGLSG+TCRBV21-01+TCRBJ02-07      ADAGFIKQY,AELEGIQY,LADAG…
 4 eQD124     CALKVGADTQYF+TCRBV30-01+TCRBJ02-03       ADAGFIKQY,AELEGIQY,LADAG…
 5 eAV93      CASSLWASGRGGTGELFF+TCRBV27-01+TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
 6 eMR16      CASSLLGWEQLDEQFF+TCRBV27-01+TCRBJ02-01   ADAGFIKQY,AELEGIQY,LADAG…
 7 eAV93      CASSSGTGVYGYTF+TCRBV12-X+TCRBJ01-02      ADAGFIKQY,AELEGIQY,LADAG…
 8 eQD123     CASSPLEWEGVTEAFF+TCRBV06-X+TCRBJ01-01    ADAGFIKQY,AELEGIQY,LADAG…
 9 eAV93      CASSFWSSGRGGTDTQYF+TCRBV27-01+TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
10 eMR15      CASSAGQGASDEQFF+TCRBV07-09+TCRBJ02-01    ADAGFIKQY,AELEGIQY,LADAG…
# ℹ 154,310 more rows

Lab 5 Exercises

  • T6: See if you can find a way to create the below data, from the above
peptide_data <- peptide_data |> 
  separate(col = `TCR BioIdentity`,
           into = c("CDR3b", "V_gene", "J_gene"),
           sep = "\\+") # Note the escape
peptide_data
# A tibble: 154,320 × 5
   Experiment CDR3b              V_gene     J_gene     `Amino Acids`            
   <chr>      <chr>              <chr>      <chr>      <chr>                    
 1 eAV93      CASSAQGTGDRGYTF    TCRBV27-01 TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
 2 eOX56      CASSLVATGNTGELFF   TCRBV07-09 TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
 3 eAV93      CASSKGTVSGLSG      TCRBV21-01 TCRBJ02-07 ADAGFIKQY,AELEGIQY,LADAG…
 4 eQD124     CALKVGADTQYF       TCRBV30-01 TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
 5 eAV93      CASSLWASGRGGTGELFF TCRBV27-01 TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
 6 eMR16      CASSLLGWEQLDEQFF   TCRBV27-01 TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
 7 eAV93      CASSSGTGVYGYTF     TCRBV12-X  TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
 8 eQD123     CASSPLEWEGVTEAFF   TCRBV06-X  TCRBJ01-01 ADAGFIKQY,AELEGIQY,LADAG…
 9 eAV93      CASSFWSSGRGGTDTQYF TCRBV27-01 TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
10 eMR15      CASSAGQGASDEQFF    TCRBV07-09 TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
# ℹ 154,310 more rows

Lab 5 Exercises

  • T7: Add a variable, which counts how many peptides are in each observation of Amino Acids
peptide_data <- peptide_data |> 
  mutate(n_peptides = str_count(`Amino Acids`,
                                pattern = ",") + 1)
peptide_data
# A tibble: 154,320 × 6
   Experiment CDR3b              V_gene     J_gene     `Amino Acids`  n_peptides
   <chr>      <chr>              <chr>      <chr>      <chr>               <dbl>
 1 eAV93      CASSAQGTGDRGYTF    TCRBV27-01 TCRBJ01-02 ADAGFIKQY,AEL…          4
 2 eOX56      CASSLVATGNTGELFF   TCRBV07-09 TCRBJ02-02 ADAGFIKQY,AEL…          4
 3 eAV93      CASSKGTVSGLSG      TCRBV21-01 TCRBJ02-07 ADAGFIKQY,AEL…          4
 4 eQD124     CALKVGADTQYF       TCRBV30-01 TCRBJ02-03 ADAGFIKQY,AEL…          4
 5 eAV93      CASSLWASGRGGTGELFF TCRBV27-01 TCRBJ02-02 ADAGFIKQY,AEL…          4
 6 eMR16      CASSLLGWEQLDEQFF   TCRBV27-01 TCRBJ02-01 ADAGFIKQY,AEL…          4
 7 eAV93      CASSSGTGVYGYTF     TCRBV12-X  TCRBJ01-02 ADAGFIKQY,AEL…          4
 8 eQD123     CASSPLEWEGVTEAFF   TCRBV06-X  TCRBJ01-01 ADAGFIKQY,AEL…          4
 9 eAV93      CASSFWSSGRGGTDTQYF TCRBV27-01 TCRBJ02-03 ADAGFIKQY,AEL…          4
10 eMR15      CASSAGQGASDEQFF    TCRBV07-09 TCRBJ02-01 ADAGFIKQY,AEL…          4
# ℹ 154,310 more rows

Lab 5 Exercises

  • T8: Re-create the following plot
pl <- peptide_data |> 
  ggplot(aes(x = n_peptides)) +
  geom_histogram(binwidth = 1,
                 colour = "black",
                 alpha = 0.5) + 
  geom_hline(yintercept = 0) +
  scale_x_continuous(breaks = 1:13) +
  theme_minimal(base_size = 20) +
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(vjust = 5)) +
  labs(x = "Number of peptides per observation",
       y = "Counts")

Lab 5 Exercises

  • Q4: What is the maximum number of peptides assigned to one observation?
max_n_peptides <- peptide_data |>
  pull(n_peptides) |>
  max()
max_n_peptides
[1] 13

Lab 5 Exercises

  • T9: Using the str_c- and the seq-functions, re-create the below
str_c("peptide_", seq(from = 1, to = 5, by = 1)) # 1:5
[1] "peptide_1" "peptide_2" "peptide_3" "peptide_4" "peptide_5"

Lab 5 Exercises

  • T10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data
peptide_data <- peptide_data |> 
  separate(col = `Amino Acids`,
           into = str_c("peptide_", 1:max_n_peptides),
           sep = ",")
peptide_data |> 
  sample_n(10)
# A tibble: 10 × 18
   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    
 1 eXL31      CASSEWGDAYE… TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF
 2 eJL151     CSVVGRSSYEQ… TCRBV… TCRBJ… AGAAAYYV… GAAAYYVGY WTAGAAAYY WTAGAAAY…
 3 eEE224     CASSLVGRTLY… TCRBV… TCRBJ… FVDGVPFVV <NA>      <NA>      <NA>     
 4 eOX52      CASSLGAWGQP… TCRBV… TCRBJ… GMEVTPSG… MEVTPSGT… TPSGTWLTY VTPSGTWL…
 5 eMR12      CASSLTPPREE… TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     
 6 eOX52      CASSEWLPPYE… TCRBV… TCRBJ… NYLYRLFRK NYNYLYRLF <NA>      <NA>     
 7 eXL30      CASAPGAQGGF… TCRBV… TCRBJ… FPNITNLC… QPTESIVRF RFPNITNL… TESIVRFP…
 8 eHO128     CASSLSGEETQ… TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     
 9 eMR16      CATGGGGRETQ… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF
10 eXL32      CASSRDRGRLE… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>

Lab 5 Exercises

  • Q5: Now, presumable you got a warning, discuss in your group why that is?

Not all have 13, so R lets you know this! Thank you R 🙏

Lab 5 Exercises

  • Q6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?
peptide_data
# A tibble: 154,320 × 18
   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    
 1 eAV93      CASSAQGTGDR… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 2 eOX56      CASSLVATGNT… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 3 eAV93      CASSKGTVSGL… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 4 eQD124     CALKVGADTQYF TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 5 eAV93      CASSLWASGRG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 6 eMR16      CASSLLGWEQL… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 7 eAV93      CASSSGTGVYG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 8 eQD123     CASSPLEWEGV… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 9 eAV93      CASSFWSSGRG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
10 eMR15      CASSAGQGASD… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
# ℹ 154,310 more rows
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>

Lab 5 Exercises

  • T12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data-pivotting functions
peptide_data <- peptide_data |> 
  pivot_longer(cols = contains("peptide_"),
               names_to = "peptide_n",
               values_to = "peptide")
peptide_data
# A tibble: 2,006,160 × 7
   Experiment CDR3b           V_gene     J_gene     n_peptides peptide_n peptide
   <chr>      <chr>           <chr>      <chr>           <dbl> <chr>     <chr>  
 1 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_1 ADAGFI…
 2 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_2 AELEGI…
 3 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_3 LADAGF…
 4 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_4 TLADAG…
 5 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_5 <NA>   
 6 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_6 <NA>   
 7 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_7 <NA>   
 8 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_8 <NA>   
 9 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_9 <NA>   
10 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_… <NA>   
# ℹ 2,006,150 more rows

Lab 5 Exercises

  • T13: Now, loose the redundant variables n_peptides and peptide_n and also get rid of the NAs in the peptide-column and make sure, that we only have unique observations, i.e. there are no repeated rows/observations
peptide_data <- peptide_data |> 
  select(-n_peptides, -peptide_n) |> 
  drop_na(peptide) |> 
  distinct()
peptide_data
# A tibble: 583,309 × 5
   Experiment CDR3b            V_gene     J_gene     peptide   
   <chr>      <chr>            <chr>      <chr>      <chr>     
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY 
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY  
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIKQY
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK 
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY  
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY 
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY  
# ℹ 583,299 more rows

Lab 5 Exercises

  • Q8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?

Super duper tidy!

  • Each row is an observation
  • Each column is a variable
  • Each cell is a value

Lab 5 Exercises

  • T14: Use the str_detect()-function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate-parameter so see what happens
peptide_data <- peptide_data |> 
  filter(str_detect(CDR3b, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE),
         str_detect(peptide, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE))
peptide_data
# A tibble: 578,357 × 5
   Experiment CDR3b            V_gene     J_gene     peptide   
   <chr>      <chr>            <chr>      <chr>      <chr>     
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY 
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY  
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIKQY
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK 
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY  
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY 
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY  
# ℹ 578,347 more rows

Lab 5 Exercises

  • T14: Use the str_detect()-function to filter the CDR3b and peptide variables
clean_amino_acids <- function(x){
  # str_detect(x, pattern = "^[ARNDCQEGHILKMFPSTWYV]+$") will also work
  return( str_detect(x, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE) )
}
peptide_data <- peptide_data |> 
  filter(clean_amino_acids(CDR3b),
         clean_amino_acids(peptide))
peptide_data
# A tibble: 578,357 × 5
   Experiment CDR3b            V_gene     J_gene     peptide   
   <chr>      <chr>            <chr>      <chr>      <chr>     
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY 
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY  
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIKQY
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK 
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY  
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY 
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY  
# ℹ 578,347 more rows

DRY

Lab 5 Exercises

  • T15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences
peptide_data <- peptide_data |> 
  mutate(k_CDR3b = str_length(CDR3b),
         k_peptide = str_length(peptide))
peptide_data
# A tibble: 578,357 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY      15         9
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY       15         8
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIK…      15        10
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK      15         9
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY      16         9
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY       16         8
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK…      16        10
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK      16         9
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY      13         9
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY       13         8
# ℹ 578,347 more rows

Lab 5 Exercises

  • T16: Re-create this plot
pl <- peptide_data |> 
  ggplot(aes(x = k_CDR3b)) +
  geom_histogram(binwidth = 1,
                 colour = "black",
                 alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20)

Lab 5 Exercises

  • T17: Re-create this plot
pl <- peptide_data |> 
  ggplot(aes(x = k_peptide)) +
  geom_histogram(binwidth = 1,
                 colour = "black",
                 alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20)

Lab 5 Exercises

  • Q9: What is the most predominant length of the CDR3b-sequences?

  • Q10: What is the most predominant length of the peptide-sequences?

  • Q11: Discuss in your group, if this data set is tidy or not?

Lab 5 Exercises

Creating one data set from two data sets

Before we move onto using the family of *_join-functions you prepared for today, we will just take a quick peek at the meta data again:

meta_data |> 
  sample_n(10)
# A tibble: 10 × 11
   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 eLH48      COVID-19-C…    28 M      White "A*0… "A*2… "B*0… "B*1… "C*0… "C*0…
 2 eJL147     Healthy (N…    40 M      Mixe… "A*0… "A*1… "B*0… "B*3… "C*0… "C*0…
 3 eAV91      Healthy (N…    31 M      White "A*0… "A*6… "B*0… "B*5… "C*0… "C*0…
 4 eQD115     COVID-19-C…    48 M      <NA>  "A*0… "A*0… "B*0… "B*4… "C*0… "C*0…
 5 eMR12      COVID-19-C…    NA <NA>   <NA>  "A*0… "A*0… "B*4… "B*5… "C*0… "C*1…
 6 eDH107     COVID-19-C…    72 F      <NA>  "A*0… "A*0… "B*1… "B*3… "C*0… "C*0…
 7 eHH169     Healthy (N…    24 F      Blac… "A*0… "A*7… "B*3… "B*3… "C*0… "C*0…
 8 eNL192     COVID-19-C…    NA <NA>   <NA>  ""    ""    ""    ""    ""    ""   
 9 eHO130     Healthy (N…    28 F      White "A*0… "A*0… "B*0… "B*0… "C*0… "C*0…
10 eMR20      COVID-19-B…    37 M      White "A*0… "A*2… "B*1… "B*1… "C*0… "C*0…
  • Q12: Discuss in your group, if this data with respect to the A1-, A2-, B1-, B2-, C1- and C2-variables is a wide- or a long-data format?

Lab 5 Exercises

  • T18: use either the pivot_wider- or pivot_longer-function to create the following data:
meta_data <- meta_data |> 
  pivot_longer(cols = c("A1", "A2", "B1", "B2", "C1", "C2"),
               names_to = "Gene",
               values_to = "Allele")
meta_data
# A tibble: 864 × 7
   Experiment Cohort                  Age Gender Race  Gene  Allele    
   <chr>      <chr>                 <dbl> <chr>  <chr> <chr> <chr>     
 1 eAM13      COVID-19-Convalescent    34 F      White A1    A*02:01:01
 2 eAM13      COVID-19-Convalescent    34 F      White A2    A*02:05:01
 3 eAM13      COVID-19-Convalescent    34 F      White B1    B*08:01:01
 4 eAM13      COVID-19-Convalescent    34 F      White B2    B*41:01:01
 5 eAM13      COVID-19-Convalescent    34 F      White C1    C*07:01:01
 6 eAM13      COVID-19-Convalescent    34 F      White C2    C*07:01:01
 7 eAM23      COVID-19-Convalescent    48 M      <NA>  A1    A*11:01:01
 8 eAM23      COVID-19-Convalescent    48 M      <NA>  A2    A*24:02:01
 9 eAM23      COVID-19-Convalescent    48 M      <NA>  B1    B*15:01:01
10 eAM23      COVID-19-Convalescent    48 M      <NA>  B2    B*52:01:01
# ℹ 854 more rows

Lab 5 Exercises

  • Q13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?
# A tibble: 864 × 7
   Experiment Cohort                  Age Gender Race  Gene  Allele    
   <chr>      <chr>                 <dbl> <chr>  <chr> <chr> <chr>     
 1 eAM13      COVID-19-Convalescent    34 F      White A1    A*02:01:01
 2 eAM13      COVID-19-Convalescent    34 F      White A2    A*02:05:01
 3 eAM13      COVID-19-Convalescent    34 F      White B1    B*08:01:01
 4 eAM13      COVID-19-Convalescent    34 F      White B2    B*41:01:01
 5 eAM13      COVID-19-Convalescent    34 F      White C1    C*07:01:01
 6 eAM13      COVID-19-Convalescent    34 F      White C2    C*07:01:01
 7 eAM23      COVID-19-Convalescent    48 M      <NA>  A1    A*11:01:01
 8 eAM23      COVID-19-Convalescent    48 M      <NA>  A2    A*24:02:01
 9 eAM23      COVID-19-Convalescent    48 M      <NA>  B1    B*15:01:01
10 eAM23      COVID-19-Convalescent    48 M      <NA>  B2    B*52:01:01
# ℹ 854 more rows
# A tibble: 578,357 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY      15         9
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY       15         8
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIK…      15        10
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK      15         9
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY      16         9
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY       16         8
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK…      16        10
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK      16         9
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY      13         9
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY       13         8
# ℹ 578,347 more rows
intersect(colnames(meta_data), colnames(peptide_data))
[1] "Experiment"

Lab 5 Exercises

  • Once you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:
meta_data <- meta_data |> 
  select(Experiment, Allele) |> 
  distinct()
meta_data
# A tibble: 761 × 2
   Experiment Allele    
   <chr>      <chr>     
 1 eAM13      A*02:01:01
 2 eAM13      A*02:05:01
 3 eAM13      B*08:01:01
 4 eAM13      B*41:01:01
 5 eAM13      C*07:01:01
 6 eAM23      A*11:01:01
 7 eAM23      A*24:02:01
 8 eAM23      B*15:01:01
 9 eAM23      B*52:01:01
10 eAM23      C*04:01:01
# ℹ 751 more rows

Here, we are interested in capturing the Allele-information

Lab 5 Exercises

  • Some alleles are e.g. A*11:01, whereas others are B*51:01:02, let’s snatch the first two fields:

  • T19: Create the following data, according to specifications above:

meta_data <- meta_data |>
  mutate(Allele_F_1_2 = str_extract(Allele,
                                    # Regex: You're NOT expected to know this!
                                    pattern = "^[ABC]\\*\\d+\\:\\d+")) |>
  filter(Allele_F_1_2 != "") |> 
  drop_na()
meta_data
# A tibble: 748 × 3
   Experiment Allele     Allele_F_1_2
   <chr>      <chr>      <chr>       
 1 eAM13      A*02:01:01 A*02:01     
 2 eAM13      A*02:05:01 A*02:05     
 3 eAM13      B*08:01:01 B*08:01     
 4 eAM13      B*41:01:01 B*41:01     
 5 eAM13      C*07:01:01 C*07:01     
 6 eAM23      A*11:01:01 A*11:01     
 7 eAM23      A*24:02:01 A*24:02     
 8 eAM23      B*15:01:01 B*15:01     
 9 eAM23      B*52:01:01 B*52:01     
10 eAM23      C*04:01:01 C*04:01     
# ℹ 738 more rows

Lab 5 Exercises

  • Some alleles are e.g. A*11:01, whereas others are B*51:01:02, let’s snatch the first two fields:

  • T19: Create the following data, according to specifications above:

meta_data <- meta_data |>
  # This will work as well, but think about caveats!
  mutate(Allele_F_1_2 = str_sub(Allele, start = 1, end = 7)) |>
  filter(Allele_F_1_2 != "") |> 
  drop_na()
meta_data
# A tibble: 748 × 3
   Experiment Allele     Allele_F_1_2
   <chr>      <chr>      <chr>       
 1 eAM13      A*02:01:01 A*02:01     
 2 eAM13      A*02:05:01 A*02:05     
 3 eAM13      B*08:01:01 B*08:01     
 4 eAM13      B*41:01:01 B*41:01     
 5 eAM13      C*07:01:01 C*07:01     
 6 eAM23      A*11:01:01 A*11:01     
 7 eAM23      A*24:02:01 A*24:02     
 8 eAM23      B*15:01:01 B*15:01     
 9 eAM23      B*52:01:01 B*52:01     
10 eAM23      C*04:01:01 C*04:01     
# ℹ 738 more rows

Lab 5 Exercises

The asterix, i.e. * is a rather annoying character because of ambiguity, so:

  • T20: Clean the data a bit more, by removing the asterix and redundant variables:
meta_data <- meta_data |> 
  mutate(Allele_F_1_2 = str_remove(Allele_F_1_2, "\\*")) |> 
  select(-Allele) |>
  rename(Allele = Allele_F_1_2) |> 
  distinct()
meta_data
# A tibble: 748 × 2
   Experiment Allele
   <chr>      <chr> 
 1 eAM13      A02:01
 2 eAM13      A02:05
 3 eAM13      B08:01
 4 eAM13      B41:01
 5 eAM13      C07:01
 6 eAM23      A11:01
 7 eAM23      A24:02
 8 eAM23      B15:01
 9 eAM23      B52:01
10 eAM23      C04:01
# ℹ 738 more rows

Lab 5 Exercises

peptide_data
# A tibble: 578,357 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY      15         9
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY       15         8
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIK…      15        10
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK      15         9
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY      16         9
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY       16         8
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK…      16        10
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK      16         9
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY      13         9
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY       13         8
# ℹ 578,347 more rows
meta_data
# A tibble: 748 × 2
   Experiment Allele
   <chr>      <chr> 
 1 eAM13      A02:01
 2 eAM13      A02:05
 3 eAM13      B08:01
 4 eAM13      B41:01
 5 eAM13      C07:01
 6 eAM23      A11:01
 7 eAM23      A24:02
 8 eAM23      B15:01
 9 eAM23      B52:01
10 eAM23      C04:01
# ℹ 738 more rows

Lab 5 Exercises

  • T21: Create a dplyr-pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)
peptide_meta_data <- peptide_data |> 
  full_join(meta_data,
            by = "Experiment",
            relationship = "many-to-many") |> 
  distinct()
peptide_meta_data |>
  sample_n(10)
# A tibble: 10 × 8
   Experiment CDR3b             V_gene   J_gene peptide k_CDR3b k_peptide Allele
   <chr>      <chr>             <chr>    <chr>  <chr>     <int>     <int> <chr> 
 1 eXL30      CASSLEASAYNEQFF   TCRBV11… TCRBJ… LLFLVL…      15         9 B35:02
 2 eXL31      CSVVPAAGGGRYNEQFF TCRBV29… TCRBJ… WPVTLA…      17        10 A29:02
 3 eOX54      CASSQGFTSTDTQYF   TCRBV03… TCRBJ… LITGRL…      15         9 B15:03
 4 eXL27      CASSFRSSGMNEQFF   TCRBV28… TCRBJ… FLWLLW…      15         9 C07:04
 5 eEE240     CASSLTSGMTYEQYF   TCRBV05… TCRBJ… SLIDFY…      15        10 C06:02
 6 eXL31      CASSLWGNSGELFF    TCRBV04… TCRBJ… FYLCFL…      14         9 A29:02
 7 eJL147     CASWGEGSYNEQFF    TCRBV06… TCRBJ… SASAFF…      14        10 A11:01
 8 eOX54      CASSLHGSGNTIYF    TCRBV28… TCRBJ… LWPVTL…      14         9 A02:01
 9 eXL31      CASSPSPGLAVQETQYF TCRBV04… TCRBJ… SLIDFY…      17        10 B44:03
10 eEE226     CASSSQPIQLYEQYF   TCRBV27… TCRBJ… MIELSL…      15        10 B35:02

Lab 5 Exercises

Analysis

Now, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:

  1. What characterises the peptides binding to the HLAs?
  2. What characterises T-cell Receptors binding to the pMHC-complexes?

Lab 5 Exercises

Peptides binding to HLA

  • T22: Subset the final peptide_meta_data-data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo
library("ggseqlogo")
pl <- peptide_meta_data |>
  filter(Allele == "A02:01",
         str_length(peptide) == 9) |>
  select(peptide) |>
  distinct() |>
  pull(peptide) |>
  ggseqlogo()

Lab 5 Exercises

Peptides binding to HLA - Rinse-and-Repeat!

  • T23: Repeat for e.g. B07:02 or another of your favorite alleles
pl <- peptide_meta_data |>
  filter(Allele == "B07:02",
         str_length(peptide) == 9) |>
  select(peptide) |>
  distinct() |>
  pull(peptide) |>
  ggseqlogo()

The reason, that the logo is actually not correct, is that we’re missing something…

  • Q14: Which positions in the peptide determines binding to HLA?

Lab 5 Exercises

CDR3b-sequences binding to pMHC

  • T24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:
pl <- peptide_meta_data |>
  filter(k_CDR3b == 15,
         Allele == "A02:01",
         peptide == "LLFLVLIML") |> 
  pull(CDR3b) |> 
  ggseqlogo()

Epilogue

  • I know this overwhelming now, but commit to it and you WILL be plenty rewarded!
  • I hope this was at least a glimpse into the flexibility and capabilities of using tidyverse for applied Bio Data Science
  • …also, noticed something?
  • We spend maybe 80% of the time here on dealing with data-wrangling and then once we’re good to go, the analysis wasn’t that time consuming!

That’s often the way it ends up going, you’ll spend a lot of time on data handling and getting the tidyverse toolbox in your toolbelt, will allow you to be so much more effecient in your data wrangling, so you can get to the fun part as quick as possible!

Secret slide: How to succeed in this course!

  1. Prepare assigned materials for class
  2. Show up for class
  3. Pay attention during lectures
  4. Do the group exercises
  5. Do the assignments
  6. Use the assignment feedback
  7. …and \(\rightarrow\)