Recap Lab 6

Leon Eyrich Jessen

Prologue: Process vs. Product

This course is all about the process
Bio Data Science is the process by which you create analysis results (products)
Therefore, yes you can do, e.g.:

my_data$new_var <- some_calculation()
my_data[my_data == "N/A"] <- NA

And it will create the right product, but in the context of this course, it is wrong, it is the wrong process!
For your project, regardless of your product, the use of the wrong process, will get you points deducted, so you might as well get used to it!
This course is on tidyverse R, not base R (different dialect)

I understand you are used to thinking in products, if helpful consider in this course, that the process IS the product!

Joins

X <- tibble(
  id = c("blue_1", "blue_2", "red_3", "red_4"),
  var_x = c(27, 42, 33, 64))
X

# A tibble: 4 × 2
  id     var_x
  <chr>  <dbl>
1 blue_1    27
2 blue_2    42
3 red_3     33
4 red_4     64

Y <- tibble(
  id = c("blue_1", "blue_2", "green_5", "green_6"),
  var_x = c(63, 17, 8, 95))
Y

# A tibble: 4 × 2
  id      var_x
  <chr>   <dbl>
1 blue_1     63
2 blue_2     17
3 green_5     8
4 green_6    95

Joins

`full_join()`

full_join(X, Y, by = "id")

# A tibble: 6 × 3
  id      var_x.x var_x.y
  <chr>     <dbl>   <dbl>
1 blue_1       27      63
2 blue_2       42      17
3 red_3        33      NA
4 red_4        64      NA
5 green_5      NA       8
6 green_6      NA      95

Joins

`inner_join()`

inner_join(X, Y, by = "id")

# A tibble: 2 × 3
  id     var_x.x var_x.y
  <chr>    <dbl>   <dbl>
1 blue_1      27      63
2 blue_2      42      17

Joins

`left_join()`

left_join(X, Y, by = "id")

# A tibble: 4 × 3
  id     var_x.x var_x.y
  <chr>    <dbl>   <dbl>
1 blue_1      27      63
2 blue_2      42      17
3 red_3       33      NA
4 red_4       64      NA

Joins

`right_join()`

right_join(X, Y, by = "id")

# A tibble: 4 × 3
  id      var_x.x var_x.y
  <chr>     <dbl>   <dbl>
1 blue_1       27      63
2 blue_2       42      17
3 green_5      NA       8
4 green_6      NA      95

Joins

`anti_join()`

anti_join(X, Y, by = "id")

# A tibble: 2 × 2
  id    var_x
  <chr> <dbl>
1 red_3    33
2 red_4    64

Pivotting

my_data <- tibble(
  id = c(1, 2),
  X = c("a", "d"),
  Y = c("b", "e"),
  Z = c("c", "f"))
my_data

# A tibble: 2 × 4
     id X     Y     Z    
  <dbl> <chr> <chr> <chr>
1     1 a     b     c    
2     2 d     e     f

Pivotting

my_data_long <- my_data |> 
  pivot_longer(
    cols = c(X, Y, Z),    # Equiv.: -id
    names_to = "names",   # Quoted: New variable
    values_to = "values") # Quoted: New variable
my_data_long

# A tibble: 6 × 3
     id names values
  <dbl> <chr> <chr> 
1     1 X     a     
2     1 Y     b     
3     1 Z     c     
4     2 X     d     
5     2 Y     e     
6     2 Z     f

Pivotting

my_data_wide <- my_data_long |> 
  pivot_wider(
    id_cols = id,
    names_from = names,   # Un-quoted: Existing variable
    values_from = values) # Un-quoted: Existing variable
my_data_wide

# A tibble: 2 × 4
     id X     Y     Z    
  <dbl> <chr> <chr> <chr>
1     1 a     b     c    
2     2 d     e     f

Lab 5 Exercises

Load libraries

library("tidyverse")

Lab 5 Exercises

Load `meta` data

meta_data <- read_csv(file = "data/subject-metadata.csv")
meta_data

# A tibble: 144 × 30
   Experiment Subject `Cell Type` `Target Type` Cohort        Age   Gender Race 
   <chr>        <dbl> <chr>       <chr>         <chr>         <chr> <chr>  <chr>
 1 eAM13          844 PBMC        C19_cI        COVID-19-Con… 34    F      White
 2 eAM23         5422 PBMC        C19_cI        COVID-19-Con… 48    M      N/A  
 3 eAV100        1995 PBMC        C19_cII       COVID-19-Con… 29    F      N/A  
 4 eAV105        1995 PBMC        C19_cII       COVID-19-Con… 29    F      N/A  
 5 eAV88        19830 naive_CD8   C19_cI        Healthy (No … 24    M      White
 6 eAV91        19855 naive_CD8   C19_cI        Healthy (No … 31    M      White
 7 eAV93        20655 naive_CD8   C19_cI        Healthy (No … 41    M      White
 8 eDH105        3533 PBMC        C19_cI        COVID-19-Con… 32    F      N/A  
 9 eDH107        1684 PBMC        C19_cI        COVID-19-Con… 72    F      N/A  
10 eDH113     1327414 PBMC        C19_cI        Healthy (No … 56    N/A    N/A  
# ℹ 134 more rows
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>, …

Lab 5 Exercises

Load `meta` data

meta_data <- read_csv(file = "data/subject-metadata.csv",
                      na = "N/A")
meta_data

# A tibble: 144 × 30
   Experiment Subject `Cell Type` `Target Type` Cohort          Age Gender Race 
   <chr>        <dbl> <chr>       <chr>         <chr>         <dbl> <chr>  <chr>
 1 eAM13          844 PBMC        C19_cI        COVID-19-Con…    34 F      White
 2 eAM23         5422 PBMC        C19_cI        COVID-19-Con…    48 M      <NA> 
 3 eAV100        1995 PBMC        C19_cII       COVID-19-Con…    29 F      <NA> 
 4 eAV105        1995 PBMC        C19_cII       COVID-19-Con…    29 F      <NA> 
 5 eAV88        19830 naive_CD8   C19_cI        Healthy (No …    24 M      White
 6 eAV91        19855 naive_CD8   C19_cI        Healthy (No …    31 M      White
 7 eAV93        20655 naive_CD8   C19_cI        Healthy (No …    41 M      White
 8 eDH105        3533 PBMC        C19_cI        COVID-19-Con…    32 F      <NA> 
 9 eDH107        1684 PBMC        C19_cI        COVID-19-Con…    72 F      <NA> 
10 eDH113     1327414 PBMC        C19_cI        Healthy (No …    56 <NA>   <NA> 
# ℹ 134 more rows
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>, …

Important to fix NAs
Can be: -99, N/A, -11, _, "" etc.
Data specific - Check!!!
If you’re lucky: Data documentation?

Lab 5 Exercises

Load `peptide` data

peptide_data <- read_csv(file = "data/peptide-detail-ci.csv.gz")
peptide_data

# A tibble: 154,320 × 7
   `TCR BioIdentity`            TCR Nucleotide Seque…¹ Experiment `ORF Coverage`
   <chr>                        <chr>                  <chr>      <chr>         
 1 CASSAQGTGDRGYTF+TCRBV27-01+… GAGTCGCCCAGCCCCAACCAG… eAV93      ORF1ab,surfac…
 2 CASSLVATGNTGELFF+TCRBV07-09… CGCACAGAGCAGGGGGACTCG… eOX56      ORF1ab,surfac…
 3 CASSKGTVSGLSG+TCRBV21-01+TC… GAGATCCAGTCCACGGAGTCA… eAV93      ORF1ab,surfac…
 4 CALKVGADTQYF+TCRBV30-01+TCR… CTGAGTTCTAAGAAGCTCCTT… eQD124     ORF1ab,surfac…
 5 CASSLWASGRGGTGELFF+TCRBV27-… AGCCCCAACCAGACCTCTCTG… eAV93      ORF1ab,surfac…
 6 CASSLLGWEQLDEQFF+TCRBV27-01… TCGCCCAGCCCCAACCAGACC… eMR16      ORF1ab,surfac…
 7 CASSSGTGVYGYTF+TCRBV12-X+TC… ATCCAGCCCTCAGAACCCAGG… eAV93      ORF1ab,surfac…
 8 CASSPLEWEGVTEAFF+TCRBV06-X+… TCGGCTGCTCCCTCCCAGACA… eQD123     ORF1ab,surfac…
 9 CASSFWSSGRGGTDTQYF+TCRBV27-… AGCCCCAACCAGACCTCTCTG… eAV93      ORF1ab,surfac…
10 CASSAGQGASDEQFF+TCRBV07-09+… CAGCGCACAGAGCAGGGGGAC… eMR15      ORF1ab,surfac…
# ℹ 154,310 more rows
# ℹ abbreviated name: ¹`TCR Nucleotide Sequence`
# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,
#   `End Index in Genome` <dbl>

Lab 5 Exercises

Q1: How many observations of how many variables are in the data?

meta_data |>
  dim()

[1] 144  30

Lab 5 Exercises

Q2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?

meta_data |> 
  colnames()

 [1] "Experiment"  "Subject"     "Cell Type"   "Target Type" "Cohort"     
 [6] "Age"         "Gender"      "Race"        "HLA-A...9"   "HLA-A...10" 
[11] "HLA-B...11"  "HLA-B...12"  "HLA-C...13"  "HLA-C...14"  "DPA1...15"  
[16] "DPA1...16"   "DPB1...17"   "DPB1...18"   "DQA1...19"   "DQA1...20"  
[21] "DQB1...21"   "DQB1...22"   "DRB1...23"   "DRB1...24"   "DRB3...25"  
[26] "DRB3...26"   "DRB4...27"   "DRB4...28"   "DRB5...29"   "DRB5...30"

Lab 5 Exercises

T1: Re-create this plot

pl <- meta_data |> 
  drop_na(Cohort, Gender) |> 
  count(Cohort, Gender) |> 
  ggplot(aes(x = Cohort,
             y = n,
             fill = Gender)) +
  geom_col(position = position_dodge(
    preserve = "single"),
    colour = "black",
    alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "bottom",
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(angle = 10,
                                   hjust = 1,
                                   vjust = 1.5))

Lab 5 Exercises

T2: Re-create this plot

pl <- meta_data |> 
  mutate(Age_group = cut(
    x = Age,
    breaks = seq(from = 0,
                 to = 100,
                 by = 10))) |> 
  drop_na(Gender, Age_group) |> 
  count(Gender, Age_group) |> 
  ggplot(aes(x = Age_group,
             y = n,
             fill = Gender)) +
  geom_col(position = position_dodge(
    preserve = "single"),
    colour = "black",
    alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "bottom",
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(vjust = 5))

Lab 5 Exercises

T3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type

pl <- meta_data |>
  ggplot(aes(x = Cohort,
             y = Age,
             fill = Gender)) +
  geom_boxplot(position = position_dodge(
    preserve = "single")) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 10,
                                   hjust = 1,
                                   vjust = 1.5))
meta_data <- meta_data |> 
  select(-Subject, -`Cell Type`, -`Target Type`)

Note the backticks in selecting variables, which contain spaces!

Lab 5 Exercises

library("table1") # <= Yes, this should normally go at the beginning!
meta_data |>
  mutate(Gender = factor(Gender),
         Cohort = factor(Cohort)) |>
  table1(x = formula(~ Gender + Age + Race | Cohort),
         data = _)

	COVID-19-Acute (N=4)	COVID-19-B-Non-Acute (N=8)	COVID-19-Convalescent (N=90)	COVID-19-Exposed (N=3)	Healthy (No known exposure) (N=39)	Overall (N=144)
Gender
F	1 (25.0%)	4 (50.0%)	33 (36.7%)	1 (33.3%)	17 (43.6%)	56 (38.9%)
M	2 (50.0%)	3 (37.5%)	36 (40.0%)	0 (0%)	21 (53.8%)	62 (43.1%)
Missing	1 (25.0%)	1 (12.5%)	21 (23.3%)	2 (66.7%)	1 (2.6%)	26 (18.1%)
Age
Mean (SD)	50.7 (17.0)	43.7 (7.74)	51.5 (15.3)	35.0 (NA)	33.3 (9.93)	44.9 (15.7)
Median [Min, Max]	52.0 [33.0, 67.0]	42.0 [33.0, 53.0]	53.0 [21.0, 79.0]	35.0 [35.0, 35.0]	31.0 [21.0, 62.0]	42.0 [21.0, 79.0]
Missing	1 (25.0%)	1 (12.5%)	21 (23.3%)	2 (66.7%)	0 (0%)	25 (17.4%)
Race
African American	1 (25.0%)	0 (0%)	0 (0%)	0 (0%)	1 (2.6%)	2 (1.4%)
White	2 (50.0%)	7 (87.5%)	13 (14.4%)	0 (0%)	28 (71.8%)	50 (34.7%)
Asian	0 (0%)	0 (0%)	3 (3.3%)	0 (0%)	2 (5.1%)	5 (3.5%)
Hispanic or Latino/a	0 (0%)	0 (0%)	1 (1.1%)	0 (0%)	0 (0%)	1 (0.7%)
Native Hawaiian or Other Pacific Islander	0 (0%)	0 (0%)	0 (0%)	1 (33.3%)	0 (0%)	1 (0.7%)
Black or African American	0 (0%)	0 (0%)	0 (0%)	0 (0%)	3 (7.7%)	3 (2.1%)
Mixed Race	0 (0%)	0 (0%)	0 (0%)	0 (0%)	1 (2.6%)	1 (0.7%)
Missing	1 (25.0%)	1 (12.5%)	73 (81.1%)	2 (66.7%)	4 (10.3%)	81 (56.3%)

Super nice way to quickly get an overview of the data!

Lab 5 Exercises

T4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2

meta_data <- meta_data |>
  select(-matches(match = "D[PQR]")) |> # Regex: D and-then-any-of P or Q or R
  rename("A1" = `HLA-A...9`,
         "A2" = `HLA-A...10`,
         "B1" = `HLA-B...11`,
         "B2" = `HLA-B...12`,
         "C1" = `HLA-C...13`,
         "C2" = `HLA-C...14`)

Lab 5 Exercises

The Peptide Details Data

Q3: How many observations of how many variables are in the data?

peptide_data |> 
  dim()

[1] 154320      7

Lab 5 Exercises

T5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids

peptide_data <- peptide_data |>
  select(Experiment,
         `TCR BioIdentity`,
         `Amino Acids`)

Lab 5 Exercises

Q4: Is this tidy data? Why/why not?

peptide_data

# A tibble: 154,320 × 3
   Experiment `TCR BioIdentity`                        `Amino Acids`            
   <chr>      <chr>                                    <chr>                    
 1 eAV93      CASSAQGTGDRGYTF+TCRBV27-01+TCRBJ01-02    ADAGFIKQY,AELEGIQY,LADAG…
 2 eOX56      CASSLVATGNTGELFF+TCRBV07-09+TCRBJ02-02   ADAGFIKQY,AELEGIQY,LADAG…
 3 eAV93      CASSKGTVSGLSG+TCRBV21-01+TCRBJ02-07      ADAGFIKQY,AELEGIQY,LADAG…
 4 eQD124     CALKVGADTQYF+TCRBV30-01+TCRBJ02-03       ADAGFIKQY,AELEGIQY,LADAG…
 5 eAV93      CASSLWASGRGGTGELFF+TCRBV27-01+TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
 6 eMR16      CASSLLGWEQLDEQFF+TCRBV27-01+TCRBJ02-01   ADAGFIKQY,AELEGIQY,LADAG…
 7 eAV93      CASSSGTGVYGYTF+TCRBV12-X+TCRBJ01-02      ADAGFIKQY,AELEGIQY,LADAG…
 8 eQD123     CASSPLEWEGVTEAFF+TCRBV06-X+TCRBJ01-01    ADAGFIKQY,AELEGIQY,LADAG…
 9 eAV93      CASSFWSSGRGGTDTQYF+TCRBV27-01+TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
10 eMR15      CASSAGQGASDEQFF+TCRBV07-09+TCRBJ02-01    ADAGFIKQY,AELEGIQY,LADAG…
# ℹ 154,310 more rows

Lab 5 Exercises

T6: See if you can find a way to create the below data, from the above

peptide_data <- peptide_data |> 
  separate(col = `TCR BioIdentity`,
           into = c("CDR3b", "V_gene", "J_gene"),
           sep = "\\+") # Note the escape
peptide_data

# A tibble: 154,320 × 5
   Experiment CDR3b              V_gene     J_gene     `Amino Acids`            
   <chr>      <chr>              <chr>      <chr>      <chr>                    
 1 eAV93      CASSAQGTGDRGYTF    TCRBV27-01 TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
 2 eOX56      CASSLVATGNTGELFF   TCRBV07-09 TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
 3 eAV93      CASSKGTVSGLSG      TCRBV21-01 TCRBJ02-07 ADAGFIKQY,AELEGIQY,LADAG…
 4 eQD124     CALKVGADTQYF       TCRBV30-01 TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
 5 eAV93      CASSLWASGRGGTGELFF TCRBV27-01 TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
 6 eMR16      CASSLLGWEQLDEQFF   TCRBV27-01 TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
 7 eAV93      CASSSGTGVYGYTF     TCRBV12-X  TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
 8 eQD123     CASSPLEWEGVTEAFF   TCRBV06-X  TCRBJ01-01 ADAGFIKQY,AELEGIQY,LADAG…
 9 eAV93      CASSFWSSGRGGTDTQYF TCRBV27-01 TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
10 eMR15      CASSAGQGASDEQFF    TCRBV07-09 TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
# ℹ 154,310 more rows

Lab 5 Exercises

T7: Add a variable, which counts how many peptides are in each observation of Amino Acids

peptide_data <- peptide_data |> 
  mutate(n_peptides = str_count(`Amino Acids`,
                                pattern = ",") + 1)
peptide_data

# A tibble: 154,320 × 6
   Experiment CDR3b              V_gene     J_gene     `Amino Acids`  n_peptides
   <chr>      <chr>              <chr>      <chr>      <chr>               <dbl>
 1 eAV93      CASSAQGTGDRGYTF    TCRBV27-01 TCRBJ01-02 ADAGFIKQY,AEL…          4
 2 eOX56      CASSLVATGNTGELFF   TCRBV07-09 TCRBJ02-02 ADAGFIKQY,AEL…          4
 3 eAV93      CASSKGTVSGLSG      TCRBV21-01 TCRBJ02-07 ADAGFIKQY,AEL…          4
 4 eQD124     CALKVGADTQYF       TCRBV30-01 TCRBJ02-03 ADAGFIKQY,AEL…          4
 5 eAV93      CASSLWASGRGGTGELFF TCRBV27-01 TCRBJ02-02 ADAGFIKQY,AEL…          4
 6 eMR16      CASSLLGWEQLDEQFF   TCRBV27-01 TCRBJ02-01 ADAGFIKQY,AEL…          4
 7 eAV93      CASSSGTGVYGYTF     TCRBV12-X  TCRBJ01-02 ADAGFIKQY,AEL…          4
 8 eQD123     CASSPLEWEGVTEAFF   TCRBV06-X  TCRBJ01-01 ADAGFIKQY,AEL…          4
 9 eAV93      CASSFWSSGRGGTDTQYF TCRBV27-01 TCRBJ02-03 ADAGFIKQY,AEL…          4
10 eMR15      CASSAGQGASDEQFF    TCRBV07-09 TCRBJ02-01 ADAGFIKQY,AEL…          4
# ℹ 154,310 more rows

Lab 5 Exercises

T8: Re-create the following plot

pl <- peptide_data |> 
  ggplot(aes(x = n_peptides)) +
  geom_histogram(binwidth = 1,
                 colour = "black",
                 alpha = 0.5) + 
  geom_hline(yintercept = 0) +
  scale_x_continuous(breaks = 1:13) +
  theme_minimal(base_size = 20) +
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.text.x = element_text(vjust = 5)) +
  labs(x = "Number of peptides per observation",
       y = "Counts")

Lab 5 Exercises

Q4: What is the maximum number of peptides assigned to one observation?

max_n_peptides <- peptide_data |>
  pull(n_peptides) |>
  max()
max_n_peptides

[1] 13

Lab 5 Exercises

T9: Using the str_c- and the seq-functions, re-create the below

str_c("peptide_", seq(from = 1, to = 5, by = 1)) # 1:5

[1] "peptide_1" "peptide_2" "peptide_3" "peptide_4" "peptide_5"

Lab 5 Exercises

T10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data

peptide_data <- peptide_data |> 
  separate(col = `Amino Acids`,
           into = str_c("peptide_", 1:max_n_peptides),
           sep = ",")
peptide_data |> 
  sample_n(10)

# A tibble: 10 × 18
   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    
 1 eXL31      CASSEWGDAYE… TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF
 2 eJL151     CSVVGRSSYEQ… TCRBV… TCRBJ… AGAAAYYV… GAAAYYVGY WTAGAAAYY WTAGAAAY…
 3 eEE224     CASSLVGRTLY… TCRBV… TCRBJ… FVDGVPFVV <NA>      <NA>      <NA>     
 4 eOX52      CASSLGAWGQP… TCRBV… TCRBJ… GMEVTPSG… MEVTPSGT… TPSGTWLTY VTPSGTWL…
 5 eMR12      CASSLTPPREE… TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     
 6 eOX52      CASSEWLPPYE… TCRBV… TCRBJ… NYLYRLFRK NYNYLYRLF <NA>      <NA>     
 7 eXL30      CASAPGAQGGF… TCRBV… TCRBJ… FPNITNLC… QPTESIVRF RFPNITNL… TESIVRFP…
 8 eHO128     CASSLSGEETQ… TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     
 9 eMR16      CATGGGGRETQ… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF
10 eXL32      CASSRDRGRLE… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>

Lab 5 Exercises

Q5: Now, presumable you got a warning, discuss in your group why that is?

Not all have 13, so R lets you know this! Thank you R 🙏

Lab 5 Exercises

Q6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?

peptide_data

# A tibble: 154,320 × 18
   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    
 1 eAV93      CASSAQGTGDR… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 2 eOX56      CASSLVATGNT… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 3 eAV93      CASSKGTVSGL… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 4 eQD124     CALKVGADTQYF TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 5 eAV93      CASSLWASGRG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 6 eMR16      CASSLLGWEQL… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 7 eAV93      CASSSGTGVYG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 8 eQD123     CASSPLEWEGV… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
 9 eAV93      CASSFWSSGRG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
10 eMR15      CASSAGQGASD… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY  LADAGFIK… TLADAGFIK
# ℹ 154,310 more rows
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>

Lab 5 Exercises

T12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data-pivotting functions

peptide_data <- peptide_data |> 
  pivot_longer(cols = contains("peptide_"),
               names_to = "peptide_n",
               values_to = "peptide")
peptide_data

# A tibble: 2,006,160 × 7
   Experiment CDR3b           V_gene     J_gene     n_peptides peptide_n peptide
   <chr>      <chr>           <chr>      <chr>           <dbl> <chr>     <chr>  
 1 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_1 ADAGFI…
 2 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_2 AELEGI…
 3 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_3 LADAGF…
 4 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_4 TLADAG…
 5 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_5 <NA>   
 6 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_6 <NA>   
 7 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_7 <NA>   
 8 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_8 <NA>   
 9 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_9 <NA>   
10 eAV93      CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02          4 peptide_… <NA>   
# ℹ 2,006,150 more rows

Lab 5 Exercises

T13: Now, loose the redundant variables n_peptides and peptide_n and also get rid of the NAs in the peptide-column and make sure, that we only have unique observations, i.e. there are no repeated rows/observations

peptide_data <- peptide_data |> 
  select(-n_peptides, -peptide_n) |> 
  drop_na(peptide) |> 
  distinct()
peptide_data

# A tibble: 583,309 × 5
   Experiment CDR3b            V_gene     J_gene     peptide   
   <chr>      <chr>            <chr>      <chr>      <chr>     
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY 
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY  
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIKQY
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK 
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY  
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY 
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY  
# ℹ 583,299 more rows

Lab 5 Exercises

Q8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?

Super duper tidy!

Each row is an observation
Each column is a variable
Each cell is a value

Lab 5 Exercises

T14: Use the str_detect()-function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate-parameter so see what happens

peptide_data <- peptide_data |> 
  filter(str_detect(CDR3b, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE),
         str_detect(peptide, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE))
peptide_data

# A tibble: 578,357 × 5
   Experiment CDR3b            V_gene     J_gene     peptide   
   <chr>      <chr>            <chr>      <chr>      <chr>     
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY 
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY  
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIKQY
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK 
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY  
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY 
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY  
# ℹ 578,347 more rows

Lab 5 Exercises

T14: Use the str_detect()-function to filter the CDR3b and peptide variables

clean_amino_acids <- function(x){
  # str_detect(x, pattern = "^[ARNDCQEGHILKMFPSTWYV]+$") will also work
  return( str_detect(x, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE) )
}
peptide_data <- peptide_data |> 
  filter(clean_amino_acids(CDR3b),
         clean_amino_acids(peptide))
peptide_data

# A tibble: 578,357 × 5
   Experiment CDR3b            V_gene     J_gene     peptide   
   <chr>      <chr>            <chr>      <chr>      <chr>     
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY 
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY  
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIKQY
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK 
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY  
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY 
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY  
# ℹ 578,347 more rows

DRY

Lab 5 Exercises

T15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences

peptide_data <- peptide_data |> 
  mutate(k_CDR3b = str_length(CDR3b),
         k_peptide = str_length(peptide))
peptide_data

# A tibble: 578,357 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY      15         9
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY       15         8
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIK…      15        10
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK      15         9
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY      16         9
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY       16         8
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK…      16        10
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK      16         9
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY      13         9
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY       13         8
# ℹ 578,347 more rows

Lab 5 Exercises

T16: Re-create this plot

pl <- peptide_data |> 
  ggplot(aes(x = k_CDR3b)) +
  geom_histogram(binwidth = 1,
                 colour = "black",
                 alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20)

Lab 5 Exercises

T17: Re-create this plot

pl <- peptide_data |> 
  ggplot(aes(x = k_peptide)) +
  geom_histogram(binwidth = 1,
                 colour = "black",
                 alpha = 0.5) +
  geom_hline(yintercept = 0) +
  theme_minimal(base_size = 20)

Lab 5 Exercises

Q9: What is the most predominant length of the CDR3b-sequences?
Q10: What is the most predominant length of the peptide-sequences?
Q11: Discuss in your group, if this data set is tidy or not?

Lab 5 Exercises

Creating one data set from two data sets

Before we move onto using the family of *_join-functions you prepared for today, we will just take a quick peek at the meta data again:

meta_data |> 
  sample_n(10)

# A tibble: 10 × 11
   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 eLH48      COVID-19-C…    28 M      White "A*0… "A*2… "B*0… "B*1… "C*0… "C*0…
 2 eJL147     Healthy (N…    40 M      Mixe… "A*0… "A*1… "B*0… "B*3… "C*0… "C*0…
 3 eAV91      Healthy (N…    31 M      White "A*0… "A*6… "B*0… "B*5… "C*0… "C*0…
 4 eQD115     COVID-19-C…    48 M      <NA>  "A*0… "A*0… "B*0… "B*4… "C*0… "C*0…
 5 eMR12      COVID-19-C…    NA <NA>   <NA>  "A*0… "A*0… "B*4… "B*5… "C*0… "C*1…
 6 eDH107     COVID-19-C…    72 F      <NA>  "A*0… "A*0… "B*1… "B*3… "C*0… "C*0…
 7 eHH169     Healthy (N…    24 F      Blac… "A*0… "A*7… "B*3… "B*3… "C*0… "C*0…
 8 eNL192     COVID-19-C…    NA <NA>   <NA>  ""    ""    ""    ""    ""    ""   
 9 eHO130     Healthy (N…    28 F      White "A*0… "A*0… "B*0… "B*0… "C*0… "C*0…
10 eMR20      COVID-19-B…    37 M      White "A*0… "A*2… "B*1… "B*1… "C*0… "C*0…

Q12: Discuss in your group, if this data with respect to the A1-, A2-, B1-, B2-, C1- and C2-variables is a wide- or a long-data format?

Lab 5 Exercises

T18: use either the pivot_wider- or pivot_longer-function to create the following data:

meta_data <- meta_data |> 
  pivot_longer(cols = c("A1", "A2", "B1", "B2", "C1", "C2"),
               names_to = "Gene",
               values_to = "Allele")
meta_data

# A tibble: 864 × 7
   Experiment Cohort                  Age Gender Race  Gene  Allele    
   <chr>      <chr>                 <dbl> <chr>  <chr> <chr> <chr>     
 1 eAM13      COVID-19-Convalescent    34 F      White A1    A*02:01:01
 2 eAM13      COVID-19-Convalescent    34 F      White A2    A*02:05:01
 3 eAM13      COVID-19-Convalescent    34 F      White B1    B*08:01:01
 4 eAM13      COVID-19-Convalescent    34 F      White B2    B*41:01:01
 5 eAM13      COVID-19-Convalescent    34 F      White C1    C*07:01:01
 6 eAM13      COVID-19-Convalescent    34 F      White C2    C*07:01:01
 7 eAM23      COVID-19-Convalescent    48 M      <NA>  A1    A*11:01:01
 8 eAM23      COVID-19-Convalescent    48 M      <NA>  A2    A*24:02:01
 9 eAM23      COVID-19-Convalescent    48 M      <NA>  B1    B*15:01:01
10 eAM23      COVID-19-Convalescent    48 M      <NA>  B2    B*52:01:01
# ℹ 854 more rows

Lab 5 Exercises

Q13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?

# A tibble: 864 × 7
   Experiment Cohort                  Age Gender Race  Gene  Allele    
   <chr>      <chr>                 <dbl> <chr>  <chr> <chr> <chr>     
 1 eAM13      COVID-19-Convalescent    34 F      White A1    A*02:01:01
 2 eAM13      COVID-19-Convalescent    34 F      White A2    A*02:05:01
 3 eAM13      COVID-19-Convalescent    34 F      White B1    B*08:01:01
 4 eAM13      COVID-19-Convalescent    34 F      White B2    B*41:01:01
 5 eAM13      COVID-19-Convalescent    34 F      White C1    C*07:01:01
 6 eAM13      COVID-19-Convalescent    34 F      White C2    C*07:01:01
 7 eAM23      COVID-19-Convalescent    48 M      <NA>  A1    A*11:01:01
 8 eAM23      COVID-19-Convalescent    48 M      <NA>  A2    A*24:02:01
 9 eAM23      COVID-19-Convalescent    48 M      <NA>  B1    B*15:01:01
10 eAM23      COVID-19-Convalescent    48 M      <NA>  B2    B*52:01:01
# ℹ 854 more rows

# A tibble: 578,357 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY      15         9
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY       15         8
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIK…      15        10
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK      15         9
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY      16         9
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY       16         8
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK…      16        10
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK      16         9
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY      13         9
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY       13         8
# ℹ 578,347 more rows

intersect(colnames(meta_data), colnames(peptide_data))

[1] "Experiment"

Lab 5 Exercises

Once you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:

meta_data <- meta_data |> 
  select(Experiment, Allele) |> 
  distinct()
meta_data

# A tibble: 761 × 2
   Experiment Allele    
   <chr>      <chr>     
 1 eAM13      A*02:01:01
 2 eAM13      A*02:05:01
 3 eAM13      B*08:01:01
 4 eAM13      B*41:01:01
 5 eAM13      C*07:01:01
 6 eAM23      A*11:01:01
 7 eAM23      A*24:02:01
 8 eAM23      B*15:01:01
 9 eAM23      B*52:01:01
10 eAM23      C*04:01:01
# ℹ 751 more rows

Here, we are interested in capturing the Allele-information

Lab 5 Exercises

Some alleles are e.g. A*11:01, whereas others are B*51:01:02, let’s snatch the first two fields:
T19: Create the following data, according to specifications above:

meta_data <- meta_data |>
  mutate(Allele_F_1_2 = str_extract(Allele,
                                    # Regex: You're NOT expected to know this!
                                    pattern = "^[ABC]\\*\\d+\\:\\d+")) |>
  filter(Allele_F_1_2 != "") |> 
  drop_na()
meta_data

# A tibble: 748 × 3
   Experiment Allele     Allele_F_1_2
   <chr>      <chr>      <chr>       
 1 eAM13      A*02:01:01 A*02:01     
 2 eAM13      A*02:05:01 A*02:05     
 3 eAM13      B*08:01:01 B*08:01     
 4 eAM13      B*41:01:01 B*41:01     
 5 eAM13      C*07:01:01 C*07:01     
 6 eAM23      A*11:01:01 A*11:01     
 7 eAM23      A*24:02:01 A*24:02     
 8 eAM23      B*15:01:01 B*15:01     
 9 eAM23      B*52:01:01 B*52:01     
10 eAM23      C*04:01:01 C*04:01     
# ℹ 738 more rows

Lab 5 Exercises

Some alleles are e.g. A*11:01, whereas others are B*51:01:02, let’s snatch the first two fields:
T19: Create the following data, according to specifications above:

meta_data <- meta_data |>
  # This will work as well, but think about caveats!
  mutate(Allele_F_1_2 = str_sub(Allele, start = 1, end = 7)) |>
  filter(Allele_F_1_2 != "") |> 
  drop_na()
meta_data

# A tibble: 748 × 3
   Experiment Allele     Allele_F_1_2
   <chr>      <chr>      <chr>       
 1 eAM13      A*02:01:01 A*02:01     
 2 eAM13      A*02:05:01 A*02:05     
 3 eAM13      B*08:01:01 B*08:01     
 4 eAM13      B*41:01:01 B*41:01     
 5 eAM13      C*07:01:01 C*07:01     
 6 eAM23      A*11:01:01 A*11:01     
 7 eAM23      A*24:02:01 A*24:02     
 8 eAM23      B*15:01:01 B*15:01     
 9 eAM23      B*52:01:01 B*52:01     
10 eAM23      C*04:01:01 C*04:01     
# ℹ 738 more rows

Lab 5 Exercises

The asterix, i.e. * is a rather annoying character because of ambiguity, so:

T20: Clean the data a bit more, by removing the asterix and redundant variables:

meta_data <- meta_data |> 
  mutate(Allele_F_1_2 = str_remove(Allele_F_1_2, "\\*")) |> 
  select(-Allele) |>
  rename(Allele = Allele_F_1_2) |> 
  distinct()
meta_data

# A tibble: 748 × 2
   Experiment Allele
   <chr>      <chr> 
 1 eAM13      A02:01
 2 eAM13      A02:05
 3 eAM13      B08:01
 4 eAM13      B41:01
 5 eAM13      C07:01
 6 eAM23      A11:01
 7 eAM23      A24:02
 8 eAM23      B15:01
 9 eAM23      B52:01
10 eAM23      C04:01
# ℹ 738 more rows

Lab 5 Exercises

peptide_data

# A tibble: 578,357 × 7
   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>
 1 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 ADAGFIKQY      15         9
 2 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 AELEGIQY       15         8
 3 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 LADAGFIK…      15        10
 4 eAV93      CASSAQGTGDRGYTF  TCRBV27-01 TCRBJ01-02 TLADAGFIK      15         9
 5 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY      16         9
 6 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY       16         8
 7 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK…      16        10
 8 eOX56      CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK      16         9
 9 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 ADAGFIKQY      13         9
10 eAV93      CASSKGTVSGLSG    TCRBV21-01 TCRBJ02-07 AELEGIQY       13         8
# ℹ 578,347 more rows

meta_data

# A tibble: 748 × 2
   Experiment Allele
   <chr>      <chr> 
 1 eAM13      A02:01
 2 eAM13      A02:05
 3 eAM13      B08:01
 4 eAM13      B41:01
 5 eAM13      C07:01
 6 eAM23      A11:01
 7 eAM23      A24:02
 8 eAM23      B15:01
 9 eAM23      B52:01
10 eAM23      C04:01
# ℹ 738 more rows

Lab 5 Exercises

T21: Create a dplyr-pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)

peptide_meta_data <- peptide_data |> 
  full_join(meta_data,
            by = "Experiment",
            relationship = "many-to-many") |> 
  distinct()
peptide_meta_data |>
  sample_n(10)

# A tibble: 10 × 8
   Experiment CDR3b             V_gene   J_gene peptide k_CDR3b k_peptide Allele
   <chr>      <chr>             <chr>    <chr>  <chr>     <int>     <int> <chr> 
 1 eXL30      CASSLEASAYNEQFF   TCRBV11… TCRBJ… LLFLVL…      15         9 B35:02
 2 eXL31      CSVVPAAGGGRYNEQFF TCRBV29… TCRBJ… WPVTLA…      17        10 A29:02
 3 eOX54      CASSQGFTSTDTQYF   TCRBV03… TCRBJ… LITGRL…      15         9 B15:03
 4 eXL27      CASSFRSSGMNEQFF   TCRBV28… TCRBJ… FLWLLW…      15         9 C07:04
 5 eEE240     CASSLTSGMTYEQYF   TCRBV05… TCRBJ… SLIDFY…      15        10 C06:02
 6 eXL31      CASSLWGNSGELFF    TCRBV04… TCRBJ… FYLCFL…      14         9 A29:02
 7 eJL147     CASWGEGSYNEQFF    TCRBV06… TCRBJ… SASAFF…      14        10 A11:01
 8 eOX54      CASSLHGSGNTIYF    TCRBV28… TCRBJ… LWPVTL…      14         9 A02:01
 9 eXL31      CASSPSPGLAVQETQYF TCRBV04… TCRBJ… SLIDFY…      17        10 B44:03
10 eEE226     CASSSQPIQLYEQYF   TCRBV27… TCRBJ… MIELSL…      15        10 B35:02

Lab 5 Exercises

Analysis

Now, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:

What characterises the peptides binding to the HLAs?
What characterises T-cell Receptors binding to the pMHC-complexes?

Lab 5 Exercises

Peptides binding to HLA

T22: Subset the final peptide_meta_data-data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo

library("ggseqlogo")
pl <- peptide_meta_data |>
  filter(Allele == "A02:01",
         str_length(peptide) == 9) |>
  select(peptide) |>
  distinct() |>
  pull(peptide) |>
  ggseqlogo()

Lab 5 Exercises

Peptides binding to HLA - Rinse-and-Repeat!

T23: Repeat for e.g. B07:02 or another of your favorite alleles

pl <- peptide_meta_data |>
  filter(Allele == "B07:02",
         str_length(peptide) == 9) |>
  select(peptide) |>
  distinct() |>
  pull(peptide) |>
  ggseqlogo()

The reason, that the logo is actually not correct, is that we’re missing something…

Q14: Which positions in the peptide determines binding to HLA?

Lab 5 Exercises

CDR3b-sequences binding to pMHC

T24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:

pl <- peptide_meta_data |>
  filter(k_CDR3b == 15,
         Allele == "A02:01",
         peptide == "LLFLVLIML") |> 
  pull(CDR3b) |> 
  ggseqlogo()

Epilogue

I know this overwhelming now, but commit to it and you WILL be plenty rewarded!
I hope this was at least a glimpse into the flexibility and capabilities of using tidyverse for applied Bio Data Science
…also, noticed something?
We spend maybe 80% of the time here on dealing with data-wrangling and then once we’re good to go, the analysis wasn’t that time consuming!

That’s often the way it ends up going, you’ll spend a lot of time on data handling and getting the tidyverse toolbox in your toolbelt, will allow you to be so much more effecient in your data wrangling, so you can get to the fun part as quick as possible!

Secret slide: How to succeed in this course!

Prepare assigned materials for class
Show up for class
Pay attention during lectures
Do the group exercises
Do the assignments
Use the assignment feedback
…and \(\rightarrow\)

Recap Lab 6

Prologue: Process vs. Product

Joins

Joins

full_join()

Joins

inner_join()

Joins

left_join()

Joins

right_join()

Joins

anti_join()

Pivotting

Pivotting

Pivotting

Lab 5 Exercises

Load libraries

Lab 5 Exercises

Load meta data

Lab 5 Exercises

Load meta data

Lab 5 Exercises

Load peptide data

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

The Peptide Details Data

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

DRY

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Creating one data set from two data sets

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Lab 5 Exercises

Analysis

Lab 5 Exercises

Peptides binding to HLA

Lab 5 Exercises

Peptides binding to HLA - Rinse-and-Repeat!

Lab 5 Exercises

CDR3b-sequences binding to pMHC

Epilogue

Secret slide: How to succeed in this course!

`full_join()`

`inner_join()`

`left_join()`

`right_join()`

`anti_join()`

Load `meta` data

Load `meta` data

Load `peptide` data