This course is all about
the process
Bio Data Science is the process
by which you create analysis results (products)
Therefore, yes you can do, e.g.:
And it will create the right product
, but in the context of this
course, it is wrong
, it is the wrong process
!
For your project, regardless of your product
, the use of the wrong process
, will get you points deducted
, so you might as well get used to it!
This course is on tidyverse R
, not base R
(different dialect)
I understand you are used to thinking in products, if helpful consider in this course, that the process IS the product!
# A tibble: 4 × 2
id var_x
<chr> <dbl>
1 blue_1 27
2 blue_2 42
3 red_3 33
4 red_4 64
meta
data# A tibble: 144 × 30
Experiment Subject `Cell Type` `Target Type` Cohort Age Gender Race
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 eAM13 844 PBMC C19_cI COVID-19-Con… 34 F White
2 eAM23 5422 PBMC C19_cI COVID-19-Con… 48 M N/A
3 eAV100 1995 PBMC C19_cII COVID-19-Con… 29 F N/A
4 eAV105 1995 PBMC C19_cII COVID-19-Con… 29 F N/A
5 eAV88 19830 naive_CD8 C19_cI Healthy (No … 24 M White
6 eAV91 19855 naive_CD8 C19_cI Healthy (No … 31 M White
7 eAV93 20655 naive_CD8 C19_cI Healthy (No … 41 M White
8 eDH105 3533 PBMC C19_cI COVID-19-Con… 32 F N/A
9 eDH107 1684 PBMC C19_cI COVID-19-Con… 72 F N/A
10 eDH113 1327414 PBMC C19_cI Healthy (No … 56 N/A N/A
# ℹ 134 more rows
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
# `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
# `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
# DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
# DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
# DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>, …
meta
data# A tibble: 144 × 30
Experiment Subject `Cell Type` `Target Type` Cohort Age Gender Race
<chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 eAM13 844 PBMC C19_cI COVID-19-Con… 34 F White
2 eAM23 5422 PBMC C19_cI COVID-19-Con… 48 M <NA>
3 eAV100 1995 PBMC C19_cII COVID-19-Con… 29 F <NA>
4 eAV105 1995 PBMC C19_cII COVID-19-Con… 29 F <NA>
5 eAV88 19830 naive_CD8 C19_cI Healthy (No … 24 M White
6 eAV91 19855 naive_CD8 C19_cI Healthy (No … 31 M White
7 eAV93 20655 naive_CD8 C19_cI Healthy (No … 41 M White
8 eDH105 3533 PBMC C19_cI COVID-19-Con… 32 F <NA>
9 eDH107 1684 PBMC C19_cI COVID-19-Con… 72 F <NA>
10 eDH113 1327414 PBMC C19_cI Healthy (No … 56 <NA> <NA>
# ℹ 134 more rows
# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,
# `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,
# `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,
# DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,
# DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,
# DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>, …
NA
s-99
, N/A
, -11
, _
, ""
etc.peptide
data# A tibble: 154,320 × 7
`TCR BioIdentity` TCR Nucleotide Seque…¹ Experiment `ORF Coverage`
<chr> <chr> <chr> <chr>
1 CASSAQGTGDRGYTF+TCRBV27-01+… GAGTCGCCCAGCCCCAACCAG… eAV93 ORF1ab,surfac…
2 CASSLVATGNTGELFF+TCRBV07-09… CGCACAGAGCAGGGGGACTCG… eOX56 ORF1ab,surfac…
3 CASSKGTVSGLSG+TCRBV21-01+TC… GAGATCCAGTCCACGGAGTCA… eAV93 ORF1ab,surfac…
4 CALKVGADTQYF+TCRBV30-01+TCR… CTGAGTTCTAAGAAGCTCCTT… eQD124 ORF1ab,surfac…
5 CASSLWASGRGGTGELFF+TCRBV27-… AGCCCCAACCAGACCTCTCTG… eAV93 ORF1ab,surfac…
6 CASSLLGWEQLDEQFF+TCRBV27-01… TCGCCCAGCCCCAACCAGACC… eMR16 ORF1ab,surfac…
7 CASSSGTGVYGYTF+TCRBV12-X+TC… ATCCAGCCCTCAGAACCCAGG… eAV93 ORF1ab,surfac…
8 CASSPLEWEGVTEAFF+TCRBV06-X+… TCGGCTGCTCCCTCCCAGACA… eQD123 ORF1ab,surfac…
9 CASSFWSSGRGGTDTQYF+TCRBV27-… AGCCCCAACCAGACCTCTCTG… eAV93 ORF1ab,surfac…
10 CASSAGQGASDEQFF+TCRBV07-09+… CAGCGCACAGAGCAGGGGGAC… eMR15 ORF1ab,surfac…
# ℹ 154,310 more rows
# ℹ abbreviated name: ¹`TCR Nucleotide Sequence`
# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,
# `End Index in Genome` <dbl>
[1] "Experiment" "Subject" "Cell Type" "Target Type" "Cohort"
[6] "Age" "Gender" "Race" "HLA-A...9" "HLA-A...10"
[11] "HLA-B...11" "HLA-B...12" "HLA-C...13" "HLA-C...14" "DPA1...15"
[16] "DPA1...16" "DPB1...17" "DPB1...18" "DQA1...19" "DQA1...20"
[21] "DQB1...21" "DQB1...22" "DRB1...23" "DRB1...24" "DRB3...25"
[26] "DRB3...26" "DRB4...27" "DRB4...28" "DRB5...29" "DRB5...30"
pl <- meta_data |>
drop_na(Cohort, Gender) |>
count(Cohort, Gender) |>
ggplot(aes(x = Cohort,
y = n,
fill = Gender)) +
geom_col(position = position_dodge(
preserve = "single"),
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
theme_minimal(base_size = 20) +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 10,
hjust = 1,
vjust = 1.5))
pl <- meta_data |>
mutate(Age_group = cut(
x = Age,
breaks = seq(from = 0,
to = 100,
by = 10))) |>
drop_na(Gender, Age_group) |>
count(Gender, Age_group) |>
ggplot(aes(x = Age_group,
y = n,
fill = Gender)) +
geom_col(position = position_dodge(
preserve = "single"),
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
theme_minimal(base_size = 20) +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
axis.text.x = element_text(vjust = 5))
Subject
, Cell Type
and Target Type
pl <- meta_data |>
ggplot(aes(x = Cohort,
y = Age,
fill = Gender)) +
geom_boxplot(position = position_dodge(
preserve = "single")) +
theme_minimal(base_size = 20) +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 10,
hjust = 1,
vjust = 1.5))
meta_data <- meta_data |>
select(-Subject, -`Cell Type`, -`Target Type`)
Note the backticks in selecting variables, which contain spaces!
library("table1") # <= Yes, this should normally go at the beginning!
meta_data |>
mutate(Gender = factor(Gender),
Cohort = factor(Cohort)) |>
table1(x = formula(~ Gender + Age + Race | Cohort),
data = _)
COVID-19-Acute (N=4) |
COVID-19-B-Non-Acute (N=8) |
COVID-19-Convalescent (N=90) |
COVID-19-Exposed (N=3) |
Healthy (No known exposure) (N=39) |
Overall (N=144) |
|
---|---|---|---|---|---|---|
Gender | ||||||
F | 1 (25.0%) | 4 (50.0%) | 33 (36.7%) | 1 (33.3%) | 17 (43.6%) | 56 (38.9%) |
M | 2 (50.0%) | 3 (37.5%) | 36 (40.0%) | 0 (0%) | 21 (53.8%) | 62 (43.1%) |
Missing | 1 (25.0%) | 1 (12.5%) | 21 (23.3%) | 2 (66.7%) | 1 (2.6%) | 26 (18.1%) |
Age | ||||||
Mean (SD) | 50.7 (17.0) | 43.7 (7.74) | 51.5 (15.3) | 35.0 (NA) | 33.3 (9.93) | 44.9 (15.7) |
Median [Min, Max] | 52.0 [33.0, 67.0] | 42.0 [33.0, 53.0] | 53.0 [21.0, 79.0] | 35.0 [35.0, 35.0] | 31.0 [21.0, 62.0] | 42.0 [21.0, 79.0] |
Missing | 1 (25.0%) | 1 (12.5%) | 21 (23.3%) | 2 (66.7%) | 0 (0%) | 25 (17.4%) |
Race | ||||||
African American | 1 (25.0%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (2.6%) | 2 (1.4%) |
White | 2 (50.0%) | 7 (87.5%) | 13 (14.4%) | 0 (0%) | 28 (71.8%) | 50 (34.7%) |
Asian | 0 (0%) | 0 (0%) | 3 (3.3%) | 0 (0%) | 2 (5.1%) | 5 (3.5%) |
Hispanic or Latino/a | 0 (0%) | 0 (0%) | 1 (1.1%) | 0 (0%) | 0 (0%) | 1 (0.7%) |
Native Hawaiian or Other Pacific Islander | 0 (0%) | 0 (0%) | 0 (0%) | 1 (33.3%) | 0 (0%) | 1 (0.7%) |
Black or African American | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 3 (7.7%) | 3 (2.1%) |
Mixed Race | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (2.6%) | 1 (0.7%) |
Missing | 1 (25.0%) | 1 (12.5%) | 73 (81.1%) | 2 (66.7%) | 4 (10.3%) | 81 (56.3%) |
Super nice way to quickly get an overview of the data!
meta_data
, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9
becomes A1
oand HLA-A...10
becomes A2
and so on for B1
, B2
, C1
and C2
peptide_data
to the variables of interest: TCR BioIdentity
, Experiment
and Amino Acids
# A tibble: 154,320 × 3
Experiment `TCR BioIdentity` `Amino Acids`
<chr> <chr> <chr>
1 eAV93 CASSAQGTGDRGYTF+TCRBV27-01+TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
2 eOX56 CASSLVATGNTGELFF+TCRBV07-09+TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
3 eAV93 CASSKGTVSGLSG+TCRBV21-01+TCRBJ02-07 ADAGFIKQY,AELEGIQY,LADAG…
4 eQD124 CALKVGADTQYF+TCRBV30-01+TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
5 eAV93 CASSLWASGRGGTGELFF+TCRBV27-01+TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
6 eMR16 CASSLLGWEQLDEQFF+TCRBV27-01+TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
7 eAV93 CASSSGTGVYGYTF+TCRBV12-X+TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
8 eQD123 CASSPLEWEGVTEAFF+TCRBV06-X+TCRBJ01-01 ADAGFIKQY,AELEGIQY,LADAG…
9 eAV93 CASSFWSSGRGGTDTQYF+TCRBV27-01+TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
10 eMR15 CASSAGQGASDEQFF+TCRBV07-09+TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
# ℹ 154,310 more rows
peptide_data <- peptide_data |>
separate(col = `TCR BioIdentity`,
into = c("CDR3b", "V_gene", "J_gene"),
sep = "\\+") # Note the escape
peptide_data
# A tibble: 154,320 × 5
Experiment CDR3b V_gene J_gene `Amino Acids`
<chr> <chr> <chr> <chr> <chr>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
2 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
3 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY,AELEGIQY,LADAG…
4 eQD124 CALKVGADTQYF TCRBV30-01 TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
5 eAV93 CASSLWASGRGGTGELFF TCRBV27-01 TCRBJ02-02 ADAGFIKQY,AELEGIQY,LADAG…
6 eMR16 CASSLLGWEQLDEQFF TCRBV27-01 TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
7 eAV93 CASSSGTGVYGYTF TCRBV12-X TCRBJ01-02 ADAGFIKQY,AELEGIQY,LADAG…
8 eQD123 CASSPLEWEGVTEAFF TCRBV06-X TCRBJ01-01 ADAGFIKQY,AELEGIQY,LADAG…
9 eAV93 CASSFWSSGRGGTDTQYF TCRBV27-01 TCRBJ02-03 ADAGFIKQY,AELEGIQY,LADAG…
10 eMR15 CASSAGQGASDEQFF TCRBV07-09 TCRBJ02-01 ADAGFIKQY,AELEGIQY,LADAG…
# ℹ 154,310 more rows
Amino Acids
peptide_data <- peptide_data |>
mutate(n_peptides = str_count(`Amino Acids`,
pattern = ",") + 1)
peptide_data
# A tibble: 154,320 × 6
Experiment CDR3b V_gene J_gene `Amino Acids` n_peptides
<chr> <chr> <chr> <chr> <chr> <dbl>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY,AEL… 4
2 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY,AEL… 4
3 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY,AEL… 4
4 eQD124 CALKVGADTQYF TCRBV30-01 TCRBJ02-03 ADAGFIKQY,AEL… 4
5 eAV93 CASSLWASGRGGTGELFF TCRBV27-01 TCRBJ02-02 ADAGFIKQY,AEL… 4
6 eMR16 CASSLLGWEQLDEQFF TCRBV27-01 TCRBJ02-01 ADAGFIKQY,AEL… 4
7 eAV93 CASSSGTGVYGYTF TCRBV12-X TCRBJ01-02 ADAGFIKQY,AEL… 4
8 eQD123 CASSPLEWEGVTEAFF TCRBV06-X TCRBJ01-01 ADAGFIKQY,AEL… 4
9 eAV93 CASSFWSSGRGGTDTQYF TCRBV27-01 TCRBJ02-03 ADAGFIKQY,AEL… 4
10 eMR15 CASSAGQGASDEQFF TCRBV07-09 TCRBJ02-01 ADAGFIKQY,AEL… 4
# ℹ 154,310 more rows
pl <- peptide_data |>
ggplot(aes(x = n_peptides)) +
geom_histogram(binwidth = 1,
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = 1:13) +
theme_minimal(base_size = 20) +
theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
axis.text.x = element_text(vjust = 5)) +
labs(x = "Number of peptides per observation",
y = "Counts")
str_c
- and the seq
-functions, re-create the belowpeptide_data <- peptide_data |>
separate(col = `Amino Acids`,
into = str_c("peptide_", 1:max_n_peptides),
sep = ",")
peptide_data |>
sample_n(10)
# A tibble: 10 × 18
Experiment CDR3b V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 eXL31 CASSEWGDAYE… TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF
2 eJL151 CSVVGRSSYEQ… TCRBV… TCRBJ… AGAAAYYV… GAAAYYVGY WTAGAAAYY WTAGAAAY…
3 eEE224 CASSLVGRTLY… TCRBV… TCRBJ… FVDGVPFVV <NA> <NA> <NA>
4 eOX52 CASSLGAWGQP… TCRBV… TCRBJ… GMEVTPSG… MEVTPSGT… TPSGTWLTY VTPSGTWL…
5 eMR12 CASSLTPPREE… TCRBV… TCRBJ… HTTDPSFL… <NA> <NA> <NA>
6 eOX52 CASSEWLPPYE… TCRBV… TCRBJ… NYLYRLFRK NYNYLYRLF <NA> <NA>
7 eXL30 CASAPGAQGGF… TCRBV… TCRBJ… FPNITNLC… QPTESIVRF RFPNITNL… TESIVRFP…
8 eHO128 CASSLSGEETQ… TCRBV… TCRBJ… HTTDPSFL… <NA> <NA> <NA>
9 eMR16 CATGGGGRETQ… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF
10 eXL32 CASSRDRGRLE… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
# peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
# peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>
Not all have 13, so R
lets you know this! Thank you R
🙏
peptide_n
, discuss in your group, if this is wide- or long-data?# A tibble: 154,320 × 18
Experiment CDR3b V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 eAV93 CASSAQGTGDR… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
2 eOX56 CASSLVATGNT… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
3 eAV93 CASSKGTVSGL… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
4 eQD124 CALKVGADTQYF TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
5 eAV93 CASSLWASGRG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
6 eMR16 CASSLLGWEQL… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
7 eAV93 CASSSGTGVYG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
8 eQD123 CASSPLEWEGV… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
9 eAV93 CASSFWSSGRG… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
10 eMR15 CASSAGQGASD… TCRBV… TCRBJ… ADAGFIKQY AELEGIQY LADAGFIK… TLADAGFIK
# ℹ 154,310 more rows
# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,
# peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,
# peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>
peptide_data
data above, with peptide_1, peptide_2, etc. create this data set using one of the data-pivotting functionspeptide_data <- peptide_data |>
pivot_longer(cols = contains("peptide_"),
names_to = "peptide_n",
values_to = "peptide")
peptide_data
# A tibble: 2,006,160 × 7
Experiment CDR3b V_gene J_gene n_peptides peptide_n peptide
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_1 ADAGFI…
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_2 AELEGI…
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_3 LADAGF…
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_4 TLADAG…
5 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_5 <NA>
6 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_6 <NA>
7 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_7 <NA>
8 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_8 <NA>
9 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_9 <NA>
10 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 4 peptide_… <NA>
# ℹ 2,006,150 more rows
n_peptides
and peptide_n
and also get rid of the NA
s in the peptide
-column and make sure, that we only have unique observations, i.e. there are no repeated rows/observationspeptide_data <- peptide_data |>
select(-n_peptides, -peptide_n) |>
drop_na(peptide) |>
distinct()
peptide_data
# A tibble: 583,309 × 5
Experiment CDR3b V_gene J_gene peptide
<chr> <chr> <chr> <chr> <chr>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 AELEGIQY
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 LADAGFIKQY
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 TLADAGFIK
5 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY
6 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY
7 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
8 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK
9 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY
10 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 AELEGIQY
# ℹ 583,299 more rows
Super duper tidy!
str_detect()
-function to filter
the CDR3b
and peptide
variables using a pattern
of [^ARNDCQEGHILKMFPSTWYV]
and then play with the negate
-parameter so see what happenspeptide_data <- peptide_data |>
filter(str_detect(CDR3b, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE),
str_detect(peptide, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE))
peptide_data
# A tibble: 578,357 × 5
Experiment CDR3b V_gene J_gene peptide
<chr> <chr> <chr> <chr> <chr>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 AELEGIQY
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 LADAGFIKQY
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 TLADAGFIK
5 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY
6 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY
7 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
8 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK
9 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY
10 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 AELEGIQY
# ℹ 578,347 more rows
str_detect()
-function to filter
the CDR3b
and peptide
variablesclean_amino_acids <- function(x){
# str_detect(x, pattern = "^[ARNDCQEGHILKMFPSTWYV]+$") will also work
return( str_detect(x, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE) )
}
peptide_data <- peptide_data |>
filter(clean_amino_acids(CDR3b),
clean_amino_acids(peptide))
peptide_data
# A tibble: 578,357 × 5
Experiment CDR3b V_gene J_gene peptide
<chr> <chr> <chr> <chr> <chr>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 AELEGIQY
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 LADAGFIKQY
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 TLADAGFIK
5 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY
6 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY
7 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIKQY
8 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK
9 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY
10 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 AELEGIQY
# ℹ 578,347 more rows
k_CDR3b
and k_peptide
each signifying the length of the respective sequencespeptide_data <- peptide_data |>
mutate(k_CDR3b = str_length(CDR3b),
k_peptide = str_length(peptide))
peptide_data
# A tibble: 578,357 × 7
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY 15 9
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 AELEGIQY 15 8
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 LADAGFIK… 15 10
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 TLADAGFIK 15 9
5 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 16 9
6 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY 16 8
7 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK… 16 10
8 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 16 9
9 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY 13 9
10 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 AELEGIQY 13 8
# ℹ 578,347 more rows
Q9: What is the most predominant length of the CDR3b-sequences?
Q10: What is the most predominant length of the peptide-sequences?
Q11: Discuss in your group, if this data set is tidy or not?
Before we move onto using the family of *_join
-functions you prepared for today, we will just take a quick peek at the meta data again:
# A tibble: 10 × 11
Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 eLH48 COVID-19-C… 28 M White "A*0… "A*2… "B*0… "B*1… "C*0… "C*0…
2 eJL147 Healthy (N… 40 M Mixe… "A*0… "A*1… "B*0… "B*3… "C*0… "C*0…
3 eAV91 Healthy (N… 31 M White "A*0… "A*6… "B*0… "B*5… "C*0… "C*0…
4 eQD115 COVID-19-C… 48 M <NA> "A*0… "A*0… "B*0… "B*4… "C*0… "C*0…
5 eMR12 COVID-19-C… NA <NA> <NA> "A*0… "A*0… "B*4… "B*5… "C*0… "C*1…
6 eDH107 COVID-19-C… 72 F <NA> "A*0… "A*0… "B*1… "B*3… "C*0… "C*0…
7 eHH169 Healthy (N… 24 F Blac… "A*0… "A*7… "B*3… "B*3… "C*0… "C*0…
8 eNL192 COVID-19-C… NA <NA> <NA> "" "" "" "" "" ""
9 eHO130 Healthy (N… 28 F White "A*0… "A*0… "B*0… "B*0… "C*0… "C*0…
10 eMR20 COVID-19-B… 37 M White "A*0… "A*2… "B*1… "B*1… "C*0… "C*0…
A1
-, A2
-, B1
-, B2
-, C1-
and C2
-variables is a wide- or a long-data format?pivot_wider
- or pivot_longer
-function to create the following data:meta_data <- meta_data |>
pivot_longer(cols = c("A1", "A2", "B1", "B2", "C1", "C2"),
names_to = "Gene",
values_to = "Allele")
meta_data
# A tibble: 864 × 7
Experiment Cohort Age Gender Race Gene Allele
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 eAM13 COVID-19-Convalescent 34 F White A1 A*02:01:01
2 eAM13 COVID-19-Convalescent 34 F White A2 A*02:05:01
3 eAM13 COVID-19-Convalescent 34 F White B1 B*08:01:01
4 eAM13 COVID-19-Convalescent 34 F White B2 B*41:01:01
5 eAM13 COVID-19-Convalescent 34 F White C1 C*07:01:01
6 eAM13 COVID-19-Convalescent 34 F White C2 C*07:01:01
7 eAM23 COVID-19-Convalescent 48 M <NA> A1 A*11:01:01
8 eAM23 COVID-19-Convalescent 48 M <NA> A2 A*24:02:01
9 eAM23 COVID-19-Convalescent 48 M <NA> B1 B*15:01:01
10 eAM23 COVID-19-Convalescent 48 M <NA> B2 B*52:01:01
# ℹ 854 more rows
peptide_data
and the meta_data
?# A tibble: 864 × 7
Experiment Cohort Age Gender Race Gene Allele
<chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 eAM13 COVID-19-Convalescent 34 F White A1 A*02:01:01
2 eAM13 COVID-19-Convalescent 34 F White A2 A*02:05:01
3 eAM13 COVID-19-Convalescent 34 F White B1 B*08:01:01
4 eAM13 COVID-19-Convalescent 34 F White B2 B*41:01:01
5 eAM13 COVID-19-Convalescent 34 F White C1 C*07:01:01
6 eAM13 COVID-19-Convalescent 34 F White C2 C*07:01:01
7 eAM23 COVID-19-Convalescent 48 M <NA> A1 A*11:01:01
8 eAM23 COVID-19-Convalescent 48 M <NA> A2 A*24:02:01
9 eAM23 COVID-19-Convalescent 48 M <NA> B1 B*15:01:01
10 eAM23 COVID-19-Convalescent 48 M <NA> B2 B*52:01:01
# ℹ 854 more rows
# A tibble: 578,357 × 7
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY 15 9
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 AELEGIQY 15 8
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 LADAGFIK… 15 10
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 TLADAGFIK 15 9
5 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 16 9
6 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY 16 8
7 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK… 16 10
8 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 16 9
9 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY 13 9
10 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 AELEGIQY 13 8
# ℹ 578,347 more rows
Experiment
, then use that knowledge to subset the meta_data
to the variables-of-interest:# A tibble: 761 × 2
Experiment Allele
<chr> <chr>
1 eAM13 A*02:01:01
2 eAM13 A*02:05:01
3 eAM13 B*08:01:01
4 eAM13 B*41:01:01
5 eAM13 C*07:01:01
6 eAM23 A*11:01:01
7 eAM23 A*24:02:01
8 eAM23 B*15:01:01
9 eAM23 B*52:01:01
10 eAM23 C*04:01:01
# ℹ 751 more rows
Here, we are interested in capturing the Allele
-information
Some alleles are e.g. A*11:01
, whereas others are B*51:01:02
, let’s snatch the first two fields:
T19: Create the following data, according to specifications above:
meta_data <- meta_data |>
mutate(Allele_F_1_2 = str_extract(Allele,
# Regex: You're NOT expected to know this!
pattern = "^[ABC]\\*\\d+\\:\\d+")) |>
filter(Allele_F_1_2 != "") |>
drop_na()
meta_data
# A tibble: 748 × 3
Experiment Allele Allele_F_1_2
<chr> <chr> <chr>
1 eAM13 A*02:01:01 A*02:01
2 eAM13 A*02:05:01 A*02:05
3 eAM13 B*08:01:01 B*08:01
4 eAM13 B*41:01:01 B*41:01
5 eAM13 C*07:01:01 C*07:01
6 eAM23 A*11:01:01 A*11:01
7 eAM23 A*24:02:01 A*24:02
8 eAM23 B*15:01:01 B*15:01
9 eAM23 B*52:01:01 B*52:01
10 eAM23 C*04:01:01 C*04:01
# ℹ 738 more rows
Some alleles are e.g. A*11:01
, whereas others are B*51:01:02
, let’s snatch the first two fields:
T19: Create the following data, according to specifications above:
meta_data <- meta_data |>
# This will work as well, but think about caveats!
mutate(Allele_F_1_2 = str_sub(Allele, start = 1, end = 7)) |>
filter(Allele_F_1_2 != "") |>
drop_na()
meta_data
# A tibble: 748 × 3
Experiment Allele Allele_F_1_2
<chr> <chr> <chr>
1 eAM13 A*02:01:01 A*02:01
2 eAM13 A*02:05:01 A*02:05
3 eAM13 B*08:01:01 B*08:01
4 eAM13 B*41:01:01 B*41:01
5 eAM13 C*07:01:01 C*07:01
6 eAM23 A*11:01:01 A*11:01
7 eAM23 A*24:02:01 A*24:02
8 eAM23 B*15:01:01 B*15:01
9 eAM23 B*52:01:01 B*52:01
10 eAM23 C*04:01:01 C*04:01
# ℹ 738 more rows
The asterix, i.e. *
is a rather annoying character because of ambiguity, so:
meta_data <- meta_data |>
mutate(Allele_F_1_2 = str_remove(Allele_F_1_2, "\\*")) |>
select(-Allele) |>
rename(Allele = Allele_F_1_2) |>
distinct()
meta_data
# A tibble: 748 × 2
Experiment Allele
<chr> <chr>
1 eAM13 A02:01
2 eAM13 A02:05
3 eAM13 B08:01
4 eAM13 B41:01
5 eAM13 C07:01
6 eAM23 A11:01
7 eAM23 A24:02
8 eAM23 B15:01
9 eAM23 B52:01
10 eAM23 C04:01
# ℹ 738 more rows
# A tibble: 578,357 × 7
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 ADAGFIKQY 15 9
2 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 AELEGIQY 15 8
3 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 LADAGFIK… 15 10
4 eAV93 CASSAQGTGDRGYTF TCRBV27-01 TCRBJ01-02 TLADAGFIK 15 9
5 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 ADAGFIKQY 16 9
6 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 AELEGIQY 16 8
7 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 LADAGFIK… 16 10
8 eOX56 CASSLVATGNTGELFF TCRBV07-09 TCRBJ02-02 TLADAGFIK 16 9
9 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 ADAGFIKQY 13 9
10 eAV93 CASSKGTVSGLSG TCRBV21-01 TCRBJ02-07 AELEGIQY 13 8
# ℹ 578,347 more rows
dplyr
-pipeline, starting with the peptide_data
, which joins it with the meta_data
and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data
(If you get a warning, discuss in your group what it means?)peptide_meta_data <- peptide_data |>
full_join(meta_data,
by = "Experiment",
relationship = "many-to-many") |>
distinct()
peptide_meta_data |>
sample_n(10)
# A tibble: 10 × 8
Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide Allele
<chr> <chr> <chr> <chr> <chr> <int> <int> <chr>
1 eXL30 CASSLEASAYNEQFF TCRBV11… TCRBJ… LLFLVL… 15 9 B35:02
2 eXL31 CSVVPAAGGGRYNEQFF TCRBV29… TCRBJ… WPVTLA… 17 10 A29:02
3 eOX54 CASSQGFTSTDTQYF TCRBV03… TCRBJ… LITGRL… 15 9 B15:03
4 eXL27 CASSFRSSGMNEQFF TCRBV28… TCRBJ… FLWLLW… 15 9 C07:04
5 eEE240 CASSLTSGMTYEQYF TCRBV05… TCRBJ… SLIDFY… 15 10 C06:02
6 eXL31 CASSLWGNSGELFF TCRBV04… TCRBJ… FYLCFL… 14 9 A29:02
7 eJL147 CASWGEGSYNEQFF TCRBV06… TCRBJ… SASAFF… 14 10 A11:01
8 eOX54 CASSLHGSGNTIYF TCRBV28… TCRBJ… LWPVTL… 14 9 A02:01
9 eXL31 CASSPSPGLAVQETQYF TCRBV04… TCRBJ… SLIDFY… 17 10 B44:03
10 eEE226 CASSSQPIQLYEQYF TCRBV27… TCRBJ… MIELSL… 15 10 B35:02
Now, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:
peptide_meta_data
-data to A02:01
and unique observations of peptides of length 9 and re-create the below sequence logoB07:02
or another of your favorite allelesThe reason, that the logo is actually not correct, is that we’re missing something…
peptide_meta_data
, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:tidyverse
for applied Bio Data ScienceThat’s often the way it ends up going, you’ll spend a lot of time on data handling and getting the tidyverse toolbox in your toolbelt, will allow you to be so much more effecient in your data wrangling, so you can get to the fun part as quick as possible!
R for Bio Data Science