[1] 4
[1] 4
Time flies when you’re having fun!
Up until now, we have been focusing on data visualisation, data wrangling and reproducible reporting
In the second half, we will switch gears and use what we have learned so far to learn to efficiently deal with more complicated data situations
dplyr
-pipeline…You are stiching functions together
dplyr
-pipeline…You are stiching functions together
set.seed(470812)
example_data <- tibble(
dna = make_random_dna(n = 10))
example_data |>
mutate(dna_length = str_length(dna)) # <- `mutate` is a function!
# A tibble: 10 × 2
dna dna_length
<chr> <int>
1 aggatatcgagttcca 16
2 tagcgctcga 10
3 gattcccgaattagtaggg 19
4 gtggcttcccgccctcttta 20
5 atcccctgcaa 11
6 gcaagtggcctatga 15
7 ctcagcgatccaagt 15
8 tacggtttcag 11
9 ggcgataccg 10
10 ctatgcttatgc 12
dplyr
-pipeline…You are stiching functions together
set.seed(470812)
example_data <- tibble(
sample_id = str_c("s_", rep(1:2, each = 5)),
dna = make_random_dna(n = 10))
example_data |>
mutate(dna_length = str_length(dna)) |> # <- `mutate` is a function!
filter( # <- `filter` is a function!
str_detect(string = dna, pattern = "aag")) # <- `str_detect` is a function!
# A tibble: 2 × 3
sample_id dna dna_length
<chr> <chr> <int>
1 s_2 gcaagtggcctatga 15
2 s_2 ctcagcgatccaagt 15
group_by()
-function actually do?# A tibble: 10 × 3
# Groups: sample_id [2]
sample_id dna dna_length
<chr> <chr> <int>
1 s_1 aggatatcgagttcca 16
2 s_1 tagcgctcga 10
3 s_1 gattcccgaattagtaggg 19
4 s_1 gtggcttcccgccctcttta 20
5 s_1 atcccctgcaa 11
6 s_2 gcaagtggcctatga 15
7 s_2 ctcagcgatccaagt 15
8 s_2 tacggtttcag 11
9 s_2 ggcgataccg 10
10 s_2 ctatgcttatgc 12
Note the Groups
on the left, this is called an attribute
, it tells functions, that the data is grouped, which they then respect
map
-family of functions in the purrr
-packagemap
to apply a function to each group in your data# A tibble: 10 × 3
sample_id dna dna_length
<chr> <chr> <int>
1 s_1 aggatatcgagttcca 16
2 s_1 tagcgctcga 10
3 s_1 gattcccgaattagtaggg 19
4 s_1 gtggcttcccgccctcttta 20
5 s_1 atcccctgcaa 11
6 s_2 gcaagtggcctatga 15
7 s_2 ctcagcgatccaagt 15
8 s_2 tacggtttcag 11
9 s_2 ggcgataccg 10
10 s_2 ctatgcttatgc 12
example_data |>
mutate(dna_length = str_length(dna)) |>
group_by(sample_id) |>
nest() |>
mutate(mean_length = map_dbl(.x = data,
.f = ~mean(pull(.x, dna_length))))
# A tibble: 2 × 3
# Groups: sample_id [2]
sample_id data mean_length
<chr> <list> <dbl>
1 s_1 <tibble [5 × 2]> 15.2
2 s_2 <tibble [5 × 2]> 12.6
data
, while ALSO computing our desired valueexample_data |>
mutate(dna_length = str_length(dna)) |>
group_by(sample_id) |>
nest() |>
mutate(mean_length = map_dbl(
.x = data,
.f = ~mean(pull(.x, dna_length))))
# A tibble: 2 × 3
# Groups: sample_id [2]
sample_id data mean_length
<chr> <list> <dbl>
1 s_1 <tibble [5 × 2]> 15.2
2 s_2 <tibble [5 × 2]> 12.6
example_data |>
mutate(dna_length = str_length(dna)) |>
group_by(sample_id) |>
nest() |>
mutate(mean_length = map_dbl(
.x = data,
.f = ~mean(pull(.x, dna_length)))) |>
unnest(data)
# A tibble: 10 × 4
# Groups: sample_id [2]
sample_id dna dna_length mean_length
<chr> <chr> <int> <dbl>
1 s_1 aggatatcgagttcca 16 15.2
2 s_1 tagcgctcga 10 15.2
3 s_1 gattcccgaattagtaggg 19 15.2
4 s_1 gtggcttcccgccctcttta 20 15.2
5 s_1 atcccctgcaa 11 15.2
6 s_2 gcaagtggcctatga 15 12.6
7 s_2 ctcagcgatccaagt 15 12.6
8 s_2 tacggtttcag 11 12.6
9 s_2 ggcgataccg 10 12.6
10 s_2 ctatgcttatgc 12 12.6
map()
-function inside mutate()
, when you work on nested data in tibbles!R for Bio Data Science