Lecture Lab 6

Leon Eyrich Jessen

That was the first half of the course!

  • Time flies when you’re having fun!

  • Up until now, we have been focusing on data visualisation, data wrangling and reproducible reporting

  • In the second half, we will switch gears and use what we have learned so far to learn to efficiently deal with more complicated data situations

Functional Programming in R using Purrr

  • Functional Programming is a paradigm, where we call functions on objects and the result is returned

  • We “catch” the result by variable assignment

double_the_value <- function(value){
  return( 2*value )
}

x <- 2
double_the_value(value = x)
[1] 4
y <- double_the_value(value = x)
y
[1] 4

When you are creating a dplyr-pipeline…

You are stiching functions together

set.seed(470812)
example_data <- tibble(
  dna = make_random_dna(n = 10))
example_data
# A tibble: 10 × 1
   dna                 
   <chr>               
 1 aggatatcgagttcca    
 2 tagcgctcga          
 3 gattcccgaattagtaggg 
 4 gtggcttcccgccctcttta
 5 atcccctgcaa         
 6 gcaagtggcctatga     
 7 ctcagcgatccaagt     
 8 tacggtttcag         
 9 ggcgataccg          
10 ctatgcttatgc        

When you are creating a dplyr-pipeline…

You are stiching functions together

set.seed(470812)
example_data <- tibble(
  dna = make_random_dna(n = 10))
example_data |> 
  mutate(dna_length = str_length(dna))    # <- `mutate` is a function!
# A tibble: 10 × 2
   dna                  dna_length
   <chr>                     <int>
 1 aggatatcgagttcca             16
 2 tagcgctcga                   10
 3 gattcccgaattagtaggg          19
 4 gtggcttcccgccctcttta         20
 5 atcccctgcaa                  11
 6 gcaagtggcctatga              15
 7 ctcagcgatccaagt              15
 8 tacggtttcag                  11
 9 ggcgataccg                   10
10 ctatgcttatgc                 12

When you are creating a dplyr-pipeline…

You are stiching functions together

set.seed(470812)
example_data <- tibble(
  sample_id = str_c("s_", rep(1:2, each = 5)),
  dna = make_random_dna(n = 10))
example_data |> 
  mutate(dna_length = str_length(dna)) |> # <- `mutate` is a function!
  filter( # <- `filter` is a function!
    str_detect(string = dna, pattern = "aag")) # <- `str_detect` is a function!
# A tibble: 2 × 3
  sample_id dna             dna_length
  <chr>     <chr>                <int>
1 s_2       gcaagtggcctatga         15
2 s_2       ctcagcgatccaagt         15

What does the group_by()-function actually do?

example_data |> 
  mutate(dna_length = str_length(dna)) |>  
  group_by(sample_id)
# A tibble: 10 × 3
# Groups:   sample_id [2]
   sample_id dna                  dna_length
   <chr>     <chr>                     <int>
 1 s_1       aggatatcgagttcca             16
 2 s_1       tagcgctcga                   10
 3 s_1       gattcccgaattagtaggg          19
 4 s_1       gtggcttcccgccctcttta         20
 5 s_1       atcccctgcaa                  11
 6 s_2       gcaagtggcctatga              15
 7 s_2       ctcagcgatccaagt              15
 8 s_2       tacggtttcag                  11
 9 s_2       ggcgataccg                   10
10 s_2       ctatgcttatgc                 12
example_data |> 
  mutate(dna_length = str_length(dna)) |> 
  group_by(sample_id) |>
  summarise(mean_length = mean(dna_length))
# A tibble: 2 × 2
  sample_id mean_length
  <chr>           <dbl>
1 s_1              15.2
2 s_2              12.6

Note the Groups on the left, this is called an attribute, it tells functions, that the data is grouped, which they then respect

The map-family of functions in the purrr-package

  • The aim is to avoid looping and instead work on grouped data!
  • You use map to apply a function to each group in your data
example_data |>
  mutate(dna_length = str_length(dna))
# A tibble: 10 × 3
   sample_id dna                  dna_length
   <chr>     <chr>                     <int>
 1 s_1       aggatatcgagttcca             16
 2 s_1       tagcgctcga                   10
 3 s_1       gattcccgaattagtaggg          19
 4 s_1       gtggcttcccgccctcttta         20
 5 s_1       atcccctgcaa                  11
 6 s_2       gcaagtggcctatga              15
 7 s_2       ctcagcgatccaagt              15
 8 s_2       tacggtttcag                  11
 9 s_2       ggcgataccg                   10
10 s_2       ctatgcttatgc                 12
example_data |>
  mutate(dna_length = str_length(dna)) |> 
  group_by(sample_id) |>
  nest()
# A tibble: 2 × 2
# Groups:   sample_id [2]
  sample_id data            
  <chr>     <list>          
1 s_1       <tibble [5 × 2]>
2 s_2       <tibble [5 × 2]>
  • Note the dimensions of the nested data, each is 5 observations of 2 variables

We avoid looping by applying functions to nested data

example_data |>
  mutate(dna_length = str_length(dna)) |> 
  group_by(sample_id) |>
  nest()
# A tibble: 2 × 2
# Groups:   sample_id [2]
  sample_id data            
  <chr>     <list>          
1 s_1       <tibble [5 × 2]>
2 s_2       <tibble [5 × 2]>
example_data |>
    mutate(dna_length = str_length(dna)) |>
    group_by(sample_id) |>
    nest() |>
    mutate(mean_length = map_dbl(.x = data,
                                 .f = ~mean(pull(.x, dna_length))))
# A tibble: 2 × 3
# Groups:   sample_id [2]
  sample_id data             mean_length
  <chr>     <list>                 <dbl>
1 s_1       <tibble [5 × 2]>        15.2
2 s_2       <tibble [5 × 2]>        12.6
  • Note here, how we are retaining the data in the output column data, while ALSO computing our desired value
  • This is a toy example, you will see in the exercises, why this is immensely useful!

What do you mean “retain the data”?

example_data |>
    mutate(dna_length = str_length(dna)) |>
    group_by(sample_id) |>
    nest() |>
    mutate(mean_length = map_dbl(
      .x = data,
      .f = ~mean(pull(.x, dna_length))))
# A tibble: 2 × 3
# Groups:   sample_id [2]
  sample_id data             mean_length
  <chr>     <list>                 <dbl>
1 s_1       <tibble [5 × 2]>        15.2
2 s_2       <tibble [5 × 2]>        12.6
example_data |>
    mutate(dna_length = str_length(dna)) |>
    group_by(sample_id) |>
    nest() |>
    mutate(mean_length = map_dbl(
      .x = data,
      .f = ~mean(pull(.x, dna_length)))) |> 
  unnest(data)
# A tibble: 10 × 4
# Groups:   sample_id [2]
   sample_id dna                  dna_length mean_length
   <chr>     <chr>                     <int>       <dbl>
 1 s_1       aggatatcgagttcca             16        15.2
 2 s_1       tagcgctcga                   10        15.2
 3 s_1       gattcccgaattagtaggg          19        15.2
 4 s_1       gtggcttcccgccctcttta         20        15.2
 5 s_1       atcccctgcaa                  11        15.2
 6 s_2       gcaagtggcctatga              15        12.6
 7 s_2       ctcagcgatccaagt              15        12.6
 8 s_2       tacggtttcag                  11        12.6
 9 s_2       ggcgataccg                   10        12.6
10 s_2       ctatgcttatgc                 12        12.6
  • By retaining the data and using objects nested inside objects, we get one point of reference for our analysis work
  • Opposed to 1,000 different object with names we forget and huge redundancy eating your ram

Functional Programming in R using Purrr

  • And then remember to call the map()-function inside mutate(), when you work on nested data in tibbles!