Lecture Lab 6

Leon Eyrich Jessen

That was the first half of the course!

Time flies when you’re having fun!
Up until now, we have been focusing on data visualisation, data wrangling and reproducible reporting
In the second half, we will switch gears and use what we have learned so far to learn to efficiently deal with more complicated data situations

Functional Programming in R using Purrr

Functional Programming is a paradigm, where we call functions on objects and the result is returned
We “catch” the result by variable assignment

double_the_value <- function(value){
  return( 2*value )
}

x <- 2
double_the_value(value = x)

[1] 4

y <- double_the_value(value = x)
y

[1] 4

When you are creating a `dplyr`-pipeline…

You are stiching functions together

set.seed(470812)
example_data <- tibble(
  dna = make_random_dna(n = 10))
example_data

# A tibble: 10 × 1
   dna                 
   <chr>               
 1 aggatatcgagttcca    
 2 tagcgctcga          
 3 gattcccgaattagtaggg 
 4 gtggcttcccgccctcttta
 5 atcccctgcaa         
 6 gcaagtggcctatga     
 7 ctcagcgatccaagt     
 8 tacggtttcag         
 9 ggcgataccg          
10 ctatgcttatgc

When you are creating a `dplyr`-pipeline…

You are stiching functions together

set.seed(470812)
example_data <- tibble(
  dna = make_random_dna(n = 10))
example_data |> 
  mutate(dna_length = str_length(dna))    # <- `mutate` is a function!

# A tibble: 10 × 2
   dna                  dna_length
   <chr>                     <int>
 1 aggatatcgagttcca             16
 2 tagcgctcga                   10
 3 gattcccgaattagtaggg          19
 4 gtggcttcccgccctcttta         20
 5 atcccctgcaa                  11
 6 gcaagtggcctatga              15
 7 ctcagcgatccaagt              15
 8 tacggtttcag                  11
 9 ggcgataccg                   10
10 ctatgcttatgc                 12

When you are creating a `dplyr`-pipeline…

You are stiching functions together

set.seed(470812)
example_data <- tibble(
  sample_id = str_c("s_", rep(1:2, each = 5)),
  dna = make_random_dna(n = 10))
example_data |> 
  mutate(dna_length = str_length(dna)) |> # <- `mutate` is a function!
  filter( # <- `filter` is a function!
    str_detect(string = dna, pattern = "aag")) # <- `str_detect` is a function!

# A tibble: 2 × 3
  sample_id dna             dna_length
  <chr>     <chr>                <int>
1 s_2       gcaagtggcctatga         15
2 s_2       ctcagcgatccaagt         15

What does the `group_by()`-function actually do?

example_data |> 
  mutate(dna_length = str_length(dna)) |>  
  group_by(sample_id)

# A tibble: 10 × 3
# Groups:   sample_id [2]
   sample_id dna                  dna_length
   <chr>     <chr>                     <int>
 1 s_1       aggatatcgagttcca             16
 2 s_1       tagcgctcga                   10
 3 s_1       gattcccgaattagtaggg          19
 4 s_1       gtggcttcccgccctcttta         20
 5 s_1       atcccctgcaa                  11
 6 s_2       gcaagtggcctatga              15
 7 s_2       ctcagcgatccaagt              15
 8 s_2       tacggtttcag                  11
 9 s_2       ggcgataccg                   10
10 s_2       ctatgcttatgc                 12

example_data |> 
  mutate(dna_length = str_length(dna)) |> 
  group_by(sample_id) |>
  summarise(mean_length = mean(dna_length))

# A tibble: 2 × 2
  sample_id mean_length
  <chr>           <dbl>
1 s_1              15.2
2 s_2              12.6

Note the Groups on the left, this is called an attribute, it tells functions, that the data is grouped, which they then respect

The `map`-family of functions in the `purrr`-package

The aim is to avoid looping and instead work on grouped data!
You use map to apply a function to each group in your data

example_data |>
  mutate(dna_length = str_length(dna))

# A tibble: 10 × 3
   sample_id dna                  dna_length
   <chr>     <chr>                     <int>
 1 s_1       aggatatcgagttcca             16
 2 s_1       tagcgctcga                   10
 3 s_1       gattcccgaattagtaggg          19
 4 s_1       gtggcttcccgccctcttta         20
 5 s_1       atcccctgcaa                  11
 6 s_2       gcaagtggcctatga              15
 7 s_2       ctcagcgatccaagt              15
 8 s_2       tacggtttcag                  11
 9 s_2       ggcgataccg                   10
10 s_2       ctatgcttatgc                 12

example_data |>
  mutate(dna_length = str_length(dna)) |> 
  group_by(sample_id) |>
  nest()

# A tibble: 2 × 2
# Groups:   sample_id [2]
  sample_id data            
  <chr>     <list>          
1 s_1       <tibble [5 × 2]>
2 s_2       <tibble [5 × 2]>

Note the dimensions of the nested data, each is 5 observations of 2 variables

We avoid looping by applying functions to nested data

example_data |>
  mutate(dna_length = str_length(dna)) |> 
  group_by(sample_id) |>
  nest()

# A tibble: 2 × 2
# Groups:   sample_id [2]
  sample_id data            
  <chr>     <list>          
1 s_1       <tibble [5 × 2]>
2 s_2       <tibble [5 × 2]>

example_data |>
    mutate(dna_length = str_length(dna)) |>
    group_by(sample_id) |>
    nest() |>
    mutate(mean_length = map_dbl(.x = data,
                                 .f = ~mean(pull(.x, dna_length))))

# A tibble: 2 × 3
# Groups:   sample_id [2]
  sample_id data             mean_length
  <chr>     <list>                 <dbl>
1 s_1       <tibble [5 × 2]>        15.2
2 s_2       <tibble [5 × 2]>        12.6

Note here, how we are retaining the data in the output column data, while ALSO computing our desired value
This is a toy example, you will see in the exercises, why this is immensely useful!

What do you mean “retain the data”?

example_data |>
    mutate(dna_length = str_length(dna)) |>
    group_by(sample_id) |>
    nest() |>
    mutate(mean_length = map_dbl(
      .x = data,
      .f = ~mean(pull(.x, dna_length))))

# A tibble: 2 × 3
# Groups:   sample_id [2]
  sample_id data             mean_length
  <chr>     <list>                 <dbl>
1 s_1       <tibble [5 × 2]>        15.2
2 s_2       <tibble [5 × 2]>        12.6

example_data |>
    mutate(dna_length = str_length(dna)) |>
    group_by(sample_id) |>
    nest() |>
    mutate(mean_length = map_dbl(
      .x = data,
      .f = ~mean(pull(.x, dna_length)))) |> 
  unnest(data)

# A tibble: 10 × 4
# Groups:   sample_id [2]
   sample_id dna                  dna_length mean_length
   <chr>     <chr>                     <int>       <dbl>
 1 s_1       aggatatcgagttcca             16        15.2
 2 s_1       tagcgctcga                   10        15.2
 3 s_1       gattcccgaattagtaggg          19        15.2
 4 s_1       gtggcttcccgccctcttta         20        15.2
 5 s_1       atcccctgcaa                  11        15.2
 6 s_2       gcaagtggcctatga              15        12.6
 7 s_2       ctcagcgatccaagt              15        12.6
 8 s_2       tacggtttcag                  11        12.6
 9 s_2       ggcgataccg                   10        12.6
10 s_2       ctatgcttatgc                 12        12.6

By retaining the data and using objects nested inside objects, we get one point of reference for our analysis work
Opposed to 1,000 different object with names we forget and huge redundancy eating your ram

Functional Programming in R using Purrr

And then remember to call the map()-function inside mutate(), when you work on nested data in tibbles!

Lecture Lab 6

That was the first half of the course!

Functional Programming in R using Purrr

When you are creating a dplyr-pipeline…

When you are creating a dplyr-pipeline…

When you are creating a dplyr-pipeline…

What does the group_by()-function actually do?

The map-family of functions in the purrr-package

We avoid looping by applying functions to nested data

What do you mean “retain the data”?

Functional Programming in R using Purrr

When you are creating a `dplyr`-pipeline…

When you are creating a `dplyr`-pipeline…

When you are creating a `dplyr`-pipeline…

What does the `group_by()`-function actually do?

The `map`-family of functions in the `purrr`-package