References

The primary references for today’s discussion can be found here and here. The latter reference provides a more advanced walkthrough on functions than we are exploring here. It is the natural next place to go to learn more about functions.

Once again, most of the material that we will cover for the next few weeks is heavily influenced by the amazing lecture notes by Grant McDermott. Some of the portions have been directly copied from his notes (with his permission). You can visit his Github page for more information.

Packages

Run this piece of code to ensure that you have the appropriate packages installed for the session to follow.

R> if (!require("pacman")) install.packages("pacman")
R> pacman::p_load(pbapply, tidyverse, gapminder, assertthat)

Introduction

We have covered a lot of work with the shell and Julia. Now it is time to get back to using R. We will be covering some topics that you have encountered with Nico and some that overlap with the Julia lectures. You can see this lecture as a bit of revision, to refocus your attention on R.

The ideas behind functions and functional programming were covered with Nico in the first part of the course. This section is an extension (revision) of the functionality that Nico introduced and also a precursor to the parallel programming lecture that follows this one. We focus our attention on how to write functions and the related ideas in functional programming (especially as it applies to iteration).

This section also serves as a review of user-built functions, an important topic in its own right. Powerful packages such as dplyr, purrr, and the apply famlily of functions are ready and waiting to apply your purpose-built functions to various components of your data. It is important that you are able to express yourself through functions, since these tools will provide great power.

Functions

You have already dealt with several functions in R. Some of them come pre-packaged with base R (e.g. mean()), while others are from external packages (e.g. dplyr::filter()). Functions in R all have the same basic syntax:

function_name(ARGUMENTS)

Most of the time we use functions written by other people. However, as you know by now, you can write your own functions too. This is easy to do with the generic function() function. The syntax will again look familiar to you:

function(ARGUMENTS) {
  OPERATIONS
  return(VALUE)
}

While we can write anonymous functions like the one above, we typically write functions because we want to reuse code. For this typical use-case it makes sense to name our functions. You can use either = or <- when assigning the function name. Try to give functions names that make sense to other people reading your code.

my_func =
  function(ARGUMENTS) {
    OPERATIONS
    return(VALUE)
  }

This function can be written across multiple lines as above, or it can be stated in a single line as follows,

my_short_func = function(ARGUMENTS) OPERATION

We will illustrate with some examples below.

Example: Gapminder data

We can write functions to more easily work with data. We might have some interactive code that we want to wrap in a function. My general advice is that you always want to wrap your code in functions if you can.

Say that we have the following interactive code that we input into our console,

R> min(gapminder$lifeExp)
## [1] 23.599
R> max(gapminder$lifeExp)
## [1] 82.603

We want to calculate the difference between the maximum and minimum life expectancy in the Gapminder dataset.

R> max(gapminder$lifeExp) - min(gapminder$lifeExp)
## [1] 59.004

One nice thing to do is to wrap this in a function, so that we can perhaps use this difference operator for other variables as well. We can write the function on one line as follows,

R> max_minus_min <- function(x) max(x) - min(x)

We can test this function on our Gapminder data, but also on other inputs.

R> max_minus_min(gapminder$lifeExp)
## [1] 59.004
R> max_minus_min(gapminder$gdpPercap)
## [1] 113282
R> max_minus_min(1:10)
## [1] 9
R> max_minus_min(runif(1000))
## [1] 0.9961785

This function will only work for certain types of inputs where max() and min() make sense. If we were to try the following it wouldn’t work,

R> max_minus_min(gapminder$country) # What is the max() and min() of the country variable?
## [1] Error in Summary.factor(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : 'max' not meaningful for factors

The function provides us the power to apply a general method to many different inputs. Remember, computer scientists are a lazy bunch, they don’t want to write out the same piece of code multiple times! We don’t want to repeat ourselves and create more effort than necessary. So when you are writing code, always think about wrapping it in a function. There are obviously many other benefits to writing in terms of functions, such as retaining local scope instead of working with global variables. I am sure that Nico covered some of the basic ideas of why function are so important.

In R it is often useful to write a function in its own script. This is generally good programming practice. In other languages like Julia this is not necessarily the same. If we have time I can go over how efficient Julia code looks.

Warning: Type stability

There are going to be some cases where the function works, but it isn’t what we intended. This means that we have to be careful to consider potential fringe cases and the types of variables that can be input into the function.

R> max_minus_min(c(TRUE, TRUE, FALSE, TRUE, TRUE))
## [1] 1

Note: This is where Julia really shines, it’s type annotation system allows you to finely control which types of variables are acceptable as inputs into the function.

As far as I am aware, since R is a dynamically typed language (the type is inferred by the compiler), you can get some weird results sometimes when writing functions. For example, what does the following function do?

R> do_magic <- 
+   function(a, b) {
+   a + b
+ }

If we are summing two numbers then this makes sense. However, this will also add two Booleans! Is this what we intended the function to do? In Julia you could write a function as follows,

function do_magic(a::Number, b::Number)
  return a + b
end

In this case the function will not work with something other than a Number, since we declared that the inputs into the function should be of a certain abstract type. This is perhaps a bit more technical of a topic, but writing type stable code is important in production. In the case of R your code should be well documented and should make people aware of the possibility of potential output conflicts for specific inputs. I am not aware of an elegant solution to resolve the issues with a dynamically typed language such as R, but maybe someone who is more of an expert in the language such as Nico might have a good answer.

One solution that I have found is to call a stopifnot() function inside of your function

R> do_magic_type_stable <- 
+   function(a, b) {
+   stopifnot(is.numeric(a))
+   stopifnot(is.numeric(b))
+   a + b
+ }
R> do_magic(1, 2)
## [1] 3
R> do_magic(TRUE, FALSE)
## [1] 1
R> do_magic_type_stable(1, 2)
## [1] 3
do_magic_type_stable(TRUE, FALSE)
## [1] Error in do_magic_type_stable(TRUE, FALSE) : is.numeric(a) is not TRUE

If you want a clearer error message you can use the assertthat package instead of stopifnot().

R> do_magic_assert <- 
+   function(a, b) {
+   assert_that(is.numeric(a))
+   assert_that(is.numeric(b))
+   a + b
+ }
do_magic_assert(TRUE, FALSE)
## [1] Error: a is not a numeric or integer vector

However, this is not a natural solution and is not as flexible as Julia when it comes to type specification. If you want to know more about how to protect the functions you wrote from providing nonsense output I recommend you read the section on defensive programming by Hadley Wickham in his textbook called Advanced R. You can find a link to the chapter here.

Example: The cube

You should already have written some of your own functions, but as a quick reminder. Let us write a function that returns the cube of some input number.

R> cube =              ## Our function name
+   function(x) {     ## The argument(s) that our function takes as an input
+     x^3             ## The operation(s) that our function performs
+   }

We can test whether this function works. Fingers crossed!

R> cube(3)
## [1] 27

If my math is correct, it should be the case that \(3^3\) is equal to \(27\). So the function seems to be doing its job. In general we wouldn’t write functions as simple as these, since they most likely already exist in base R (just look at the arithmetic functions in R) and will be more efficiently coded. However, in some cases we will need to write our own functions.

In our example above we didn’t specify the return value in the code block. This is because R, like many other programming languages, automatically returns the final object in that you created with the function. However, I believe it is good practice to always assign return objects (as opposed to what Hadley Wickham says about return functions here).

R> cube =              
+   function(x) {     
+     x_cub = x^3     ## Create an intermediary object (that will be returned)
+     return(x_cub)   ## The value(s) or object(s) that we want returned.    
+   }

Specifying a return value is also helpful when we want to return more than one object. For example, let’s say that we want to remind our user what variable they used as an argument in our function:

R> cube = 
+   function(x) {     ## The argument(s) that our function takes as an input
+     x_cub = x^3     ## The operation(s) that our function performs
+     return(list(value = x, value_cubed = x_cub)) ## The list of object(s) that we want returned.
+   }

If we have multiple return objects we have to combine them in a list. Remember that many objects in R `contain multiple elements (vectors, data frames, and lists are all good examples of this). So we can also specify one of these “array”-type objects within the function itself if that provides a more convenient form of output. For example, we could combine the input and output values into a data frame:

R> cube = 
+   function(x) { 
+     x_cub = x^3 
+     df = tibble(value = x, value_cubed = x_cub) ## Bundle up our input and output values into a convenient dataframe.
+     return(df)
+   }

Default argument values

Another thing worth noting about R functions is that you can assign default argument values.

R> cube = 
+   function(x = 1) { ## Setting the default argument value 
+     x_cub = x^3 
+     df = tibble(value=x, value_cubed = x_cub)
+     return(df)
+   }
R> cube() ## Will take the deafult value of 1
## # A tibble: 1 × 2
##   value value_cubed
##   <dbl>       <dbl>
## 1     1           1
R> cube(2) ## Now takes the explicit value that we give it.
## # A tibble: 1 × 2
##   value value_cubed
##   <dbl>       <dbl>
## 1     2           8

Lexical scoping

An important concept to understand in the discussion of functions is the idea of lexical scoping. If you enter the code for this session into the console you will see that the functions will have entered the global enironment. However, the variables such as x inside of the function haven’t creeped into the global environment. These variables can only be accessed within the function itself (locallly).

You can think of a function as its own environment. They only return objects to the global environment when they are forced to, normally with a return() command. The function will also only look outside of its own scope for a variable if it isn’t defined in the function body. We will talk about this a bit more in the section on functional programming. I am sure Nico has mentioned this before, I just want to reiterate the point here.

Control flow

Now that we’ve got a good sense of the basic function syntax, it’s time to learn control flow. That is, we want to control the order (or “flow”) of statements and operations that our functions evaluate.

We’ve already encountered conditional statements like if() and ifelse() numerous times in the course thus far. However, let’s see how they can work in our own bespoke functions by slightly modifying our previous cube function. This time, instead of specifying a default input value of 1 in the function argument itself, we’ll specify a value of NULL. Then we’ll use an if() statement to reassign this default to one.

R> cube = 
+   function(x = NULL) {  ## Default value of NULL
+     if (is.null(x)) x=1 ## Re-assign default to 1
+     x_cub = x^3 
+     df = tibble(value = x, value_cubed = x_cub)
+     return(df)
+   }
R> cube()
## # A tibble: 1 × 2
##   value value_cubed
##   <dbl>       <dbl>
## 1     1           1

Why go through the rigmarole of specifying a NULL default inpute if we’re going to change it to 1 anyway? Admittedly, this is a pretty silly thing to do in the above example. However, consider what it buys us in the next code chunk:

R> cube = 
+   function(x = NULL) {
+     if (is.null(x)) { ## Start multiline if statement with `{`
+       x = 1
+       message("No input value provided. Using default value of 1.") ## Message to users
+       } ## Close multiline if statement with `}`
+     x_cub = x^3 
+     df = tibble(value = x, value_cubed = x_cub)
+     return(df)
+   }
R> cube()
## No input value provided. Using default value of 1.
## # A tibble: 1 × 2
##   value value_cubed
##   <dbl>       <dbl>
## 1     1           1

This time, by specifying NULL in the argument — alongside the expanded if() statement — our function now both takes a default value and generates a helpful message. Note too the use of curly brackets for conditional operations that span multiple lines after an if() statement. This provides a nice segue to ifelse() statements. As we’ve already seen , these be written as a single conditional call where the format is:

ifelse(CONDITION, DO IF TRUE, DO IF FALSE)

Within our own functions, though we’re more likely to write them over several lines. Consider, for example a new function that evaluates whether our cube() function is doing its job properly.

R> eval_cube =
+   function(x) {
+     if (cube(x)$value_cubed == x*x*x) { ## condition
+       ## What to do if the condition is TRUE 
+       message("Nailed it.")
+     } else {
+       ## What to do if the condition is FALSE
+       message("Dude, your function sucks.")
+     }
+   }
R> eval_cube(8)
## Nailed it.

As you may have guessed, it’s certainly possible to write nested ifelse() statements. For example,

ifelse(CONDITION1, DO IF TRUE, ifelse(CONDITION2, DO IF TRUE, ifelse(...)))

However, these nested statements quickly become difficult to read and troubleshoot. A better solution was originally developed in SQL with the CASE WHEN statement. We see that dplyr provides a case_when() implementation, as seen below,

R> x = 1:10
R> ## dplyr::case_when()
R> case_when(
+   x <= 3 ~ "small",
+   x <= 7 ~ "medium",
+   TRUE ~ "big" ## Default value. Could also write `x > 7 ~ "big"` here.
+   )
##  [1] "small"  "small"  "small"  "medium" "medium" "medium" "medium" "big"   
##  [9] "big"    "big"

Not to belabour the point, but you can easily use these case when implementations inside of data frames/tables too.

R> ## dplyr::case_when()
R> tibble(x = 1:10) %>%
+     mutate(grp = case_when(x <= 3 ~ "small",
+                            x <= 7 ~ "medium",
+                            TRUE ~ "big"))
## # A tibble: 10 × 2
##        x grp   
##    <int> <chr> 
##  1     1 small 
##  2     2 small 
##  3     3 small 
##  4     4 medium
##  5     5 medium
##  6     6 medium
##  7     7 medium
##  8     8 big   
##  9     9 big   
## 10    10 big

Iteration

Alongside control flow, the most important early programming skill to master is iteration. In particular, we want to write functions that can iterate — or map — over a set of inputs. By far the most common way to iterate across different programming languages is for loops. Indeed, we already saw some examples of for loops back in the shell lecture. However, while R certainly accepts standard for loops, I’m going to advocate that you adopt what is known as a “functional programming” approach to writing loops. Let’s dive into the reasons why and how these approaches differ.

Vectorisation

The first question you need to ask is: “Do I need to iterate at all?” Here we will discuss the concept of vectorisation, which is to say that you can apply a function to every element of a vector at once, rather than one at a time. Let’s demonstrate this property with our cube function:

R> cube(1:5)
## # A tibble: 5 × 2
##   value value_cubed
##   <int>       <dbl>
## 1     1           1
## 2     2           8
## 3     3          27
## 4     4          64
## 5     5         125
R> cube(c(2, 4))
## # A tibble: 2 × 2
##   value value_cubed
##   <dbl>       <dbl>
## 1     2           8
## 2     4          64

So you may not need to worry about explicit iteration at all. That being said, there are certainly cases where you will need to worry about it. Let’s explore with some simple examples (some of which are already vectorised) that provide a mental springboard for thinking about more complex cases.

For loops

In R, standard for loops take a pretty intuitive form. For example:

R> for(i in 1:10) print(LETTERS[i])
## [1] "A"
## [1] "B"
## [1] "C"
## [1] "D"
## [1] "E"
## [1] "F"
## [1] "G"
## [1] "H"
## [1] "I"
## [1] "J"

Note that in cases where we want to “grow” an object via a for loop, we first have to create an empty (or NULL) object.

R> kelvin = 300:305
R> fahrenheit = NULL
R> # fahrenheit = vector("double", length(kelvin)) ## Better than the above. Why?
R> for(k in 1:length(kelvin)) {
+   fahrenheit[k] = kelvin[k] * 9/5 - 459.67
+ }
R> fahrenheit
## [1] 80.33 82.13 83.93 85.73 87.53 89.33

Unfortunately, basic for loops in R also come with some downsides. Historically, they used to be significantly slower and memory consumptive than alternative methods. This has largely been resolved, but I’ve still run into cases where an inconspicuous for loop has brought an entire analysis crashing to its knees. The bigger problem with for loops, however, is that they deviate from the norms and best practices of functional programming.

Functional programming

The concept of functional programming (FP) is arguably the most important thing that you can take away from today’s lecture. In his excellent book, Advanced R, Hadley Wickham explains the core idea as follows.

R, at its heart, is a functional programming (FP) language. This means that it provides many tools for the creation and manipulation of functions. In particular, R has what’s known as first class functions. You can do anything with functions that you can do with vectors: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.

The problem is that for loops tend to emphasise the objects that we’re working with (say, a vector of numbers) rather than the operations that we want to apply to them (say, get the mean or median or whatever). This is inefficient because it requires us to continually write out the for loops by hand rather than getting an R function to create the for-loop for us.

As a corollary, for loops also pollute our global environment with the variables that are used as counting variables. Take a look at your “Environment” pane in RStudio. What do you see? In addition to the kelvin and fahrenheit vectors that we created, we also see two variables i and k (equal to the last value of their respective loops). Creating these auxiliary variables is almost certainly not an intended outcome when your write a for loop. More worringly, they can cause programming errors when we inadvertently refer to a similarly-named variable elsewhere in our script. So we best remove them manually as soon as we’re finished with a loop.

Another annoyance arrived in cases where we want to “grow” an object as we iterate over it (e.g. the fahrenheit object in our second example). In order to do this with a for loop, we had to go through the rigmarole of creating an empty object first.

FP allows to avoid the explicit use of loop constructs and its associated downsides. In practice, there are two ways to implement FP in R:

  1. The *apply family of functions in base R.
  2. The map*() family of functions from the purrr.

Let’s explore these in more depth.

lapply and friends

Base R contains a very useful family of *apply functions. I won’t go through all of these here — see ?apply — but they all follow a similar philosophy and syntax. The good news is that this syntax very closely mimics the syntax of basic for loops. For example, consider the code below, which is analgous to our first for loop above, but now invokes a base::lapply() call instead.

R> # for(i in 1:10) print(LETTERS[i]) ## Our original for loop (for comparison)
R> lapply(1:10, function(i) LETTERS[i])
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] "B"
## 
## [[3]]
## [1] "C"
## 
## [[4]]
## [1] "D"
## 
## [[5]]
## [1] "E"
## 
## [[6]]
## [1] "F"
## 
## [[7]]
## [1] "G"
## 
## [[8]]
## [1] "H"
## 
## [[9]]
## [1] "I"
## 
## [[10]]
## [1] "J"

There are s couple of things to notice.

First, check your “Environment” pane in RStudio. Do you see an object called “i” in the Global Environment? (The answer should be “no”.) Again, this is because of R’s lexical scoping rules, which mean that any object created and invoked by a function is evaluated in a sandboxed environment outside of your global environment.

Second, notice how little the basic syntax changed when switching over from for() to lapply(). Yes, there are some differences, but the essential structure remains the same: We first provide the iteration list (1:10) and then specify the desired function or operation (LETTERS[i]).

Third, notice that the returned object is a list. The lapply() function can take various input types as arguments — vectors, data frames, lists — but always returns a list, where each element of the returned list is the result from one iteration of the loop. (So now you know where the “l” in “lapply” comes from.)

Okay, but what if you don’t want the output in list form? There several options here. However, the method that I use most commonly is to bind the different list elements into a single data frame with dplyr::bind_rows(). For example, here’s a a slightly modified version of our function that now yields a data frame:

R> lapply(1:10, function(i) {
+   df = tibble(num = i, let = LETTERS[i])
+   return(df)
+   }) %>%
+   bind_rows()
## # A tibble: 10 × 2
##      num let  
##    <int> <chr>
##  1     1 A    
##  2     2 B    
##  3     3 C    
##  4     4 D    
##  5     5 E    
##  6     6 F    
##  7     7 G    
##  8     8 H    
##  9     9 I    
## 10    10 J

Taking a step back, while the default list-return behaviour may not sound ideal at first, I’ve found that I use lapply() more frequently than any of the other apply family members. A key reason is that my functions normally return multiple objects of different type (which makes lists the only sensible format) or a single data frame (which is where dplyr::bind_rows() come in).

purrr package

The tidyverse offers its own enhanced implementation of the base *apply() functions through the purrr package. The key function to remember here is purrr::map(). And, indeed, the syntax and output of this command are effectively identical to base::lapply():

R> map(1:10, function(i) { ## only need to swap `lapply` for `map`
+   df = tibble(num = i, let = LETTERS[i])
+   return(df)
+   })
## [[1]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     1 A    
## 
## [[2]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     2 B    
## 
## [[3]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     3 C    
## 
## [[4]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     4 D    
## 
## [[5]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     5 E    
## 
## [[6]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     6 F    
## 
## [[7]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     7 G    
## 
## [[8]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     8 H    
## 
## [[9]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1     9 I    
## 
## [[10]]
## # A tibble: 1 × 2
##     num let  
##   <int> <chr>
## 1    10 J

Given these similarities, I won’t spend much time on purrr. Although, I do think it will be the optimal entry point for many you when it comes to programming and iteration. You have already learned the syntax, so it should be very easy to switch over. However, one additional thing I wanted to flag for today is that map() also comes with its own variants, which are useful for returning objects of a desired type. For example, we can use purrr::map_df() to return a data frame.

R> map_df(1:10, function(i) { ## don't need bind_rows with `map_df`
+   df = tibble(num = i, let = LETTERS[i])
+   return(df)
+   })
## # A tibble: 10 × 2
##      num let  
##    <int> <chr>
##  1     1 A    
##  2     2 B    
##  3     3 C    
##  4     4 D    
##  5     5 E    
##  6     6 F    
##  7     7 G    
##  8     8 H    
##  9     9 I    
## 10    10 J

Note that this is more efficient (i.e. involves less typing) than the lapply() version, since we don’t have to go through the extra step of binding rows at the end.

Iterate over multiple inputs

Thus far, we have only been working with functions that take a single input when iterating. For example, we feed them a single vector (even though that vector contains many elements that drive the iteration process). But what if we want to iterate over multiple inputs? Consider the following function, which takes two separate variables x and y as inputs, combines them in a data frame, and then uses them to create a third variable z.

R> ## Create a named function
R> multi_func = 
+   function(x, y) {
+   df = 
+     tibble(x = x, y = y) %>%
+     mutate(z = (x + y)/sqrt(x))
+   return(df)
+   }

Before continuing, quickly test that it works using non-iterated inputs.

R> multi_func(1, 6)
## # A tibble: 1 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     6     7

Great, it works. Now let’s imagine that we want to iterate over various levels of both x and y. There are two basics approaches that we can follow to achieve this, use base::mapply() or purrr::pmap().

I’ll quickly review both approaches, continuing with the multi_func() function that we just created above.

Use mapply() or pmap()

Both base R — through mapply() — and purrr — through pmap — can handle multiple input cases for iteration. The latter is easier to work with in my opinion, since the syntax is closer (nearly identical) to the single input case. Still, I’ll demonstrate using both versions below.

First, base::mapply():

R> ## Note that the inputs are now moved to the *end* of the call. 
R> ## Also, mapply() is based on sapply(), so we also have to tell it not to 
R> ## simplify if we want to keep the list structure.
R> mapply(
+   multi_func,
+   x = 1:5,         ## Our "x" vector input
+   y = 6:10,        ## Our "y" vector input
+   SIMPLIFY = FALSE ## Tell it not to simplify to keep the list structure
+   ) %>%
+   bind_rows()
## # A tibble: 5 × 3
##       x     y     z
##   <int> <int> <dbl>
## 1     1     6  7   
## 2     2     7  6.36
## 3     3     8  6.35
## 4     4     9  6.5 
## 5     5    10  6.71

Second, purrr::pmap():

R> ## Note that the inputs are combined in a list.
R> pmap_df(list(x=1:5, y=6:10), multi_func)
## # A tibble: 5 × 3
##       x     y     z
##   <int> <int> <dbl>
## 1     1     6  7   
## 2     2     7  6.36
## 3     3     8  6.35
## 4     4     9  6.5 
## 5     5    10  6.71

There is a third method, which you can view here. Most of the notes from the this section are directly copied from the reading.

There are some more notes advanced notes on functions that we will not have time to cover. These notes can be found here. Please take the time to try and read these notes. They provide some really useful software engineering constructions.