Note: This is the second of three lectures on programming. Please take a look at the first lecture if you haven’t gone through it yet. Today, we’ll build on that foundation by tackling some more advanced issues that arise when writing and working with functions in R. In particular, I’m going to focus on three issues: 1) function debugging, 2) catching user errors, and 3) caching results.
Install (if necessary) and load these packages now:
if (!require("pacman")) install.packages("pacman")
::p_load(memoise, tidyverse, here) pacman
We will also be working with our simple square()
function from the previous lecture. Let’s create it again quickly.
=
square function(x = 1) {
= x^2
x_sq = tibble(value=x, value_squared=x_sq)
d return(d)
}
Functions are incredibly powerful and helpful programming tools. They also tend to go wrong a lot. Sometimes this is because we (or someone else) made a mistake in writing up the code. Othertimes it is because of user error (e.g. invalid inputs). Regardless, this is where we have to begin debugging the code to figure out where and why things went wrong.
R and the RStudio IDE provide a number of excellent tools for debugging. We’ll walk through these together in a series of live coding examples in a minute. But, first here is a visual summary, cropped directly from the RStudio IDE Cheat Sheet.1
Note: This section should be run interactively and I’ve set most of the below code chunks to eval=FALSE
in the R Markdown document. So don’t be surprised if the knitted HTML document doesn’t contain output corresponding to what I’m talking about in the text. My advice is to run the code chunks yourself and then follow along with the text.
As the figure above suggests, there are various ways to enter so-called debug mode. This is where you “step inside” a function and evaluate objects within that function environment. In other words, it allows you to pause time and interact with internal function objects that are normally hidden from your global environment.2 Let’s practice with an example.
Suppose we feed some deliberately invalid input — i.e. a character string — to our square()
function:
square("one")
## Error in x^2: non-numeric argument to binary operator
Now, of course, we already knew that this function call would fail. (D’uh, you can’t square a string.) In this case, R also produced an informative error message. However, notice that we don’t actually get to see the point of failure — i.e. the line of code where our function tried square the value “one”. For this we need to enter debug mode.
While there are several ways to do trigger the debugger, I recommend the debugonce()
function. As the name suggests, running debugonce(square)
will cause us to enter debug mode the next time we call square()
, but only that one time.3 Let’s try it in a live session:
## Run this next chunk yourself in a live session
debugonce(square)
square("one")
Here is a screenshot from my computer after I ran the above code chunk and explored a little.
Note the following changes to my RStudio IDE while I’m in debug mode:
x
(i.e. my function argument) which has been assigned the value of “one”.>
to Browse[2]>
, indicating that I am now in the R environment browser. The debugger console is fully functional and I have started exploring.
x
into the console prompt and confirmed that R returns a value of “one”.x_sq = x^2
) and was been met with an error message (“non-numeric argument to binary operator”). This is the exact point where the function actually fails.Now, again, in this case particular case the problem was obvious and known ahead of time. But I think this simple example is useful for illustrating some general debugging principles and the power of being able to step inside functions via debugger mode. It gives you a chance to see what the function “sees” and then work through to the root cause of the problem in a systematic way. Once you have identified the problem, then you can set about correcting your function (or inputs). Ultimately, you want to write robust functions that recognise errors early, fail with some acceptable level of tolerance, and provide helpful feedback to users. This will be the subject of the next section.
First, an aside: We don’t always have to manually invoke the debugger when a function fails. For example, with debugonce()
. In fact, RStudio will often prompt you to “Rerun with Debug” automatically if something goes wrong. You will see something like the below screenshot in your console when this happens. (Ignore the specific function and error message and rather focus on the blue icons on the right.)
The automatic RStudio prompt will arise whenever there is something wrong with the way the R code in your function is being evaluated (i.e. it yields an R error). In contrast, you won’t receive the automatic prompt if there is something wrong with your code logic (e.g. you try to square a character string).
In the previous lecture, we implicitly assumed that the user knows exactly how to use our function. However, this isn’t always the case. A related, but more complicated, case is when we mistakenly input the wrong type of argument into a function. Let’s return to our earlier example, where we “accidentally” entered a string into our square()
function.
square("one")
## Error in x^2: non-numeric argument to binary operator
This may just seem like a case of particularly dumb user error. However — trust me — its very easy to run into this category of problem when you have a complex analysis that consists of, say, a series of nested functions. (One function calling another, calling another…) For whatever reason, a single function or iteration may produce slightly different output than expected and this can bring your entire analysis crashing to its knees, because the output can’t be used in the next part of the chain. This is especially frustrating when you are running a multicore process (e.g. a parallel Monte Carlo simulation), since the program will first complete the entire run — perhaps taking several hours — before informing you right at the end that there was an error somewhere and no results (even for the valid iterations!) have been retained.
Luckily, there are several approaches to guarding against these kind of mistakes. I’ll briefly run through what I see as the three main options below.
base::tryCatch()
purrr::safely()
and familyWe covered the basics of control flow in the previous lecture (here). Let’s put the same concepts to use for debugging. In this particular function, for example, we know that our function requires a numeric input. So we can check whether the input argument is a numeric and use an ifelse
statement to produce a warning/error message if it fails this test. Consider, then, a slightly modified version of our function, which I’ll call square_ifelse
.
=
square_ifelse function (x = 1) {
if (is.numeric(x)) { ## Check that this is a valid argument to our function.
= x^2
x_sq = tibble(value=x, value_squared=x_sq)
d return(d)
else { ## Return a warning message if not.
} message("Sorry, you need to provide a numeric input variable.")
} }
Test it.
square_ifelse("one") ## Will trigger our warning message.
## Sorry, you need to provide a numeric input variable.
square_ifelse(1) ## Works.
## # A tibble: 1 × 2
## value value_squared
## <dbl> <dbl>
## 1 1 1
We can achieve a very similar result, but with less code, using the generic stop()
function.
=
square_stop function (x = 1) {
if (!is.numeric(x)) stop("Sorry, you need to provide a numeric input variable.")
= x^2
x_sq = tibble(value=x, value_squared=x_sq)
d return(d)
}square_stop("one") ## Triggers a stop error and warning
## Error in square_stop("one"): Sorry, you need to provide a numeric input variable.
square_stop(1) ## Works
## # A tibble: 1 × 2
## value value_squared
## <dbl> <dbl>
## 1 1 1
base::tryCatch()
Another, more general option is to use the base::tryCatch()
function for handling errors and warnings. Let me demonstrate its usefulness with two separate examples.
tryCatch()
around an entire functionThe first simply wraps a generic tryCatch
statement around our existing square
function. Note the invocation of R’s in-built “error” class, which in turn is passed to another in-built function called message
. Basically, we are telling R to produce a particular message whenever it recognizes that an error (any error!) has occurred while executing our bespoke function.
tryCatch(
square("three"),
error = function(e) message("Sorry, something went wrong. Did you try to square a string instead of a number?")
)
## Sorry, something went wrong. Did you try to square a string instead of a number?
This first example works well, but it has the downside of throwing out everything that went into the function in favour of a single error message. Not only that, but it could throw out potentially valid input-output because of a single error. To see this more clearly, let’s feed our function a vector of inputs, where only one input is invalid.
tryCatch(
square(c(1,2,"three")),
error = function(e) message("Sorry, something went wrong. Did you try to square a string instead of a number?")
)
## Sorry, something went wrong. Did you try to square a string instead of a number?
So we simply get an error message, even though some (most) of our inputs were valid. In an ideal world, we would have retained the input-output from the valid parameters (i.e. 1 and 2) and only received an error message for the single invalid case (i.e. “three”). This leads us to our second example…
tryCatch()
inside a functionThe second example avoids the above problem by invoking tryCatch()
inside our user-defined function. The principle is very much the same as before: We’re going to tell R what to give us whenever it encounters an error. However, we are going to be more explicit about where we expect that error to occur. Moreover, instead of simply producing an error message, this time we’ll instruct R to return an explicit, alternative value (i.e. NA
).
=
square_trycatch function (x = 1) {
= tryCatch(x^2, error = function(e) NA_real_) ## tryCatch goes here now. Produce an NA value if we can't square the input.
x_sq = tibble(value=x, value_squared=x_sq)
d return(d)
}
Let’s see that it works on our previous input vector, where only one input was invalid.
square_trycatch(c(1,2,"three"))
## # A tibble: 3 × 2
## value value_squared
## <chr> <dbl>
## 1 1 NA
## 2 2 NA
## 3 three NA
Huh? Looks like it half worked. We get the input values, but now all of the squared values have been converted to NA
. Why do you think that is? Challenge: See if you can figure out the problem on your own using debugonce(square_trycatch)
before continuing…
Let’s take a deeper look at our input vector:
str(c(1,2,"three"))
## chr [1:3] "1" "2" "three"
Ah-ha. R has coerced every element in the input vector to a character string. (Remember: Vectors can only contain elements of the same type.) The solution is to use an input array that allows different element types — i.e. a list. This, in turn, requires modifying the way that we invoke the function by putting it in a base::lapply()
or purrr::map()
call. As you’ll hopefully remember from the last lecture, these two functions are syntactically identical, so I’ll just use the latter:
map(list(1,2,"three"), square_trycatch)
## [[1]]
## # A tibble: 1 × 2
## value value_squared
## <dbl> <dbl>
## 1 1 1
##
## [[2]]
## # A tibble: 1 × 2
## value value_squared
## <dbl> <dbl>
## 1 2 4
##
## [[3]]
## # A tibble: 1 × 2
## value value_squared
## <chr> <dbl>
## 1 three NA
As we practiced last lecture, we may wish to bind the resulting list of data frames into a single data frame using dplyr::bind_rows()
or, more simply, purrr::map_df()
. However, that actually produces errors of its own because all of the columns need to be the same.
map_df(list(1,2,"three"), square_trycatch)
## Error: Can't combine `..1$value` <double> and `..3$value` <character>.
The somewhat pedantic solution is to make sure that the offending input is coerced to a numeric within the function itself. Note that this will introduce coercion warnings of its own, but at least it won’t fail.
=
square_trycatch2 function (x = 1) {
= tryCatch(x^2, error = function(e) NA_real_)
x_sq = tibble(value=as.numeric(x), value_squared=x_sq) ## Convert input to numeric
d return(d)
}
map_df(list(1,2,"three"), square_trycatch2)
## Warning in eval_tidy(xs[[j]], mask): NAs introduced by coercion
## # A tibble: 3 × 2
## value value_squared
## <dbl> <dbl>
## 1 1 1
## 2 2 4
## 3 NA NA
purrr::safely()
and familyFinally, for those of you who prefer a tidyverse equivalent of tryCatch()
, you can use purrr::safely()
and its related functions (including purrr::possibly()
and other variants). I won’t go through the entire rigmarole again, so here’s a simple flavour of how they work:
=
square_simple function (x = 1) {
= x^2
x_sq
}= safely(square_simple)
square_safely square_safely("three")
## $result
## NULL
##
## $error
## <simpleError in x^2: non-numeric argument to binary operator>
You can also specify default behaviour in case of an error:
= safely(square_simple, otherwise = NA_real_)
square_safely square_safely("three")
## $result
## [1] NA
##
## $error
## <simpleError in x^2: non-numeric argument to binary operator>
We’ve already experienced the benefits (and occasional frustrations) of caching with R Markdown documents.4 Caching can also be extremely useful for regular R scripts and analyses. For example, we may wish to save some computationally-expensive output so that we don’t have to run it again. On a related but more sinister note, programs and functions can crash midway through completion. This can happen for a variety of reasons: invalid function arguments buried in iterated input, computer malfunction, memory limits, power outages, timeouts, etc. Regardless, it can be a fairly soul-splintering experience to lose all of your work if you’re working on a particularly lengthy simulation or computation problem.
We’ll get to parallel computation in the next lecture, but the problem is even worse there. What typically happens with a parallelized program is that the entire run will complete (potentially taking many hours or days) and only reveal an error right at the end… with no saved output!
Fortunately, R has our back with several caching tools. Here I’m going to focus on the memoise package (link). Note that memoisation/memoization refers to a particular form of caching where we save (i.e. “remember”) the results of expensive functions calls, so that we don’t have to repeat them in the future.
Let’s start by creating a “slow” version of our simple square function — that waits for two seconds before doing anything — which I’ll creatively call slow_square()
. Of course, this is just meant to emulate a computationally-expensive operation, but the basic principles will carry through intact.
## Emulate slow function
=
slow_square function(x) {
Sys.sleep(2)
square(x)
}
Enabling caching (i.e. memoisation) of our slow function is a simple matter of feeding it to memoise::memoise()
.
# library(memoise) ## Already loaded
= memoise(slow_square) mem_square
Note: I’ve assigned the memoised version as its own function here (i.e. mem_square()
). However, it’s no problem to recycle the original function name if you prefer (i.e. slow_square <- memoise(slow_square)
).
The first time we execute our memoised slow_square_mem()
function, it won’t be able to draw on any saved results. This means that it will have to run through all of the underlying computation. In the process of doing so, however, it will save both the inputs and results for immediate retrieval later on.
Let’s run some examples and compare actual timings. For the first run, I’ll iterate over our function using the numbers 1 through 10 and save the resulting data frame to an object called m1
.
system.time({
= map_df(1:10, mem_square)
m1 })
## user system elapsed
## 0.045 0.000 20.062
As expected this took 20 seconds because of the enforced two second wait during each iteration. Now, we try calling the function a second time — iterating over the exact same inputs and saving to a new m2
object — to see if caching makes a difference…
system.time({
= map_df(1:10, mem_square)
m2 })
## user system elapsed
## 0.001 0.000 0.002
And does it ever! We’re down to a fraction of a second, since we didn’t need to run at all again. Rather, we simply recalled the previously saved (i.e. memoised) results. And just to prove that we’re really saving meaningful output, here is a comparison of the two data frames, as well as the printed output of m2
.
all.equal(m1, m2)
## [1] TRUE
m2
## # A tibble: 10 × 2
## value value_squared
## <int> <dbl>
## 1 1 1
## 2 2 4
## 3 3 9
## 4 4 16
## 5 5 25
## 6 6 36
## 7 7 49
## 8 8 64
## 9 9 81
## 10 10 100
Finally, note that our caching function is smart enough to distinguish between previously cached and non-cached results. For example, consider what happens if I include five more numbers in the x
input vector.
system.time({
= map_df(1:15, mem_square)
m3 })
## user system elapsed
## 0.015 0.003 10.026
As expected, this only took (5 \(\times\) 2 = ) 10 seconds to generate the new results from scratch, with the previous results being called up from the cache. You can think of preceding example as approximating a real-life scenario, where your program crashes or halts midway through its run, yet you don’t need to restart all the way at the beginning. These kinds of interruptions happen more frequently than you might expect, especially if you’re working with complex analyses and high-performance computing tools (e.g. preemptible nodes or VM instances). Being smart about caching has saved me many lost hours and it could do the same for you.
The previous paragraph elides an important caveat: The default memoise()
cache is only valid for the current R session. You can see this more clearly by exploring the help documentation of the function, where you will note the internal cache = cache_memory()
argument. To enable caching that persists across sessions — including when your computer crashes — you need to specify a dedicated cache directory with cache = cache_filesystem(PATH)
. This directory can be located anywhere on your system (or, indeed, on a linked cloud storage service) and you can even have multiple cache directories for different projects. My only modest recommendation is that you use a .rcache/
naming pattern to keep things orderly.
For example, we can specify a new, persistent memoise cache location for our slow_square()
function within this lecture sub-directory as follows.
## Cache directory path
= here("11-funcs-adv/.rcache")
cache_dir
## Create this directory if it doesn't yet exist
if (!dir.exists(cache_dir)) dir.create(cache_dir)
## (Re-)memoise our function with the persistent cache location
= memoise(slow_square, cache = cache_filesystem(cache_dir)) mem_square_persistent
Run our new memoised function and check that it saved the cached output to the specified directory.
= map_df(1:7, mem_square_persistent)
m4
= list.files(cache_dir)
cached_files cached_files
## [1] "01c0e8630e7f14d5" "1b4b51ffbcca5468" "5083c4a086fa5548" "6508857a8ff8ac3b"
## [5] "c33f179b82a88fea" "df6e6c2cd4fd862a" "e1779b8cdb7378d7"
Bottom line: Specify a dedicated cache directory for complex or time-consuming analyses that you want to be able to access across R sessions.
It’s possible (and often very helpful) to add verbose prompts to our memoised functions. Consider the code below, which folds our mem_square_persistent()
function into two sections:
=
mem_square_verbose function(x) {
## 1. Load cached data if already generated
if (has_cache(mem_square_persistent)(x)) {
cat("Loading cached data for x =", x, "\n")
= mem_square_persistent(x)
my_data return(my_data)
}
## 2. Generate new data if cache not available
cat("Generating data from scratch for x =", x, "...")
= mem_square_persistent(x)
my_data cat("ok\n")
return(my_data)
}
And here’s an example of the verbose function in action. The output is probably less impressive in a knitted R Markdown document, but I find the real-time feedback to be very informative in a live session. (Try it yourself.)
system.time({
= map_df(1:10, mem_square_verbose)
m5 })
## Loading cached data for x = 1
## Loading cached data for x = 2
## Loading cached data for x = 3
## Loading cached data for x = 4
## Loading cached data for x = 5
## Loading cached data for x = 6
## Loading cached data for x = 7
## Generating data from scratch for x = 8 ...ok
## Generating data from scratch for x = 9 ...ok
## Generating data from scratch for x = 10 ...ok
## user system elapsed
## 0.012 0.003 6.020
Finally, albeit probably unnecessary, we can also prove to ourselves that we’ve added the three new cases (i.e. for 8:10
) to our cache directory by comparing to what we had previously.
setdiff(list.files(cache_dir), cached_files)
## [1] "4cecd4ddcc3e39e2" "e8f1fd81cf9238e8" "f87e79515d7f3630"
I mentioned this in an earlier lecture, but RStudio hosts a bunch of excellent cheat sheets that are all worth checking out.↩︎
Remember from our last lecture: Functions operate in semi-sandboxed environments determined by R’s lexical scoping rules.↩︎
The generic debug()
function works well too. However, there are some edge cases where running it inside the RStudio IDE can result in a recursive debugging loop (i.e. hell). So I advise avoiding debug()
in favour of debugonce()
unless you are running R (or calling R scripts) directly from the terminal.↩︎
Like all of my notes for the course, this lecture is written in R Markdown and then “knitted” to HTML. If you look at the .Rmd source file, you’ll see a code chunk with knitr::opts_chunk$set(echo=TRUE, cache=TRUE, dpi=300)
right at the top. The cache=TRUE
bit causes each subsequent code chunk to be cached, so that they don’t have to be re-run every time I recompile the document (unless something changes).↩︎