Welcome back everyone! Today we will have a look at
When you write code, things will inevitably go wrong at some point. You can professionalize the way of how to - fix unanticipated problems (debugging) - let functions communicate problems and take actions based on those communications (condition handling) - learn how to avoid common problems before they occur (defensive programming)
Potential problems are communicated via “conditions”: errors, warnings, and messages. For example - fatal errors are raised by stop() and force all execution to terminate - warnings are generated by warning() and display potential problems - messages are generated by message() and can provide informative output on the way
In the lecture, we saw the following function that calculates the geodesic distance between two points specified by radian latitude/longitude using the Haversine formula (hf); taken from here, as an example. We inserted some bugs here… 🐛
<- function(lat1, lon1, lat2, lon2, earth.radius = 6371) {
geod_dist # from degrees to radians
<- function(deg) return(deg*pi/180)
deg2rad <- deg2rad(lon1)
lon1 <- deg2rad(lat1)
lat1 <- deg2rad(long2)
lon2 <- deg2rad(lat2)
lat2 # calculation
<- (lon2 - lon1)
delta.long <- (lat2 - lat1)
delta.lat <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sing(delta.long/2)^2
a <- 2 * asin(min(1,sqrt(a)))
c = earth_radius * c
d return(d)
}
geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)
That is, if you see the error right away, try and fix it. You have lots of experience doing that. 🤓
You basically turn the arguments of the function into global objects (objects you can see in your environment, otherwise the objects are only available within your function). Then you can step through the code line by line to locate the bug.
# make the objects that are otherwise entered as input parameters to your function global
= 49.5; lon1 = 8.4; lat2 = 52.5; lon2 = 13.4 lat1
# now, execute line by line
<- function(deg) return(deg*pi/180)
deg2rad <- deg2rad(lon1)
lon1 <- deg2rad(lat1)
lat1 <- deg2rad(long2)
lon2 <- deg2rad(lat2)
lat2 <- (lon2 - lon1)
delta.long <- (lat2 - lat1)
delta.lat <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sing(delta.long/2)^2
a <- 2 * asin(min(1,sqrt(a)))
c = earth_radius * c
d return(d)
Problem: This creates global objects that match arguments names, which can become confusing and cause problems that become obvious when the function is called in a different environment. ⚠️ In case you choose this option, it is a good idea to clean your environment afterwards, or simply to remove all the new global objects using rm()
.
Side note If you haven’t done so already, as a general best practice advise, change the settings in your global options to “never” save the workspace as this can cause similar issues to the example described above.
geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)
traceback()
This shows you where the error occurred (but not why). Read from bottom to top (e.g. 1. called function X, 2. called function Y, error occurred in line #6 of function Y)
Basically, add browser()
into your function somewhere before you expect the expected error.
<- function(lat1, lon1, lat2, lon2, earth.radius = 6371) {
geod_dist # from degrees to radians
browser()
<- function(deg) return(deg*pi/180)
deg2rad <- deg2rad(lon1)
lon1 <- deg2rad(lat1)
lat1 <- deg2rad(lon2)
lon2 <- deg2rad(lat2)
lat2 # calculation
<- (lon2 - lon1)
delta.long <- (lat2 - lat1)
delta.lat <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sin(delta.long/2)^2
a <- 2 * asin(min(1,sqrt(a)))
c = earth_radius * c
d return(d)
}geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)
You can then interactively work through the function line by line by hitting enter in the console or send additional lines of code.
Note: Other helpful tools to debug functions in R that you have not written yourself are debug()
, that automatically opens a debugger at the start of a function call and trace()
, which allows temporary code modifications inside functions that you don’t have easy access to (e.g. ggplot()).
Sometimes errors come expected, and you want to handle them automatically, e.g.: - model fails to converge - download of files fails - stack processing of lists
Useful functions to deal with such cases: try()
and tryCatch()
<- function(x) {
f1 log(x)
10
} f1("x")
Ignore error:
<- function(x) {
f1 try(log(x))
10
} f1("x")
## Error in log(x) : non-numeric argument to mathematical function
## [1] 10
Suppress error message:
<- function(x) {
f1 try(log(x), silent = TRUE)
10
} f1("x")
## [1] 10
Pass block of code to try():
try({
<- 1
a <- "x"
b + b
a })
## Error in a + b : non-numeric argument to binary operator
Capture the output of try():
<- try(1 + 2)
success <- try("a" + "b") failure
## Error in "a" + "b" : non-numeric argument to binary operator
class(success)
## [1] "numeric"
class(failure)
## [1] "try-error"
Use possibly()
, a function similar to try()
from the purrr
package when applying a function to multiple elements in a list. You can also provide a default value (here: NA) in case execution fails.
library(purrr)
<-list(1,2,3,"f")
elements <- map(elements, log)
results <- map(elements, possibly(log, NA)) results
React to conditions, such as errors, warnings, messages, or interruptions, with certain actions “handlers” (functions that are called with the condition as an input) are mapped to condition handler functions can do anything but typically will return a value or create a more informative error message:
<- function(code) {
show_condition tryCatch(code,
error = function(c) "error",
warning = function(c) "warning",
message = function(c) "message" )
}
show_condition(stop("!"))
## [1] "error"
show_condition(warning("?!"))
## [1] "warning"
show_condition(message("?"))
## [1] "message"
If no condition is captured, tryCatch returns the value of the input:
show_condition(10+5)
## [1] 15
A more real-life example of how to use tryCatch() is this one:
<- function(data, formula1, formula2){
model_selection
tryCatch(lm(formula1, data), error = function(e) lm(formula2, data))
}
You try to fit formula1 to the data, however, maybe it is a model with very strict requirements. You might also have a more robust (but maybe less interesting) formula2 that you might fit if the requirements are not met and the modeling process throws an error.
Make more informative error messages
<- function(file, ...) {
read.csv_new tryCatch(read.csv(file, ...), error = function(c) {
$message <- paste0(c$message, " (in ", file, ")")
cstop(c)
})
}
read.csv("code/dummy.csv")
read.csv_new("code/dummy.csv")
Here is a piece of code that comes with quite some flaws. As an optional take-home exercise, please identify the bugs, remove them and report what you have done using comments.
# load packages
library(tidyverse)
library(LegislatoR)
# get political data on German legislators
<-
political_df left_join(x = filter(get_political(legislature = "ger"), as.numeric("session") == 18),
y = get_core(legislature = "ger"), by = "pageid")
# wiki traffic data
<-
traffic_df get_traffic(legislature = "ger") %>%
filter(date >= "2013-10-22" & date <= "2017-10-24") %>%
group_by(pageid) %>%
summarize(traffic_mean = mean(traffic, na.rm = TRUE),
traffic_max = max(traffic, na.rm = TRUE))
# sessions served
<-
sessions_served_df get_political(legislature = "deu") %%
group_by(pageid) %>%
::summarize(sessions_served = n())
dplyr
# merge
<-
legislator_df left_join(political_df, sessions_served_df, by = "pageid") %>%
left_join(traffic_df, by = "pageid")
# compute age
<- function(birth, date_at) {
get_age <- date_at
date_at_fmt <- birth
birth_fmt <- difftime(lubridate::ymd(birth_fmt), lubridate::ymd(date_at_fmt))
diff <- time_length(diff, "years")
diff_years
diff_years
}$age_in_years <- round(get_age(legislator_df$birth, "2017-10-24"), 0)
legislator_df
# plot top 10 pageviews
<- arrange(legislator_df, desc(traffic_mean))
legislator_df $rank <- 1:nrow(legislator_df)
legislator_df<- dplyr::select(rank, name, traffic_mean, traffic_max)
legislator_df_table names(legislator_df_table) <- c("Rank", "Representative", "Mean", "Maximum")
<- head(legislator_df_table, 10)
legislator_df_table
ggplot(legislator_df_table, aes(y = Mean, x = -Rank)) +
xlab("Rank") + ylab("Avg. daily page views") +
labs(title = "Top 10 representatives by average daily page views") +
geom_bar(stats = "identity") +
scale_x_continuous(breaks = -nrow(legislator_df_table):-1, labels = rev(1:nrow(legislator_df_table)))
geom_text(aes(y = 10, label = Representative), hjust = 0, color = "white", size = 2) +
coord_flip() +
theme_minimal()
# run model of page views as a function of sessions served, party, sex, and age in years
$traffic_log <- log(legislator_df$traffic_mean)
legislator_df
<- c("sessions_served", "party", "sex", "age_in_years")
covars <- paste("traffic_log", paste(covars, collapse = " - "), sep = " ~ ")
fmla summary(log_traffic_model <- lm(fmla, legislator_df))
# plot table
::tab_model(log_traffic_model) sjPlot
You can find the solutions here.
First of all, there is a very nice cheat sheet on package development.
Remember the overall workflow of creating a package:
Source: Simo Goshev & Steve Worthinton
We now run through the development of a small toy package.
Note: In this minimal example, we won’t sync the package to GitHub as not to overload information and won’t do version control via Git. In principle, it is strongly recommended to use Git for version control during the process.
Load the devtools
package, which is the public face of a set of packages that support various aspects of package development. Also load the roxygen2
package, which provides helper functions to document your package.
library(devtools)
library(roxygen2)
You are going to create a directory with the bare minimum folders of R packages. We are going to make a very tiny package providing only a single function as an illustration.
Call create_package()
from the usethis
package, which is loaded when you load devtools
, to initialize a new package in a directory on your computer (and create the directory, if necessary). Make a deliberate choice about where to create this package on your computer. It should probably be somewhere within your home directory, alongside your other R projects. It should not be nested inside another RStudio Project, R package, or Git repo. Nor should it be in an R package library, which holds packages that have already been built and installed. The conversion of the source package we are creating here into an installed package is part of what devtools facilitates. Don’t try to do devtools’ job for it!
Substitute your chosen path into a create_package()
call like this:
create_package("./awesomepackage", open = FALSE)
# create_package("~/Documents/R/awesomepackage", open = FALSE)
Also, let’s navigate into that directory, which will allow us to use some fancy functions down the line.
setwd("./awesomepackage")
Now it’s time to add the functions you want to create the package for. Let’s just work with a toy function here:
Some years ago, there used to be an issue when we catenating two factors (see this stackoverflow discussion). The problem has already been solved, for example within the forcats
package, but let’s pretend it had not.
For whatever reason, the result of catenating two factors used to be an integer vector consisting of the numbers 1, 2, 3, and 4. A possible solution would have been to coerce each factor to character, catenate, then re-convert to factor.
<- factor(c("character", "hits", "your", "eyeballs"))
a <- factor(c("but", "integer", "where it", "counts"))
b
c(a, b) # used to cause troubles in the past
factor(c(as.character(a), as.character(b))) # was a common workaround (no longer necessary)
Let’s drop that logic into the body of a function called fbind()
:
<- function(a, b) {
fbind factor(c(as.character(a), as.character(b)))
}
Save this function as an R file, fbind.R
, to the R directory in your package folder.
Alternatively, you can use the helper function use_r()
from the usethis
package, which automatically creates and/or opens a script below R/
:
use_r("fbind")
While not strictly necessary, it’s useful for Future-You and anybody else if you provide some documentation of your function. Luckily, there’s a tool again that helps you doing that. All you have to do is to add a comment like the following at the beginning of your function file, just like:
#' Bind two factors
#'
#' Create a new factor from two existing factors, where the new factor's levels
#' are the union of the levels of the input factors.
#'
#' @param a factor
#' @param b factor
#'
#' @return factor
#' @export
#' @examples
#' fbind(iris$Species[c(1, 51, 101)], PlantGrowth$group[c(1, 11, 21)])
Then, you can run document()
from roxygen2 to automatically create the documentation for you:
document()
You should now be able to preview your help file like so:
?fbind
Now it is as simple as installing the package! You need to run this from the parent working directory that contains the package folder.
setwd("..")
install("awesomepackage")
Now you have a real, live, functioning R package. For example, try typing ?fbind. You should see the standard help page pop up!
setwd("awesomepackage")
check()
The field of (software) licensing is complex. Check out this resource for a start. In terms of open source licenses, there are two major types: - permissive licences: easy-going; can be freely copied, modified, and published. examples: MIT, Apache - copyleft licenses: can be freely copied and modified or personal use, but publishing may be restricted. example: GPL
The usethis package comes with helper functions to add various licenses with minimal effort.
Things can become more complicated when you use other people’s code and want to bundle it with yours. The licenses may not be compatible… see more here.
More detailed information on R package licenses can be found here for more information.
use_mit_license("Your Name")
# use_gpl_license("Your Name")
# use_proprietary_license("Your Name") # don't make it open sources; cannot be published on CRAN
In case you’re having a nice idea for an actual package, please take a look at some much more in depth explanations, e.g. by Hadley Wickham and Jenny Bryan: R packages, Chapter 2 “The whole game” or Karl Broman: R package primer. An overview of next steps to take are:
As we do not have an assignment to this session, here is an optional take-home exercise to self-test your packaging skills.
Build your own package and publish it in a public GitHub repository. Here are the requirements:
All you have to provide to users interested in your package would be this line of code:
::install_github("<your-github-account-name>/<your-package-name>") devtools
We won’t cover automation in the lab today, but here is a summary of helpful resources in case you are planning to:
1. Organize the workflow of a complex project
2. Schedule reoccurring tasks
schtasks.exe
The debugging tutorial is inspired by Sean C. Anderson’s work on debugging R functions.
The packaging tutorial draws heavily on resources by Hilary Parker: Writing an R package from scratch, Hadley Wickham and Jenny Bryan: R packages, Chapter 2 “The whole game” and Karl Broman: R package primer
A work by Lisa Oswald & Tom Arend
Prepared for Intro to Data Science, taught by Simon Munzert