class: center, middle, inverse, title-slide .title[ # Introduction to Data Science ] .subtitle[ ## Session 3: Programming II ] .author[ ### Simon Munzert ] .institute[ ### Hertie School |
GRAD-C11/E1339
] --- <style type="text/css"> @media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619 .has-continuation { display: block !important; } } </style> # Table of contents <br> 1. [Iteration](#iteration) 2. [Automation and scripting](#automation) 3. [Scheduling](#scheduling) 4. [Debugging](#debugging) <!-- ############################################ --> --- class: inverse, center, middle name: iteration # Iteration <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- background-image: url("pics/iteration-ford-assembly-line-1913.jpg") background-size: contain background-color: #000000 # Iteration --- # Iteration ### The ubiquity of iteration - Often we have to run the same task over and over again, with minor variations. Examples: - Standardize values of a variable - Recode all numeric variables in a dataset - Running multiple models with varying covariate sets - A benefit of scripting languages in data (as opposed to point-and-click solutions) is that we can easily automate the process of iteration -- ### Ways to iterate - A simple approach is to copy-and-paste code with minor modifications (→ "[duplicate code](https://en.wikipedia.org/wiki/Duplicate_code)", → "[copy-and-paste programming](https://en.wikipedia.org/wiki/Copy-and-paste_programming)"). This is lazy, error-prone, not very efficient, and violates the "[Don't repeat yourself](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)" (DRY) principle. - In R, [vectorization](https://adv-r.hadley.nz/perf-improve.html#vectorise), that is applying a function to every element of a vector at once, already does a good share of iteration for us. - `for()` [loops](https://r4ds.hadley.nz/iteration.html) are intuitive and straightforward to build, but sometimes not very efficient. - Finally, we learned about functions. Now, we learn how to unleash their power by applying them to anything we interact with in R at scale. --- # A simple example .pull-left[ ## Task Say we want to double each element in a numeric vector, `x = c(1, 2, 3, 4, 5)`. Here are some different approaches to achieve this: ### 1. Manually (sometimes: copying and pasting code) ``` r R> x <- c(1, 2, 3, 4, 5) R> x_doubled <- c(2, 4, 6, 8, 10) ``` ### 2. Vectorization ``` r R> x <- c(1, 2, 3, 4, 5) R> x_doubled <- x * 2 R> x_doubled ``` ``` [1] 2 4 6 8 10 ``` ] -- .pull-right[ ### 3. `for()` loop ``` r R> x <- c(1, 2, 3, 4, 5) R> x_doubled <- numeric(length(x)) R> for (i in seq_along(x)) { + x_doubled[i] <- x[i] * 2 + } R> x_doubled ``` ``` [1] 2 4 6 8 10 ``` ### 4. Using `purrr` ``` r R> library(purrr) R> x <- c(1, 2, 3, 4, 5) R> x_doubled <- map_dbl(x, ~ .x * 2) R> x_doubled ``` ``` [1] 2 4 6 8 10 ``` ] --- # Iteration with purrr .pull-left-wide[ ### The tidyverse way to iterate - For *real* functional programming in base R, we can use the `*apply()` family of functions (`lapply()`, `sapply()`, etc.). See [here](https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/) for an excellent summary. - In the tidyverse, this functionality comes with the `purrr` package. - At its core is the `map*()` family of functions. ### How `purrr` works - The idea is always to **apply** a function to **x**, where x can be a list, vector, data.frame, or something more complex. - The output is then returned as output of a pre-defined type (e.g., a list). ] .pull-right-small-center[ <div align="center"> <br> <img src="pics/purrr.png" height=250> </div> ] --- # Iteration with purrr: map() The `map*()` functions all follow a similar syntax: <div align="center"> `map(.x, .f, ...)` </div> We use it to apply a function `.f` to each piece in `.x`. Additional arguments to `.f` can be passed on in `...`. -- For instance, if we want to identify the object class of every column of a data.frame, we can write: ``` r R> map(starwars, class) ``` ``` $name [1] "character" $height [1] "integer" $mass [1] "numeric" $hair_color [1] "character" $skin_color [1] "character" $eye_color [1] "character" $birth_year [1] "numeric" $sex [1] "character" $gender [1] "character" $homeworld [1] "character" $species [1] "character" $films [1] "list" $vehicles [1] "list" $starships [1] "list" ``` --- # Iteration with purrr: map() *cont.* By default, `map()` returns a list. But we can also use other `map*()` functions to give us an atomic vector of an indicated type (e.g., `map_int()` to return an integer vector, or `map_vec()` to return a vector that is the simplest common type). Going back to the previous example, we can also use `map_chr()`, which returns a character vector: ``` r R> map_chr(starwars, class) ``` ``` name height mass hair_color skin_color eye_color "character" "integer" "numeric" "character" "character" "character" birth_year sex gender homeworld species films "numeric" "character" "character" "character" "character" "list" vehicles starships "list" "list" ``` -- The `purrr` function set is quite comprehensive. Be sure to check out the [cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) and the [tutorials](https://jennybc.github.io/purrr-tutorial/index.html). You'll survive without `purrr` but you probably don't want to live without it. Together with `dplyr` it's easily the most powerful package for data wrangling in the tidyverse. If you master it, it will save you a lot of time and headaches. --- # Iteration with purrr: map() *cont.* <div align="center"> <img src="pics/purrr-cheatsheet.png" height=550> </div> --- # Iteration with purrr: map() *cont.* <div align="center"> <img src="pics/purrr-cheatsheet-2.png" height=550> </div> --- # Another example .pull-left[ ## Task Let's say we want to calculate the mean and standard deviation of height and mass for each species in the `starwars` dataset. ``` r R> # Load the starwars dataset R> data(starwars) R> R> # Custom function for calculations R> calc_stats <- function(df) { + df %>% + summarise( + height_mean = mean(height, na.rm = TRUE), + height_sd = sd(height, na.rm = TRUE), + mass_mean = mean(mass, na.rm = TRUE), + mass_sd = sd(mass, na.rm = TRUE) + ) + } ``` ] -- .pull-right[ ``` r R> # Group by species and apply the custom function R> species_stats <- starwars %>% + group_by(species) %>% + nest() # Nesting the data R> species_stats ``` ``` # A tibble: 38 × 2 # Groups: species [38] species data <chr> <list> 1 Human <tibble [35 × 13]> 2 Droid <tibble [6 × 13]> 3 Wookiee <tibble [2 × 13]> 4 Rodian <tibble [1 × 13]> 5 Hutt <tibble [1 × 13]> 6 <NA> <tibble [4 × 13]> 7 Yoda's species <tibble [1 × 13]> 8 Trandoshan <tibble [1 × 13]> 9 Mon Calamari <tibble [1 × 13]> 10 Ewok <tibble [1 × 13]> # ℹ 28 more rows ``` ] --- # Another example .pull-left[ ## Task Let's say we want to calculate the mean and standard deviation of height and mass for each species in the `starwars` dataset. ``` r R> # Load the starwars dataset R> data(starwars) R> R> # Custom function for calculations R> calc_stats <- function(df) { + df %>% + summarise( + height_mean = mean(height, na.rm = TRUE), + height_sd = sd(height, na.rm = TRUE), + mass_mean = mean(mass, na.rm = TRUE), + mass_sd = sd(mass, na.rm = TRUE) + ) + } ``` ] .pull-right[ ``` r R> # Group by species and apply the custom function R> species_stats <- starwars %>% + group_by(species) %>% + nest() %>% # Nesting the data + # purrr magic + mutate(stats = map(data, calc_stats)) R> species_stats ``` ``` # A tibble: 38 × 3 # Groups: species [38] species data stats <chr> <list> <list> 1 Human <tibble [35 × 13]> <tibble [1 × 4]> 2 Droid <tibble [6 × 13]> <tibble [1 × 4]> 3 Wookiee <tibble [2 × 13]> <tibble [1 × 4]> 4 Rodian <tibble [1 × 13]> <tibble [1 × 4]> 5 Hutt <tibble [1 × 13]> <tibble [1 × 4]> 6 <NA> <tibble [4 × 13]> <tibble [1 × 4]> 7 Yoda's species <tibble [1 × 13]> <tibble [1 × 4]> 8 Trandoshan <tibble [1 × 13]> <tibble [1 × 4]> 9 Mon Calamari <tibble [1 × 13]> <tibble [1 × 4]> 10 Ewok <tibble [1 × 13]> <tibble [1 × 4]> # ℹ 28 more rows ``` ] --- # Another example .pull-left[ ## Task Let's say we want to calculate the mean and standard deviation of height and mass for each species in the `starwars` dataset. ``` r R> # Load the starwars dataset R> data(starwars) R> R> # Custom function for calculations R> calc_stats <- function(df) { + df %>% + summarise( + height_mean = mean(height, na.rm = TRUE), + height_sd = sd(height, na.rm = TRUE), + mass_mean = mean(mass, na.rm = TRUE), + mass_sd = sd(mass, na.rm = TRUE) + ) + } ``` ] .pull-right[ ``` r R> # Group by species and apply the custom function R> species_stats <- starwars %>% + group_by(species) %>% + nest() %>% # Nesting the data + # purrr magic + mutate(stats = map(data, calc_stats)) %>% + select(species, stats) %>% # Select columns + unnest(stats) # Unnest the data R> species_stats ``` ``` # A tibble: 38 × 5 # Groups: species [38] species height_mean height_sd mass_mean mass_sd <chr> <dbl> <dbl> <dbl> <dbl> 1 Human 178 12.0 81.3 19.3 2 Droid 131. 49.1 69.8 51.0 3 Wookiee 231 4.24 124 17.0 4 Rodian 173 NA 74 NA 5 Hutt 175 NA 1358 NA 6 <NA> 175 12.4 81 31.2 7 Yoda's species 66 NA 17 NA 8 Trandoshan 190 NA 113 NA 9 Mon Calamari 180 NA 83 NA 10 Ewok 88 NA 20 NA # ℹ 28 more rows ``` ] <!-- ############################################ --> --- class: inverse, center, middle name: automation # Automation and scripting <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- background-image: url("pics/automation-press-room.jpg") background-size: contain background-color: #000000 # Automation --- # Automation <div align="center"> <img src="pics/xkcd-automation.png" width="500"/> <br> <tt>Credit</tt> <a href="https://xkcd.com/1319/">Randall Munroe/xkcd 1319</a> </div> --- # Automation .pull-left-wide2[ ### Motivation - We spend [too much time](https://itchronicles.com/technology/repetitive-tasks-cost-5-trillion-annually/) on repetitive tasks. - We're already automating using scripts that bundle multiple commands! Next step: The pipeline as a series of scripts and commands. - Good pipelines are modular. But you don't want to trigger 10 scripts sequentially by hand. - Some tasks are to be repeated on a regular basis (schedule). ### When automation makes sense - The input is variable but the process of turning input into output is highly standardized. - You use a diverse set of software to produce the output. - Others (humans, machines) are supposed to run the analyses. - Time saved by automation >> Time needed to automate. ] .pull-right-small2[ ### Different ways of doing it We will consider automation - using **R**, - using the **Shell** and **RScript**, - using **make**, and - using dedicated **scheduling tools**. <div align="center"> <img src="pics/automation-giphy.gif" width="400"/> </div> ] --- # Thinking in pipelines .pull-left[ ### Key characteristics - Pipelines make complex projects easier to handle because they break up a monolithic script into **discrete, manageable chunks**. - If properly done, each stage of the pipeline defines its input and its outputs. - Pipeline modules **do not modify their inputs** (*idempotence*). Rerunning one module produces the same results as the previous run. ### Key advantages - When you modify one stage of the pipeline, you only have to rerun the downstream, dependent stages. - Division of labor is straightforward. - Modules tend to be a lot easier to debug. ] .pull-right[ <br> <div align="center"> <img src="pics/berlin-pink-pipes.jpeg" width="450"/> </div> ] --- # A data science pipeline is a graph .pull-left-small2[ ### Wait what - Scripts and data files are vertices of the graph. - Dependencies between stages are edges of the graph. - Pipelines are not necessarily DAGS. Recursive routines are imaginable (but to be avoided?). - Also, scripts are not necessarily hierarchical (e.g., multiple different modeling approaches of the same data in different scripts). - An automation script gives *one* order in which you can successfully run the pipeline. ] .pull-right-wide2[ <br> <div align="center"> <img src="pics/lotr-pipeline.png" width="600"/> </div> ] --- # An example pipeline .pull-left-small[ In the following, we will work with this toy pipeline:<sup>1</sup> .footnote[<sup>1</sup>Courtesy of [Jenny Bryan](https://github.com/STAT545-UBC/STAT545-UBC-original-website).] ] .pull-right-wide[ ] --- # An example pipeline .pull-left-small[ In the following, we will work with this toy pipeline: - `00-packages.R` loads the packages necessary for analysis, ] .pull-right-wide[ `00-packages.R`: ``` r R> # install packages from CRAN R> p_needed <- c("tidyverse" # tidyverse packages + ) R> packages <- rownames(installed.packages()) R> p_to_install <- p_needed[!(p_needed %in% packages)] R> if (length(p_to_install) > 0) { + install.packages(p_to_install) + } R> lapply(p_needed, require, character.only = TRUE) ``` ] --- # An example pipeline .pull-left-small[ In the following, we will work with this toy pipeline: - `00-packages.R` loads the packages necessary for analysis, - `01-download-data.R` downloads a spreadsheet, which is stored as `lotr_raw.tsv`, ] .pull-right-wide[ `01-download-data.R`: ``` r R> ## download raw data R> download.file(url = "http://bit.ly/lotr_raw-tsv", + destfile = "lotr_raw.tsv") ``` ] --- # An example pipeline .pull-left-small[ In the following, we will work with this toy pipeline: - `00-packages.R` loads the packages necessary for analysis, - `01-download-data.R` downloads a spreadsheet, which is stored as `lotr_raw.tsv`, - `02-process-data.R` imports and processes the data and exports a clean spreadsheet as `lotr_clean.tsv`, and ] .pull-right-wide[ `02-process-data.R`: ``` r R> ## import raw data R> lotr_dat <- read_tsv("lotr_raw.tsv") R> R> ## reorder Film factor levels based on story R> old_levels <- levels(as.factor(lotr_dat$Film)) R> j_order <- sapply(c("Fellowship", "Towers", "Return"), + function(x) grep(x, old_levels)) R> new_levels <- old_levels[j_order] R> R> ## process data set R> lotr_dat <- lotr_dat %>% + # apply new factor levels to Film + mutate(Film = factor(as.character(Film), new_levels), + # revalue Race + Race = recode(Race, `Ainur` = "Wizard", `Men` = "Man")) %>% + ## <skipping some steps here to avoid slide overflow> + + ## write data to file + write_tsv(lotr_dat, "lotr_clean.tsv") ``` ] --- # An example pipeline .pull-left-small[ In the following, we will work with this toy pipeline: - `00-packages.R` loads the packages necessary for analysis, - `01-download-data.R` downloads a spreadsheet, which is stored as `lotr_raw.tsv`, - `02-process-data.R` imports and processes the data and exports a clean spreadsheet as `lotr_clean.tsv`, and - `03-plot.R` imports the clean dataset, produces a figure and exports it as `barchart-words-by-race.png`. ] .pull-right-wide[ `03-plot.R`: ``` r R> ## import clean data R> lotr_dat <- read_tsv("lotr_clean.tsv") %>% + # reorder Race based on words spoken + mutate(Race = reorder(Race, Words, sum)) R> R> ## make a plot R> p <- ggplot(lotr_dat, aes(x = Race, weight = Words)) + geom_bar() R> ggsave("barchart-words-by-race.png", p) ``` ] --- # An example pipeline ``` r R> slice_sample(lotr_dat, n = 10) ``` ``` # A tibble: 10 × 5 Film Chapter Character Race Words <chr> <chr> <chr> <chr> <dbl> 1 The Return Of The King 64: The Mouth Of Sauron Aragorn Man 23 2 The Fellowship Of The Ring 36: The Bridge Of Khazad-dûm Frodo Hobb… 4 3 The Two Towers 36: Isengard Unleashed Saruman Wiza… 50 4 The Fellowship Of The Ring 42: The Great River Sam Hobb… 37 5 The Return Of The King 42: Breaking The Gate Of Go… Gandalf Wiza… 21 6 The Two Towers 45: The Glittering Caves Legolas Elf 36 7 The Two Towers 35: Helm's Deep Rohan Wa… Man 22 8 The Fellowship Of The Ring 33: Moria Aragorn Man 31 9 The Fellowship Of The Ring 43: Parth Galen Aragorn Man 79 10 The Return Of The King 24: Courage Is The Best Def… Gothmog Orc 4 ``` --- # An example pipeline ``` r R> p <- ggplot(lotr_dat, aes(x = Race, weight = Words)) + + geom_bar() + theme_minimal() ``` <div align="center"> <br> <img src="examples/01-automation-just-r/barchart-words-by-race.png" width="400"/> </div> --- # Automation using pipelines in R .pull-left[ ### Motivation and usage - The `source()` function reads and parses R code from a file or connection. - We can build a pipeline by sourcing scripts sequentially. - This pipeline is usually stored in a "master/main" script. - The removal of previous work is optional and maybe redundant. Often the data is overwritten by default. - It is recommended that the individual scripts are (partial) standalones, i.e. that they import all data they need by default (loading the packages could be considered an exception). - Note that as long as the environment is not reset, it remains intact across scripts, which is a potential source of error and confusion. ] -- .pull-right[ ### Example The master script `master.R`: ``` r R> ## clean out any previous work R> outputs <- c("lotr_raw.tsv", + "lotr_clean.tsv", + list.files(pattern = "*.png$")) R> file.remove(outputs) R> R> ## run scripts R> source("00-packages.R") R> source("01-download-data.R") R> source("02-process-data.R") R> source("03-plot.R") ``` ] --- # Automation using the Shell and Rscript .pull-left[ ### Motivation and usage - Alternatively to using an R master script, we can also run the pipeline from the command line. - Note that here, the environments don't carry over across `Rscript` calls. The scripts definitely have to run in a standalone fashion (i.e., load packages, import all necessary data, etc.). - The working directory should be set either in the script(s) or in the shell with `cd`. ] -- .pull-right[ ### Example The master script `master.sh`: ```bash #!/bin/sh cd /Users/simonmunzert/github/examples/02-automation-shell-rscript set -eux Rscript 01-download-data.R Rscript 02-process-data.R Rscript 03-plot.R ``` The `set` command allows to adjust some base shell parameters: - `-e`: Stop at first error - `-u`: Undefined variables are an error - `-x`: Print each command as it is run For more information on `set`, see [here](http://linuxcommand.org/lc3_man_pages/seth.html). ] --- # Automation using the Shell and Rscript .pull-left[ ### Motivation and usage - Alternatively to using an R master script, we can also run the pipeline from the command line. - Note that here, the environments don't carry over across `Rscript` calls. The scripts definitely have to run in a standalone fashion (i.e., load packages, import all necessary data, etc.). - The working directory should be set either in the script(s) or in the shell with `cd`. - One advantage of this approach is that it can be easily coupled with other command line tools, building a **polyglot pipeline**. ] .pull-right[ ### Example The master script `master.sh`: ```bash #!/bin/sh cd /Users/simonmunzert/github/examples/02-automation-shell-rscript set -eux *curl -L http://bit.ly/lotr_raw-tsv > lotr_raw.tsv Rscript 02-process-data.R Rscript 03-plot.R ``` The `set` command allows to adjust some base shell parameters: - `-e`: Stop at first error - `-u`: Undefined variables are an error - `-x`: Print each command as it is run For more information on `set`, see [here](http://linuxcommand.org/lc3_man_pages/seth.html). ] --- # Automation using Make .pull-left-vwide[ ### Motivation and usage - [Make](https://en.wikipedia.org/wiki/Make_%28software%29) is an automation tool that allows us to specify and manage build processes. - It is commonly run via the shell. - At the heart of a make operation is the `makefile` (or `Makefile`, `GNUmakefile`), a script which serves as a recipe for the building process. - A `makefile` is written following a particular syntax and in a declarative fashion. - Conceptually, the recipe describes which files are built how and using what input. ### Advantages of Make - It looks at which files you have and automatically figures out how to create the files that you have. For complex pipelines this "automation of the automation process" can be very helpful. - While shell scripts give *one* order in which you can successfully run the pipeline, Make will figure out the parts of the pipeline (and their order) that are needed to build a desired target. ] .pull-right-vsmall[ <div align="center"> <br><br><br><br> <img src="pics/gnu-make.png" width="250"/> </div> ] --- # Automation using Make (cont.) .pull-left-small[ ### Basic syntax Each batch of lines indicates - a file to be created (the target), - the files it depends on (the prerequisites), and - set of commands needed to construct the target from the dependent files. Dependencies propagate. - To create any of the `png` figures, we need `lotr_clean.tsv`. - If this file changes, the `png`s change as well when they're built. ] .pull-right-wide[ ### Example `makefile` ```bash all: lotr_clean.tsv barchart-words-by-race.png words-histogram.png lotr_raw.tsv: curl -L http://bit.ly/lotr_raw-tsv > lotr_raw.tsv lotr_clean.tsv: lotr_raw.tsv 02-process-data.R Rscript 02-process-data.R barchart-words-by-race.png: lotr_clean.tsv 03-plot.R Rscript 03-plot.R words-histogram.png: lotr_clean.tsv Rscript -e 'library(ggplot2); qplot(Words, data = read.delim("$<"), geom = "histogram"); ggsave("$@")' rm Rplots.pdf clean: rm -f lotr_raw.tsv lotr_clean.tsv *.png ``` ] --- # Automation using Make (cont.) .pull-left-small[ ### Getting Make to run - Using the command line, go into the directory for your project. - Create the `Makefile` file. - The most basic Make commands are `make all` and `make clean` which builds (or deletes) all output as specified in the script. ] .pull-right-wide[ ### Example `makefile` ```bash *all: lotr_clean.tsv barchart-words-by-race.png words-histogram.png lotr_raw.tsv: curl -L http://bit.ly/lotr_raw-tsv > lotr_raw.tsv lotr_clean.tsv: lotr_raw.tsv 02-process-data.R Rscript 02-process-data.R barchart-words-by-race.png: lotr_clean.tsv 03-plot.R Rscript 03-plot.R words-histogram.png: lotr_clean.tsv Rscript -e 'library(ggplot2); qplot(Words, data = read.delim("$<"), geom = "histogram"); ggsave("$@")' rm Rplots.pdf *clean: * rm -f lotr_raw.tsv lotr_clean.tsv *.png ``` ] --- # Automation using Make - FAQ ### Does it work on Windows? To install an run `make` on Windows, check out [these instructions](https://stat545.com/make-windows.html). ### Where can I learn more? If you consider working with Make, check out the [official manual](https://www.gnu.org/software/make/manual/make.html), [this helpful tutorial](https://makefiletutorial.com/), Karl Broman's [excellent minimal make introduction](https://kbroman.org/minimal_make/), or [this Stat545 piece](https://stat545.com/automation-overview.html). .pull-left-vwide[ ### This is dusty technology. Are there alternatives? In the context of data science with R, the `targets` package is an interesting option. It provides R functionality to define a Make-stype pipeline. Check out the [overview](https://docs.ropensci.org/targets/) and [manual](https://books.ropensci.org/targets/). ] .pull-right-vsmall[ <div align="center"> <br> <img src="pics/targets-hex.png" width="150"/> </div> ] <!-- ############################################ --> --- class: inverse, center, middle name: scheduling # Scheduling <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- background-image: url("pics/scheduling-clocks-louis-delphin.jpg") background-size: contain background-color: #000000 # Scheduling --- # Scheduling <div align="center"> <img src="pics/xkcd-time.png" width="600"/> <br> <tt>Credit</tt> <a href="https://xkcd.com/1205/">Randall Munroe/xkcd 1205</a> </div> --- # Scheduling scripts and processes .pull-left-wide[ ### Motivation - So far, we have automated data science pipelines. - But the execution of these pipelines still needs to be triggered. - In some cases, it is desirable to also **automate the initialization** of R scripts (or any processes for that matter) **on a regular basis**, e.g. weekly, daily, on logon, etc. - This makes particular sense when you have moving parts in your pipeline (most likely: data). ### Common scenarios for scheduling 1. You fetch data from the web on a regular basis (e.g., via scraping scripts or APIs). 2. You generate daily/weekly/monthly reports/tweets based on changing data. 3. You build an alert control system informing you about anomalies in a database. ] .pull-right-small[ <br><br><br><br> <div align="center"> <img src="pics/robot-clock-giphy.gif" width="400"/> </div> `Credit` [Simone Giertz](https://www.youtube.com/watch?v=Lh2-iJj3dI0) ] --- # Scheduling scripts and processes on Windows .pull-left-small2[ ### Scheduling options - Schedule tasks on Windows with [Windows Task Scheduler](https://en.wikipedia.org/wiki/Windows_Task_Scheduler). - Manage them via a GUI (→ Control Panel) or the command line using `schtasks.exe`. - The R package [`taskscheduleR`](https://cran.r-project.org/web/packages/taskscheduleR/vignettes/taskscheduleR.html) provides a programmable R interface to the WTS. <div align="center"> <img src="pics/windows-task-scheduler.png" width="400"/> </div> ] .pull-right-wide2[ ### `taskscheduleR` example ``` r R> library(taskscheduleR) R> myscript <- "examples/scrape-wiki.R" R> ## Run every 5 minutes, starting from 10:40 R> taskscheduler_create( + taskname = "WikiScraperR_5min", rscript = myscript, + schedule = "MINUTE", starttime = "10:40", modifier = 5) R> R> ## Run every week on Saturday and Sunday at 09:10 R> taskscheduler_create( + taskname = "WikiScraperR_SatSun", rscript = myscript, + schedule = "WEEKLY", starttime = "09:10", + days = c('SAT', 'SUN')) R> R> ## Delete task R> taskscheduler_delete("WikiScraperR_SatSun") R> R> ## Get a data.frame of all tasks R> tasks <- taskscheduler_ls() R> str(tasks) ``` ] --- # Scheduling scripts and processes on a Mac .footnote[<sup>1</sup>For more resources on scheduling with `launchd`, check out [this](https://babichmorrowc.github.io/post/launchd-jobs/) and [this](https://towardsdatascience.com/a-step-by-step-guide-to-scheduling-tasks-for-your-data-science-project-d7df4531fc41).] .pull-left-small3[ ### Scheduling options - On macOS you can schedule background jobs using [`cron`](https://en.wikipedia.org/wiki/Cron) and [`launchd`](https://en.wikipedia.org/wiki/Launchd). - `launchd`<sup>1</sup> was created by Apple as a replacement for the popular Linux utility `cron` ([deprecated](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/ScheduledJobs.html) but still usable). - The R package [`cronR`](https://cran.r-project.org/web/packages/cronR/index.html) provides a programmable R interface. - `cron` syntax for more complex scheduling: <div align="center"> <img src="pics/cron-syntax.png" width="380"/> </div> ] .pull-right-wide3[ ### `cronR` example ``` r R> library(cronR) R> myscript <- "examples/scrape-wiki.R" R> # Create bash code for crontab to execute R script R> cmd <- cron_rscript(myscript) R> R> ## Run every minute R> cron_add(command = cmd, frequency = 'minutely', + id = 'ScraperR_1min', description = 'Every 1min') R> R> ## Run every 15 minutes (using cron syntax) R> cron_add(cmd, frequency = '*/15 * * * *', + id = 'ScraperR_15min', description = 'Every 15 mins') R> R> ## Check number of running cronR jobs R> cron_njobs() R> R> ## Delete task R> cron_rm("WikiScraperR_1min", ask = TRUE) ``` ] <!-- ############################################ --> --- class: inverse, center, middle name: debugging # Strategies for debugging <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- background-image: url("pics/debugging-female-welders.jpg") background-size: contain background-color: #000000 # Debugging --- # What's debugging? .pull-left[ ### Straight from the [Wikipedia](https://en.wikipedia.org/wiki/Debugging) "Debugging is the process of finding and resolving bugs (defects or problems that prevent correct operation) within computer programs, software, or systems." ### A famous (yet not the first) bug: The term "bug" was used in an account by computer pioneer [Grace Hopper](https://en.wikipedia.org/wiki/Grace_Hopper) (see on the right). While she was working on a [Mark II](https://en.wikipedia.org/wiki/Harvard_Mark_II) computer at Harvard University, her associates discovered a moth stuck in a relay and thereby impeding operation, whereupon she remarked that they were "debugging" the system. This bug was carefully removed and taped to the log book (see on the right). ] .pull-right-center[ <div align="center"> <br> <img src="pics/grace-hopper.jpg" width="200"/> <br> <i>Above:</i> Grace Hopper, <i>Below:</i> The bug <br> <img src="pics/computer-bug-hopper.jpeg" width="300"/> </div> ] --- # Why debugging matters .pull-left-wide[ The Wikipedia [list of software bugs](https://en.wikipedia.org/wiki/List_of_software_bugs) with significant consequences is growing and you don't want to be on it. NASA software engineers are [famous for producing bug-free code](https://www.bugsplat.com/blog/less-serious/why-nasa-code-doesnt-crash/). This was learned the hard and costly way though. Some highlights from space: - 1962: A booster went off course during launch, resulting in the [destruction of NASA Mariner 1](https://www.youtube.com/watch?v=CkOOazEJcUc) . This was the result of the failure of a transcriber to notice an overbar in a handwritten specification for the guidance program, resulting in an incorrect formula the FORTRAN code. - 1999: [NASA's Mars Climate Orbiter was destroyed](https://www.youtube.com/watch?v=lcYkOh4nweE), due to software on the ground generating commands based on parameters in pound-force (lbf) rather than newtons (N) - 2004: [NASA's Spirit rover became unresponsive](https://www.youtube.com/watch?v=7V54LRRJaGk) on January 21, 2004, a few weeks after landing on Mars. Engineers found that too many files had accumulated in the rover's flash memory (the problem could be fixed though by deleting unnecessary files, and the Rover lived happily ever after. Until it [froze to death in 2011](https://en.wikipedia.org/wiki/Spirit_(rover)). ] .pull-right-small[ <div align="center"> <br><br> <img src="pics/nasa-coding-error.png" width="400"/> <img src="pics/nasa-coding-error-2.png" width="400"/> </div> ] --- # Why debugging matters (cont.) <div align="center"> <br> <img src="pics/excel-error-paper-2.jpeg" height="500"/> <img src="pics/excel-error-paper.png" height="500"/> </div> --- # Why debugging matters (cont.) .pull-left-center[ <div align="center"> <br> <img src="pics/fb-socsci1-1.png" width="550"/> <img src="pics/fb-socsci1-2.png" width="550"/> <img src="pics/fb-socsci1-3.png" width="550"/> </div> `Source` [Washington Post](https://www.washingtonpost.com/technology/2021/09/10/facebook-error-data-social-scientists/) ] .pull-right-center[ <div align="center"> <br> <img src="pics/fb-socsci1-4.png" width="450"/> <img src="pics/fb-socsci1-5.png" width="450"/> </div> `Source` [Solomon Messing / Twitter](https://twitter.com/solomonmg/status/1436742352039669760) ] --- # A general strategy for debugging .pull-left-vsmall[] .pull-right-wide[ <br> <br> <br> ## 1. Google ## 2. Reset ## 3. Debug ## 4. Deter ] --- # Google .footnote[<sup>1</sup>Do you get an error message you don't understand? That's good news actually, because the really nasty bugs come without errors. ] .pull-left-wide2[ According to [this analysis](https://github.com/noamross/zero-dependency-problems/blob/master/misc/stack-overflow-common-r-errors.md), the most common error types in R are:<sup>1</sup> 1. `Could not find function` errors, usually caused by typos or not loading a required package. 2. `Error in if` errors, caused by non-logical data or missing values passed to R's `if` conditional statement. 3. `Error in eval` errors, caused by references to objects that don't exist. 4. `Cannot open` errors, caused by attempts to read a file that doesn't exist or can't be accessed. 5. `no applicable method` errors, caused by using an object-oriented function on a data type it doesn't support. 6. `subscript out of bounds` errors, caused by trying to access an element or dimension that doesn't exist 7. Package errors caused by being unable to install, compile or load a package. ] -- .pull-right-small2[ Whenever you see an error message, start by [googling](https://lmgtfy.app/?q=Error+in+interpretative_method+%3A+%20+%20could+not+find+function+%22interpretative_method%22&iie=1) it. Improve your chances of a good match by removing any variable names or values that are specific to your problem. Also, look for [Stack Overflow](https://stackoverflow.com/questions/tagged/r) posts and list of answers. <div align="center"> <img src="pics/google-programming.jpeg" height="250"/> </div> ] --- # Reset .pull-left[ - If at first you don't succeed, try exactly the same thing again. - Have you tried turning it off and on again? - Do you use `rm(list = ls())`? Don't. Packages remain loaded, options and environment variables set, ... all possible sources of error! - A fresh start clears the workspace, resets options, environment variables, and the path. - While we're at it, check out James Wade's advice ["How I set up RStudio for Efficient Coding" (YouTube)](https://www.youtube.com/watch?v=p-r-AWR3-Es). ] .pull-right[ <div align="center"> <img src="pics/restart-r-1.png" width="350"/><br> <img src="pics/restart-r-2.png" width="350"/> </div> ] --- # Debug ### Make the error repeatable. - Execute the code many times as you consider and reject hypotheses. To make that iteration as quick possible, it’s worth some upfront investment to make the problem both easy and fast to reproduce. - Work with reproducible and minimal examples by removing innocuous code and simplifying data. - Consider automated testing. Add some nearby tests to ensure that existing good behaviour is preserved. ### Track the error down. - Execute code step by step and inspect intermediate outputs. - Adopt the scientific method: Generate hypotheses, design experiments to test them, and record your results. ### Once found, fix the error and test it. - Ensure you haven’t introduced any new bugs in the process. - Make sure to carefully record the correct output, and check against the inputs that previously failed. - Reset and run again to make sure everything still works. --- # Deter .pull-left-wide2[ ### Defensive programming - **Pay attention.** Do results make sense? Do they look different from previous results? Why? - **Know what you're doing**, and what you're expecting. - Avoid functions that return different types of output depending on their input, e.g., `[]` and `sapply()`. - Be strict about what you accept (e.g., only scalars). - Avoid functions that use non-standard evaluation (e.g., `with()`) - **Fail fast**. - As soon as something wrong is discovered, signal an error. - Add tests (e.g., with the `testthat` package). - Practice good condition/exception handling, e.g., with `try()` and `tryCatch()`. - Write error messages for humans. ] .pull-right-small3[ ### Transparency - Collaborate! [Pair programming](https://en.wikipedia.org/wiki/Pair_programming) is an established software development technique that increases code robustness. It also works [from remote](https://ivelasq.rbind.io/blog/vscode-live-share/). - Be transparent! Let others access your code and comment on it. <div align="center"> <img src="pics/pair-programming-its-not-for-everyone.jpeg" width="350"/> </div> ] --- # Debugging R: What you get <br> <br> <tt> Error : .onLoad failed in loadNamespace() for 'rJava', details: <br> call: dyn.load(file, DLLpath = DLLpath, ...) <br> error: unable to load shared object '/Users/janedoe/Library/R/3.6/library/rJava/libs/rJava.so': <br> libjvm.so: cannot open shared object file: No such file or directory <br> Error: loading failed <br> Execution halted <br> ERROR: loading failed <br> * removing '/Users/janedoe/Library/R/3.6/library/rJava/' <br> Warning in install.packages : <br> installation of package 'rJava' had non-zero exit status </tt> <br> `Credit` [Jenny Bryan](https://github.com/jennybc/debugging) --- # Debugging R: What you see <br> <br> <tt> <span style = "color:red">Error</span> : blah <span style = "color:red">failed</span> blah blah() blah 'blah', blah: <br> call: blah.blah(blah, blah = blah, ...) <br> <span style = "color:red">error</span>: <span style = "color:red">unable</span> to blah blah blah '/blah/blah/blah/blah/blah/blah/blah/blah/blah.so': <br> blah.so: <span style = "color:red">cannot</span> open blah blah blah: <span style = "color:red">No</span> blah blah blah blah <br> <span style = "color:red">Error</span>: blah <span style = "color:red">failed</span> <br> blah blah <br> <span style = "color:red">ERROR</span>: blah <span style = "color:red">failed</span> <br> * removing '/blah/blah/blah/blah/blah/blah/blah/' <br> <span style = "color:red">Warning</span> in blah.blah : <br> blah of blah 'blah' blah blah-blah blah blah </tt> <br> `Credit` [Jenny Bryan](https://github.com/jennybc/debugging) --- # Strategies to debug your R code Sometimes the mistake in your code is hard to diagnose, and googling doesn't help. Here are a couple of strategies to debug your code: - Use `traceback()` to determine where a given error is occurring. - Output diagnostic information in code with `print()`, `cat()` or `message()` statements. - Use `browser()` to open an interactive debugger before the error - Use `debug()` to automatically open a debugger at the start of a function call. - Use `trace()` to make temporary code modifications inside a function that you don't have easy access to. --- # Locating errors with traceback() .pull-left[ ### Motivation and usage - When an error occurs with an unidentifiable error message or an error message that you are in principle familiar with but cannot locate its sources, the `traceback()` function comes in handy. - The `traceback()` function prints the sequence of calls that led to an uncaught error. - The `traceback()` output reads from bottom to top. - Note that errors caught via `try()` or `tryCatch()` do not generate a traceback! - If you’re calling code that you `source()`d into R, the traceback will also display the location of the function, in the form `filename.r#linenumber`. ] -- .pull-right[ ### Example In the call sequence below, the execution of `g()` triggers an error: ``` r R> f <- function(x) x + 1 R> g <- function(x) f(x) R> g("a") ``` ```r #> Error in x + 1 : non-numeric argument to binary operator ``` Doing the traceback reveals that the function call f(x) is what lead to the error: ``` r R> traceback() ``` ```r #> 2: f(x) at #1 #> 1: g("a") ``` ] --- # Interactive debugging with browser() .pull-left[ ### Motivation and usage - Sometimes, you need more information than the precise location of an error in a function to fix it. - The interactive debugger lets you pause the run of a function and interactively explore its state. - Two options to enter the interactive debugger: 1. Through RStudio's "Rerun with Debug" tool, shown to the right of an error message. 2. You can insert a call to `browser()` into the function at the stage where you want to pause, and re-run the function. - In either case, you’ll end up in an interactive environment inside the function where you can run arbitrary R code to explore the current state. You’ll know when you’re in the interactive debugger because you get a special prompt, `Browse[1]>`. ] -- .pull-right[ ### Example ``` r R> h <- function(x) x + 3 R> g <- function(b) { *+ browser() + h(b) + } R> g(10) ``` Some useful things to do are: 1. Use `ls()` to determine what objects are available in the current environment. 2. Use `str()`, `print()` etc. to examine the objects. 3. Use `n` to evaluate the next statement. 4. Use `s`: like `n` but also step into function calls. 5. Use `where` to print a stack trace (→ traceback). 6. Use `c` to exit debugger and continue execution. 7. Use `Q` to exit debugger and return to the R prompt. ] --- # Debugging other peoples' code .pull-left[ ### Motivation - Sometimes the error is outside your code in a package you're using, you might still want to be able to debug. - Two options: 1. Get a local version of the package code and debug as if it were your own. 2. Use functions which which allow you to start a browser in existing functions, including `recover()` and `debug()`. ] .pull-right[ ] --- # Debugging other peoples' code (cont.) .pull-left-small[ ### Motivation - `recover()` serves as an alternative error handler which you activate by calling `options(error = recover)`. - You can then select from a list of current calls to browse. - `options(error = NULL)` turns off this debugging mode again. - A simpler alternative is `options(error = browser)`, but this only allows you to browse the call where the error occurred. ] -- .pull-right-wide[ ### Example - Activate debugging mode; then execute (flawed) function: ``` r *R> options(error = recover) R> lm(mpg ~ wt, data = "mtcars") ``` ```r Error in model.frame.default(formula = mpg ~ wt, data = "mtcars", drop.unused.levels = TRUE) 'data' must be a data.frame, environment, or list Enter a frame number, or 0 to exit 1: lm(mpg ~ wt, data = "mtcars") 2: eval(mf, parent.frame()) 3: eval(mf, parent.frame()) Selection: ``` - Deactivate debugging mode: ``` r *R> options(error = NULL) ``` ] --- # Debugging other peoples' code (cont.) .pull-left[ ### Motivation - `debug()` activates the debugger on any function, including those in packages (see on the right). `undebug()` deactivates the debugger again. - Some functions in another package are easier to find than others. There are - *exported* functions which are available outside of a package and - *internal* functions which are only available within a package. - To find (and debug) exported functions, use the `::` syntax, as in `ggplot2::ggplot`. - To find un-exported functions, use the `:::` syntax, as in `ggplot2:::check_required_aesthetics`. ] -- .pull-right[ ### Example - Activate debugging mode for `lm()` function; then execute function: ``` r *R> debug(stats::lm) R> lm(mpg ~ weight, data = "mtcars") ``` - Interactive debugging mode for `lm()` is entered; use the common `browser()` functionality to navigate: ```r debugging in: lm(mpg ~ weight, data = mtcars) debug: { ret.x <- x ... Browse[2]> ``` - Deactivate debugging mode: ``` r *R> undebug(stats::lm) ``` ] --- # Debugging in RStudio <div align="center"> <img src="pics/rstudio-debug-mode.png" width="1100"/> </div> --- # More on debugging R .pull-left[ <br><br><br> ### Further reading - [12-minute video](https://vimeo.com/99375765) on debugging in R - Jenny Bryan's [talk on debugging](https://github.com/jennybc/debugging) at rstudio::conf 2020 - Jenny Bryan and Jim Hester's "What They Forgot to Teach You About R", Chapter 11: [Debugging R code](https://rstats.wtf/debugging-r) - Jonathan McPherson's [Debugging with RStudio](https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio) ] .pull-right[ <br> <div align="center"> <img src="pics/classy-bear-debugging.jpeg" width="450"/> </div> ] --- # Next steps <br> ### Assignment 1 and Quiz 1 Don't forget to submit your solutions for Assignment 1. Also, Quiz 1 is online! ### Next lecture We're going to dig into the world wide web... <b>Important:</b> The lecture is going to take place on Wednesday, 8-10am, in the Forum! If your regular lab is schedule for that slot, please visit the lab session at the ordinary lecture slot instead (Mon, 10-12h). The lecture will be recorded and made available in case you cannot attend.