Introduction to Data Science

class: center, middle, inverse, title-slide

.title[
# Introduction to Data Science
]
.subtitle[
## Session 3: Programming II
]
.author[
### Simon Munzert
]
.institute[
### Hertie School | <a href="https://github.com/intro-to-data-science-25">GRAD-C11/E1339</a>
]

---

# Table of contents

<br>

1. [Iteration](#iteration)

2. [Automation and scripting](#automation)

3. [Scheduling](#scheduling)

4. [Debugging](#debugging)

---
class: inverse, center, middle
name: iteration

# Iteration

---
background-image: url("pics/iteration-ford-assembly-line-1913.jpg")
background-size: contain
background-color: #000000

# Iteration

---

# Iteration

### The ubiquity of iteration

- Often we have to run the same task over and over again, with minor variations. Examples:
  - Standardize values of a variable
  - Recode all numeric variables in a dataset
  - Running multiple models with varying covariate sets
- A benefit of scripting languages in data (as opposed to point-and-click solutions) is that we can easily automate the process of iteration

### Ways to iterate

- A simple approach is to copy-and-paste code with minor modifications (→ "[duplicate code](https://en.wikipedia.org/wiki/Duplicate_code)", → "[copy-and-paste programming](https://en.wikipedia.org/wiki/Copy-and-paste_programming)"). This is lazy, error-prone, not very efficient, and violates the "[Don't repeat yourself](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)" (DRY) principle. 
- In R, [vectorization](https://adv-r.hadley.nz/perf-improve.html#vectorise), that is applying a function to every element of a vector at once, already does a good share of iteration for us.
- `for()` [loops](https://r4ds.hadley.nz/iteration.html) are intuitive and straightforward to build, but sometimes not very efficient.
- Finally, we learned about functions. Now, we learn how to unleash their power by applying them to anything we interact with in R at scale.

---
# A simple example

.pull-left[

## Task

Say we want to double each element in a numeric vector, `x = c(1, 2, 3, 4, 5)`. Here are some different approaches to achieve this:

### 1. Manually (sometimes: copying and pasting code)

``` r
R> x <- c(1, 2, 3, 4, 5)
R> x_doubled <- c(2, 4, 6, 8, 10)
```

### 2. Vectorization

``` r
R> x <- c(1, 2, 3, 4, 5)
R> x_doubled <- x * 2
R> x_doubled
```

```
   [1]  2  4  6  8 10
```
]

.pull-right[
### 3. `for()` loop

``` r
R> x <- c(1, 2, 3, 4, 5)
R> x_doubled <- numeric(length(x))
R> for (i in seq_along(x)) {
+   x_doubled[i] <- x[i] * 2
+ }
R> x_doubled
```

```
   [1]  2  4  6  8 10
```

### 4. Using `purrr`

``` r
R> library(purrr)
R> x <- c(1, 2, 3, 4, 5)
R> x_doubled <- map_dbl(x, ~ .x * 2)
R> x_doubled
```

```
   [1]  2  4  6  8 10
```
]

---

# Iteration with purrr

.pull-left-wide[

### The tidyverse way to iterate

- For *real* functional programming in base R, we can use the `*apply()` family of functions (`lapply()`, `sapply()`, etc.). See [here](https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/) for an excellent summary.
- In the tidyverse, this functionality comes with the `purrr` package.
- At its core is the `map*()` family of functions.

### How `purrr` works

- The idea is always to **apply** a function to **x**, where x can be a list, vector, data.frame, or something more complex. 
- The output is then returned as output of a pre-defined type (e.g., a list).
]

.pull-right-small-center[
<div align="center">
<br>
<img src="pics/purrr.png" height=250>
</div>
]

---

# Iteration with purrr: map()

The `map*()` functions all follow a similar syntax:

We use it to apply a function `.f` to each piece in `.x`. Additional arguments to `.f` can be passed on in `...`.

For instance, if we want to identify the object class of every column of a data.frame, we can write:

``` r
R> map(starwars, class)
```

```
   $name
   [1] "character"
   
   $height
   [1] "integer"
   
   $mass
   [1] "numeric"
   
   $hair_color
   [1] "character"
   
   $skin_color
   [1] "character"
   
   $eye_color
   [1] "character"
   
   $birth_year
   [1] "numeric"
   
   $sex
   [1] "character"
   
   $gender
   [1] "character"
   
   $homeworld
   [1] "character"
   
   $species
   [1] "character"
   
   $films
   [1] "list"
   
   $vehicles
   [1] "list"
   
   $starships
   [1] "list"
```

---

# Iteration with purrr: map() *cont.*

By default, `map()` returns a list. But we can also use other `map*()` functions to give us an atomic vector of an indicated type (e.g., `map_int()` to return an integer vector, or `map_vec()` to return a vector that is the simplest common type).

Going back to the previous example, we can also use `map_chr()`, which returns a character vector:

``` r
R> map_chr(starwars, class)
```

```
          name      height        mass  hair_color  skin_color   eye_color 
   "character"   "integer"   "numeric" "character" "character" "character" 
    birth_year         sex      gender   homeworld     species       films 
     "numeric" "character" "character" "character" "character"      "list" 
      vehicles   starships 
        "list"      "list"
```

The `purrr` function set is quite comprehensive. Be sure to check out the [cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) and the [tutorials](https://jennybc.github.io/purrr-tutorial/index.html). You'll survive without `purrr` but you probably don't want to live without it. Together with `dplyr` it's easily the most powerful package for data wrangling in the tidyverse. If you master it, it will save you a lot of time and headaches.

---

# Iteration with purrr: map() *cont.*

---

# Iteration with purrr: map() *cont.*

---
# Another example

.pull-left[
## Task

Let's say we want to calculate the mean and standard deviation of height and mass for each species in the `starwars` dataset.

``` r
R> # Load the starwars dataset
R> data(starwars)
R> 
R> # Custom function for calculations
R> calc_stats <- function(df) {
+   df %>%
+     summarise(
+       height_mean = mean(height, na.rm = TRUE),
+       height_sd = sd(height, na.rm = TRUE),
+       mass_mean = mean(mass, na.rm = TRUE),
+       mass_sd = sd(mass, na.rm = TRUE)
+     )
+ }
```
]

.pull-right[

``` r
R> # Group by species and apply the custom function
R> species_stats <- starwars %>%
+   group_by(species) %>%
+   nest() # Nesting the data
R> species_stats
```

```
   # A tibble: 38 × 2
   # Groups:   species [38]
      species        data              
      <chr>          <list>            
    1 Human          <tibble [35 × 13]>
    2 Droid          <tibble [6 × 13]> 
    3 Wookiee        <tibble [2 × 13]> 
    4 Rodian         <tibble [1 × 13]> 
    5 Hutt           <tibble [1 × 13]> 
    6 <NA>           <tibble [4 × 13]> 
    7 Yoda's species <tibble [1 × 13]> 
    8 Trandoshan     <tibble [1 × 13]> 
    9 Mon Calamari   <tibble [1 × 13]> 
   10 Ewok           <tibble [1 × 13]> 
   # ℹ 28 more rows
```
]

---
# Another example

.pull-left[
## Task

Let's say we want to calculate the mean and standard deviation of height and mass for each species in the `starwars` dataset.

.pull-right[

``` r
R> # Group by species and apply the custom function
R> species_stats <- starwars %>%
+   group_by(species) %>%
+   nest() %>% # Nesting the data
+   # purrr magic
+   mutate(stats = map(data, calc_stats))
R> species_stats
```

```
   # A tibble: 38 × 3
   # Groups:   species [38]
      species        data               stats           
      <chr>          <list>             <list>          
    1 Human          <tibble [35 × 13]> <tibble [1 × 4]>
    2 Droid          <tibble [6 × 13]>  <tibble [1 × 4]>
    3 Wookiee        <tibble [2 × 13]>  <tibble [1 × 4]>
    4 Rodian         <tibble [1 × 13]>  <tibble [1 × 4]>
    5 Hutt           <tibble [1 × 13]>  <tibble [1 × 4]>
    6 <NA>           <tibble [4 × 13]>  <tibble [1 × 4]>
    7 Yoda's species <tibble [1 × 13]>  <tibble [1 × 4]>
    8 Trandoshan     <tibble [1 × 13]>  <tibble [1 × 4]>
    9 Mon Calamari   <tibble [1 × 13]>  <tibble [1 × 4]>
   10 Ewok           <tibble [1 × 13]>  <tibble [1 × 4]>
   # ℹ 28 more rows
```
]

---
# Another example

.pull-left[
## Task

Let's say we want to calculate the mean and standard deviation of height and mass for each species in the `starwars` dataset.

.pull-right[

```
   # A tibble: 38 × 5
   # Groups:   species [38]
      species        height_mean height_sd mass_mean mass_sd
      <chr>                <dbl>     <dbl>     <dbl>   <dbl>
    1 Human                 178      12.0       81.3    19.3
    2 Droid                 131.     49.1       69.8    51.0
    3 Wookiee               231       4.24     124      17.0
    4 Rodian                173      NA         74      NA  
    5 Hutt                  175      NA       1358      NA  
    6 <NA>                  175      12.4       81      31.2
    7 Yoda's species         66      NA         17      NA  
    8 Trandoshan            190      NA        113      NA  
    9 Mon Calamari          180      NA         83      NA  
   10 Ewok                   88      NA         20      NA  
   # ℹ 28 more rows
```
]

---
class: inverse, center, middle
name: automation

# Automation and scripting
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
background-image: url("pics/automation-press-room.jpg")
background-size: contain
background-color: #000000

# Automation

---
# Automation

<div align="center">
<img src="pics/xkcd-automation.png" width="500"/>
<br>
<tt>Credit</tt> <a href="https://xkcd.com/1319/">Randall Munroe/xkcd 1319</a>
</div>

---
# Automation

.pull-left-wide2[
### Motivation

- We spend [too much time](https://itchronicles.com/technology/repetitive-tasks-cost-5-trillion-annually/) on repetitive tasks.
- We're already automating using scripts that bundle multiple commands! Next step: The pipeline as a series of scripts and commands.
- Good pipelines are modular. But you don't want to trigger 10 scripts sequentially by hand.
- Some tasks are to be repeated on a regular basis (schedule).

### When automation makes sense

- The input is variable but the process of turning input into output is highly standardized.
- You use a diverse set of software to produce the output.
- Others (humans, machines) are supposed to run the analyses.
- Time saved by automation >> Time needed to automate.
]

.pull-right-small2[
### Different ways of doing it

We will consider automation

- using **R**,
- using the **Shell** and **RScript**,
- using **make**, and
- using dedicated **scheduling tools**.

<div align="center">
<img src="pics/automation-giphy.gif" width="400"/>
</div>
]

---
# Thinking in pipelines

.pull-left[
### Key characteristics
- Pipelines make complex projects easier to handle because they break up a monolithic script into **discrete, manageable chunks**.
- If properly done, each stage of the pipeline defines its input and its outputs.
- Pipeline modules **do not modify their inputs** (*idempotence*). Rerunning one module produces the same results as the previous run.

### Key advantages
- When you modify one stage of the pipeline, you only have to rerun the downstream, dependent stages.
- Division of labor is straightforward.
- Modules tend to be a lot easier to debug.
]

.pull-right[
<br>
<div align="center">
<img src="pics/berlin-pink-pipes.jpeg" width="450"/>
</div>
]

---
# A data science pipeline is a graph

.pull-left-small2[
### Wait what
- Scripts and data files are vertices of the graph.
- Dependencies between stages are edges of the graph.
- Pipelines are not necessarily DAGS. Recursive routines are imaginable (but to be avoided?).
- Also, scripts are not necessarily hierarchical (e.g., multiple different modeling approaches of the same data in different scripts).
- An automation script gives *one* order in which you can successfully run the pipeline.
]

.pull-right-wide2[
<br>
<div align="center">
<img src="pics/lotr-pipeline.png" width="600"/>
</div>
]

---
# An example pipeline

.pull-left-small[
In the following, we will work with this toy pipeline:<sup>1</sup>

.footnote[<sup>1</sup>Courtesy of [Jenny Bryan](https://github.com/STAT545-UBC/STAT545-UBC-original-website).]
]

.pull-right-wide[
]

---
# An example pipeline

.pull-left-small[
In the following, we will work with this toy pipeline:

- `00-packages.R` loads the packages necessary for analysis,
]

.pull-right-wide[
`00-packages.R`:

``` r
R> # install packages from CRAN
R> p_needed <- c("tidyverse" # tidyverse packages
+ )
R> packages <- rownames(installed.packages())
R> p_to_install <- p_needed[!(p_needed %in% packages)]
R> if (length(p_to_install) > 0) {
+   install.packages(p_to_install)
+ }
R> lapply(p_needed, require, character.only = TRUE)
```
]

---
# An example pipeline

.pull-left-small[
In the following, we will work with this toy pipeline:

- `00-packages.R` loads the packages necessary for analysis,
- `01-download-data.R` downloads a spreadsheet, which is stored as `lotr_raw.tsv`,
]

.pull-right-wide[
`01-download-data.R`:

``` r
R> ## download raw data
R> download.file(url = "http://bit.ly/lotr_raw-tsv", 
+                destfile = "lotr_raw.tsv")
```
]

---
# An example pipeline

.pull-left-small[
In the following, we will work with this toy pipeline:

- `00-packages.R` loads the packages necessary for analysis,
- `01-download-data.R` downloads a spreadsheet, which is stored as `lotr_raw.tsv`,
- `02-process-data.R` imports and processes the data and exports a clean spreadsheet as `lotr_clean.tsv`, and
]

.pull-right-wide[
`02-process-data.R`:

``` r
R> ## import raw data
R> lotr_dat <- read_tsv("lotr_raw.tsv")
R> 
R> ## reorder Film factor levels based on story
R> old_levels <- levels(as.factor(lotr_dat$Film))
R> j_order <- sapply(c("Fellowship", "Towers", "Return"),
+ 					function(x) grep(x, old_levels))
R> new_levels <- old_levels[j_order]
R> 
R> ## process data set 
R> lotr_dat <- lotr_dat %>%
+   # apply new factor levels to Film
+ 	mutate(Film = factor(as.character(Film), new_levels),
+ 	# revalue Race
+ 	Race = recode(Race, `Ainur` = "Wizard", `Men` = "Man")) %>%
+ ## <skipping some steps here to avoid slide overflow>
+ 
+ ## write data to file
+ write_tsv(lotr_dat, "lotr_clean.tsv")
```
]

---
# An example pipeline

.pull-left-small[
In the following, we will work with this toy pipeline:

.pull-right-wide[
`03-plot.R`:

``` r
R> ## import clean data
R> lotr_dat <- read_tsv("lotr_clean.tsv") %>% 
+ # reorder Race based on words spoken
+ mutate(Race = reorder(Race, Words, sum))
R> 
R> ## make a plot
R> p <- ggplot(lotr_dat, aes(x = Race, weight = Words)) + geom_bar()
R> ggsave("barchart-words-by-race.png", p)
```
]

---
# An example pipeline

``` r
R> slice_sample(lotr_dat, n = 10)
```

```
   # A tibble: 10 × 5
      Film                       Chapter                      Character Race  Words
      <chr>                      <chr>                        <chr>     <chr> <dbl>
    1 The Return Of The King     64: The Mouth Of Sauron      Aragorn   Man      23
    2 The Fellowship Of The Ring 36: The Bridge Of Khazad-dûm Frodo     Hobb…     4
    3 The Two Towers             36: Isengard Unleashed       Saruman   Wiza…    50
    4 The Fellowship Of The Ring 42: The Great River          Sam       Hobb…    37
    5 The Return Of The King     42: Breaking The Gate Of Go… Gandalf   Wiza…    21
    6 The Two Towers             45: The Glittering Caves     Legolas   Elf      36
    7 The Two Towers             35: Helm's Deep              Rohan Wa… Man      22
    8 The Fellowship Of The Ring 33: Moria                    Aragorn   Man      31
    9 The Fellowship Of The Ring 43: Parth Galen              Aragorn   Man      79
   10 The Return Of The King     24: Courage Is The Best Def… Gothmog   Orc       4
```

---
# An example pipeline

``` r
R> p <- ggplot(lotr_dat, aes(x = Race, weight = Words)) + 
+   geom_bar() + theme_minimal()
```

---
# Automation using pipelines in R

.pull-left[
### Motivation and usage
- The `source()` function reads and parses R code from a file or connection. 
- We can build a pipeline by sourcing scripts sequentially.
- This pipeline is usually stored in a "master/main" script.
- The removal of previous work is optional and maybe redundant. Often the data is overwritten by default.
- It is recommended that the individual scripts are (partial) standalones, i.e. that they import all data they need by default (loading the packages could be considered an exception). 
- Note that as long as the environment is not reset, it remains intact across scripts, which is a potential source of error and confusion.
]

.pull-right[
### Example

The master script `master.R`:

``` r
R> ## clean out any previous work
R> outputs <- c("lotr_raw.tsv",
+ 	          "lotr_clean.tsv",
+               list.files(pattern = "*.png$"))
R> file.remove(outputs)
R> 
R> ## run scripts
R> source("00-packages.R")
R> source("01-download-data.R")
R> source("02-process-data.R")
R> source("03-plot.R")
```
]

---
# Automation using the Shell and Rscript

.pull-left[
### Motivation and usage
- Alternatively to using an R master script, we can also run the pipeline from the command line.
- Note that here, the environments don't carry over across `Rscript` calls. The scripts definitely have to run in a standalone fashion (i.e., load packages, import all necessary data, etc.).
- The working directory should be set either in the script(s) or in the shell with `cd`.
]

.pull-right[
### Example

The master script `master.sh`:

```bash
#!/bin/sh
cd /Users/simonmunzert/github/examples/02-automation-shell-rscript
set -eux
Rscript 01-download-data.R
Rscript 02-process-data.R
Rscript 03-plot.R
```

The `set` command allows to adjust some base shell parameters:
- `-e`: Stop at first error
- `-u`: Undefined variables are an error
- `-x`: Print each command as it is run

For more information on `set`, see [here](http://linuxcommand.org/lc3_man_pages/seth.html).
]

---
# Automation using the Shell and Rscript

.pull-right[
### Example

The master script `master.sh`:

```bash
#!/bin/sh
cd /Users/simonmunzert/github/examples/02-automation-shell-rscript
set -eux
*curl -L http://bit.ly/lotr_raw-tsv > lotr_raw.tsv
Rscript 02-process-data.R
Rscript 03-plot.R
```

The `set` command allows to adjust some base shell parameters:
- `-e`: Stop at first error
- `-u`: Undefined variables are an error
- `-x`: Print each command as it is run

For more information on `set`, see [here](http://linuxcommand.org/lc3_man_pages/seth.html).
]

---
# Automation using Make

.pull-left-vwide[
### Motivation and usage
- [Make](https://en.wikipedia.org/wiki/Make_%28software%29) is an automation tool that allows us to specify and manage build processes.
- It is commonly run via the shell.
- At the heart of a make operation is the `makefile` (or `Makefile`, `GNUmakefile`), a script which serves as a recipe for the building process.
- A `makefile` is written following a particular syntax and in a declarative fashion.
- Conceptually, the recipe describes which files are built how and using what input.

### Advantages of Make
- It looks at which files you have and automatically figures out how to create the files that you have. For complex pipelines this "automation of the automation process" can be very helpful.
- While shell scripts give *one* order in which you can successfully run the pipeline, Make will figure out the parts of the pipeline (and their order) that are needed to build a desired target.
]

.pull-right-vsmall[
<div align="center">
<br><br><br><br>
<img src="pics/gnu-make.png" width="250"/>
</div>
]

---
# Automation using Make (cont.)

.pull-left-small[
### Basic syntax

Each batch of lines indicates 
- a file to be created (the target),
- the files it depends on (the prerequisites), and 
- set of commands needed to construct the target from the dependent files.

Dependencies propagate.
- To create any of the `png` figures, we need `lotr_clean.tsv`.
- If this file changes, the `png`s change as well when they're built.
]

.pull-right-wide[

### Example `makefile`

```bash
all: lotr_clean.tsv barchart-words-by-race.png words-histogram.png

lotr_raw.tsv:
	curl -L http://bit.ly/lotr_raw-tsv > lotr_raw.tsv

lotr_clean.tsv: lotr_raw.tsv 02-process-data.R
	Rscript 02-process-data.R

barchart-words-by-race.png: lotr_clean.tsv 03-plot.R
	Rscript 03-plot.R

words-histogram.png: lotr_clean.tsv
	Rscript -e 'library(ggplot2); 
	qplot(Words, data = read.delim("$<"), geom = "histogram"); 
	ggsave("$@")'
	rm Rplots.pdf

clean:
	rm -f lotr_raw.tsv lotr_clean.tsv *.png

```
]

---
# Automation using Make (cont.)

.pull-left-small[

### Getting Make to run

- Using the command line, go into the directory for your project.
- Create the `Makefile` file.
- The most basic Make commands are `make all` and `make clean` which builds (or deletes) all output as specified in the script.
]

.pull-right-wide[

### Example `makefile`

```bash
*all: lotr_clean.tsv barchart-words-by-race.png words-histogram.png

lotr_raw.tsv:
	curl -L http://bit.ly/lotr_raw-tsv > lotr_raw.tsv

lotr_clean.tsv: lotr_raw.tsv 02-process-data.R
	Rscript 02-process-data.R

barchart-words-by-race.png: lotr_clean.tsv 03-plot.R
	Rscript 03-plot.R

words-histogram.png: lotr_clean.tsv
	Rscript -e 'library(ggplot2); 
	qplot(Words, data = read.delim("$<"), geom = "histogram"); 
	ggsave("$@")'
	rm Rplots.pdf

*clean:
*   rm -f lotr_raw.tsv lotr_clean.tsv *.png
```
]

---
# Automation using Make - FAQ

### Does it work on Windows?

To install an run `make` on Windows, check out [these instructions](https://stat545.com/make-windows.html).

### Where can I learn more?

If you consider working with Make, check out the [official manual](https://www.gnu.org/software/make/manual/make.html), [this helpful tutorial](https://makefiletutorial.com/), Karl Broman's [excellent minimal make introduction](https://kbroman.org/minimal_make/), or [this Stat545 piece](https://stat545.com/automation-overview.html).

.pull-left-vwide[
### This is dusty technology. Are there alternatives?

In the context of data science with R, the `targets` package is an interesting option. It provides R functionality to define a Make-stype pipeline. Check out the [overview](https://docs.ropensci.org/targets/) and [manual](https://books.ropensci.org/targets/).
]

.pull-right-vsmall[
<div align="center">
<br>
<img src="pics/targets-hex.png" width="150"/>
</div>
]

---
class: inverse, center, middle
name: scheduling

# Scheduling
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
background-image: url("pics/scheduling-clocks-louis-delphin.jpg")
background-size: contain
background-color: #000000

# Scheduling

---
# Scheduling

<div align="center">
<img src="pics/xkcd-time.png" width="600"/>
<br>
<tt>Credit</tt> <a href="https://xkcd.com/1205/">Randall Munroe/xkcd 1205</a>
</div>

---
# Scheduling scripts and processes

.pull-left-wide[
### Motivation
- So far, we have automated data science pipelines.
- But the execution of these pipelines still needs to be triggered.
- In some cases, it is desirable to also **automate the initialization** of R scripts (or any processes for that matter) **on a regular basis**, e.g. weekly, daily, on logon, etc.
- This makes particular sense when you have moving parts in your pipeline (most likely: data).

### Common scenarios for scheduling

1. You fetch data from the web on a regular basis (e.g., via scraping scripts or APIs).
2. You generate daily/weekly/monthly reports/tweets based on changing data.
3. You build an alert control system informing you about anomalies in a database.

]

.pull-right-small[
<br><br><br><br>
<div align="center">
<img src="pics/robot-clock-giphy.gif" width="400"/>
</div>
`Credit` [Simone Giertz](https://www.youtube.com/watch?v=Lh2-iJj3dI0)
]

---
# Scheduling scripts and processes on Windows

.pull-left-small2[
### Scheduling options
- Schedule tasks on Windows with [Windows Task Scheduler](https://en.wikipedia.org/wiki/Windows_Task_Scheduler).
- Manage them via a GUI (→ Control Panel) or the command line using `schtasks.exe`.
- The R package [`taskscheduleR`](https://cran.r-project.org/web/packages/taskscheduleR/vignettes/taskscheduleR.html) provides a programmable R interface to the WTS.

<div align="center">
<img src="pics/windows-task-scheduler.png" width="400"/>
</div>
]

.pull-right-wide2[
### `taskscheduleR` example

``` r
R> library(taskscheduleR)
R> myscript <- "examples/scrape-wiki.R"
R> ## Run every 5 minutes, starting from 10:40
R> taskscheduler_create(
+   taskname = "WikiScraperR_5min", rscript = myscript,
+   schedule = "MINUTE", starttime = "10:40", modifier = 5)
R> 
R> ## Run every week on Saturday and Sunday at 09:10
R> taskscheduler_create(
+   taskname = "WikiScraperR_SatSun", rscript = myscript, 
+   schedule = "WEEKLY", starttime = "09:10", 
+   days = c('SAT', 'SUN'))
R> 
R> ## Delete task
R> taskscheduler_delete("WikiScraperR_SatSun")
R> 
R> ## Get a data.frame of all tasks
R> tasks <- taskscheduler_ls()
R> str(tasks)
```
]

---
# Scheduling scripts and processes on a Mac

.footnote[<sup>1</sup>For more resources on scheduling with `launchd`, check out [this](https://babichmorrowc.github.io/post/launchd-jobs/) and [this](https://towardsdatascience.com/a-step-by-step-guide-to-scheduling-tasks-for-your-data-science-project-d7df4531fc41).]

.pull-left-small3[
### Scheduling options
- On macOS you can schedule background jobs using [`cron`](https://en.wikipedia.org/wiki/Cron) and [`launchd`](https://en.wikipedia.org/wiki/Launchd).
- `launchd`<sup>1</sup> was created by Apple as a replacement for the popular Linux utility `cron` ([deprecated](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPSystemStartup/Chapters/ScheduledJobs.html) but still usable). 
- The R package [`cronR`](https://cran.r-project.org/web/packages/cronR/index.html) provides a programmable R interface.
- `cron` syntax for more complex scheduling:

<div align="center">
<img src="pics/cron-syntax.png" width="380"/>
</div>
]

.pull-right-wide3[
### `cronR` example

``` r
R> library(cronR)
R> myscript <- "examples/scrape-wiki.R"
R> # Create bash code for crontab to execute R script
R> cmd <- cron_rscript(myscript)
R> 
R> ## Run every minute
R> cron_add(command = cmd, frequency = 'minutely',
+           id = 'ScraperR_1min', description = 'Every 1min')
R> 
R> ## Run every 15 minutes (using cron syntax)
R> cron_add(cmd, frequency = '*/15 * * * *', 
+           id = 'ScraperR_15min', description = 'Every 15 mins') 
R> 
R> ## Check number of running cronR jobs
R> cron_njobs()
R> 
R> ## Delete task
R> cron_rm("WikiScraperR_1min", ask = TRUE)
```
]

---
class: inverse, center, middle
name: debugging

# Strategies for debugging
<html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html>

---
background-image: url("pics/debugging-female-welders.jpg")
background-size: contain
background-color: #000000

# Debugging

---
# What's debugging?

.pull-left[
### Straight from the [Wikipedia](https://en.wikipedia.org/wiki/Debugging)

"Debugging is the process of finding and resolving bugs (defects or problems that prevent correct operation) within computer programs, software, or systems."

### A famous (yet not the first) bug:

The term "bug" was used in an account by computer pioneer [Grace Hopper](https://en.wikipedia.org/wiki/Grace_Hopper) (see on the right). While she was working on a [Mark II](https://en.wikipedia.org/wiki/Harvard_Mark_II) computer at Harvard University, her associates discovered a moth stuck in a relay and thereby impeding operation, whereupon she remarked that they were "debugging" the system. This bug was carefully removed and taped to the log book (see on the right).
]

.pull-right-center[
<div align="center">
<br>
<img src="pics/grace-hopper.jpg" width="200"/>
<br>
<i>Above:</i> Grace Hopper, <i>Below:</i> The bug
<br>
<img src="pics/computer-bug-hopper.jpeg" width="300"/> 
</div>
]

---
# Why debugging matters

.pull-left-wide[
The Wikipedia [list of software bugs](https://en.wikipedia.org/wiki/List_of_software_bugs) with significant consequences is growing and you don't want to be on it.

NASA software engineers are [famous for producing bug-free code](https://www.bugsplat.com/blog/less-serious/why-nasa-code-doesnt-crash/). This was learned the hard and costly way though. Some highlights from space:

- 1962: A booster went off course during launch, resulting in the [destruction of NASA Mariner 1](https://www.youtube.com/watch?v=CkOOazEJcUc) . This was the result of the failure of a transcriber to notice an overbar in a handwritten specification for the guidance program, resulting in an incorrect formula the FORTRAN code.
- 1999: [NASA's Mars Climate Orbiter was destroyed](https://www.youtube.com/watch?v=lcYkOh4nweE), due to software on the ground generating commands based on parameters in pound-force (lbf) rather than newtons (N)
- 2004: [NASA's Spirit rover became unresponsive](https://www.youtube.com/watch?v=7V54LRRJaGk) on January 21, 2004, a few weeks after landing on Mars. Engineers found that too many files had accumulated in the rover's flash memory (the problem could be fixed though by deleting unnecessary files, and the Rover lived happily ever after. Until it [froze to death in 2011](https://en.wikipedia.org/wiki/Spirit_(rover)).
]

.pull-right-small[
<div align="center">
<br><br>
<img src="pics/nasa-coding-error.png" width="400"/>
<img src="pics/nasa-coding-error-2.png" width="400"/>
</div>
]

---
# Why debugging matters (cont.)

---
# Why debugging matters (cont.)

.pull-left-center[
<div align="center">
<br>
<img src="pics/fb-socsci1-1.png" width="550"/>
<img src="pics/fb-socsci1-2.png" width="550"/>
<img src="pics/fb-socsci1-3.png" width="550"/>
</div>
`Source` [Washington Post](https://www.washingtonpost.com/technology/2021/09/10/facebook-error-data-social-scientists/)
]

.pull-right-center[
<div align="center">
<br>
<img src="pics/fb-socsci1-4.png" width="450"/>
<img src="pics/fb-socsci1-5.png" width="450"/>
</div>
`Source` [Solomon Messing / Twitter](https://twitter.com/solomonmg/status/1436742352039669760)
]

---
# A general strategy for debugging

.pull-left-vsmall[]

.pull-right-wide[
<br>
<br>
<br>

## 1. Google

## 2. Reset

## 3. Debug

## 4. Deter
]

---
# Google

.footnote[<sup>1</sup>Do you get an error message you don't understand? That's good news actually, because the really nasty bugs come without errors. ]

.pull-left-wide2[
According to [this analysis](https://github.com/noamross/zero-dependency-problems/blob/master/misc/stack-overflow-common-r-errors.md), the most common error types in R are:<sup>1</sup>

1. `Could not find function` errors, usually caused by typos or not loading a required package.
2. `Error in if` errors, caused by non-logical data or missing values passed to R's `if` conditional statement.
3. `Error in eval` errors, caused by references to objects that don't exist.
4. `Cannot open` errors, caused by attempts to read a file that doesn't exist or can't be accessed.
5. `no applicable method` errors, caused by using an object-oriented function on a data type it doesn't support.
6. `subscript out of bounds` errors, caused by trying to access an element or dimension that doesn't exist
7. Package errors caused by being unable to install, compile or load a package.
]

.pull-right-small2[
Whenever you see an error message, start by [googling](https://lmgtfy.app/?q=Error+in+interpretative_method+%3A+%20+%20could+not+find+function+%22interpretative_method%22&iie=1) it. Improve your chances of a good match by removing any variable names or values that are specific to your problem. Also, look for [Stack Overflow](https://stackoverflow.com/questions/tagged/r) posts and list of answers.

]

---
# Reset

.pull-left[
- If at first you don't succeed, try exactly the same thing again.
- Have you tried turning it off and on again?
- Do you use `rm(list = ls())`? Don't. Packages remain loaded, options and environment variables set, ... all possible sources of error!
- A fresh start clears the workspace, resets options, environment variables, and the path.
- While we're at it, check out James Wade's advice ["How I set up RStudio for Efficient Coding" (YouTube)](https://www.youtube.com/watch?v=p-r-AWR3-Es).
]

.pull-right[
<div align="center">
<img src="pics/restart-r-1.png" width="350"/><br>
<img src="pics/restart-r-2.png" width="350"/>
</div>
]

---
# Debug

### Make the error repeatable.

- Execute the code many times as you consider and reject hypotheses. To make that iteration as quick possible, it’s worth some upfront investment to make the problem both easy and fast to reproduce.
- Work with reproducible and minimal examples by removing innocuous code and simplifying data.
- Consider automated testing. Add some nearby tests to ensure that existing good behaviour is preserved.

### Track the error down.

- Execute code step by step and inspect intermediate outputs. 
- Adopt the scientific method: Generate hypotheses, design experiments to test them, and record your results.

### Once found, fix the error and test it.

- Ensure you haven’t introduced any new bugs in the process. 
- Make sure to carefully record the correct output, and check against the inputs that previously failed.
- Reset and run again to make sure everything still works.

---
# Deter

.pull-left-wide2[

### Defensive programming

- **Pay attention.** Do results make sense? Do they look different from previous results? Why?
- **Know what you're doing**, and what you're expecting.
  - Avoid functions that return different types of output depending on their input, e.g., `[]` and `sapply()`.
  - Be strict about what you accept (e.g., only scalars). 
  - Avoid functions that use non-standard evaluation (e.g., `with()`)
- **Fail fast**.
  - As soon as something wrong is discovered, signal an error. 
  - Add tests (e.g., with the `testthat` package).
  - Practice good condition/exception handling, e.g., with `try()` and `tryCatch()`. 
  - Write error messages for humans.

]

.pull-right-small3[
### Transparency

- Collaborate! [Pair programming](https://en.wikipedia.org/wiki/Pair_programming) is an established software development technique that increases code robustness. It also works [from remote](https://ivelasq.rbind.io/blog/vscode-live-share/).
- Be transparent! Let others access your code and comment on it.

<div align="center">
<img src="pics/pair-programming-its-not-for-everyone.jpeg" width="350"/>
</div>
]

---
# Debugging R: What you get

<br>
<br>
<tt>
Error : .onLoad failed in loadNamespace() for 'rJava', details: <br>
call: dyn.load(file, DLLpath = DLLpath, ...) <br>
error: unable to load shared object '/Users/janedoe/Library/R/3.6/library/rJava/libs/rJava.so':  <br>
libjvm.so: cannot open shared object file: No such file or directory  <br>
Error: loading failed  <br>
Execution halted  <br>
ERROR: loading failed  <br>
* removing '/Users/janedoe/Library/R/3.6/library/rJava/' <br>
Warning in install.packages : <br>
installation of package 'rJava' had non-zero exit status
</tt>

<br>
`Credit` [Jenny Bryan](https://github.com/jennybc/debugging)

---
# Debugging R: What you see

<br>
<br>
<tt>
<span style = "color:red">Error</span> : blah <span style = "color:red">failed</span> blah blah() blah 'blah', blah: <br>
call: blah.blah(blah, blah = blah, ...) <br>
<span style = "color:red">error</span>: <span style = "color:red">unable</span> to blah blah blah '/blah/blah/blah/blah/blah/blah/blah/blah/blah.so':  <br>
blah.so: <span style = "color:red">cannot</span> open blah blah blah: <span style = "color:red">No</span> blah blah blah blah  <br>
<span style = "color:red">Error</span>: blah <span style = "color:red">failed</span>  <br>
blah blah  <br>
<span style = "color:red">ERROR</span>: blah <span style = "color:red">failed</span>  <br>
* removing '/blah/blah/blah/blah/blah/blah/blah/' <br>
<span style = "color:red">Warning</span> in blah.blah : <br>
blah of blah 'blah' blah blah-blah blah blah
</tt>

<br>
`Credit` [Jenny Bryan](https://github.com/jennybc/debugging)

---
# Strategies to debug your R code

Sometimes the mistake in your code is hard to diagnose, and googling doesn't help. Here are a couple of strategies to debug your code:

- Use `traceback()` to determine where a given error is occurring.

- Output diagnostic information in code with `print()`, `cat()` or `message()` statements.

- Use `browser()` to open an interactive debugger before the error

- Use `debug()` to automatically open a debugger at the start of a function call.

- Use `trace()` to make temporary code modifications inside a function that you don't have easy access to.

---
# Locating errors with traceback()

.pull-left[

### Motivation and usage
- When an error occurs with an unidentifiable error message or an error message that you are in principle familiar with but cannot locate its sources, the `traceback()` function comes in handy.
- The `traceback()` function prints the sequence of calls that led to an uncaught error.
- The `traceback()` output reads from bottom to top.
- Note that errors caught via `try()` or `tryCatch()` do not generate a traceback!
-  If you’re calling code that you `source()`d into R, the traceback will also display the location of the function, in the form `filename.r#linenumber`. 
]

.pull-right[
### Example

In the call sequence below, the execution of `g()` triggers an error:

``` r
R> f <- function(x) x + 1
R> g <- function(x) f(x)
R> g("a")
```

```r 
#> Error in x + 1 : non-numeric argument to binary operator
```

Doing the traceback reveals that the function call f(x) is what lead to the error:

``` r
R> traceback()
```

```r 
#> 2: f(x) at #1
#> 1: g("a")
```
]

---
# Interactive debugging with browser()

.pull-left[
### Motivation and usage
- Sometimes, you need more information than the precise location of an error in a function to fix it. 
- The interactive debugger lets you pause the run of a function and interactively explore its state.
- Two options to enter the interactive debugger: 
  1. Through RStudio's "Rerun with Debug" tool, shown to the right of an error message.
  2. You can insert a call to `browser()` into the function at the stage where you want to pause, and re-run the function.
- In either case, you’ll end up in an interactive environment inside the function where you can run arbitrary R code to explore the current state. You’ll know when you’re in the interactive debugger because you get a special prompt, `Browse[1]>`. 
]

.pull-right[
### Example

``` r
R> h <- function(x) x + 3
R> g <- function(b) {
*+   browser()
+   h(b)
+ }
R> g(10)
```

Some useful things to do are:

1. Use `ls()` to determine what objects are available in the current
   environment.
2. Use `str()`, `print()` etc. to examine the objects.
3. Use `n` to evaluate the next statement.
4. Use `s`: like `n` but also step into function calls.
5. Use `where` to print a stack trace (→ traceback).
6. Use `c` to exit debugger and continue execution.
7. Use `Q` to exit debugger and return to the R prompt.
]

---
# Debugging other peoples' code

.pull-left[
### Motivation
- Sometimes the error is outside your code in a package you're using, you might still want to be able to debug.
- Two options:
  1. Get a local version of the package code and debug as if it were your own.
  2. Use functions which which allow you to start a browser in existing functions, including `recover()` and `debug()`.
]

.pull-right[
]

---
# Debugging other peoples' code (cont.)

.pull-left-small[
### Motivation
- `recover()` serves as an alternative error handler which you activate by calling `options(error = recover)`.
- You can then select from a list of current calls to browse.
- `options(error = NULL)` turns off this debugging mode again.
- A simpler alternative is `options(error = browser)`, but this only allows you to browse the call where the error occurred.
]

.pull-right-wide[
### Example

- Activate debugging mode; then execute (flawed) function:

``` r
*R> options(error = recover)
R> lm(mpg ~ wt, data = "mtcars")
```

```r 
Error in model.frame.default(formula = mpg ~ wt, data = "mtcars", drop.unused.levels = TRUE) 
 'data' must be a data.frame, environment, or list
 
Enter a frame number, or 0 to exit

1: lm(mpg ~ wt, data = "mtcars")
2: eval(mf, parent.frame())
3: eval(mf, parent.frame())

Selection: 
```

- Deactivate debugging mode:

``` r
*R> options(error = NULL)
```
]

---
# Debugging other peoples' code (cont.)

.pull-left[
### Motivation
- `debug()` activates the debugger on any function, including those in packages (see on the right). `undebug()` deactivates the debugger again.
- Some functions in another package are easier to find than others. There are
  - *exported* functions which are available outside of a package and
  - *internal* functions which are only available within a package.
- To find (and debug) exported functions, use the `::` syntax, as in `ggplot2::ggplot`.
- To find un-exported functions, use the `:::` syntax, as in `ggplot2:::check_required_aesthetics`.
]

.pull-right[
### Example

- Activate debugging mode for `lm()` function; then execute function:

``` r
*R> debug(stats::lm)
R> lm(mpg ~ weight, data = "mtcars")
```

- Interactive debugging mode for `lm()` is entered; use the common `browser()` functionality to navigate:

```r 
debugging in: lm(mpg ~ weight, data = mtcars)
debug: {
    ret.x <- x
    ...
Browse[2]> 
```

- Deactivate debugging mode:

``` r
*R> undebug(stats::lm)
```
]

---
# Debugging in RStudio

---
# More on debugging R

.pull-left[
<br><br><br>

### Further reading

- [12-minute video](https://vimeo.com/99375765) on debugging in R
- Jenny Bryan's [talk on debugging](https://github.com/jennybc/debugging) at  rstudio::conf 2020
- Jenny Bryan and Jim Hester's "What They Forgot to Teach You About R", Chapter 11: [Debugging R code](https://rstats.wtf/debugging-r)
- Jonathan McPherson's [Debugging with RStudio](https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio)
]

.pull-right[
<br>
<div align="center">
<img src="pics/classy-bear-debugging.jpeg" width="450"/>
</div>
]

---
# Next steps

<br>

### Assignment 1 and Quiz 1

Don't forget to submit your solutions for Assignment 1. Also, Quiz 1 is online!

### Next lecture

We're going to dig into the world wide web...

<b>Important:</b> The lecture is going to take place on Wednesday, 8-10am, in the Forum! If your regular lab is schedule for that slot, please visit the lab session at the ordinary lecture slot instead (Mon, 10-12h). The lecture will be recorded and made available in case you cannot attend.