Topic 2: Programming

class: center, middle, inverse, title-slide

.title[
# Topic 2: Programming
]
.author[
### Nick Hagerty* <br> ECNS 460/560 Fall 2024 <br> Montana State University
]
.date[
### .small[<br> *Parts of these slides are adapted from <a href="http://rafalab.dfci.harvard.edu/dsbook/programming-basics.html">“Introduction to Data Science”</a> by Rafael A. Irizarry, used under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0">CC BY-NC-SA 4.0</a>, and <a href="https://raw.githack.com/uo-ec607/lectures/master/12-parallel/12-parallel.html">“Data Science for Economists”</a> by Grant R. McDermott, used under the <a href="https://github.com/uo-ec607/lectures/blob/master/LICENSE">MIT License</a>.]
]

---

# Programming

You can do a lot of data analysis in R without knowing much programming.

But knowing some basic concepts of programming can make you a better data analyst.

</br>

1. [If/else statements](#if-else)

1. [For-loops](#for-loops)

1. [Functions](#functions)

1. [Vectorization](#vectorization)

1. [Parallelization](#parallelization)

---
class: inverse, middle
name: if-else

# If/else statements

---

# If/else statements

If/else statements are a type of **conditional expression**.

**Example:** Print the reciprocal of `a`, unless `a` is 0.

```r
a = 0

if(a != 0) {
  
  print(1 / a)
  
} else {
  
  print("Reciprocal does not exist.")
  
}
```

```
## [1] "Reciprocal does not exist."
```

Statements like this are used for **control flow** of your code.
* They are used all the time in software development.
* You probably won't use them much in data analysis, until you start writing your own functions and packages.

---

# If/else statements

Here's a related function that you *will* use all the time in data analysis: `ifelse`. Its syntax is:

```r
ifelse(test, yes, no)
```

For example:

```r
a = 0
ifelse(a > 0, 1/a, NA)
```

```
## [1] NA
```

`ifelse` is particularly useful because it is **vectorized**. For example, to change negative numbers to missing:

```r
b = c(0, 1, 2, -3, 4)
ifelse(b < 0, NA, b)
```

```
## [1]  0  1  2 NA  4
```

---
class: inverse, middle
name: for-loops

# For-loops

---

# Abstraction

Very often, you will find yourself copying and pasting your code to do the same thing `$n$` times, with only a small tweak each time. What's wrong with that?
--

It's:
* Annoying (especially if `$n$` is large)
* Hard to change later if needed
* Prone to errors/bugs

Instead, you can **abstract** your code: define it once, and run it multiple times. The rest of this lecture covers tools for abstraction in different situations.

A good rule to aim for is to **never copy-and-paste more than twice.** If you're pasting more than that, abstract it instead!

---

# For-loops

A for-loop is a simple tool that is used for **iteration**. It just repeats the same code for different values of a variable (entries of a vector):

```r
for (i in 1:5) {
  print(i)
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
```

---

# For-loops

```r
library(dslabs)
data(murders)
murders2 = cbind(murders, rate = murders$total / murders$population * 1e5)
```

Suppose we want to calculate the mean of the numeric columns in the `murders` data frame. And we also want the mean murder *rate*. We could copy and paste:

```r
mean(murders2$total)
```

```
## [1] 184.3725
```

```r
mean(murders2$population)
```

```
## [1] 6075769
```

```r
mean(murders$rate)
```

```
## [1] NA
```

Oops! What's the problem?

---

# For-loops

Instead of risking errors with copy-and-paste, we could put this into a for-loop:

```r
numeric_cols = c("total", "population", "rate")
for(var in numeric_cols) {
  print(mean(murders2[[var]]))
}
```

```
## [1] 184.3725
## [1] 6075769
## [1] 2.779125
```

---

# For-loops

Instead of printing the output, we can populate a vector:

```r
numeric_cols = c("total", "population", "rate")
murder_means = vector()
for(var in numeric_cols) {
  murder_means[[var]] = mean(murders2[[var]])
}
murder_means
```

```
##        total   population         rate 
## 1.843725e+02 6.075769e+06 2.779125e+00
```

</br>

There is one technical problem with this code. The vector storing the output "grows" at each iteration, which can make the loop very slow.

---

# For-loops

To make the loop more efficient, give your empty vector the right length before starting the loop:

```r
numeric_cols = c("total", "population", "rate")
murder_numbers = murders2[numeric_cols]
murder_means = vector("numeric", length = length(murder_numbers))
for(i in 1:length(murder_numbers)) {
  murder_means[[i]] = mean(murder_numbers[[i]])
}
names(murder_means) = numeric_cols
murder_means
```

```
##        total   population         rate 
## 1.843725e+02 6.075769e+06 2.779125e+00
```

---

# For-loops

For-loops are actually discouraged in R programming.

* We're covering them because the concepts are foundational.

* But R has nicer ways to iterate, called **vectorization.**

* Before getting to vectorization, we have to cover **functions.**

---
class: inverse, middle
name: functions

# Functions

---

# Functions

Suppose I want to **standardize**, or calculate the Z-score of, a numeric variable.

```r
z_population = (murders$population - mean(murders$population)) / sd(murders$population)
```

This works, but:
* It's hard to read what's actually going on.
* If I ever wanted to standardize a different vector, I would have to retype everything.

---

# Functions

We can abstract this calculation by writing our own **function.**

```r
calculate_z = function(x) {
  z = (x - mean(x)) / sd(x)
  return(z)
}
```

What's going on:
* `calculate_z` is the **name** of the new function.
* `x` is an **argument** of this function (information that changes each time you run it).
  - When using the function, the user will supply the value of `x`
  - Within the function, `x` is now an object you can use for calculations.
* The **stuff the function does** is contained within curly brackets `{}`.
* The `return()` function defines what the **result** of the function is.
* Objects defined within a function, like `z`, exist only within the function.

---

# Functions

Now we can use this function:

```r
calculate_z = function(x) {
  z = (x - mean(x)) / sd(x)
  return(z)
}
z_population = calculate_z(murders$population)
head(z_population)
```

```
## [1] -0.18890769 -0.78207215  0.04609577 -0.46057478  4.54448196 -0.15254681
```
Isn't this better? Compare readability.

Also, we now have the function available to use in different contexts:

```r
z_population = calculate_z(murders$population)

z_rate = calculate_z(murders2$rate)

z_seq = calculate_z(seq(1:100))
```

---

# Functions

When might we prefer to use a function vs. a for-loop?

---

# Challenge

1. **Write a function `average(x)` that returns the mean of an input vector `x`.**

2. **Write a function `average(x, median)` that calculates a different kind of average depending on the value of the argument `median`:**
* If `median` is `FALSE`, the function returns the **mean** of `x`.
* If `median` is `TRUE`, the function returns the **median** of `x`.

Test it to make sure it works, for example, with `c(1, 9, 10)`.

---

# Challenge

1. **Write a function `average(x)` that returns the mean of an input vector `x`.**

Test it to make sure it works, for example, with `c(1, 9, 10)`.

---

# Challenge

1. **Write a function `average(x)` that returns the mean of an input vector `x`.**

Test it to make sure it works, for example, with `c(1, 9, 10)`.

---
class: inverse, middle
name: vectorization

# Vectorization and functionals

---

# Vectorization and functionals

Another way to iterate uses the `apply` family. These functions apply a function to all elements of an object. For example:

```r
nums = data.frame(a = 1:5, b = 11:15, c = 21:25)
sapply(X = nums, FUN = sqrt)
```

```
##             a        b        c
## [1,] 1.000000 3.316625 4.582576
## [2,] 1.414214 3.464102 4.690416
## [3,] 1.732051 3.605551 4.795832
## [4,] 2.000000 3.741657 4.898979
## [5,] 2.236068 3.872983 5.000000
```
* `X` is the vector of objects that you want to iterate over.
* `FUN` is the function you want to apply to each element of `X`.
* `sapply` returns an object of the same type and size as `X`.

---

# Vectorization and functionals

Say we want to apply our function `calculate_z` to all columns of `murder_numbers`. We wrote `calculate_z` to work on vectors, but not lists of vectors, so this doesn't work:

```r
calculate_z(murder_numbers)
```

```
## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double'
```

What can we do? First, we could copy and paste:

```r
z_total      = calculate_z(murder_numbers$total)
z_population = calculate_z(murder_numbers$population)
z_rate       = calculate_z(murder_numbers$rate)
murders_z = data.frame(z_total, z_population, z_rate)
head(murders_z, n = 3)
```

```
##      z_total z_population      z_rate
## 1 -0.2090939  -0.18890769  0.01844305
## 2 -0.7003568  -0.78207215 -0.04231860
## 3  0.2017034   0.04609577  0.34623812
```

This is OK, but what if `murder_numbers` had 50 columns? It would get unwieldy.

---

# Vectorization and functionals

Another option is to put our function inside of a for-loop:

```r
murders_z = list()
for(col in names(murder_numbers)) {
  murders_z[[col]] = calculate_z(murder_numbers[[col]])
}
head(as.data.frame(murders_z))
```

```
##        total  population        rate
## 1 -0.2090939 -0.18890769  0.01844305
## 2 -0.7003568 -0.78207215 -0.04231860
## 3  0.2017034  0.04609577  0.34623812
## 4 -0.3869650 -0.46057478  0.16703781
## 5  4.5426034  4.54448196  0.24225740
## 6 -0.5055457 -0.15254681 -0.60529343
```

But this is hard to read (and to write). It puts more emphasis on the structure -- the objects and the iteration -- than the action.

---

# Vectorization and functionals

Instead, we can do the same thing in 1 line with `sapply`:

```r
murders_z = sapply(murder_numbers, calculate_z)
head(as.data.frame(murders_z))
```

So refreshing! Now you can start to see why for-loops are discouraged in R.

So what is a **functional?** A functional is a function that takes another function as an argument. The `sapply` function is an example of a functional.

---
class: inverse, middle
name: parallelization

# Parallelization

---

# Parallelization

Run these preliminaries:

```r
library(future)
library(future.apply)
library(tictoc)
plan(multisession)
```

Let's make a deliberately slow function:

```r
slow_square = function(x = 1) {
  Sys.sleep(1/3)  # Just wait a third of a second
  return(x^2)
}
```

---

# Parallelization

How long does it take to run this for 24 iterations?

```r
tic()
sapply(1:24, slow_square)
```

```
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324 361
## [20] 400 441 484 529 576
```

```r
toc()
```

```
## 8 sec elapsed
```

The iteration is executed in **serial,** so it should have taken about `$24 * 1/3 = 8$` seconds.

But we can speed this up. Modern CPUs are made up of multiple **cores,** or processing units, so they can actually run iterations in **parallel.** How many cores does your computer have?

```r
availableCores()
```

```
## system 
##      8
```

---

# Parallelization

A really easy way to **parallelize** your code uses the `future.apply` package. Let's try it:

```r
tic()
future_sapply(1:24, slow_square)
```

```
##  [1]   1   4   9  16  25  36  49  64  81 100 121 144 169 196 225 256 289 324 361
## [20] 400 441 484 529 576
```

```r
toc()
```

```
## 2.14 sec elapsed
```

Much less time! We've engaged not just one core, but all of them.

What did we have to do to make it parallel?
* Declare your intent to use parallel processing: `plan(multisession)`
* Use `future_sapply` instead of `sapply`
* That's it!

---

# Parallelization

There are other packages that implement parallel processing in R, but `future` is uniquely simple, powerful, and consistent across operating systems.
* It makes it so that you don't really need to know about hardware or what's going on behind the scenes.
* But there is a lot more you *can* learn about if you're interested.

When is parallelization useful?
* When your code **takes a long time** and is **easy to break up into chunks.**
* No point in using it for code that's already fast -- parallelization involves some overhead.
* The chunks must be independent -- they can't need to communicate with each other.
* Examples: Bootstrapping, simulation, many iteration problems.

---

# Other functions & packages

Other functions for vectorization:

* Base: `lapply()` returns a list instead of a vector, so it's a bit more general. (You can always `unlist` the result.)

```r
sapply(1:2, sqrt)
```

```
## [1] 1.000000 1.414214
```

```r
lapply(1:2, sqrt)
```

```
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
```

```r
unlist(lapply(1:2, sqrt))
```

```
## [1] 1.000000 1.414214
```

---

# Other functions & packages

Other functions for vectorization:

* Base: `lapply()` returns a list instead of a vector, so it's a bit more general. (You can always `unlist` the result.)

* Base: `mapply()` is a multivariate version of `sapply` -- it vectorizes over more than one dimension.

* Package `purrr`: `map()` and `map2()` are similar to `lapply` and `mapply` but interact better with other functions in the Tidyverse.

* Package `furrr`: Parallelized Tidyverse-friendly vectorization. Combines `purrr` with `future`.

---

# Programming

1. [If/else statements](#if-else)

1. [For-loops](#for-loops)

1. [Functions](#functions)

1. [Vectorization](#vectorization)

1. [Parallelization](#parallelization)