class: center, middle, inverse, title-slide .title[ # Topic 2: Programming ] .author[ ### Nick Hagerty*
ECNS 460/560 Fall 2024
Montana State University ] .date[ ### .small[
*Parts of these slides are adapted from
“Introduction to Data Science”
by Rafael A. Irizarry, used under
CC BY-NC-SA 4.0
, and
“Data Science for Economists”
by Grant R. McDermott, used under the
MIT License
.] ] --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .remark-code-line { font-size: 95%; } .small { font-size: 75%; } .scroll-output-full { height: 90%; overflow-y: scroll; } .scroll-output-75 { height: 75%; overflow-y: scroll; } </style> # Programming You can do a lot of data analysis in R without knowing much programming. But knowing some basic concepts of programming can make you a better data analyst. </br> 1. [If/else statements](#if-else) 1. [For-loops](#for-loops) 1. [Functions](#functions) 1. [Vectorization](#vectorization) 1. [Parallelization](#parallelization) --- class: inverse, middle name: if-else # If/else statements --- # If/else statements If/else statements are a type of **conditional expression**. **Example:** Print the reciprocal of `a`, unless `a` is 0. ```r a = 0 if(a != 0) { print(1 / a) } else { print("Reciprocal does not exist.") } ``` ``` ## [1] "Reciprocal does not exist." ``` -- Statements like this are used for **control flow** of your code. * They are used all the time in software development. * You probably won't use them much in data analysis, until you start writing your own functions and packages. --- # If/else statements Here's a related function that you *will* use all the time in data analysis: `ifelse`. Its syntax is: ```r ifelse(test, yes, no) ``` For example: ```r a = 0 ifelse(a > 0, 1/a, NA) ``` ``` ## [1] NA ``` -- `ifelse` is particularly useful because it is **vectorized**. For example, to change negative numbers to missing: ```r b = c(0, 1, 2, -3, 4) ifelse(b < 0, NA, b) ``` ``` ## [1] 0 1 2 NA 4 ``` --- class: inverse, middle name: for-loops # For-loops --- # Abstraction Very often, you will find yourself copying and pasting your code to do the same thing `\(n\)` times, with only a small tweak each time. What's wrong with that? -- It's: * Annoying (especially if `\(n\)` is large) * Hard to change later if needed * Prone to errors/bugs Instead, you can **abstract** your code: define it once, and run it multiple times. The rest of this lecture covers tools for abstraction in different situations. A good rule to aim for is to **never copy-and-paste more than twice.** If you're pasting more than that, abstract it instead! --- # For-loops A for-loop is a simple tool that is used for **iteration**. It just repeats the same code for different values of a variable (entries of a vector): ```r for (i in 1:5) { print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ``` --- # For-loops ```r library(dslabs) data(murders) murders2 = cbind(murders, rate = murders$total / murders$population * 1e5) ``` Suppose we want to calculate the mean of the numeric columns in the `murders` data frame. And we also want the mean murder *rate*. We could copy and paste: ```r mean(murders2$total) ``` ``` ## [1] 184.3725 ``` ```r mean(murders2$population) ``` ``` ## [1] 6075769 ``` ```r mean(murders$rate) ``` ``` ## [1] NA ``` -- Oops! What's the problem? --- # For-loops Instead of risking errors with copy-and-paste, we could put this into a for-loop: ```r numeric_cols = c("total", "population", "rate") for(var in numeric_cols) { print(mean(murders2[[var]])) } ``` ``` ## [1] 184.3725 ## [1] 6075769 ## [1] 2.779125 ``` --- # For-loops Instead of printing the output, we can populate a vector: ```r numeric_cols = c("total", "population", "rate") murder_means = vector() for(var in numeric_cols) { murder_means[[var]] = mean(murders2[[var]]) } murder_means ``` ``` ## total population rate ## 1.843725e+02 6.075769e+06 2.779125e+00 ``` -- </br> There is one technical problem with this code. The vector storing the output "grows" at each iteration, which can make the loop very slow. --- # For-loops To make the loop more efficient, give your empty vector the right length before starting the loop: ```r numeric_cols = c("total", "population", "rate") murder_numbers = murders2[numeric_cols] murder_means = vector("numeric", length = length(murder_numbers)) for(i in 1:length(murder_numbers)) { murder_means[[i]] = mean(murder_numbers[[i]]) } names(murder_means) = numeric_cols murder_means ``` ``` ## total population rate ## 1.843725e+02 6.075769e+06 2.779125e+00 ``` --- # For-loops For-loops are actually discouraged in R programming. * We're covering them because the concepts are foundational. * But R has nicer ways to iterate, called **vectorization.** * Before getting to vectorization, we have to cover **functions.** --- class: inverse, middle name: functions # Functions --- # Functions Suppose I want to **standardize**, or calculate the Z-score of, a numeric variable. ```r z_population = (murders$population - mean(murders$population)) / sd(murders$population) ``` -- This works, but: * It's hard to read what's actually going on. * If I ever wanted to standardize a different vector, I would have to retype everything. --- # Functions We can abstract this calculation by writing our own **function.** ```r calculate_z = function(x) { z = (x - mean(x)) / sd(x) return(z) } ``` What's going on: * `calculate_z` is the **name** of the new function. * `x` is an **argument** of this function (information that changes each time you run it). - When using the function, the user will supply the value of `x` - Within the function, `x` is now an object you can use for calculations. * The **stuff the function does** is contained within curly brackets `{}`. * The `return()` function defines what the **result** of the function is. * Objects defined within a function, like `z`, exist only within the function. --- # Functions Now we can use this function: ```r calculate_z = function(x) { z = (x - mean(x)) / sd(x) return(z) } z_population = calculate_z(murders$population) head(z_population) ``` ``` ## [1] -0.18890769 -0.78207215 0.04609577 -0.46057478 4.54448196 -0.15254681 ``` Isn't this better? Compare readability. -- Also, we now have the function available to use in different contexts: ```r z_population = calculate_z(murders$population) z_rate = calculate_z(murders2$rate) z_seq = calculate_z(seq(1:100)) ``` --- # Functions When might we prefer to use a function vs. a for-loop? --- # Challenge 1. **Write a function `average(x)` that returns the mean of an input vector `x`.** 2. **Write a function `average(x, median)` that calculates a different kind of average depending on the value of the argument `median`:** * If `median` is `FALSE`, the function returns the **mean** of `x`. * If `median` is `TRUE`, the function returns the **median** of `x`. Test it to make sure it works, for example, with `c(1, 9, 10)`. --- # Challenge 1. **Write a function `average(x)` that returns the mean of an input vector `x`.** 2. **Write a function `average(x, median)` that calculates a different kind of average depending on the value of the argument `median`:** * If `median` is `FALSE`, the function returns the **mean** of `x`. * If `median` is `TRUE`, the function returns the **median** of `x`. Test it to make sure it works, for example, with `c(1, 9, 10)`. --- # Challenge 1. **Write a function `average(x)` that returns the mean of an input vector `x`.** 2. **Write a function `average(x, median)` that calculates a different kind of average depending on the value of the argument `median`:** * If `median` is `FALSE`, the function returns the **mean** of `x`. * If `median` is `TRUE`, the function returns the **median** of `x`. Test it to make sure it works, for example, with `c(1, 9, 10)`. --- class: inverse, middle name: vectorization # Vectorization and functionals --- # Vectorization and functionals Another way to iterate uses the `apply` family. These functions apply a function to all elements of an object. For example: ```r nums = data.frame(a = 1:5, b = 11:15, c = 21:25) sapply(X = nums, FUN = sqrt) ``` ``` ## a b c ## [1,] 1.000000 3.316625 4.582576 ## [2,] 1.414214 3.464102 4.690416 ## [3,] 1.732051 3.605551 4.795832 ## [4,] 2.000000 3.741657 4.898979 ## [5,] 2.236068 3.872983 5.000000 ``` * `X` is the vector of objects that you want to iterate over. * `FUN` is the function you want to apply to each element of `X`. * `sapply` returns an object of the same type and size as `X`. --- # Vectorization and functionals Say we want to apply our function `calculate_z` to all columns of `murder_numbers`. We wrote `calculate_z` to work on vectors, but not lists of vectors, so this doesn't work: ```r calculate_z(murder_numbers) ``` ``` ## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double' ``` -- What can we do? First, we could copy and paste: ```r z_total = calculate_z(murder_numbers$total) z_population = calculate_z(murder_numbers$population) z_rate = calculate_z(murder_numbers$rate) murders_z = data.frame(z_total, z_population, z_rate) head(murders_z, n = 3) ``` ``` ## z_total z_population z_rate ## 1 -0.2090939 -0.18890769 0.01844305 ## 2 -0.7003568 -0.78207215 -0.04231860 ## 3 0.2017034 0.04609577 0.34623812 ``` -- This is OK, but what if `murder_numbers` had 50 columns? It would get unwieldy. --- # Vectorization and functionals Another option is to put our function inside of a for-loop: ```r murders_z = list() for(col in names(murder_numbers)) { murders_z[[col]] = calculate_z(murder_numbers[[col]]) } head(as.data.frame(murders_z)) ``` ``` ## total population rate ## 1 -0.2090939 -0.18890769 0.01844305 ## 2 -0.7003568 -0.78207215 -0.04231860 ## 3 0.2017034 0.04609577 0.34623812 ## 4 -0.3869650 -0.46057478 0.16703781 ## 5 4.5426034 4.54448196 0.24225740 ## 6 -0.5055457 -0.15254681 -0.60529343 ``` -- But this is hard to read (and to write). It puts more emphasis on the structure -- the objects and the iteration -- than the action. --- # Vectorization and functionals Instead, we can do the same thing in 1 line with `sapply`: ```r murders_z = sapply(murder_numbers, calculate_z) head(as.data.frame(murders_z)) ``` ``` ## total population rate ## 1 -0.2090939 -0.18890769 0.01844305 ## 2 -0.7003568 -0.78207215 -0.04231860 ## 3 0.2017034 0.04609577 0.34623812 ## 4 -0.3869650 -0.46057478 0.16703781 ## 5 4.5426034 4.54448196 0.24225740 ## 6 -0.5055457 -0.15254681 -0.60529343 ``` So refreshing! Now you can start to see why for-loops are discouraged in R. -- So what is a **functional?** A functional is a function that takes another function as an argument. The `sapply` function is an example of a functional. --- class: inverse, middle name: parallelization # Parallelization --- # Parallelization Run these preliminaries: ```r library(future) library(future.apply) library(tictoc) plan(multisession) ``` Let's make a deliberately slow function: ```r slow_square = function(x = 1) { Sys.sleep(1/3) # Just wait a third of a second return(x^2) } ``` --- # Parallelization How long does it take to run this for 24 iterations? ```r tic() sapply(1:24, slow_square) ``` ``` ## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 ## [20] 400 441 484 529 576 ``` ```r toc() ``` ``` ## 8 sec elapsed ``` The iteration is executed in **serial,** so it should have taken about `\(24 * 1/3 = 8\)` seconds. But we can speed this up. Modern CPUs are made up of multiple **cores,** or processing units, so they can actually run iterations in **parallel.** How many cores does your computer have? ```r availableCores() ``` ``` ## system ## 8 ``` --- # Parallelization A really easy way to **parallelize** your code uses the `future.apply` package. Let's try it: ```r tic() future_sapply(1:24, slow_square) ``` ``` ## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 ## [20] 400 441 484 529 576 ``` ```r toc() ``` ``` ## 2.14 sec elapsed ``` Much less time! We've engaged not just one core, but all of them. -- What did we have to do to make it parallel? * Declare your intent to use parallel processing: `plan(multisession)` * Use `future_sapply` instead of `sapply` * That's it! --- # Parallelization There are other packages that implement parallel processing in R, but `future` is uniquely simple, powerful, and consistent across operating systems. * It makes it so that you don't really need to know about hardware or what's going on behind the scenes. * But there is a lot more you *can* learn about if you're interested. When is parallelization useful? * When your code **takes a long time** and is **easy to break up into chunks.** * No point in using it for code that's already fast -- parallelization involves some overhead. * The chunks must be independent -- they can't need to communicate with each other. * Examples: Bootstrapping, simulation, many iteration problems. --- # Other functions & packages Other functions for vectorization: * Base: `lapply()` returns a list instead of a vector, so it's a bit more general. (You can always `unlist` the result.) ```r sapply(1:2, sqrt) ``` ``` ## [1] 1.000000 1.414214 ``` ```r lapply(1:2, sqrt) ``` ``` ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ``` ```r unlist(lapply(1:2, sqrt)) ``` ``` ## [1] 1.000000 1.414214 ``` --- # Other functions & packages Other functions for vectorization: * Base: `lapply()` returns a list instead of a vector, so it's a bit more general. (You can always `unlist` the result.) * Base: `mapply()` is a multivariate version of `sapply` -- it vectorizes over more than one dimension. * Package `purrr`: `map()` and `map2()` are similar to `lapply` and `mapply` but interact better with other functions in the Tidyverse. * Package `furrr`: Parallelized Tidyverse-friendly vectorization. Combines `purrr` with `future`. --- # Programming 1. [If/else statements](#if-else) 1. [For-loops](#for-loops) 1. [Functions](#functions) 1. [Vectorization](#vectorization) 1. [Parallelization](#parallelization)