class: center, middle, inverse, title-slide # Data in/and .mono[R] ## EC 425/525, Lab 2 ### Edward Rubin ### 12 April 2019 --- class: inverse, middle # Prologue --- name: schedule # Schedule ## Last time Getting to know .mono[R]—objects, functions, *etc.* ## Today Working with data in .mono[R]. - The `data.frame` class - The `dplyr` package ## Upcoming .hi[Due Monday] Step 1 of our research-project proposal. --- layout: true # Matrices --- name: review ## Quick review 1. `mat <- matrix(data = 1:10, ncol = 2)` creates a 5×2 `matrix` object containing the numbers 1 through 10 (filled by column). 1. `mat[1,]` grabs the first row of our matrix `mat`. 1. `mat[3,2] <- NA` assigns `NA` to row-3 column-2 element of `mat`. 1. `head(mat, 3)` returns up to the first three rows of `mat`. 1. `matrix(data = rnorm(100), ncol = 10)` creates a 10×10 matrix filled with random draws from `\(N(\mu=0,\sigma^2=1)\)`. 1. `mat[3,2] <- "Carrots"` assigns the `character` object `"Carrots"` to the `[3,2]` element of `mat`, forcing all elements of `mat` to `character`. --- ## Next steps Matrices are convenient two-dimensional arrays on which math "works."<sup>.pink[†]</sup> .footnote[.pink[†] At least for `numeric` and `logical` matrices.] *But* matrices also require all elements to be of the same class. .qa[Q] What if we a datasets whose variables (columns) have different classes? -- .qa[A] We need a more flexible table-like object for our data. -- <br>Maybe a `data.table`? -- Or a `data.frame`? -- We'll start with `data.frame`. -- We will spend a good amount of time on data frames, as they make up a huge part of your workflow. --- layout: true # Data frames --- name: df A `data.frame` is .mono[R]'s base, spreadsheet-like object that holds variables. -- .ex[Example] -- ``` #> id first_name fave_num is_tired loves_econ #> 1 1 Karmin 68 TRUE FALSE #> 2 2 Raychelle 57 TRUE TRUE #> 3 3 Jemelle 10 TRUE TRUE #> 4 4 Yusif 90 TRUE TRUE #> 5 5 Catherine 24 TRUE TRUE #> 6 6 Glory 4 TRUE TRUE #> 7 7 Kaelah 33 FALSE TRUE #> 8 8 Lysette 96 TRUE TRUE #> 9 9 Cisco 89 TRUE TRUE #> 10 10 Harman 69 TRUE TRUE #> 11 11 Jennelle 64 TRUE TRUE #> 12 12 Crayton 100 TRUE TRUE ``` --- A `data.frame` is .mono[R]'s base, spreadsheet-like object that holds variables. .ex[Example] ``` #> name height mass gender homeworld species #> 1 Luke Skywalker 172 77 male Tatooine Human #> 2 C-3PO 167 75 <NA> Tatooine Droid #> 3 R2-D2 96 32 <NA> Naboo Droid #> 4 Darth Vader 202 136 male Tatooine Human #> 5 Leia Organa 150 49 female Alderaan Human #> 6 Owen Lars 178 120 male Tatooine Human #> 7 Beru Whitesun lars 165 75 female Tatooine Human #> 8 R5-D4 97 32 <NA> Tatooine Droid #> 9 Biggs Darklighter 183 84 male Tatooine Human #> 10 Obi-Wan Kenobi 182 77 male Stewjon Human #> 11 Anakin Skywalker 188 84 male Tatooine Human #> 12 Wilhuff Tarkin 180 NA male Eriadu Human ``` --- name: creation ## Creation The `data.frame()` function creates... -- `data.frame` objects. -- You'll generally define data frames by passing the function<br>(.hi-slate[1]) column names and (.hi-slate[2]) values for the columns. ```r data.frame(var1 = 1:5, var2 = "apple", var3 = rnorm(5)) ``` -- You can also assign the values using already-existing objects, _e.g._, ```r # An object with value tmp <- rnorm(5) # Creating the data frame data.frame(var1 = 1:5, var2 = "apple", var3 = tmp) ``` --- ## Creation ```r # Creating the data frame data.frame(var1 = 1:5, var2 = "apple", var3 = rnorm(5)) ``` ``` #> var1 var2 var3 #> 1 1 apple -0.6250393 #> 2 2 apple -1.6866933 #> 3 3 apple 0.8377870 #> 4 4 apple 0.1533731 #> 5 5 apple -1.1381369 ``` (What a beauty.) -- Notice that .mono[R] assumes we want to repeat `"apple"` for the entire column. --- ## Creation You can also create data frames from other objects (_e.g._, matrices) using the function `as.data.frame()`<sup>.pink[†]</sup>. .footnote[.pink[†] Or just plain, old `data.frame()`.] However, your data frame's columns will only have names if your matrix's columns had names. --- name: indexing ## Indexing Consider a data frame `our_df <- data.frame(x = 1:3, y = 4:6, z = 7:9)`. .purple[Option 1] Index data frames just as you index matrices in .mono[R]. - `our_df[1,1]` grabs the value in the first row of the first variable. - `our_df[2,]` returns the second row of `our_df` (as a data frame). - `our_df[,3]` returns the third column (variable) of `our_df` (as a vector). -- .purple[Option 2] Reference values/variables using columns' names. - `our_df$x` returns the column named `x` (as a vector). .hi[New:] `$` - `our_df[,"x"]` returns the column named `x` (as a vector). - `our_df["x"]` returns the column named `x` (as a data frame). - `our_df[,c("x","y")]` returns a data frame with variables `"x"` and `"y"`. --- name: names ## Names (of columns) The columns (variables) in your data frame have names.<sup>.pink[†]</sup> .footnote[.pink[†] If you don't name the columns, then .mono[R] will.] .qa[Q] What if you want to see/know those names? -- .qa[A] You've got a few options. -- 1. The `names()` function returns the *names* of an object. -- 2. `head(your_df)` will show you the first 6 rows of `your_df`. <br>*Note:* May provide too much output if you have a lot of columns. -- 3. In .mono[RStudio]: `View(your_df)` or look in your .mono[Environment] tab. --- ## Naming The `names()` function will also help you rename any/all variables. -- Change the names of .b[all variables] (include a name for each variable): ```r # Set new names names(our_df) <- c("name1", "name2", "name3") ``` -- Change the name of .b[the second variable] (only): ```r # Set new names names(our_df)[2] <- "name2" ``` --- name: adding ## Adding variables Just as we referenced .pink[existing] variables using `$var_name`, <br>we can create .purple[new] varirables using `$new_var`, _e.g._, ```r # Add a variable to our_df our_df$new_var <- 1:100 ``` -- If you want to use existing columns to create a new variable ```r # Create interaction: xy = x * y our_df$xy <- our_df$x * our_df$y ``` -- .qa[Q] Isn't there a better/faster/less-typing way? -- .qa[A] Yes. *Enter* `dplyr` -- (also: `data.table`, which we'll leave for the future). --- layout: true # dplyr --- name: dplyr ## Intro It's a package. -- `dplyr` is not installed by default, so you'll need to install it.<sup>.pink[†]</sup> .footnote[.pink[†] or just `p_load(dplyr)` after loading `pacman`.] -- `dplyr` is part of the [`tidyverse`](https://dplyr.tidyverse.org/) (Hadleyverse), and it follows a grammar-based approach to programming/data work. -- - `data` compose the subjects of your stories - `dplyr` provides the *verbs* (action words) :<br> `filter()`, `mutate()`, `select()`, `group_by()`, `summarize()`, `arrange()` -- .hi-slate[*Bonus*] `dplyr` is pretty fast and able to interact with SQL databases. --- name: mutate ## Manipulating variables: `mutate()` `dplyr` streamlines adding/manipulating variables in your data frame. .hi-slate[Function] `mutate(.data, ...)` - .pink[Required argument] `.data`, an existing data frame - .pink[Additional arguments] Names and values of the new variables - .pink[Output] An updated data frame -- .ex[Example] ```r mutate(.data = our_df, new1 = 7, new2 = x * y) ``` --- ## `mutate()` .ex[Example] Take the data frame ```r my_df <- data.frame(x = 1:4, y = 5:8) ``` -- `mutate()` allows us to create many new variables with one call. .pull-left[ ```r mutate(.data = my_df, xy = x * y, x2 = x^2, y2 = y^2, xy2 = xy^2, is_x_max = x == max(x) ) ``` ] -- .pull-right[ ``` #> x y xy x2 y2 xy2 is_x_max #> 1 1 5 5 1 25 25 FALSE #> 2 2 6 12 4 36 144 FALSE #> 3 3 7 21 9 49 441 FALSE #> 4 4 8 32 16 64 1024 TRUE ``` Notice `mutate()` returns the original *and* new columns. ] --- name: transmute ## `mutate()` *vs.* `transmute()` As their names imply, `mutate()` and `transmute()` are very similar functions. - `mutate()` returns the .pink[original] *and* .purple[new] columns (variables). - `transmute()` returns only the .purple[new] columns (variables). -- .slate[*Note*] Both functions return a new object as *output*—they do not update the object in .mono[R]'s memory. (This is the case for all functions in `dplyr`.) --- name: pipes ## Pipes We can't go much deeper into the land of `dplyr` without mentioning pipes. -- A *pipe* in programming allows you to take the output of one function and plug it into another function as an argument/input. -- In `dplyr`, the expression for a pipe is `%>%`. -- .mono[R]'s pipe specifically plugs the returned object to the .pink[left] of the pipe into the first argument of the function on the .purple[right] fo the pipe, _e.g._, -- ```r rnorm(10) %>% mean() ``` ``` #> [1] 0.4854731 ``` --- ## Pipes Pipes help avoid lots of nested functions, prevent excessive writing to your disc, and increase the readability of our .mono[R] scripts. -- .ex[Example] Three ways to draw 100 N(0,1) observations and calculate the interquartile range (IQR: difference between the 75.super[th] and 25.super[th] percentiles). ```r # Save each intermediate step draw <- rnorm(100) end_points <- quantile(draw, probs = c(0.25, 0.75)) diff(end_points) # Lots of nesting diff(quantile(rnorm(100), probs = c(0.25, 0.75))) # Piping 💪 rnorm(100) %>% quantile(probs = c(0.25, 0.75)) %>% diff() ``` --- ## Pipes By default, .mono[R] pipes the output from the LHS of the pipe into<br>the .hi[first] argument of the function on the RHS of the pipe. -- *E.g.*, `a %>% fun(3)` is equivalent to `fun(arg1 = a, arg2 = 3)`. -- If you want to pipe output into a different argument, you use a period (`.`). -- - `b %>% fun(arg1 = 3, .)` is equivalent to `fun(arg1 = 3, arg2 = b)`. - `b %>% fun(3, .)` is also equivalent to `fun(arg1 = 3, arg2 = b)`. -- - `b %>% fun(., .)` is equivalent to `fun(arg1 = b, arg2 = b)`. -- The `magrittr` package contains even more piping power.<sup>.pink[†]</sup> .footnote[.pink[†] `magrittr` = Magritte (of [*this is not a pipe*](https://en.wikipedia.org/wiki/The_Treachery_of_Images) fame) plus .mono[R].] --- ## `%>%` and `dplyr` Each `dplyr` function begins with a `.data` argument so that you can easily pipe in data frames (recall: `mutate(.data, ...)`). -- The common workflow in `dplyr` will look something like `new_df <- old_df %>% mutate(cool stuff here)` which takes `old_df`, does some cool stuff with `mutate()`, and then saves the output of `mutate()` as `new_df`. --- ## `filter()` The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions]. --- layout: true # dplyr ## `filter()` The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions]. .ex[Example] .pull-left[ ```r # Create a dataset some_df <- data.frame( x = 1:10, y = 11:20 ) ``` ] --- name: filter count: false -- .pull-right[ ```r # Only keep rows where x is 3 some_df %>% filter(x == 3) ``` ``` #> x y #> 1 3 13 ``` ] --- .pull-right[ ```r # Only keep rows where x > 7 some_df %>% filter(x > 7) ``` ``` #> x y #> 1 8 18 #> 2 9 19 #> 3 10 20 ``` ] --- .pull-right[ ```r # Keep rows where y/x > 3 some_df %>% filter(y/x > 3) ``` ``` #> x y #> 1 1 11 #> 2 2 12 #> 3 3 13 #> 4 4 14 ``` ] --- .pull-right[ ```r # Keep rows where x>7 OR y<12 some_df %>% filter(x > 7 | y < 12) ``` ``` #> x y #> 1 1 11 #> 2 8 18 #> 3 9 19 #> 4 10 20 ``` ] --- .pull-right[ ```r # Keep rows where 15<=y<=18 some_df %>% filter(between(y, 15, 18)) ``` ``` #> x y #> 1 5 15 #> 2 6 16 #> 3 7 17 #> 4 8 18 ``` ] --- .pull-right[ ```r # Keep rows where y > 20 some_df %>% filter(y > 20) ``` ``` #> [1] x y #> <0 rows> (or 0-length row.names) ``` ] If you filter your data frame down to nothing, .mono[R] returns a 0-row data frame with the names/number of columns from the original data frame. --- layout: true # dplyr --- name: select ## `select()` Just as .purple[`filter()`] grabs .purple[row-based subsets] of your data frame, <br>.pink[`select()`] grabs .pink[column-based subsets]. -- You can select columns using their .b[names] <br>.pad-left[`our_df %>% select(var10, var100)`] -- you can select columns using their .b[numbers] <br>.pad-left[`our_df %>% select(10, 100)`] -- or you can select columns using .b[helper fuctions] <br>.pad-left[`our_df %>% select(starts_with("var10"))`] -- `select()` helps you narrow down a dataset to its necessary features. --- name: summarize ## `summarize()` Hopefully you're starting to see that functions' names in `dplyr` tell you what the function does. `summarize()`<sup>.pink[†]</sup> summarizes variables—you choose the variables and the summaries (_e.g._, `mean()` or `min()`). .footnote[.pink[†] or `summarise()` if you ❤️️ 🇬🇧] -- ```r the_df %>% summarize( mean(x), mean(y), mean(z), min(x), max(x), ) ``` would return a 1×5 data frame with the means of `x`, `y`, and `z`; the minimum of `x`; and the maximum of `x`. --- name: group_summarize ## `summarize()` and `group_by()` While sample-wide summarizes are certainly interesting, `dplyr` has one last gem for us: `group_by()`. `group_by()` groups your observations by the variable(s) that you name. -- Specifically, `group_by()` returns a *grouped data frame* that you can then feed to `summarize()`, `mutate()`, or `transmuate` to perform grouped calculations, _e.g._, each group's mean. --- ## Example: Grouped summaries .pull-left[.small[ ```r # Create a new data frame our_df <- data.frame( x = 1:6, y = c(0, 1), grp = rep(c("A", "B"), each = 3) ) ``` ``` #> x y grp #> 1 1 0 A #> 2 2 1 A #> 3 3 0 A #> 4 4 1 B #> 5 5 0 B #> 6 6 1 B ``` ]] -- .pull-right[.small[ ```r # For dataset 'our_df'... our_df %>% # Group by 'grp' group_by(grp) %>% # Take means of 'x' and 'y' summarize(mean(x), mean(y)) ``` ``` #> # A tibble: 2 x 3 #> grp `mean(x)` `mean(y)` #> <fct> <dbl> <dbl> #> 1 A 2 0.333 #> 2 B 5 0.667 ``` ]] --- ## Example: Grouped mutation .pull-left[.small[ ```r # Create a new data frame our_df <- data.frame( x = 1:6, y = c(0, 1), grp = rep(c("A", "B"), each = 3) ) ``` ``` #> x y grp #> 1 1 0 A #> 2 2 1 A #> 3 3 0 A #> 4 4 1 B #> 5 5 0 B #> 6 6 1 B ``` ]] -- .pull-right[.small[ ```r # Add grp means for x and y our_df %>% group_by(grp) %>% mutate( x_m = mean(x), y_m = mean(y) ) ``` ``` #> # A tibble: 6 x 5 #> # Groups: grp [2] #> x y grp x_m y_m #> <int> <dbl> <fct> <dbl> <dbl> #> 1 1 0 A 2 0.333 #> 2 2 1 A 2 0.333 #> 3 3 0 A 2 0.333 #> 4 4 1 B 5 0.667 #> 5 5 0 B 5 0.667 #> 6 6 1 B 5 0.667 ``` ]] --- name: arrange ## `arrange()` `arrange()` will sorts the rows of a data frame using the inputted columns. .mono[R] defaults to starting with the "lowest" (smallest) at the top of the data frame. Use a `-` in front of the variable's name to reverse sort. .pull-left[ ```r # As is our_df ``` ``` #> x y grp #> 1 1 0 A #> 2 2 1 A #> 3 3 0 A #> 4 4 1 B #> 5 5 0 B #> 6 6 1 B ``` ] .pull-right[ ```r # Arrang by y, grp, then -x our_df %>% arrange(y, grp, -x) ``` ``` #> x y grp #> 1 3 0 A #> 2 1 0 A #> 3 5 0 B #> 4 2 1 A #> 5 6 1 B #> 6 4 1 B ``` ] --- layout: false # Table of contents .pull-left[.hi-slate[Data and .mono[R]] .smaller[ 1. [Schedule](#schedule) 1. [Matrix review](#review) 1. [The `data.frame`](#df) - [Basic examples](#df) - [Creating](#creation) - [Indexing](#indexing) - [Names](#names) - [Adding variables](#adding) ]] .pull-right[.hi-slate[`dplyr`] .smaller[ 1. [Intro](#dplyr) 1. [`mutate()`](#mutate) 1. [`transmute()`](#transmute) 1. [Pipes (`%>%`)](#pipes) 1. [`filter()`](#filter) 1. [`select()`](#select) 1. [`summarize`](#summarize) 1. [`summarize()` and `group_by()`](#group_summarize) 1. [`arrange()`](#arrange) ]] --- exclude: true