class: center, middle, inverse, title-slide # Lab .mono[000] ## Data cleaning and workflow [1/N] ### Edward Rubin ### 07 January 2022 --- exclude: true --- class: inverse, middle # Admin --- name: admin # Admin Basic .hi[workflow] (best) practices (_i.e._, _Projects_) - RStudio and projects - Naming conventions - Pipes (`%>%`) - Data cleaning with `dplyr` .note[Reminders] .b.slate[Reminder:] Readings for next week - ISL Ch1–Ch2 - [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015) --- layout: true # Improving your workflow --- class: inverse, middle --- name: workflow-main Data cleaning, manipulation, and analysis can be grueling, but optimizing your workflow can speed things along and make them less painful..super[.pink[†]] .footnote[ .pink[†] Notice that I said .pink[*less*] painful. ] .hi-slate[A few dimensions that can help] - Understand how to interact with RStudio - Use .mono[R] projects - Follow reasonable naming conventions - `dplyr` and pipes - .grey-mid[Write your own functions (future lab)] - .grey-mid[Use loops and parallelization (future lab)] - .grey-light[Hire an intern/assistant to do your work for you] --- layout: false class: clear, middle ## Efficiency <img src="images/xkcd-efficiency.png" width="80%" style="display: block; margin: auto;" /> .grey-light[Source: [xkcd](https://xkcd.com/1445/)] --- layout: false class: inverse, middle # RStudio --- name: rstudio class: clear Let's recap some of the major features in .mono[RStudio]... <img src="images/rstudio/rstudio.png" width="4309" style="display: block; margin: auto;" /> --- class: clear First, you write your .mono[R] scripts (source code) in the .hi[Source] pane. <img src="images/rstudio/rstudio_source_rec.png" width="4309" style="display: block; margin: auto;" /> --- class: clear You can use the menubar or .mono[⇧+⌘+N] to create new .mono[R] scripts. <img src="images/rstudio/rstudio_source_arrow.png" width="4309" style="display: block; margin: auto;" /> --- class: clear To execute commands from your .mono[R] script, use .mono[⌘+Enter]. <img src="images/rstudio/rstudio_source_ex.png" width="4309" style="display: block; margin: auto;" /> --- class: clear .mono[RStudio] will execute the command in the terminal. <img src="images/rstudio/rstudio_source_ex2.png" width="4309" style="display: block; margin: auto;" /> --- class: clear You can see our new object in the .hi[Environment] pane. <img src="images/rstudio/rstudio_source_ex3.png" width="4309" style="display: block; margin: auto;" /> --- class: clear The .hi-purple[History] tab (next to .hi[Environment]) records your old commands. <img src="images/rstudio/rstudio_history.png" width="4309" style="display: block; margin: auto;" /> --- class: clear The .hi[Files] pane is file explorer. <img src="images/rstudio/rstudio_files.png" width="4309" style="display: block; margin: auto;" /> --- class: clear The .hi[Plots] pane/tab shows... plots. <img src="images/rstudio/rstudio_plots.png" width="4309" style="display: block; margin: auto;" /> --- class: clear .hi[Packages] shows installed packages <img src="images/rstudio/rstudio_packages.png" width="4309" style="display: block; margin: auto;" /> --- count: false class: clear .hi[Packages] shows installed packages and whether they are .hi-purple[loaded]. <img src="images/rstudio/rstudio_packages2.png" width="4309" style="display: block; margin: auto;" /> --- class: clear The .hi[Help] tab shows help documentation (also accessible via `?`). <img src="images/rstudio/rstudio_help.png" width="4309" style="display: block; margin: auto;" /> --- class: clear Finally, you can customize the actual layout <img src="images/rstudio/rstudio_layout.png" width="4309" style="display: block; margin: auto;" /> --- count: false class: clear Finally, you can customize the actual layout and many other items. <img src="images/rstudio/rstudio_customize.png" width="4309" style="display: block; margin: auto;" /> --- name: best-practices # .mono[R] and .mono[RStudio] .b.slate[Related best practices] 1. Write code in .mono[R] scripts. Troubleshoot in .mono[RStudio]. Then run the scripts. 1. Comment your code. (`# This is a comment`) 1. Name objects/variables/files with intelligible, standardized names. - .hi-purple[BAD] `ALLCARS`, `Vl123a8`, `a.fun`, `cens.12931`, `cens.12933` - .hi-pink[GOOD] `unique_cars`, `health_df`, `sim_fun`, `is_female`, `age` 1. Write code that is readable (see comments comment above). 1. Use projects in .mono[RStudio] (next). And organize your projects. --- layout: true # Projects --- class: inverse, middle --- name: projects Projects in .mono[R] offer several benefits 1. Act as an .hi[anchor] for working with files. 1. Make your work (projects) easily .hi[reproducible]..super[.pink[†]] 1. Help you .hi[quickly jump back] into your work. .footnote[.pink[†] In this class, we're assuming reproducibility is good/desirable.] --- layout: false class: clear To start a new project, hit the .hi[project icon]. <img src="images/rstudio/rstudio_projects.png" width="4309" style="display: block; margin: auto;" /> --- class: clear You'll then choose the folder/directory where your project lives. <img src="images/rstudio/rstudio_projects2.png" width="4309" style="display: block; margin: auto;" /> --- class: clear If you open (double click) a project, .mono[RStudio] opens .mono[R] in that location. <img src="images/rstudio/rstudio_projects3.png" width="4309" style="display: block; margin: auto;" /> --- count: false class: clear .mono[RStudio] will 'load' your previous setup (pane setup, scripts, *etc.*). <img src="images/rstudio/rstudio_projects3.png" width="4309" style="display: block; margin: auto;" /> --- layout: true # .mono[R] and .mono[RStudio] .b.slate[Projects] --- .hi-purple[Without a project], you will need to define long file paths that you'll need to keep updating as folder names/locations change. -- `dir_class <- "/Users/edwardarubin/Dropbox/UO/Teaching/EC525S19/"` <br>`dir_labs <- paste0(dir_class, "NotesLab/")` <br>`dir_lab03 <- paste0(dir_labs, "03RInput/")` <br>`sample_df <- read.csv(paste0(dir_lab03, "sample.csv"))` -- .hi-pink[With a project], .mono[R] automatically references the project's folder. `sample_df <- read.csv("sample.csv")` -- .note[Double-plus bonus] The [`here`](https://github.com/r-lib/here) package extends projects' reproducibility. --- layout: true # Pipes and dplyr --- class: inverse, middle --- .b.slate[Introduction] 1. Pipes (`%>%`) make your life easier..super[.pink[†]] 1. `dplyr` is your data-work friend. .footnote[.pink[†] Check out `magrittr` for more pipe options, _e.g._, `%<>%`.] --- name: pipes .b.slate[What is a pipe?] -- Pipes are a .pink[simplifying] programming tool; make your code easier to read -- Take the .hi-pink[output] of a function as the .hi-purple[input/argument] of another function -- In `dplyr`, the expression for a pipe is `%>%` .footnote[.pink[†] `|>` native pipe as of .mono[R] 4.1.0 ] -- .mono[R]'s pipe specifically plugs the returned object to the .pink[left] of the pipe into the first argument of the function on the .purple[right] fo the pipe, _e.g._, -- ```r rnorm(10) %>% mean() ``` ``` #> [1] -0.5162503 ``` --- .b.slate[Pipes] Pipes avoid nested functions, prevent excessive writing to your disc, and increase the readability of our .mono[R] scripts -- .ex[Example] Three ways to draw 100 N(0,1) observations and calculate the interquartile range (IQR: difference between the 75.super[th] and 25.super[th] percentiles). ```r # Save each intermediate step draw <- rnorm(100) end_points <- quantile(draw, probs = c(0.25, 0.75)) diff(end_points) # Lots of nesting diff(quantile(rnorm(100), probs = c(0.25, 0.75))) # Piping 💪 rnorm(100) %>% quantile(probs = c(0.25, 0.75)) %>% diff() ``` -- Think [russian dolls](https://en.wikipedia.org/wiki/Matryoshka_doll) --- .b.slate[Pipes] By default, .mono[R] pipes the output from the LHS of the pipe into<br>the .hi[first] argument of the function on the RHS of the pipe. -- *E.g.*, `a %>% fun(3)` is equivalent to `fun(arg1 = a, arg2 = 3)`. -- If you want to pipe output into a different argument, you use a period (`.`). -- - `b %>% fun(arg1 = 3, .)` is equivalent to `fun(arg1 = 3, arg2 = b)`. - `b %>% fun(3, .)` is also equivalent to `fun(arg1 = 3, arg2 = b)`. -- - `b %>% fun(., .)` is equivalent to `fun(arg1 = b, arg2 = b)`. -- The `magrittr` package contains even more piping power.<sup>.pink[†]</sup> .footnote[.pink[†] `magrittr` = Magritte (of [*this is not a pipe*](https://en.wikipedia.org/wiki/The_Treachery_of_Images) fame) plus .mono[R].] --- layout: true # dplyr --- # dplyr .b.slate[Before we begin:] 1. Ensure `tidyverse` is installed: `install.packages('tidyverse')` -- 1. Install `nycflights13` package: `install.packages('nycflights13')` -- 1. Load package libraries: `library(tidyverse, nycflights13)` -- 1. Test the `flights` dataset: `(flights)`
year
month
day
dep_time
sched_dep_time
dep_delay
arr_time
2013
1
1
517
515
2
830
2013
1
1
533
529
4
850
2013
1
1
542
540
2
923
--- class: inverse, middle --- name: dplyr .b.slate[Introduction] It's a package. -- `dplyr` is not installed by default, so you'll need to install it.<sup>.pink[†]</sup> .footnote[.pink[†] or just `p_load(dplyr)` after loading `pacman`.] -- `dplyr` is part of the [`tidyverse`](https://dplyr.tidyverse.org/) (Hadleyverse), and it follows a grammar-based approach to programming/data work. -- - `data` compose the subjects of your stories - `dplyr` provides the *verbs* (action words) :<br> `filter()`, `mutate()`, `select()`, `group_by()`, `summarize()`, `arrange()` -- <br> .hi-slate[*Bonus*] `dplyr` is pretty fast and able to interact with SQL databases. --- name: mutate .b-slate[Manipulating variables:] `mutate()` `dplyr` streamlines adding/manipulating variables in your data frame. .hi-slate[Function] `mutate(.data, ...)` - .pink[Required argument] `.data`, an existing data frame - .pink[Additional arguments] Names and values of the new variables - .pink[Output] An updated data frame -- .ex[Example] ```r mutate(.data = our_df, new1 = 7, new2 = x * y) ``` --- ## `mutate()` .ex[Example] Take the data frame ```r my_df <- data.frame(x = 1:3, y = 5:7) ``` -- `mutate()` allows us to create many new variables with one call. .pull-left[ ```r mutate(.data = my_df, xy = x * y, x2 = x^2, xy2 = xy^2, is_max = x == max(x) ) ``` ] -- .pull-right[
Notice `mutate()` returns the original *and* new columns. ] --- name: transmute ## `mutate()` *vs.* `transmute()` As their names imply, `mutate()` and `transmute()` are very similar functions. - `mutate()` returns the .pink[original] *and* .purple[new] columns (variables). - `transmute()` returns only the .purple[new] columns (variables). -- .slate[*Note*] Both functions return a new object as *output*—they do not update the object in .mono[R]'s memory. (This is the case for all functions in `dplyr`.) --- ## `%>%` and `dplyr` Each `dplyr` function begins with a `.data` argument so that you can easily pipe in data frames (recall: `mutate(.data, ...)`). -- The common workflow in `dplyr` will look something like `new_df <- old_df %>% mutate(cool stuff here)` which takes `old_df`, does some cool stuff with `mutate()`, and then saves the output of `mutate()` as `new_df`. --- ## `filter()` The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions]. --- layout: true # dplyr ## `filter()` The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions]. .ex[Example] .pull-left[ ```r # Create a dataset some_df <- data.frame( x = 1:10, y = 11:20 ) ``` ] --- name: filter count: false -- .pull-right[ ```r # Only keep rows where x is 3 some_df %>% filter(x == 3) ```
] --- .pull-right[ ```r # Only keep rows where x > 7 some_df %>% filter(x > 7) ```
] --- .pull-right[ ```r # Keep rows where y/x > 3 some_df %>% filter(y/x > 3) ```
] --- .pull-right[ ```r # Keep rows where x>8 OR y<12 some_df %>% filter(x > 8 | y < 12) ```
] --- .pull-right[ ```r # Keep rows where 16<=y<=18 some_df %>% filter(between(y, 16, 18)) ```
] --- .pull-right[ ```r # Keep rows where y > 20 some_df %>% filter(y > 20) ```
] If you filter your data frame down to nothing, .mono[R] returns a 0-row data frame with the names/number of columns from the original data frame. --- layout: true # dplyr --- name: select ## `select()` Just as .purple[`filter()`] grabs .purple[row-based subsets] of your data frame, <br>.pink[`select()`] grabs .pink[column-based subsets]. -- You can select columns using their .b[names] <br>.pad-left[`our_df %>% select(var10, var100)`] -- you can select columns using their .b[numbers] <br>.pad-left[`our_df %>% select(10, 100)`] -- or you can select columns using .b[helper fuctions] <br>.pad-left[`our_df %>% select(starts_with("var10"))`] -- `select()` helps you narrow down a dataset to its necessary features. --- name: summarize ## `summarize()` Hopefully you're starting to see that functions' names in `dplyr` tell you what the function does. `summarize()`<sup>.pink[†]</sup> summarizes variables—you choose the variables and the summaries (_e.g._, `mean()` or `min()`). .footnote[.pink[†] or `summarise()` if you ❤️️ 🇬🇧] -- ```r the_df %>% summarize( mean(x), mean(y), mean(z), min(x), max(x), ) ``` would return a 1×5 data frame with the means of `x`, `y`, and `z`; the minimum of `x`; and the maximum of `x`. --- name: group_summarize ## `summarize()` and `group_by()` While sample-wide summarizes are certainly interesting, `dplyr` has one last gem for us: `group_by()`. `group_by()` groups your observations by the variable(s) that you name. -- Specifically, `group_by()` returns a *grouped data frame* that you can then feed to `summarize()`, `mutate()`, or `transmuate` to perform grouped calculations, _e.g._, each group's mean. --- ## Example: Grouped summaries .pull-left[.small[ ```r # Create a new data frame our_df <- data.frame( x = 1:6, y = c(0, 1), grp = rep(c("A", "B"), each = 3) ) ```
]] -- .pull-right[.small[ ```r # For dataset 'our_df'... our_df %>% # Group by 'grp' group_by(grp) %>% # Take means of 'x' and 'y' summarize(mean(x), mean(y)) ```
]] --- ## Example: Grouped mutation .pull-left[.small[ ```r # Create a new data frame our_df <- data.frame( x = 1:6, y = c(0, 1), grp = rep(c("A", "B"), each = 3) ) ```
]] -- .pull-right[.small[ ```r # Add grp means for x and y our_df %>% group_by(grp) %>% mutate( x_m = mean(x), y_m = mean(y) ) ```
]] --- name: arrange ## `arrange()` `arrange()` will sorts the rows of a data frame using the inputted columns. .mono[R] defaults to starting with the "lowest" (smallest) at the top of the data frame. Use a `-` in front of the variable's name to reverse sort. --- layout: false class: clear, middle .pull-left[ ```r # As is our_df ```
] .pull-right[ ```r # Arrang by y, grp, then -x our_df %>% arrange(y, grp, -x) ```
] --- name: tidyverse # The tidyverse There's more! `dplyr` and .purple[`tidyr`] offer even more....super[.pink[†]] .footnote[ .pink[†] And these are only two of the packages in the `tidyverse`. ] - .note[Viewing data] `glimpse()`, `top_n()` - .note[Sampling] `sample_n()`, `sample_frac()` - .note[Summaries] `first()`, `last()`, `nth()`, `n_distinct()` - .note[Duplicates] `distinct()` - .note[Missingness] `na_if()`, .purple[`replace_na()`], .purple[`drop_na()`], .purple[`fill()`] -- The folks at RStudio have put together some great cheatsheets, *e.g.*, - [`dplyr`](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-dplyr.pdf) - [data import](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-import.pdf) - [data wrangling](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-wrangling.pdf) --- # Exercises Some selected exercises from [R for Data Science](https://r4ds.had.co.nz/index.html) by Hadley Wickham 1. Exercise [5.2.4.1](https://r4ds.had.co.nz/transform.html#exercises-8) 1. Exercise [5.3.1.2, 5.3.1.3, 5.3.1.4](https://r4ds.had.co.nz/transform.html#exercises-9) 1. Exercise [5.5.2.4](https://r4ds.had.co.nz/transform.html#exercises-11) 1. Exercise [5.7.1.3](https://r4ds.had.co.nz/transform.html#exercises-13) --- layout: false # Table of contents .col-left[ .small[ #### Admin - [Today and upcoming](#admin) #### Workflow - [General](#workflow-main) - [RStudio](#rstudio) - [Related best practies](#best-practices) - [Projects](#projects) ]] .col-right[ .small[ #### `dplyr` - [Pipes](#pipes) - [`mutate`](#mutate) - [`transmute()`](#transmute) - [`arrange`](#arrange) - [`filter()`](#filter) - [`select()`](#select) - [`summarize`](#summarize) - [`summarize()` and `group_by()`](#group_summarize) - [The `tidyverse`](#tidyverse) ] ]