--- title: "Lab .mono[000]" subtitle: "Data cleaning and workflow (1/N)" author: "Edward Rubin" #date: "`r format(Sys.time(), '%d %B %Y')`" date: "10 January 2020" output: xaringan::moon_reader: css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css'] # self_contained: true nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{R, setup, include = F} library(pacman) p_load( broom, tidyverse, ggplot2, ggthemes, ggforce, ggridges, cowplot, latex2exp, viridis, extrafont, gridExtra, kableExtra, snakecase, janitor, DT, data.table, dplyr, lubridate, knitr, future, furrr, estimatr, FNN, parsnip, huxtable, here, magrittr ) # Define colors red_pink = "#e64173" turquoise = "#20B2AA" orange = "#FFA500" red = "#fb6107" blue = "#3b3b9a" green = "#8bb174" grey_light = "grey70" grey_mid = "grey50" grey_dark = "grey20" purple = "#6A5ACD" slate = "#314f4f" # Knitr options opts_chunk$set( comment = "#>", fig.align = "center", fig.height = 7, fig.width = 10.5, warning = F, message = F ) opts_chunk$set(dev = "svg") options(device = function(file, width, height) { svg(tempfile(), width = width, height = height) }) options(knitr.table.format = "html") ``` --- class: inverse, middle # Admin --- name: admin # Admin Basic .hi[workflow] (best) practices (_i.e._, _Projects_) - RStudio and projects - Naming conventions - Pipes (`%>%`) - Data cleaning with `dplyr` .note[Reminder] Readings for next week - ISL Ch1–Ch2 - [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015) --- layout: true # Improving your workflow --- class: inverse, middle --- name: workflow-main Data cleaning, manipulation, and analysis can be grueling, but optimizing your workflow can speed things along and make them less painful..super[.pink[†]] .footnote[ .pink[†] Notice that I said .pink[*less*] painful. ] .hi-slate[A few dimensions that can help] - Understand how to interact with RStudio - Use .mono[R] projects - Follow reasonable naming conventions - `dplyr` and pipes - .grey-mid[Write your own functions (future lab)] - .grey-mid[Use loops and parallelization (future lab)] - .grey-light[Hire an intern/assistant to do your work for you] --- layout: false class: clear, middle ## Efficiency ```{R, fig xkcd, echo = F, out.width = '80%'} knitr::include_graphics("images/xkcd-efficiency.png") ``` .grey-light[Source: [xkcd](https://xkcd.com/1445/)] --- layout: false class: inverse, middle # RStudio --- name: rstudio class: clear Let's recap some of the major features in .mono[RStudio]... ```{R, pic_rstudio, echo = F} knitr::include_graphics("images/rstudio/rstudio.png") ``` --- class: clear First, you write your .mono[R] scripts (source code) in the .hi[Source] pane. ```{R, pic_rstudio_source1, echo = F} knitr::include_graphics("images/rstudio/rstudio_source_rec.png") ``` --- class: clear You can use the menubar or .mono[⇧+⌘+N] to create new .mono[R] scripts. ```{R, pic_rstudio_source2, echo = F} knitr::include_graphics("images/rstudio/rstudio_source_arrow.png") ``` --- class: clear To execute commands from your .mono[R] script, use .mono[⌘+Enter]. ```{R, pic_rstudio_source3, echo = F} knitr::include_graphics("images/rstudio/rstudio_source_ex.png") ``` --- class: clear .mono[RStudio] will execute the command in the terminal. ```{R, pic_rstudio_source4, echo = F} knitr::include_graphics("images/rstudio/rstudio_source_ex2.png") ``` --- class: clear You can see our new object in the .hi[Environment] pane. ```{R, pic_rstudio_source5, echo = F} knitr::include_graphics("images/rstudio/rstudio_source_ex3.png") ``` --- class: clear The .hi-purple[History] tab (next to .hi[Environment]) records your old commands. ```{R, pic_rstudio_history, echo = F} knitr::include_graphics("images/rstudio/rstudio_history.png") ``` --- class: clear The .hi[Files] pane is file explorer. ```{R, pic_rstudio_files, echo = F} knitr::include_graphics("images/rstudio/rstudio_files.png") ``` --- class: clear The .hi[Plots] pane/tab shows... plots. ```{R, pic_rstudio_plots, echo = F} knitr::include_graphics("images/rstudio/rstudio_plots.png") ``` --- class: clear .hi[Packages] shows installed packages ```{R, pic_rstudio_packages, echo = F} knitr::include_graphics("images/rstudio/rstudio_packages.png") ``` --- count: false class: clear .hi[Packages] shows installed packages and whether they are .hi-purple[loaded]. ```{R, pic_rstudio_packages2, echo = F} knitr::include_graphics("images/rstudio/rstudio_packages2.png") ``` --- class: clear The .hi[Help] tab shows help documentation (also accessible via `?`). ```{R, pic_rstudio_help, echo = F} knitr::include_graphics("images/rstudio/rstudio_help.png") ``` --- class: clear Finally, you can customize the actual layout ```{R, pic_rstudio_layout, echo = F} knitr::include_graphics("images/rstudio/rstudio_layout.png") ``` --- count: false class: clear Finally, you can customize the actual layout and many other items. ```{R, pic_rstudio_customize, echo = F} knitr::include_graphics("images/rstudio/rstudio_customize.png") ``` --- name: best-practices # .mono[R] and .mono[RStudio] ## Related best practices 1. Write code in .mono[R] scripts. Troubleshoot in .mono[RStudio]. Then run the scripts. 1. Comment your code. (`# This is a comment`) 1. Name objects/variables/files with intelligible, standardized names. - .hi-purple[BAD] `ALLCARS`, `Vl123a8`, `a.fun`, `cens.12931`, `cens.12933` - .hi-pink[GOOD] `unique_cars`, `health_df`, `sim_fun`, `is_female`, `age` 1. Write code that is readable (see comments comment above). 1. Use projects in .mono[RStudio] (next). And organize your projects. --- layout: true # Projects --- class: inverse, middle --- name: projects Projects in .mono[R] offer several benefits 1. Act as an .hi[anchor] for working with files. 1. Make your work (projects) easily .hi[reproducible]..super[.pink[†]] 1. Help you .hi[quickly jump back] into your work. .footnote[.pink[†] In this class, we're assuming reproducibility is good/desirable.] --- layout: false class: clear To start a new project, hit the .hi[project icon]. ```{R, pic_rstudio_projects, echo = F} knitr::include_graphics("images/rstudio/rstudio_projects.png") ``` --- class: clear You'll then choose the folder/directory where your project lives. ```{R, pic_rstudio_projects2, echo = F} knitr::include_graphics("images/rstudio/rstudio_projects2.png") ``` --- class: clear If you open (double click) a project, .mono[RStudio] opens .mono[R] in that location. ```{R, pic_rstudio_projects3, echo = F} knitr::include_graphics("images/rstudio/rstudio_projects3.png") ``` --- count: false class: clear .mono[RStudio] will 'load' your previous setup (pane setup, scripts, *etc.*). ```{R, pic_rstudio_projects3b, echo = F} knitr::include_graphics("images/rstudio/rstudio_projects3.png") ``` --- layout: true # .mono[R] and .mono[RStudio] ## Projects --- .hi-purple[Without a project], you will need to define long file paths that you'll need to keep updating as folder names/locations change. -- `dir_class <- "/Users/edwardarubin/Dropbox/UO/Teaching/EC525S19/"`
`dir_labs <- paste0(dir_class, "NotesLab/")`
`dir_lab03 <- paste0(dir_labs, "03RInput/")`
`sample_df <- read.csv(paste0(dir_lab03, "sample.csv"))` -- .hi-pink[With a project], .mono[R] automatically references the project's folder. `sample_df <- read.csv("sample.csv")` -- .note[Double-plus bonus] The [`here`](https://github.com/r-lib/here) package extends projects' reproducibility. --- layout: true # Pipes and dplyr --- class: inverse, middle --- ## Introduction 1. Pipes (`%>%`) make your life easier..super[.pink[†]] 1. `dplyr` is your data-work friend. .footnote[.pink[†] Check out `magrittr` for more pipe options, _e.g._, `%<>%`.] --- name: pipes ## Pipes We can't go much deeper into the land of `dplyr` without mentioning pipes. -- A *pipe* in programming allows you to take the output of one function and plug it into another function as an argument/input. -- In `dplyr`, the expression for a pipe is `%>%`. -- .mono[R]'s pipe specifically plugs the returned object to the .pink[left] of the pipe into the first argument of the function on the .purple[right] fo the pipe, _e.g._, -- ```{R, ex_pipe_simple} rnorm(10) %>% mean() ``` --- ## Pipes Pipes help avoid lots of nested functions, prevent excessive writing to your disc, and increase the readability of our .mono[R] scripts. -- .ex[Example] Three ways to draw 100 N(0,1) observations and calculate the interquartile range (IQR: difference between the 75.super[th] and 25.super[th] percentiles). ```{R, ex_pipe_iqr, eval = F} # Save each intermediate step draw <- rnorm(100) end_points <- quantile(draw, probs = c(0.25, 0.75)) diff(end_points) # Lots of nesting diff(quantile(rnorm(100), probs = c(0.25, 0.75))) # Piping 💪 rnorm(100) %>% quantile(probs = c(0.25, 0.75)) %>% diff() ``` --- ## Pipes By default, .mono[R] pipes the output from the LHS of the pipe into
the .hi[first] argument of the function on the RHS of the pipe. -- *E.g.*, `a %>% fun(3)` is equivalent to `fun(arg1 = a, arg2 = 3)`. -- If you want to pipe output into a different argument, you use a period (`.`). -- - `b %>% fun(arg1 = 3, .)` is equivalent to `fun(arg1 = 3, arg2 = b)`. - `b %>% fun(3, .)` is also equivalent to `fun(arg1 = 3, arg2 = b)`. -- - `b %>% fun(., .)` is equivalent to `fun(arg1 = b, arg2 = b)`. -- The `magrittr` package contains even more piping power..pink[†] .footnote[.pink[†] `magrittr` = Magritte (of [*this is not a pipe*](https://en.wikipedia.org/wiki/The_Treachery_of_Images) fame) plus .mono[R].] --- layout: true # dplyr --- class: inverse, middle --- name: dplyr ## Intro It's a package. -- `dplyr` is not installed by default, so you'll need to install it..pink[†] .footnote[.pink[†] or just `p_load(dplyr)` after loading `pacman`.] -- `dplyr` is part of the [`tidyverse`](https://dplyr.tidyverse.org/) (Hadleyverse), and it follows a grammar-based approach to programming/data work. -- - `data` compose the subjects of your stories - `dplyr` provides the *verbs* (action words) :
 `filter()`, `mutate()`, `select()`, `group_by()`, `summarize()`, `arrange()` -- .hi-slate[*Bonus*] `dplyr` is pretty fast and able to interact with SQL databases. --- name: mutate ## Manipulating variables: `mutate()` `dplyr` streamlines adding/manipulating variables in your data frame. .hi-slate[Function] `mutate(.data, ...)` - .pink[Required argument] `.data`, an existing data frame - .pink[Additional arguments] Names and values of the new variables - .pink[Output] An updated data frame -- .ex[Example] ```{R, ex_mutate1, eval = F} mutate(.data = our_df, new1 = 7, new2 = x * y) ``` --- ## `mutate()` .ex[Example] Take the data frame ```{R, ex_mutate2_df} my_df <- data.frame(x = 1:3, y = 5:7) ``` -- `mutate()` allows us to create many new variables with one call. .pull-left[ ```{R, ex_mutate2, eval = F} mutate(.data = my_df, xy = x * y, x2 = x^2, xy2 = xy^2, is_max = x == max(x) ) ``` ] -- .pull-right[ ```{R, ex_mutate2_result, echo = F, results = 'asis'} mutate(.data = my_df, xy = x * y, x2 = x^2, xy2 = xy^2, is_max = x == max(x) ) %>% DT::datatable(rownames = F, options = list(dom = 't')) ``` Notice `mutate()` returns the original *and* new columns. ] --- name: transmute ## `mutate()` *vs.* `transmute()` As their names imply, `mutate()` and `transmute()` are very similar functions. - `mutate()` returns the .pink[original] *and* .purple[new] columns (variables). - `transmute()` returns only the .purple[new] columns (variables). -- .slate[*Note*] Both functions return a new object as *output*—they do not update the object in .mono[R]'s memory. (This is the case for all functions in `dplyr`.) --- ## `%>%` and `dplyr` Each `dplyr` function begins with a `.data` argument so that you can easily pipe in data frames (recall: `mutate(.data, ...)`). -- The common workflow in `dplyr` will look something like `new_df <- old_df %>% mutate(cool stuff here)` which takes `old_df`, does some cool stuff with `mutate()`, and then saves the output of `mutate()` as `new_df`. --- ## `filter()` The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions]. --- layout: true # dplyr ## `filter()` The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions]. .ex[Example] .pull-left[ ```{R, ex_filter} # Create a dataset some_df <- data.frame( x = 1:10, y = 11:20 ) ``` ] --- name: filter count: false -- .pull-right[ ```{R, ex_filter1, eval = F} # Only keep rows where x is 3 some_df %>% filter(x == 3) ``` ```{R, ex_filter1b, echo = F, out.width = '25%'} # Only keep rows where x is 3 some_df %>% filter(x == 3) %>% datatable(rownames = F,options = list(dom = 't'), width = 200) ``` ] --- .pull-right[ ```{R, ex_filter2, eval = F} # Only keep rows where x > 7 some_df %>% filter(x > 7) ``` ```{R, ex_filter2b, echo = F} # Only keep rows where x > 7 some_df %>% filter(x > 7) %>% datatable(rownames = F,options = list(dom = 't'), width = 200) ``` ] --- .pull-right[ ```{R, ex_filter3, eval = F} # Keep rows where y/x > 3 some_df %>% filter(y/x > 3) ``` ```{R, ex_filter3b, echo = F} # Keep rows where y/x > 3 some_df %>% filter(y/x > 3) %>% datatable(rownames = F,options = list(dom = 't'), width = 200) ``` ] --- .pull-right[ ```{R, ex_filter4, eval = F} # Keep rows where x>8 OR y<12 some_df %>% filter(x > 8 | y < 12) ``` ```{R, ex_filter4b, echo = F} # Keep rows where x>8 OR y<12 some_df %>% filter(x > 8 | y < 12) %>% datatable(rownames = F,options = list(dom = 't'), width = 200) ``` ] --- .pull-right[ ```{R, ex_filter5, eval = F} # Keep rows where 16<=y<=18 some_df %>% filter(between(y, 16, 18)) ``` ```{R, ex_filter5b, echo = F} # Keep rows where 16<=y<=18 some_df %>% filter(between(y, 16, 18)) %>% datatable(rownames = F,options = list(dom = 't'), width = 200) ``` ] --- .pull-right[ ```{R, ex_filter6, eval = F} # Keep rows where y > 20 some_df %>% filter(y > 20) ``` ```{R, ex_filter6b, echo = F} # Keep rows where y > 20 some_df %>% filter(y > 20) %>% datatable(rownames = F,options = list(dom = 't'), width = 200, height = 100) ``` ] If you filter your data frame down to nothing, .mono[R] returns a 0-row data frame with the names/number of columns from the original data frame. --- layout: true # dplyr --- name: select ## `select()` Just as .purple[`filter()`] grabs .purple[row-based subsets] of your data frame,
.pink[`select()`] grabs .pink[column-based subsets]. -- You can select columns using their .b[names]
.pad-left[`our_df %>% select(var10, var100)`] -- you can select columns using their .b[numbers]
.pad-left[`our_df %>% select(10, 100)`] -- or you can select columns using .b[helper fuctions]
.pad-left[`our_df %>% select(starts_with("var10"))`] -- `select()` helps you narrow down a dataset to its necessary features. --- name: summarize ## `summarize()` Hopefully you're starting to see that functions' names in `dplyr` tell you what the function does. `summarize()`.pink[†] summarizes variables—you choose the variables and the summaries (_e.g._, `mean()` or `min()`). .footnote[.pink[†] or `summarise()` if you ❤️️ 🇬🇧] -- ```{R, ex_summarize, eval = F} the_df %>% summarize( mean(x), mean(y), mean(z), min(x), max(x), ) ``` would return a 1×5 data frame with the means of `x`, `y`, and `z`; the minimum of `x`; and the maximum of `x`. --- name: group_summarize ## `summarize()` and `group_by()` While sample-wide summarizes are certainly interesting, `dplyr` has one last gem for us: `group_by()`. `group_by()` groups your observations by the variable(s) that you name. -- Specifically, `group_by()` returns a *grouped data frame* that you can then feed to `summarize()`, `mutate()`, or `transmuate` to perform grouped calculations, _e.g._, each group's mean. --- ## Example: Grouped summaries .pull-left[.small[ ```{R, ex_group1} # Create a new data frame our_df <- data.frame( x = 1:6, y = c(0, 1), grp = rep(c("A", "B"), each = 3) ) ``` ```{R, ex_group2, echo = F} our_df %>% datatable(rownames = F,options = list(dom = 't'), width = '50%', height = 250) ``` ]] -- .pull-right[.small[ ```{R, ex_group3, eval = F} # For dataset 'our_df'... our_df %>% # Group by 'grp' group_by(grp) %>% # Take means of 'x' and 'y' summarize(mean(x), mean(y)) ``` ```{R, ex_group4, echo = F} our_df %>% group_by(grp) %>% summarize(mean(x), mean(y)) %>% datatable(rownames = F,options = list(dom = 't'), width = '100%', height = 110) %>% formatRound(columns=c('mean(x)', 'mean(y)'), digits = 3) ``` ]] --- ## Example: Grouped mutation .pull-left[.small[ ```{R, ex_group5} # Create a new data frame our_df <- data.frame( x = 1:6, y = c(0, 1), grp = rep(c("A", "B"), each = 3) ) ``` ```{R, ex_group6, echo = F} our_df %>% datatable(rownames = F,options = list(dom = 't'), width = '50%', height = 250) ``` ]] -- .pull-right[.small[ ```{R, ex_group7, eval = F} # Add grp means for x and y our_df %>% group_by(grp) %>% mutate( x_m = mean(x), y_m = mean(y) ) ``` ```{R, ex_group8, echo = F} our_df %>% group_by(grp) %>% mutate(x_m = mean(x), y_m = mean(y)) %>% datatable(rownames = F,options = list(dom = 't'), height = 250) %>% formatRound(columns=c('x_m', 'y_m'), digits = 3) ``` ]] --- name: arrange ## `arrange()` `arrange()` will sorts the rows of a data frame using the inputted columns. .mono[R] defaults to starting with the "lowest" (smallest) at the top of the data frame. Use a `-` in front of the variable's name to reverse sort. --- layout: false class: clear, middle .pull-left[ ```{R, ex_arrange1, eval = F} # As is our_df ``` ```{R, ex_arrange1b, echo = F} # As is our_df %>% datatable(rownames = F,options = list(dom = 't'), height = 250) ``` ] .pull-right[ ```{R, ex_arrange2, eval = F} # Arrang by y, grp, then -x our_df %>% arrange(y, grp, -x) ``` ```{R, ex_arrange2b, echo = F} # Arrang by y, grp, then -x our_df %>% arrange(y, grp, -x) %>% datatable(rownames = F,options = list(dom = 't'), height = 250) ``` ] --- name: tidyverse # The tidyverse There's more! `dplyr` and .purple[`tidyr`] offer even more....super[.pink[†]] .footnote[ .pink[†] And these are only two of the packages in the `tidyverse`. ] - .note[Viewing data] `glimpse()`, `top_n()` - .note[Sampling] `sample_n()`, `sample_frac()` - .note[Summaries] `first()`, `last()`, `nth()`, `n_distinct()` - .note[Duplicates] `distinct()` - .note[Missingness] `na_if()`, .purple[`replace_na()`], .purple[`drop_na()`], .purple[`fill()`] -- The folks at RStudio have put together some great cheatsheets, *e.g.*, - [`dplyr`](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-dplyr.pdf) - [data import](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-import.pdf) - [data wrangling](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-wrangling.pdf) --- layout: false # Table of contents .col-left[ .small[ #### Admin - [Today and upcoming](#admin) #### Workflow - [General](#workflow-main) - [RStudio](#rstudio) - [Related best practies](#best-practices) - [Projects](#projects) ]] .col-right[ .small[ #### `dplyr` - [Pipes](#pipes) - [`mutate`](#mutate) - [`transmute()`](#transmute) - [`arrange`](#arrange) - [`filter()`](#filter) - [`select()`](#select) - [`summarize`](#summarize) - [`summarize()` and `group_by()`](#group_summarize) - [The `tidyverse`](#tidyverse) ] ]