---
title: "Lab .mono[000]"
subtitle: "Data cleaning and workflow [1/N]"
author: "Edward Rubin"
date: "`r format(Sys.time(), '%d %B %Y')`"
#date: "08 January 2021"
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
knit: pagedown::chrome_print
---
exclude: true
```{R, setup, include = F}
library(pacman)
p_load(
broom, tidyverse,
ggplot2, ggthemes, ggforce, ggridges, cowplot,
latex2exp, viridis, extrafont, gridExtra,
kableExtra, snakecase, janitor,
DT, data.table, dplyr,
lubridate, knitr, future, furrr,
estimatr, FNN, parsnip,
huxtable, here, magrittr
)
# Define colors
red_pink = "#e64173"
turquoise = "#20B2AA"
orange = "#FFA500"
red = "#fb6107"
blue = "#3b3b9a"
green = "#8bb174"
grey_light = "grey70"
grey_mid = "grey50"
grey_dark = "grey20"
purple = "#6A5ACD"
slate = "#314f4f"
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
---
class: inverse, middle
# Admin
---
name: admin
# Admin
Basic .hi[workflow] (best) practices (_i.e._, _Projects_)
- RStudio and projects
- Naming conventions
- Pipes (`%>%`)
- Data cleaning with `dplyr`
.note[Reminders]
.b.slate[Reminder:] Readings for next week
- ISL Ch1–Ch2
- [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015)
---
layout: true
# Improving your workflow
---
class: inverse, middle
---
name: workflow-main
Data cleaning, manipulation, and analysis can be grueling, but optimizing your workflow can speed things along and make them less painful..super[.pink[†]]
.footnote[
.pink[†] Notice that I said .pink[*less*] painful.
]
.hi-slate[A few dimensions that can help]
- Understand how to interact with RStudio
- Use .mono[R] projects
- Follow reasonable naming conventions
- `dplyr` and pipes
- .grey-mid[Write your own functions (future lab)]
- .grey-mid[Use loops and parallelization (future lab)]
- .grey-light[Hire an intern/assistant to do your work for you]
---
layout: false
class: clear, middle
## Efficiency
```{R, fig xkcd, echo = F, out.width = '80%'}
knitr::include_graphics("images/xkcd-efficiency.png")
```
.grey-light[Source: [xkcd](https://xkcd.com/1445/)]
---
layout: false
class: inverse, middle
# RStudio
---
name: rstudio
class: clear
Let's recap some of the major features in .mono[RStudio]...
```{R, pic_rstudio, echo = F}
knitr::include_graphics("images/rstudio/rstudio.png")
```
---
class: clear
First, you write your .mono[R] scripts (source code) in the .hi[Source] pane.
```{R, pic_rstudio_source1, echo = F}
knitr::include_graphics("images/rstudio/rstudio_source_rec.png")
```
---
class: clear
You can use the menubar or .mono[⇧+⌘+N] to create new .mono[R] scripts.
```{R, pic_rstudio_source2, echo = F}
knitr::include_graphics("images/rstudio/rstudio_source_arrow.png")
```
---
class: clear
To execute commands from your .mono[R] script, use .mono[⌘+Enter].
```{R, pic_rstudio_source3, echo = F}
knitr::include_graphics("images/rstudio/rstudio_source_ex.png")
```
---
class: clear
.mono[RStudio] will execute the command in the terminal.
```{R, pic_rstudio_source4, echo = F}
knitr::include_graphics("images/rstudio/rstudio_source_ex2.png")
```
---
class: clear
You can see our new object in the .hi[Environment] pane.
```{R, pic_rstudio_source5, echo = F}
knitr::include_graphics("images/rstudio/rstudio_source_ex3.png")
```
---
class: clear
The .hi-purple[History] tab (next to .hi[Environment]) records your old commands.
```{R, pic_rstudio_history, echo = F}
knitr::include_graphics("images/rstudio/rstudio_history.png")
```
---
class: clear
The .hi[Files] pane is file explorer.
```{R, pic_rstudio_files, echo = F}
knitr::include_graphics("images/rstudio/rstudio_files.png")
```
---
class: clear
The .hi[Plots] pane/tab shows... plots.
```{R, pic_rstudio_plots, echo = F}
knitr::include_graphics("images/rstudio/rstudio_plots.png")
```
---
class: clear
.hi[Packages] shows installed packages
```{R, pic_rstudio_packages, echo = F}
knitr::include_graphics("images/rstudio/rstudio_packages.png")
```
---
count: false
class: clear
.hi[Packages] shows installed packages and whether they are .hi-purple[loaded].
```{R, pic_rstudio_packages2, echo = F}
knitr::include_graphics("images/rstudio/rstudio_packages2.png")
```
---
class: clear
The .hi[Help] tab shows help documentation (also accessible via `?`).
```{R, pic_rstudio_help, echo = F}
knitr::include_graphics("images/rstudio/rstudio_help.png")
```
---
class: clear
Finally, you can customize the actual layout
```{R, pic_rstudio_layout, echo = F}
knitr::include_graphics("images/rstudio/rstudio_layout.png")
```
---
count: false
class: clear
Finally, you can customize the actual layout and many other items.
```{R, pic_rstudio_customize, echo = F}
knitr::include_graphics("images/rstudio/rstudio_customize.png")
```
---
name: best-practices
# .mono[R] and .mono[RStudio]
.b.slate[Related best practices]
1. Write code in .mono[R] scripts. Troubleshoot in .mono[RStudio]. Then run the scripts.
1. Comment your code. (`# This is a comment`)
1. Name objects/variables/files with intelligible, standardized names.
- .hi-purple[BAD] `ALLCARS`, `Vl123a8`, `a.fun`, `cens.12931`, `cens.12933`
- .hi-pink[GOOD] `unique_cars`, `health_df`, `sim_fun`, `is_female`, `age`
1. Write code that is readable (see comments comment above).
1. Use projects in .mono[RStudio] (next). And organize your projects.
---
layout: true
# Projects
---
class: inverse, middle
---
name: projects
Projects in .mono[R] offer several benefits
1. Act as an .hi[anchor] for working with files.
1. Make your work (projects) easily .hi[reproducible]..super[.pink[†]]
1. Help you .hi[quickly jump back] into your work.
.footnote[.pink[†] In this class, we're assuming reproducibility is good/desirable.]
---
layout: false
class: clear
To start a new project, hit the .hi[project icon].
```{R, pic_rstudio_projects, echo = F}
knitr::include_graphics("images/rstudio/rstudio_projects.png")
```
---
class: clear
You'll then choose the folder/directory where your project lives.
```{R, pic_rstudio_projects2, echo = F}
knitr::include_graphics("images/rstudio/rstudio_projects2.png")
```
---
class: clear
If you open (double click) a project, .mono[RStudio] opens .mono[R] in that location.
```{R, pic_rstudio_projects3, echo = F}
knitr::include_graphics("images/rstudio/rstudio_projects3.png")
```
---
count: false
class: clear
.mono[RStudio] will 'load' your previous setup (pane setup, scripts, *etc.*).
```{R, pic_rstudio_projects3b, echo = F}
knitr::include_graphics("images/rstudio/rstudio_projects3.png")
```
---
layout: true
# .mono[R] and .mono[RStudio]
.b.slate[Projects]
---
.hi-purple[Without a project], you will need to define long file paths that you'll need to keep updating as folder names/locations change.
--
`dir_class <- "/Users/edwardarubin/Dropbox/UO/Teaching/EC525S19/"`
`dir_labs <- paste0(dir_class, "NotesLab/")`
`dir_lab03 <- paste0(dir_labs, "03RInput/")`
`sample_df <- read.csv(paste0(dir_lab03, "sample.csv"))`
--
.hi-pink[With a project], .mono[R] automatically references the project's folder.
`sample_df <- read.csv("sample.csv")`
--
.note[Double-plus bonus] The [`here`](https://github.com/r-lib/here) package extends projects' reproducibility.
---
layout: true
# Pipes and dplyr
---
class: inverse, middle
---
.b.slate[Introduction]
1. Pipes (`%>%`) make your life easier..super[.pink[†]]
1. `dplyr` is your data-work friend.
.footnote[.pink[†] Check out `magrittr` for more pipe options, _e.g._, `%<>%`.]
---
name: pipes
.b.slate[What is a pipe?]
--
Pipes are a .pink[simplifying] programming tool; make your code easier to read
--
Take the .hi-pink[output] of a function as the .hi-purple[input/argument] of another function
--
In `dplyr`, the expression for a pipe is `%>%`
.footnote[.pink[†] `|>` native pipe as of .mono[R] 4.1.0
]
--
.mono[R]'s pipe specifically plugs the returned object to the .pink[left] of the pipe into the first argument of the function on the .purple[right] fo the pipe, _e.g._,
--
```{R, ex_pipe_simple}
rnorm(10) %>% mean()
```
---
.b.slate[Pipes]
Pipes avoid nested functions, prevent excessive writing to your disc, and increase the readability of our .mono[R] scripts
--
.ex[Example] Three ways to draw 100 N(0,1) observations and calculate the interquartile range (IQR: difference between the 75.super[th] and 25.super[th] percentiles).
```{R, ex_pipe_iqr, eval = F}
# Save each intermediate step
draw <- rnorm(100)
end_points <- quantile(draw, probs = c(0.25, 0.75))
diff(end_points)
# Lots of nesting
diff(quantile(rnorm(100), probs = c(0.25, 0.75)))
# Piping 💪
rnorm(100) %>% quantile(probs = c(0.25, 0.75)) %>% diff()
```
--
Think [russian dolls](https://en.wikipedia.org/wiki/Matryoshka_doll)
---
.b.slate[Pipes]
By default, .mono[R] pipes the output from the LHS of the pipe into
the .hi[first] argument of the function on the RHS of the pipe.
--
*E.g.*, `a %>% fun(3)` is equivalent to `fun(arg1 = a, arg2 = 3)`.
--
If you want to pipe output into a different argument, you use a period (`.`).
--
- `b %>% fun(arg1 = 3, .)` is equivalent to `fun(arg1 = 3, arg2 = b)`.
- `b %>% fun(3, .)` is also equivalent to `fun(arg1 = 3, arg2 = b)`.
--
- `b %>% fun(., .)` is equivalent to `fun(arg1 = b, arg2 = b)`.
--
The `magrittr` package contains even more piping power..pink[†]
.footnote[.pink[†] `magrittr` = Magritte (of [*this is not a pipe*](https://en.wikipedia.org/wiki/The_Treachery_of_Images) fame) plus .mono[R].]
---
layout: true
# dplyr
---
# dplyr
.b.slate[Before we begin:]
1. Ensure `tidyverse` is installed: `install.packages('tidyverse')`
--
1. Install `nycflights13` package: `install.packages('nycflights13')`
--
1. Load package libraries: `library(tidyverse, nycflights13)`
--
1. Test the `flights` dataset: `(flights)`
```{r, echo = FALSE}
pacman::p_load(nycflights13)
flights |> select(year, month, day, dep_time, sched_dep_time, dep_delay, arr_time) |>
head(3)
```
---
class: inverse, middle
---
name: dplyr
.b.slate[Introduction]
It's a package.
--
`dplyr` is not installed by default, so you'll need to install it..pink[†]
.footnote[.pink[†] or just `p_load(dplyr)` after loading `pacman`.]
--
`dplyr` is part of the [`tidyverse`](https://dplyr.tidyverse.org/) (Hadleyverse), and it follows a grammar-based approach to programming/data work.
--
- `data` compose the subjects of your stories
- `dplyr` provides the *verbs* (action words)
:
`filter()`, `mutate()`, `select()`, `group_by()`, `summarize()`, `arrange()`
--
.hi-slate[*Bonus*] `dplyr` is pretty fast and able to interact with SQL databases.
---
name: mutate
.b-slate[Manipulating variables:] `mutate()`
`dplyr` streamlines adding/manipulating variables in your data frame.
.hi-slate[Function] `mutate(.data, ...)`
- .pink[Required argument] `.data`, an existing data frame
- .pink[Additional arguments] Names and values of the new variables
- .pink[Output] An updated data frame
--
.ex[Example]
```{R, ex_mutate1, eval = F}
mutate(.data = our_df, new1 = 7, new2 = x * y)
```
---
## `mutate()`
.ex[Example] Take the data frame
```{R, ex_mutate2_df}
my_df <- data.frame(x = 1:3, y = 5:7)
```
--
`mutate()` allows us to create many new variables with one call.
.pull-left[
```{R, ex_mutate2, eval = F}
mutate(.data = my_df,
xy = x * y,
x2 = x^2,
xy2 = xy^2,
is_max = x == max(x)
)
```
]
--
.pull-right[
```{R, ex_mutate2_result, echo = F, results = 'asis'}
mutate(.data = my_df,
xy = x * y,
x2 = x^2,
xy2 = xy^2,
is_max = x == max(x)
) %>% DT::datatable(rownames = F, options = list(dom = 't'))
```
Notice `mutate()` returns the original *and* new columns.
]
---
name: transmute
## `mutate()` *vs.* `transmute()`
As their names imply, `mutate()` and `transmute()` are very similar functions.
- `mutate()` returns the .pink[original] *and* .purple[new] columns (variables).
- `transmute()` returns only the .purple[new] columns (variables).
--
.slate[*Note*] Both functions return a new object as *output*—they do not update the object in .mono[R]'s memory. (This is the case for all functions in `dplyr`.)
---
## `%>%` and `dplyr`
Each `dplyr` function begins with a `.data` argument so that you can easily pipe in data frames (recall: `mutate(.data, ...)`).
--
The common workflow in `dplyr` will look something like
`new_df <- old_df %>% mutate(cool stuff here)`
which takes `old_df`, does some cool stuff with `mutate()`, and then saves the output of `mutate()` as `new_df`.
---
## `filter()`
The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions].
---
layout: true
# dplyr
## `filter()`
The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions].
.ex[Example]
.pull-left[
```{R, ex_filter}
# Create a dataset
some_df <- data.frame(
x = 1:10,
y = 11:20
)
```
]
---
name: filter
count: false
--
.pull-right[
```{R, ex_filter1, eval = F}
# Only keep rows where x is 3
some_df %>% filter(x == 3)
```
```{R, ex_filter1b, echo = F, out.width = '25%'}
# Only keep rows where x is 3
some_df %>% filter(x == 3) %>%
datatable(rownames = F,options = list(dom = 't'), width = 200)
```
]
---
.pull-right[
```{R, ex_filter2, eval = F}
# Only keep rows where x > 7
some_df %>% filter(x > 7)
```
```{R, ex_filter2b, echo = F}
# Only keep rows where x > 7
some_df %>% filter(x > 7) %>%
datatable(rownames = F,options = list(dom = 't'), width = 200)
```
]
---
.pull-right[
```{R, ex_filter3, eval = F}
# Keep rows where y/x > 3
some_df %>% filter(y/x > 3)
```
```{R, ex_filter3b, echo = F}
# Keep rows where y/x > 3
some_df %>% filter(y/x > 3) %>%
datatable(rownames = F,options = list(dom = 't'), width = 200)
```
]
---
.pull-right[
```{R, ex_filter4, eval = F}
# Keep rows where x>8 OR y<12
some_df %>%
filter(x > 8 | y < 12)
```
```{R, ex_filter4b, echo = F}
# Keep rows where x>8 OR y<12
some_df %>%
filter(x > 8 | y < 12) %>%
datatable(rownames = F,options = list(dom = 't'), width = 200)
```
]
---
.pull-right[
```{R, ex_filter5, eval = F}
# Keep rows where 16<=y<=18
some_df %>%
filter(between(y, 16, 18))
```
```{R, ex_filter5b, echo = F}
# Keep rows where 16<=y<=18
some_df %>%
filter(between(y, 16, 18)) %>%
datatable(rownames = F,options = list(dom = 't'), width = 200)
```
]
---
.pull-right[
```{R, ex_filter6, eval = F}
# Keep rows where y > 20
some_df %>% filter(y > 20)
```
```{R, ex_filter6b, echo = F}
# Keep rows where y > 20
some_df %>% filter(y > 20) %>%
datatable(rownames = F,options = list(dom = 't'), width = 200, height = 100)
```
]
If you filter your data frame down to nothing, .mono[R] returns a 0-row data frame with the names/number of columns from the original data frame.
---
layout: true
# dplyr
---
name: select
## `select()`
Just as .purple[`filter()`] grabs .purple[row-based subsets] of your data frame,
.pink[`select()`] grabs .pink[column-based subsets].
--
You can select columns using their .b[names]
.pad-left[`our_df %>% select(var10, var100)`]
--
you can select columns using their .b[numbers]
.pad-left[`our_df %>% select(10, 100)`]
--
or you can select columns using .b[helper fuctions]
.pad-left[`our_df %>% select(starts_with("var10"))`]
--
`select()` helps you narrow down a dataset to its necessary features.
---
name: summarize
## `summarize()`
Hopefully you're starting to see that functions' names in `dplyr` tell you what the function does.
`summarize()`.pink[†] summarizes variables—you choose the variables and the summaries (_e.g._, `mean()` or `min()`).
.footnote[.pink[†] or `summarise()` if you ❤️️ 🇬🇧]
--
```{R, ex_summarize, eval = F}
the_df %>% summarize(
mean(x), mean(y), mean(z),
min(x), max(x),
)
```
would return a 1×5 data frame with the means of `x`, `y`, and `z`; the minimum of `x`; and the maximum of `x`.
---
name: group_summarize
## `summarize()` and `group_by()`
While sample-wide summarizes are certainly interesting, `dplyr` has one last gem for us: `group_by()`.
`group_by()` groups your observations by the variable(s) that you name.
--
Specifically, `group_by()` returns a *grouped data frame* that you can then feed to `summarize()`, `mutate()`, or `transmuate` to perform grouped calculations, _e.g._, each group's mean.
---
## Example: Grouped summaries
.pull-left[.small[
```{R, ex_group1}
# Create a new data frame
our_df <- data.frame(
x = 1:6,
y = c(0, 1),
grp = rep(c("A", "B"), each = 3)
)
```
```{R, ex_group2, echo = F}
our_df %>% datatable(rownames = F,options = list(dom = 't'), width = '50%', height = 250)
```
]]
--
.pull-right[.small[
```{R, ex_group3, eval = F}
# For dataset 'our_df'...
our_df %>%
# Group by 'grp'
group_by(grp) %>%
# Take means of 'x' and 'y'
summarize(mean(x), mean(y))
```
```{R, ex_group4, echo = F}
our_df %>%
group_by(grp) %>%
summarize(mean(x), mean(y)) %>%
datatable(rownames = F,options = list(dom = 't'), width = '100%', height = 110) %>%
formatRound(columns=c('mean(x)', 'mean(y)'), digits = 3)
```
]]
---
## Example: Grouped mutation
.pull-left[.small[
```{R, ex_group5}
# Create a new data frame
our_df <- data.frame(
x = 1:6,
y = c(0, 1),
grp = rep(c("A", "B"), each = 3)
)
```
```{R, ex_group6, echo = F}
our_df %>% datatable(rownames = F,options = list(dom = 't'), width = '50%', height = 250)
```
]]
--
.pull-right[.small[
```{R, ex_group7, eval = F}
# Add grp means for x and y
our_df %>%
group_by(grp) %>%
mutate(
x_m = mean(x), y_m = mean(y)
)
```
```{R, ex_group8, echo = F}
our_df %>%
group_by(grp) %>%
mutate(x_m = mean(x), y_m = mean(y)) %>%
datatable(rownames = F,options = list(dom = 't'), height = 250) %>%
formatRound(columns=c('x_m', 'y_m'), digits = 3)
```
]]
---
name: arrange
## `arrange()`
`arrange()` will sorts the rows of a data frame using the inputted columns.
.mono[R] defaults to starting with the "lowest" (smallest) at the top of the data frame. Use a `-` in front of the variable's name to reverse sort.
---
layout: false
class: clear, middle
.pull-left[
```{R, ex_arrange1, eval = F}
# As is
our_df
```
```{R, ex_arrange1b, echo = F}
# As is
our_df %>%
datatable(rownames = F,options = list(dom = 't'), height = 250)
```
]
.pull-right[
```{R, ex_arrange2, eval = F}
# Arrang by y, grp, then -x
our_df %>% arrange(y, grp, -x)
```
```{R, ex_arrange2b, echo = F}
# Arrang by y, grp, then -x
our_df %>% arrange(y, grp, -x) %>%
datatable(rownames = F,options = list(dom = 't'), height = 250)
```
]
---
name: tidyverse
# The tidyverse
There's more! `dplyr` and .purple[`tidyr`] offer even more....super[.pink[†]]
.footnote[
.pink[†] And these are only two of the packages in the `tidyverse`.
]
- .note[Viewing data] `glimpse()`, `top_n()`
- .note[Sampling] `sample_n()`, `sample_frac()`
- .note[Summaries] `first()`, `last()`, `nth()`, `n_distinct()`
- .note[Duplicates] `distinct()`
- .note[Missingness] `na_if()`, .purple[`replace_na()`], .purple[`drop_na()`], .purple[`fill()`]
--
The folks at RStudio have put together some great cheatsheets, *e.g.*,
- [`dplyr`](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-dplyr.pdf)
- [data import](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-import.pdf)
- [data wrangling](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-wrangling.pdf)
---
# Exercises
Some selected exercises from [R for Data Science](https://r4ds.had.co.nz/index.html) by Hadley Wickham
1. Exercise [5.2.4.1](https://r4ds.had.co.nz/transform.html#exercises-8)
1. Exercise [5.3.1.2, 5.3.1.3, 5.3.1.4](https://r4ds.had.co.nz/transform.html#exercises-9)
1. Exercise [5.5.2.4](https://r4ds.had.co.nz/transform.html#exercises-11)
1. Exercise [5.7.1.3](https://r4ds.had.co.nz/transform.html#exercises-13)
---
layout: false
# Table of contents
.col-left[
.small[
#### Admin
- [Today and upcoming](#admin)
#### Workflow
- [General](#workflow-main)
- [RStudio](#rstudio)
- [Related best practies](#best-practices)
- [Projects](#projects)
]]
.col-right[
.small[
#### `dplyr`
- [Pipes](#pipes)
- [`mutate`](#mutate)
- [`transmute()`](#transmute)
- [`arrange`](#arrange)
- [`filter()`](#filter)
- [`select()`](#select)
- [`summarize`](#summarize)
- [`summarize()` and `group_by()`](#group_summarize)
- [The `tidyverse`](#tidyverse)
]
]