Lab .mono[000]

class: center, middle, inverse, title-slide

# Lab .mono[000]
## Data cleaning and workflow [1/N]
### Edward Rubin
### 07 January 2022

---

exclude: true

---
class: inverse, middle
# Admin
---
name: admin
# Admin

Basic .hi[workflow] (best) practices (_i.e._, _Projects_)
- RStudio and projects
- Naming conventions
- Pipes (`%>%`)
- Data cleaning with `dplyr`

.note[Reminders]

.b.slate[Reminder:] Readings for next week
- ISL Ch1–Ch2
- [Prediction Policy Problems](https://www.aeaweb.org/articles?id=10.1257/aer.p20151023) by Kleinberg .it[et al.] (2015)
---
layout: true
# Improving your workflow

---
class: inverse, middle
---
name: workflow-main

Data cleaning, manipulation, and analysis can be grueling, but optimizing your workflow can speed things along and make them less painful..super[.pink[†]]

.footnote[
.pink[†] Notice that I said .pink[*less*] painful.
]

.hi-slate[A few dimensions that can help]

- Understand how to interact with RStudio
- Use .mono[R] projects
- Follow reasonable naming conventions
- `dplyr` and pipes
- .grey-mid[Write your own functions (future lab)]
- .grey-mid[Use loops and parallelization (future lab)]
- .grey-light[Hire an intern/assistant to do your work for you]
---
layout: false
class: clear, middle

## Efficiency

.grey-light[Source: [xkcd](https://xkcd.com/1445/)]
---
layout: false
class: inverse, middle
# RStudio

---
name: rstudio
class: clear

Let's recap some of the major features in .mono[RStudio]...

<img src="images/rstudio/rstudio.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

First, you write your .mono[R] scripts (source code) in the .hi[Source] pane.

<img src="images/rstudio/rstudio_source_rec.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

You can use the menubar or .mono[⇧+⌘+N] to create new .mono[R] scripts.

<img src="images/rstudio/rstudio_source_arrow.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

To execute commands from your .mono[R] script, use .mono[⌘+Enter].

<img src="images/rstudio/rstudio_source_ex.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

.mono[RStudio] will execute the command in the terminal.

<img src="images/rstudio/rstudio_source_ex2.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

You can see our new object in the .hi[Environment] pane.

<img src="images/rstudio/rstudio_source_ex3.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

The .hi-purple[History] tab (next to .hi[Environment]) records your old commands.

<img src="images/rstudio/rstudio_history.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

The .hi[Files] pane is file explorer.

<img src="images/rstudio/rstudio_files.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

The .hi[Plots] pane/tab shows... plots.

<img src="images/rstudio/rstudio_plots.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

.hi[Packages] shows installed packages

<img src="images/rstudio/rstudio_packages.png" width="4309" style="display: block; margin: auto;" />
---
count: false
class: clear

.hi[Packages] shows installed packages and whether they are .hi-purple[loaded].

<img src="images/rstudio/rstudio_packages2.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

The .hi[Help] tab shows help documentation (also accessible via `?`).

<img src="images/rstudio/rstudio_help.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

Finally, you can customize the actual layout

<img src="images/rstudio/rstudio_layout.png" width="4309" style="display: block; margin: auto;" />
---
count: false
class: clear

Finally, you can customize the actual layout and many other items.

<img src="images/rstudio/rstudio_customize.png" width="4309" style="display: block; margin: auto;" />
---
name: best-practices
# .mono[R] and .mono[RStudio]

.b.slate[Related best practices]

1. Write code in .mono[R] scripts. Troubleshoot in .mono[RStudio]. Then run the scripts.

1. Comment your code. (`# This is a comment`)

1. Name objects/variables/files with intelligible, standardized names.
  - .hi-purple[BAD] `ALLCARS`, `Vl123a8`, `a.fun`, `cens.12931`, `cens.12933`
  - .hi-pink[GOOD] `unique_cars`, `health_df`, `sim_fun`, `is_female`, `age`

1. Write code that is readable (see comments comment above).

1. Use projects in .mono[RStudio] (next). And organize your projects.

---
layout: true
# Projects

---
class: inverse, middle

---
name: projects

Projects in .mono[R] offer several benefits

1. Act as an .hi[anchor] for working with files.

1. Make your work (projects) easily .hi[reproducible]..super[.pink[†]]

1. Help you .hi[quickly jump back] into your work.

.footnote[.pink[†] In this class, we're assuming reproducibility is good/desirable.]

---
layout: false
class: clear

To start a new project, hit the .hi[project icon].

<img src="images/rstudio/rstudio_projects.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

You'll then choose the folder/directory where your project lives.

<img src="images/rstudio/rstudio_projects2.png" width="4309" style="display: block; margin: auto;" />
---
class: clear

If you open (double click) a project, .mono[RStudio] opens .mono[R] in that location.

<img src="images/rstudio/rstudio_projects3.png" width="4309" style="display: block; margin: auto;" />
---
count: false
class: clear

.mono[RStudio] will 'load' your previous setup (pane setup, scripts, *etc.*).

<img src="images/rstudio/rstudio_projects3.png" width="4309" style="display: block; margin: auto;" />
---
layout: true
# .mono[R] and .mono[RStudio]

.b.slate[Projects]

---

.hi-purple[Without a project], you will need to define long file paths that you'll need to keep updating as folder names/locations change.

`dir_class <- "/Users/edwardarubin/Dropbox/UO/Teaching/EC525S19/"`
<br>`dir_labs <- paste0(dir_class, "NotesLab/")`
<br>`dir_lab03 <- paste0(dir_labs, "03RInput/")`
<br>`sample_df <- read.csv(paste0(dir_lab03, "sample.csv"))`

.hi-pink[With a project], .mono[R] automatically references the project's folder.

`sample_df <- read.csv("sample.csv")`

.note[Double-plus bonus] The [`here`](https://github.com/r-lib/here) package extends projects' reproducibility.
---
layout: true
# Pipes and dplyr

---
class: inverse, middle

---

.b.slate[Introduction]

1. Pipes (`%>%`) make your life easier..super[.pink[†]]

1. `dplyr` is your data-work friend.

.footnote[.pink[†] Check out `magrittr` for more pipe options, _e.g._, `%<>%`.]
---
name: pipes

.b.slate[What is a pipe?]

Pipes are a .pink[simplifying] programming tool; make your code easier to read

Take the .hi-pink[output] of a function as the .hi-purple[input/argument] of another function

In `dplyr`, the expression for a pipe is `%>%`

.footnote[.pink[†] `|>` native pipe as of .mono[R] 4.1.0
]

.mono[R]'s pipe specifically plugs the returned object to the .pink[left] of the pipe into the first argument of the function on the .purple[right] fo the pipe, _e.g._,

```r
rnorm(10) %>% mean()
```

```
#> [1] -0.5162503
```
---

.b.slate[Pipes]

Pipes avoid nested functions, prevent excessive writing to your disc, and increase the readability of our .mono[R] scripts

.ex[Example] Three ways to draw 100 N(0,1) observations and calculate the interquartile range (IQR: difference between the 75.super[th] and 25.super[th] percentiles).

```r
# Save each intermediate step
draw <- rnorm(100)
end_points <- quantile(draw, probs = c(0.25, 0.75))
diff(end_points)
# Lots of nesting
diff(quantile(rnorm(100), probs = c(0.25, 0.75)))
# Piping 💪
rnorm(100) %>% quantile(probs = c(0.25, 0.75)) %>% diff()
```

Think [russian dolls](https://en.wikipedia.org/wiki/Matryoshka_doll)
---

.b.slate[Pipes]

By default, .mono[R] pipes the output from the LHS of the pipe into<br>the .hi[first] argument of the function on the RHS of the pipe.

*E.g.*, `a %>% fun(3)` is equivalent to `fun(arg1 = a, arg2 = 3)`.

If you want to pipe output into a different argument, you use a period (`.`).

- `b %>% fun(arg1 = 3, .)` is equivalent to `fun(arg1 = 3, arg2 = b)`.
- `b %>% fun(3, .)` is also equivalent to `fun(arg1 = 3, arg2 = b)`.

--
- `b %>% fun(., .)` is equivalent to `fun(arg1 = b, arg2 = b)`.

The `magrittr` package contains even more piping power.<sup>.pink[†]</sup>

.footnote[.pink[†] `magrittr` = Magritte (of [*this is not a pipe*](https://en.wikipedia.org/wiki/The_Treachery_of_Images) fame) plus .mono[R].]

---
layout: true
# dplyr

---
# dplyr

.b.slate[Before we begin:]

1. Ensure `tidyverse` is installed: `install.packages('tidyverse')`

1. Install `nycflights13` package: `install.packages('nycflights13')`

1. Load package libraries: `library(tidyverse, nycflights13)`

1. Test the `flights` dataset: `(flights)`

<table class="huxtable" style="border-collapse: collapse; border: 0px; margin-bottom: 2em; margin-top: 2em; ; margin-left: auto; margin-right: auto;  " id="tab:unnamed-chunk-1">
<col><col><col><col><col><col><col><tr>
<th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">year</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">month</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">day</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">dep_time</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">sched_dep_time</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">dep_delay</th><th style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0.4pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: bold;">arr_time</th></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">2013</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">1</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">1</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">517</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">515</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">2</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0.4pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">830</td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">2013</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">1</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">1</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">533</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">529</td><td style="vertical-align: top; text-align: right; white-space: normal; padding: 6pt 6pt 6pt 6pt; font-weight: normal;">4</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0pt 0pt;    padding: 6pt 6pt 6pt 6pt; font-weight: normal;">850</td></tr>
<tr>
<td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0.4pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">2013</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">1</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">1</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">542</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">540</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">2</td><td style="vertical-align: top; text-align: right; white-space: normal; border-style: solid solid solid solid; border-width: 0pt 0.4pt 0.4pt 0pt;    padding: 6pt 6pt 6pt 6pt; background-color: rgb(242, 242, 242); font-weight: normal;">923</td></tr>
</table>

---
class: inverse, middle
---
name: dplyr

.b.slate[Introduction]

It's a package.
--
 `dplyr` is not installed by default, so you'll need to install it.<sup>.pink[†]</sup>

.footnote[.pink[†] or just `p_load(dplyr)` after loading `pacman`.]

`dplyr` is part of the [`tidyverse`](https://dplyr.tidyverse.org/) (Hadleyverse), and it follows a grammar-based approach to programming/data work.

- `data` compose the subjects of your stories

- `dplyr` provides the *verbs* (action words)
:<br> `filter()`, `mutate()`, `select()`, `group_by()`, `summarize()`, `arrange()`

<br>

.hi-slate[*Bonus*] `dplyr` is pretty fast and able to interact with SQL databases.

---
name: mutate

.b-slate[Manipulating variables:] `mutate()`

`dplyr` streamlines adding/manipulating variables in your data frame.

.hi-slate[Function] `mutate(.data, ...)`

- .pink[Required argument] `.data`, an existing data frame

- .pink[Additional arguments] Names and values of the new variables

- .pink[Output] An updated data frame

.ex[Example]

```r
mutate(.data = our_df, new1 = 7, new2 = x * y)
```

---

## `mutate()`

.ex[Example] Take the data frame

```r
my_df <- data.frame(x = 1:3, y = 5:7)
```

`mutate()` allows us to create many new variables with one call.

.pull-left[

```r
mutate(.data = my_df,
  xy = x * y,
  x2 = x^2,
  xy2 = xy^2,
  is_max = x == max(x)
)
```

]
--
.pull-right[

<div id="htmlwidget-cc895535cca1f12d590c" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-cc895535cca1f12d590c">{"x":{"filter":"none","vertical":false,"data":[[1,2,3],[5,6,7],[5,12,21],[1,4,9],[25,144,441],[false,false,true]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>xy<\/th>\n      <th>x2<\/th>\n      <th>xy2<\/th>\n      <th>is_max<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1,2,3,4]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

Notice `mutate()` returns the original *and* new columns.

]
---
name: transmute

## `mutate()` *vs.* `transmute()`

As their names imply, `mutate()` and `transmute()` are very similar functions.

- `mutate()` returns the .pink[original] *and* .purple[new] columns (variables).

- `transmute()` returns only the .purple[new] columns (variables).

.slate[*Note*] Both functions return a new object as *output*—they do not update the object in .mono[R]'s memory. (This is the case for all functions in `dplyr`.)
---

## `%>%` and `dplyr`

Each `dplyr` function begins with a `.data` argument so that you can easily pipe in data frames (recall: `mutate(.data, ...)`).

The common workflow in `dplyr` will look something like

`new_df <- old_df %>% mutate(cool stuff here)`

which takes `old_df`, does some cool stuff with `mutate()`, and then saves the output of `mutate()` as `new_df`.

---

## `filter()`

The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions].

---
layout: true

# dplyr
## `filter()`

The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions].

.ex[Example]

.pull-left[

```r
# Create a dataset
some_df <- data.frame(
  x = 1:10,
  y = 11:20
)
```
]

---
name: filter
count: false

.pull-right[

```r
# Only keep rows where x is 3
some_df %>% filter(x == 3)
```
<div id="htmlwidget-49d6dc8269a8f2dd644c" style="width:200px;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-49d6dc8269a8f2dd644c">{"x":{"filter":"none","vertical":false,"data":[[3],[13]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]
---

.pull-right[

```r
# Only keep rows where x > 7
some_df %>% filter(x > 7)
```
<div id="htmlwidget-c5955e3909f6b2e2ca9c" style="width:200px;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-c5955e3909f6b2e2ca9c">{"x":{"filter":"none","vertical":false,"data":[[8,9,10],[18,19,20]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]
---

.pull-right[

```r
# Keep rows where y/x > 3
some_df %>% filter(y/x > 3)
```
<div id="htmlwidget-43980592c1d749964e44" style="width:200px;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-43980592c1d749964e44">{"x":{"filter":"none","vertical":false,"data":[[1,2,3,4],[11,12,13,14]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]
---

.pull-right[

```r
# Keep rows where x>8 OR y<12
some_df %>%
  filter(x > 8 | y < 12)
```
<div id="htmlwidget-9bcc45640b6ebe189935" style="width:200px;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-9bcc45640b6ebe189935">{"x":{"filter":"none","vertical":false,"data":[[1,9,10],[11,19,20]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]
---

.pull-right[

```r
# Keep rows where 16<=y<=18
some_df %>%
  filter(between(y, 16, 18))
```
<div id="htmlwidget-b3c21e141b1d09648933" style="width:200px;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-b3c21e141b1d09648933">{"x":{"filter":"none","vertical":false,"data":[[6,7,8],[16,17,18]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]
---

.pull-right[

```r
# Keep rows where y > 20
some_df %>% filter(y > 20)
```
<div id="htmlwidget-c55f00d0f1ef2f79f547" style="width:200px;height:100px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-c55f00d0f1ef2f79f547">{"x":{"filter":"none","vertical":false,"data":[[],[]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

If you filter your data frame down to nothing, .mono[R] returns a 0-row data frame with the names/number of columns from the original data frame.
---
layout: true
# dplyr
---
name: select

## `select()`

Just as .purple[`filter()`] grabs .purple[row-based subsets] of your data frame,
<br>.pink[`select()`] grabs .pink[column-based subsets].

You can select columns using their .b[names]
<br>.pad-left[`our_df %>% select(var10, var100)`]

you can select columns using their .b[numbers]
<br>.pad-left[`our_df %>% select(10, 100)`]

or you can select columns using .b[helper fuctions]
<br>.pad-left[`our_df %>% select(starts_with("var10"))`]

`select()` helps you narrow down a dataset to its necessary features.
---
name: summarize

## `summarize()`

Hopefully you're starting to see that functions' names in `dplyr` tell you what the function does.

`summarize()`<sup>.pink[†]</sup> summarizes variables—you choose the variables and the summaries (_e.g._, `mean()` or `min()`).

.footnote[.pink[†] or `summarise()` if you ❤️️ 🇬🇧]

```r
the_df %>% summarize(
  mean(x), mean(y), mean(z),
  min(x), max(x),
)
```
would return a 1×5 data frame with the means of `x`, `y`, and `z`; the minimum of `x`; and the maximum of `x`.
---
name: group_summarize

## `summarize()` and `group_by()`

While sample-wide summarizes are certainly interesting, `dplyr` has one last gem for us: `group_by()`.

`group_by()` groups your observations by the variable(s) that you name.

Specifically, `group_by()` returns a *grouped data frame* that you can then feed to `summarize()`, `mutate()`, or `transmuate` to perform grouped calculations, _e.g._, each group's mean.

---

## Example: Grouped summaries

.pull-left[.small[

```r
# Create a new data frame
our_df <- data.frame(
  x = 1:6,
  y = c(0, 1),
  grp = rep(c("A", "B"), each = 3)
)
```

<div id="htmlwidget-faabd3dacc294d5e9475" style="width:50%;height:250px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-faabd3dacc294d5e9475">{"x":{"filter":"none","vertical":false,"data":[[1,2,3,4,5,6],[0,1,0,1,0,1],["A","A","A","B","B","B"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>grp<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]]

.pull-right[.small[

```r
# For dataset 'our_df'...
our_df %>%
  # Group by 'grp'
  group_by(grp) %>%
  # Take means of 'x' and 'y'
  summarize(mean(x), mean(y))
```

<div id="htmlwidget-6e9bdbe704639b735f8e" style="width:100%;height:110px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-6e9bdbe704639b735f8e">{"x":{"filter":"none","vertical":false,"data":[["A","B"],[2,5],[0.333333333333333,0.666666666666667]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>grp<\/th>\n      <th>mean(x)<\/th>\n      <th>mean(y)<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"targets":1,"render":"function(data, type, row, meta) {\n    return type !== 'display' ? data : DTWidget.formatRound(data, 3, 3, \",\", \".\");\n  }"},{"targets":2,"render":"function(data, type, row, meta) {\n    return type !== 'display' ? data : DTWidget.formatRound(data, 3, 3, \",\", \".\");\n  }"},{"className":"dt-right","targets":[1,2]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":["options.columnDefs.0.render","options.columnDefs.1.render"],"jsHooks":[]}</script>
]]
---

## Example: Grouped mutation

.pull-left[.small[

```r
# Create a new data frame
our_df <- data.frame(
  x = 1:6,
  y = c(0, 1),
  grp = rep(c("A", "B"), each = 3)
)
```

<div id="htmlwidget-77b321f2d44dfdbd0273" style="width:50%;height:250px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-77b321f2d44dfdbd0273">{"x":{"filter":"none","vertical":false,"data":[[1,2,3,4,5,6],[0,1,0,1,0,1],["A","A","A","B","B","B"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>grp<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]]

.pull-right[.small[

```r
# Add grp means for x and y
our_df %>%
  group_by(grp) %>%
  mutate(
    x_m = mean(x), y_m = mean(y)
  )
```

<div id="htmlwidget-1adb18983df03fa0ec2e" style="width:100%;height:250px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-1adb18983df03fa0ec2e">{"x":{"filter":"none","vertical":false,"data":[[1,2,3,4,5,6],[0,1,0,1,0,1],["A","A","A","B","B","B"],[2,2,2,5,5,5],[0.333333333333333,0.333333333333333,0.333333333333333,0.666666666666667,0.666666666666667,0.666666666666667]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>grp<\/th>\n      <th>x_m<\/th>\n      <th>y_m<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"targets":3,"render":"function(data, type, row, meta) {\n    return type !== 'display' ? data : DTWidget.formatRound(data, 3, 3, \",\", \".\");\n  }"},{"targets":4,"render":"function(data, type, row, meta) {\n    return type !== 'display' ? data : DTWidget.formatRound(data, 3, 3, \",\", \".\");\n  }"},{"className":"dt-right","targets":[0,1,3,4]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":["options.columnDefs.0.render","options.columnDefs.1.render"],"jsHooks":[]}</script>
]]

---
name: arrange

## `arrange()`

`arrange()` will sorts the rows of a data frame using the inputted columns.

.mono[R] defaults to starting with the "lowest" (smallest) at the top of the data frame. Use a `-` in front of the variable's name to reverse sort.

---
layout: false
class: clear, middle

.pull-left[

```r
# As is
our_df
```
<div id="htmlwidget-ccc5ec7e94192418b907" style="width:100%;height:250px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ccc5ec7e94192418b907">{"x":{"filter":"none","vertical":false,"data":[[1,2,3,4,5,6],[0,1,0,1,0,1],["A","A","A","B","B","B"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>grp<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

.pull-right[

```r
# Arrang by y, grp, then -x
our_df %>% arrange(y, grp, -x)
```
<div id="htmlwidget-2f185194195b1af0f45d" style="width:100%;height:250px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-2f185194195b1af0f45d">{"x":{"filter":"none","vertical":false,"data":[[3,1,5,2,6,4],[0,0,0,1,1,1],["A","A","B","A","B","B"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>grp<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","columnDefs":[{"className":"dt-right","targets":[0,1]}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

---
name: tidyverse
# The tidyverse

There's more! `dplyr` and .purple[`tidyr`] offer even more....super[.pink[†]]

.footnote[
.pink[†] And these are only two of the packages in the `tidyverse`.
]

- .note[Viewing data] `glimpse()`, `top_n()`
- .note[Sampling] `sample_n()`, `sample_frac()`
- .note[Summaries] `first()`, `last()`, `nth()`, `n_distinct()`
- .note[Duplicates] `distinct()`
- .note[Missingness] `na_if()`, .purple[`replace_na()`], .purple[`drop_na()`], .purple[`fill()`]

The folks at RStudio have put together some great cheatsheets, *e.g.*,

- [`dplyr`](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-dplyr.pdf)
- [data import](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-import.pdf)
- [data wrangling](https://raw.githack.com/edrubin/EC524W20/master/resources/cheatsheet-data-wrangling.pdf)

---
# Exercises

Some selected exercises from [R for Data Science](https://r4ds.had.co.nz/index.html) by Hadley Wickham

1. Exercise [5.2.4.1](https://r4ds.had.co.nz/transform.html#exercises-8)

1. Exercise [5.3.1.2, 5.3.1.3, 5.3.1.4](https://r4ds.had.co.nz/transform.html#exercises-9)

1. Exercise [5.5.2.4](https://r4ds.had.co.nz/transform.html#exercises-11)

1. Exercise [5.7.1.3](https://r4ds.had.co.nz/transform.html#exercises-13)

---
layout: false
# Table of contents

.col-left[
.small[
#### Admin
- [Today and upcoming](#admin)

#### Workflow
- [General](#workflow-main)
- [RStudio](#rstudio)
- [Related best practies](#best-practices)
- [Projects](#projects)
]]

.col-right[
.small[
#### `dplyr`
- [Pipes](#pipes)
- [`mutate`](#mutate)
- [`transmute()`](#transmute)
- [`arrange`](#arrange)
- [`filter()`](#filter)
- [`select()`](#select)
- [`summarize`](#summarize)
- [`summarize()` and `group_by()`](#group_summarize)
- [The `tidyverse`](#tidyverse)
]
]