---
title: "Data in/and .mono[R]"
subtitle: "EC 425/525, Lab 2"
author: "Edward Rubin"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
xaringan::moon_reader:
css: ['default', 'metropolis', 'metropolis-fonts', 'my-css.css']
# self_contained: true
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: inverse, middle
```{R, setup, include = F}
# devtools::install_github("dill/emoGG")
library(pacman)
p_load(
broom, tidyverse,
latex2exp, ggplot2, ggthemes, ggforce, viridis, extrafont, gridExtra,
kableExtra, snakecase, janitor,
data.table, dplyr, estimatr,
lubridate, knitr, parallel,
lfe,
here, magrittr
)
# Define pink color
red_pink <- "#e64173"
turquoise <- "#20B2AA"
orange <- "#FFA500"
red <- "#fb6107"
blue <- "#3b3b9a"
green <- "#8bb174"
grey_light <- "grey70"
grey_mid <- "grey50"
grey_dark <- "grey20"
purple <- "#6A5ACD"
slate <- "#314f4f"
# Dark slate grey: #314f4f
# Knitr options
opts_chunk$set(
comment = "#>",
fig.align = "center",
fig.height = 7,
fig.width = 10.5,
warning = F,
message = F
)
opts_chunk$set(dev = "svg")
options(device = function(file, width, height) {
svg(tempfile(), width = width, height = height)
})
options(knitr.table.format = "html")
```
# Prologue
---
name: schedule
# Schedule
## Last time
Getting to know .mono[R]—objects, functions, *etc.*
## Today
Working with data in .mono[R].
- The `data.frame` class
- The `dplyr` package
## Upcoming
.hi[Due Monday] Step 1 of our research-project proposal.
---
layout: true
# Matrices
---
name: review
## Quick review
1. `mat <- matrix(data = 1:10, ncol = 2)` creates a 5×2 `matrix` object containing the numbers 1 through 10 (filled by column).
1. `mat[1,]` grabs the first row of our matrix `mat`.
1. `mat[3,2] <- NA` assigns `NA` to row-3 column-2 element of `mat`.
1. `head(mat, 3)` returns up to the first three rows of `mat`.
1. `matrix(data = rnorm(100), ncol = 10)` creates a 10×10 matrix filled with random draws from $N(\mu=0,\sigma^2=1)$.
1. `mat[3,2] <- "Carrots"` assigns the `character` object `"Carrots"` to the `[3,2]` element of `mat`, forcing all elements of `mat` to `character`.
---
## Next steps
Matrices are convenient two-dimensional arrays on which math "works.".pink[†]
.footnote[.pink[†] At least for `numeric` and `logical` matrices.]
*But* matrices also require all elements to be of the same class.
.qa[Q] What if we a datasets whose variables (columns) have different classes?
--
.qa[A] We need a more flexible table-like object for our data.
--
Maybe a `data.table`?
--
Or a `data.frame`?
--
We'll start with `data.frame`.
--
We will spend a good amount of time on data frames, as they make up a huge part of your workflow.
---
layout: true
# Data frames
---
name: df
A `data.frame` is .mono[R]'s base, spreadsheet-like object that holds variables.
--
.ex[Example]
--
```{R, ex_df, echo = F}
p_load(babynames)
set.seed(123)
n <- 12
data.frame(
id = 1:n,
first_name = sample(filter(babynames, between(year, 1980, 2000))$name, size = n, replace = T),
fave_num = sample(0:1e2, size = n, replace = T),
is_tired = sample(c(T,F), size = n, prob = c(0.95,0.05), replace = T),
loves_econ = sample(c(T,F), size = n, replace = T)
)
```
---
A `data.frame` is .mono[R]'s base, spreadsheet-like object that holds variables.
.ex[Example]
```{R, df_starwars, echo = F}
starwars[1:12,c(1:3, 8:10)] %>% data.frame()
```
---
name: creation
## Creation
The `data.frame()` function creates...
--
`data.frame` objects.
--
You'll generally define data frames by passing the function
(.hi-slate[1]) column names and (.hi-slate[2]) values for the columns.
```{R, ex_creat1, eval = F}
data.frame(var1 = 1:5, var2 = "apple", var3 = rnorm(5))
```
--
You can also assign the values using already-existing objects, _e.g._,
```{R, ex_create2, eval = F}
# An object with value
tmp <- rnorm(5)
# Creating the data frame
data.frame(var1 = 1:5, var2 = "apple", var3 = tmp)
```
---
## Creation
```{R, ex_create3}
# Creating the data frame
data.frame(var1 = 1:5, var2 = "apple", var3 = rnorm(5))
```
(What a beauty.)
--
Notice that .mono[R] assumes we want to repeat `"apple"` for the entire column.
---
## Creation
You can also create data frames from other objects (_e.g._, matrices) using the function `as.data.frame()`.pink[†].
.footnote[.pink[†] Or just plain, old `data.frame()`.]
However, your data frame's columns will only have names if your matrix's columns had names.
---
name: indexing
## Indexing
Consider a data frame `our_df <- data.frame(x = 1:3, y = 4:6, z = 7:9)`.
.purple[Option 1] Index data frames just as you index matrices in .mono[R].
- `our_df[1,1]` grabs the value in the first row of the first variable.
- `our_df[2,]` returns the second row of `our_df` (as a data frame).
- `our_df[,3]` returns the third column (variable) of `our_df` (as a vector).
--
.purple[Option 2] Reference values/variables using columns' names.
- `our_df$x` returns the column named `x` (as a vector). .hi[New:] `$`
- `our_df[,"x"]` returns the column named `x` (as a vector).
- `our_df["x"]` returns the column named `x` (as a data frame).
- `our_df[,c("x","y")]` returns a data frame with variables `"x"` and `"y"`.
---
name: names
## Names (of columns)
The columns (variables) in your data frame have names..pink[†]
.footnote[.pink[†] If you don't name the columns, then .mono[R] will.]
.qa[Q] What if you want to see/know those names?
--
.qa[A] You've got a few options.
--
1. The `names()` function returns the *names* of an object.
--
2. `head(your_df)` will show you the first 6 rows of `your_df`.
*Note:* May provide too much output if you have a lot of columns.
--
3. In .mono[RStudio]: `View(your_df)` or look in your .mono[Environment] tab.
---
## Naming
The `names()` function will also help you rename any/all variables.
--
Change the names of .b[all variables] (include a name for each variable):
```{R, change_names_all, eval = F}
# Set new names
names(our_df) <- c("name1", "name2", "name3")
```
--
Change the name of .b[the second variable] (only):
```{R, change_names_one, eval = F}
# Set new names
names(our_df)[2] <- "name2"
```
---
name: adding
## Adding variables
Just as we referenced .pink[existing] variables using `$var_name`,
we can create .purple[new] varirables using `$new_var`, _e.g._,
```{R, create_var, eval = F}
# Add a variable to our_df
our_df$new_var <- 1:100
```
--
If you want to use existing columns to create a new variable
```{R, create_var2, eval = F}
# Create interaction: xy = x * y
our_df$xy <- our_df$x * our_df$y
```
--
.qa[Q] Isn't there a better/faster/less-typing way?
--
.qa[A] Yes. *Enter* `dplyr`
--
(also: `data.table`, which we'll leave for the future).
---
layout: true
# dplyr
---
name: dplyr
## Intro
It's a package.
--
`dplyr` is not installed by default, so you'll need to install it..pink[†]
.footnote[.pink[†] or just `p_load(dplyr)` after loading `pacman`.]
--
`dplyr` is part of the [`tidyverse`](https://dplyr.tidyverse.org/) (Hadleyverse), and it follows a grammar-based approach to programming/data work.
--
- `data` compose the subjects of your stories
- `dplyr` provides the *verbs* (action words)
:
`filter()`, `mutate()`, `select()`, `group_by()`, `summarize()`, `arrange()`
--
.hi-slate[*Bonus*] `dplyr` is pretty fast and able to interact with SQL databases.
---
name: mutate
## Manipulating variables: `mutate()`
`dplyr` streamlines adding/manipulating variables in your data frame.
.hi-slate[Function] `mutate(.data, ...)`
- .pink[Required argument] `.data`, an existing data frame
- .pink[Additional arguments] Names and values of the new variables
- .pink[Output] An updated data frame
--
.ex[Example]
```{R, ex_mutate1, eval = F}
mutate(.data = our_df, new1 = 7, new2 = x * y)
```
---
## `mutate()`
.ex[Example] Take the data frame
```{R, ex_mutate2_df}
my_df <- data.frame(x = 1:4, y = 5:8)
```
--
`mutate()` allows us to create many new variables with one call.
.pull-left[
```{R, ex_mutate2, eval = F}
mutate(.data = my_df,
xy = x * y,
x2 = x^2,
y2 = y^2,
xy2 = xy^2,
is_x_max = x == max(x)
)
```
]
--
.pull-right[
```{R, ex_mutate2_result, echo = F}
mutate(.data = my_df,
xy = x * y,
x2 = x^2,
y2 = y^2,
xy2 = xy^2,
is_x_max = x == max(x)
)
```
Notice `mutate()` returns the original *and* new columns.
]
---
name: transmute
## `mutate()` *vs.* `transmute()`
As their names imply, `mutate()` and `transmute()` are very similar functions.
- `mutate()` returns the .pink[original] *and* .purple[new] columns (variables).
- `transmute()` returns only the .purple[new] columns (variables).
--
.slate[*Note*] Both functions return a new object as *output*—they do not update the object in .mono[R]'s memory. (This is the case for all functions in `dplyr`.)
---
name: pipes
## Pipes
We can't go much deeper into the land of `dplyr` without mentioning pipes.
--
A *pipe* in programming allows you to take the output of one function and plug it into another function as an argument/input.
--
In `dplyr`, the expression for a pipe is `%>%`.
--
.mono[R]'s pipe specifically plugs the returned object to the .pink[left] of the pipe into the first argument of the function on the .purple[right] fo the pipe, _e.g._,
--
```{R, ex_pipe_simple}
rnorm(10) %>% mean()
```
---
## Pipes
Pipes help avoid lots of nested functions, prevent excessive writing to your disc, and increase the readability of our .mono[R] scripts.
--
.ex[Example] Three ways to draw 100 N(0,1) observations and calculate the interquartile range (IQR: difference between the 75.super[th] and 25.super[th] percentiles).
```{R, ex_pipe_iqr, eval = F}
# Save each intermediate step
draw <- rnorm(100)
end_points <- quantile(draw, probs = c(0.25, 0.75))
diff(end_points)
# Lots of nesting
diff(quantile(rnorm(100), probs = c(0.25, 0.75)))
# Piping 💪
rnorm(100) %>% quantile(probs = c(0.25, 0.75)) %>% diff()
```
---
## Pipes
By default, .mono[R] pipes the output from the LHS of the pipe into
the .hi[first] argument of the function on the RHS of the pipe.
--
*E.g.*, `a %>% fun(3)` is equivalent to `fun(arg1 = a, arg2 = 3)`.
--
If you want to pipe output into a different argument, you use a period (`.`).
--
- `b %>% fun(arg1 = 3, .)` is equivalent to `fun(arg1 = 3, arg2 = b)`.
- `b %>% fun(3, .)` is also equivalent to `fun(arg1 = 3, arg2 = b)`.
--
- `b %>% fun(., .)` is equivalent to `fun(arg1 = b, arg2 = b)`.
--
The `magrittr` package contains even more piping power..pink[†]
.footnote[.pink[†] `magrittr` = Magritte (of [*this is not a pipe*](https://en.wikipedia.org/wiki/The_Treachery_of_Images) fame) plus .mono[R].]
---
## `%>%` and `dplyr`
Each `dplyr` function begins with a `.data` argument so that you can easily pipe in data frames (recall: `mutate(.data, ...)`).
--
The common workflow in `dplyr` will look something like
`new_df <- old_df %>% mutate(cool stuff here)`
which takes `old_df`, does some cool stuff with `mutate()`, and then saves the output of `mutate()` as `new_df`.
---
## `filter()`
The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions].
---
layout: true
# dplyr
## `filter()`
The `filter()` function does what its name implies: it .b[filters the rows] of your data frame .b[based upon logical conditions].
.ex[Example]
.pull-left[
```{R, ex_filter}
# Create a dataset
some_df <- data.frame(
x = 1:10,
y = 11:20
)
```
]
---
name: filter
count: false
--
.pull-right[
```{R, ex_filter1}
# Only keep rows where x is 3
some_df %>% filter(x == 3)
```
]
---
.pull-right[
```{R, ex_filter2}
# Only keep rows where x > 7
some_df %>% filter(x > 7)
```
]
---
.pull-right[
```{R, ex_filter3}
# Keep rows where y/x > 3
some_df %>% filter(y/x > 3)
```
]
---
.pull-right[
```{R, ex_filter4}
# Keep rows where x>7 OR y<12
some_df %>%
filter(x > 7 | y < 12)
```
]
---
.pull-right[
```{R, ex_filter5}
# Keep rows where 15<=y<=18
some_df %>%
filter(between(y, 15, 18))
```
]
---
.pull-right[
```{R, ex_filter6}
# Keep rows where y > 20
some_df %>% filter(y > 20)
```
]
If you filter your data frame down to nothing, .mono[R] returns a 0-row data frame with the names/number of columns from the original data frame.
---
layout: true
# dplyr
---
name: select
## `select()`
Just as .purple[`filter()`] grabs .purple[row-based subsets] of your data frame,
.pink[`select()`] grabs .pink[column-based subsets].
--
You can select columns using their .b[names]
.pad-left[`our_df %>% select(var10, var100)`]
--
you can select columns using their .b[numbers]
.pad-left[`our_df %>% select(10, 100)`]
--
or you can select columns using .b[helper fuctions]
.pad-left[`our_df %>% select(starts_with("var10"))`]
--
`select()` helps you narrow down a dataset to its necessary features.
---
name: summarize
## `summarize()`
Hopefully you're starting to see that functions' names in `dplyr` tell you what the function does.
`summarize()`.pink[†] summarizes variables—you choose the variables and the summaries (_e.g._, `mean()` or `min()`).
.footnote[.pink[†] or `summarise()` if you ❤️️ 🇬🇧]
--
```{R, ex_summarize, eval = F}
the_df %>% summarize(
mean(x), mean(y), mean(z),
min(x), max(x),
)
```
would return a 1×5 data frame with the means of `x`, `y`, and `z`; the minimum of `x`; and the maximum of `x`.
---
name: group_summarize
## `summarize()` and `group_by()`
While sample-wide summarizes are certainly interesting, `dplyr` has one last gem for us: `group_by()`.
`group_by()` groups your observations by the variable(s) that you name.
--
Specifically, `group_by()` returns a *grouped data frame* that you can then feed to `summarize()`, `mutate()`, or `transmuate` to perform grouped calculations, _e.g._, each group's mean.
---
## Example: Grouped summaries
.pull-left[.small[
```{R, ex_group1}
# Create a new data frame
our_df <- data.frame(
x = 1:6,
y = c(0, 1),
grp = rep(c("A", "B"), each = 3)
)
```
```{R, ex_group2, echo = F}
our_df
```
]]
--
.pull-right[.small[
```{R, ex_group3, eval = F}
# For dataset 'our_df'...
our_df %>%
# Group by 'grp'
group_by(grp) %>%
# Take means of 'x' and 'y'
summarize(mean(x), mean(y))
```
```{R, ex_group4, echo = F}
our_df %>%
group_by(grp) %>%
summarize(mean(x), mean(y))
```
]]
---
## Example: Grouped mutation
.pull-left[.small[
```{R, ex_group5}
# Create a new data frame
our_df <- data.frame(
x = 1:6,
y = c(0, 1),
grp = rep(c("A", "B"), each = 3)
)
```
```{R, ex_group6, echo = F}
our_df
```
]]
--
.pull-right[.small[
```{R, ex_group7, eval = F}
# Add grp means for x and y
our_df %>%
group_by(grp) %>%
mutate(
x_m = mean(x), y_m = mean(y)
)
```
```{R, ex_group8, echo = F}
our_df %>%
group_by(grp) %>%
mutate(x_m = mean(x), y_m = mean(y))
```
]]
---
name: arrange
## `arrange()`
`arrange()` will sorts the rows of a data frame using the inputted columns.
.mono[R] defaults to starting with the "lowest" (smallest) at the top of the data frame. Use a `-` in front of the variable's name to reverse sort.
.pull-left[
```{R, ex_arrange1}
# As is
our_df
```
]
.pull-right[
```{R, ex_arrange2}
# Arrang by y, grp, then -x
our_df %>% arrange(y, grp, -x)
```
]
---
layout: false
# Table of contents
.pull-left[.hi-slate[Data and .mono[R]]
.smaller[
1. [Schedule](#schedule)
1. [Matrix review](#review)
1. [The `data.frame`](#df)
- [Basic examples](#df)
- [Creating](#creation)
- [Indexing](#indexing)
- [Names](#names)
- [Adding variables](#adding)
]]
.pull-right[.hi-slate[`dplyr`]
.smaller[
1. [Intro](#dplyr)
1. [`mutate()`](#mutate)
1. [`transmute()`](#transmute)
1. [Pipes (`%>%`)](#pipes)
1. [`filter()`](#filter)
1. [`select()`](#select)
1. [`summarize`](#summarize)
1. [`summarize()` and `group_by()`](#group_summarize)
1. [`arrange()`](#arrange)
]]
---
exclude: true
```{R, generate pdfs, include = F, eval = T}
source("../../ScriptsR/unpause.R")
unpause("02RData.Rmd", ".", T, T)
```