--- title: "Big Data and Economics" # subtitle: "

" subtitle: "Lecture 5: Data cleaning & wrangling: (1) Tidyverse" author: "Grant McDermott, adapted by Kyle Coombs" date: "Bates College | [EC/DCS 368](https://github.com/big-data-and-economics)" #"`r format(Sys.time(), '%d %B %Y')`" output: xaringan::moon_reader: css: [default, metropolis, metropolis-fonts] lib_dir: libs nature: highlightStyle: github highlightLines: true highlightSpans: true countIncrementalSlides: false --- name: toc ```{css, echo=FALSE} @media print { .has-continuation { display: block !important; } } ``` ```{r setup, include=FALSE} options(htmltools.dir.version = FALSE) library(knitr) opts_chunk$set( fig.align="center", #fig.width=6, fig.height=4.5, # out.width="748px", #out.length="520.75px", dpi=300, #fig.path='Figs/', cache=T#, echo=F, warning=F, message=F ) ``` # Table of contents 1. [Prologue](#prologue) 2. [Tidyverse basics](#basics) 3. [Data wrangling with dplyr](#dplyr) - [filter](#filter) - [arrange](#arrange) - [select](#select) - [mutate](#mutate) - [summarise](#summarise) - [joins](#joins) 4. [Data tidying with tidyr](#tidyr) - [pivot_longer](#pivot_longer) / [pivot_wider](#pivot_wider) - [separate](#separate) - [unite](#unite) 5. [Summary](#summary) --- class: inverse, center, middle name: prologue # Prologue

--- # Housekeeping - Full version of these slides available on [Grant McDermott's website](https://github.com/uo-ec607/lectures) - Participate using GitHub Codespaces created from the course materials repository (not your fork) - That will ensure we're all working with the same versions of the data and packages - If you want to use your own fork or the clone in your local environment, first `sync` your fork with the parent repository - Then, `pull` the latest changes from the parent repository onto your local machine or create a GitHub Codespace --- # Checklist ### R packages you'll need for this lecture ☑ [**tidyverse**](https://www.tidyverse.org/) - This is a meta-package that loads a suite of other packages, including **dplyr** and **tidyr**, which includes the `starwars` dataset that we'll use for practice. ☑ [**nycflights13**](https://github.com/hadley/nycflights13) --
The following code chunk will install (if necessary) and load everything for you. ```{r libs, cache=FALSE, message=FALSE} if (!require(pacman)) install.packages('pacman', repos = 'https://cran.rstudio.com') pacman::p_load(tidyverse, ggplot2, nycflights13) ``` --- # For problem set On the problem set you'll be using the National Longitudinal Survey of Youth, 1997 cohort (**NLSY 1997**). - This is a long-running survey of a representative sample of U.S. youth born between 1980 and 1984. - The data is available from the [NLS Investigator](https://www.nlsinfo.org/investigator/pages/search) website. - I've provided a zipped folder with the data in the `data/raw` folder, which you can access from the problem set - You may not recognize all the file extensions, but the [NLSY documentation](https://www.nlsinfo.org/content/access-data-investigator/investigator-user-guide/zip-file-contents) can help - The key file to note is `nlsy1997-ps2.NLSY97` is a tagset file that you can upload to the NLS Investigator to get an exact copy of the data to use for the problem set - Further details provided on the problem set - **Replication** alert: tagsets are one way to ensure that your data is reproducible. --- # What is "tidy" data? ### Resources: - [Vignettes](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (from the **tidyr** package) - [Original paper](https://vita.had.co.nz/papers/tidy-data.pdf) (Hadley Wickham, 2014 JSS) --
### Key points: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. --
Basically, tidy data is more likely to be [long (i.e. narrow) format](https://en.wikipedia.org/wiki/Wide_and_narrow_data) than wide format. --- # Relational Database Management with R - Remember Relational Database Management from our work on [Empirical Organization](https://raw.githack.com/big-data-and-economics/big-data-class-materials/main/lectures/02-empirical-workflow/02-empirical-workflow.html#1)? - Today, we'll learn how to implement it using packages in the `tidyverse` - We'll cover: - Subsetting data - Variable creation, renaming, selection - Grouping and summarizing data - Joining and appending datasets --- class: inverse, center, middle name: basics # Tidyverse basics

--- # Tidyverse vs. base R (Almost) everything you can do in the tidyverse can be done in base R without loading new packages. I won't delve into the debate, because I think the answer is [clear](http://varianceexplained.org/r/teach-tidyverse/): We should teach the tidyverse first (or, at least, early). - The documentation and community support are outstanding. - Having a consistent philosophy and syntax makes it easier to learn. - Provides a convenient "front-end" to big data tools. - For data cleaning, wrangling, and plotting, the tidyverse really is a no-brainer.1 -- **But**... this certainly shouldn't put you off learning base R alternatives. - Base R is extremely flexible and powerful (and stable). - There are some things that you'll have to venture outside of the tidyverse for. - A combination of tidyverse and base R is often the best solution to a problem. - Excellent base R data manipulation tutorials: [here](https://www.rspatial.org/intr/index.html) and [here](https://github.com/matloff/fasteR). .footnote[ 1 I'm also a huge fan of [**data.table**](http://r-datatable.com/). This package will be the subject of our next lecture. ] --- # Tidyverse vs. base R (cont.) One point of convenience is that there is often a direct correspondence between a tidyverse command and its base R equivalent. These generally follow a `tidyverse::snake_case` vs `base::period.case` rule. E.g. Compare: | tidyverse | base | |---|---| | `?readr::read_csv` | `?utils::read.csv` | | `?dplyr::if_else` | `?base::ifelse` | | `?tibble::tibble` | `?base::data.frame` | Etcetera. If you call up the above examples, you'll see that the tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart. -- **Remember:** There are (almost) always multiple ways to achieve a single goal in R. --- # Tidyverse packages Let's load the tidyverse meta-package and check the output. ```{r tverse, cache = FALSE} library(tidyverse) ``` -- We see that we have actually loaded a number of packages (which could also be loaded individually): **ggplot2**, **tibble**, **dplyr**, etc. - We can also see information about the package versions and some [namespace conflicts](https://raw.githack.com/big-data-and-economics/big-data-class-materials/main/lectures/04-rlang/04-rlang.html#62). --- # Tidyverse packages (cont.) The tidyverse actually comes with a lot more packages than those loaded automatically.1 ```{r tverse_pkgs} tidyverse_packages() ``` We'll use most of these packages during the remainder of this course. - E.g. The **lubridate** for dates, **rvest** for webscraping, **broom** to `tidy()` R objects into tables - However, bear in mind that these packages will have to be loaded separately with `library()` .footnote[ 1 It also includes a *lot* of dependencies upon installation. This is a matter of some [controversy](http://www.tinyverse.org/). ] --- # Tidyverse packages (cont.) We will cover most of the tidyverse packages over the length of this course. Today, however, I'm only really going to focus on two packages: 1. [**dplyr**](https://dplyr.tidyverse.org/) 2. [**tidyr**](https://tidyr.tidyverse.org/) These are the workhorse packages for cleaning and wrangling data. They are thus the ones that you will likely make the most use of (alongside **ggplot2**, which we already met back in Lecture 1). - Data cleaning and wrangling occupies an inordinate amount of time, no matter where you are in your research career. --- class: inverse, center, middle name: dplyr # dplyr

-- *Note:* **dplyr** 1.0.0 also notifies you about grouping variables every time you do operations on or with them. YMMV, but, personally, I find these messages annoying and so prefer to [switch them off](https://twitter.com/MattCowgill/status/1278463099272491008). ```r options(dplyr.summarise.inform = FALSE) ## Add to .Rprofile to make permanent ``` --- # Key dplyr verbs There are five key dplyr verbs that you need to learn. 1. `filter`: Filter (i.e. subset) rows based on their values. 2. `arrange`: Arrange (i.e. reorder) rows based on their values. 3. `select`: Select (i.e. subset) columns by their names: 4. `mutate`: Create new columns. 5. `summarise`: Collapse multiple rows into a single summary value.1 .footnote[ 1 `summarize` with a "z" works too, but Hadley Wickham is from New Sealand. ] --- # Learn the verbs Practice these commands together using the `starwars` data frame that comes pre-packaged with dplyr. **Stop** when you hit the last `summarise` slide (approx. 33). ```{r see-starwars} starwars ``` --- name: filter # 1) dplyr::filter Filter means "subset" the rows of a data frame based on some condition(s). ```{r filter1} filter(starwars, species == "Human", height >= 190) ``` --- # 1) dplyr::filter *cont.* (pipes) We can chain multiple commands with the pipe `%>%` from the **magrittr** package1. ```{r filter_pipe} starwars %>% filter(species == "Human", height >= 190) ``` .footnote[1 Pipes were invented by Doug McIlroy in 1964, are widely used in Unix shells (e.g. bash) and other programming languages (e.g. `F#`). They pass the preceding object as the first argument to the following function. In R, they allow you to chain together code in a way that reads from left to right.] --- # 1) dplyr::filter *cont.* String operations from the **stringr** package are also auto-loaded with **tidyverse** and work well with `filter` too. ```{r filter2} starwars %>% filter(str_detect(name,'Skywalker')) # str_detect is from the stringr package ``` --- # 1) dplyr::filter *cont.* A very common `filter` use case is identifying (or removing) missing data cases. ```{r filter3} starwars %>% filter(is.na(height)) ``` To remove missing observations, simply use negation: `filter(!is.na(height))`. Try this yourself. --- name: arrange # 2) dplyr::arrange ```{r arrange1} starwars %>% arrange(birth_year) ``` -- *Note:* Arranging on a character-based column (i.e. strings) will sort alphabetically. Try this yourself by arranging according to the "name" column. --- # 2) dplyr::arrange *cont.* We can also arrange items in descending order using `arrange(desc())`. ```{r arrange2} starwars %>% arrange(desc(birth_year)) ``` --- name: select # 3) dplyr::select Select means subset the columns of a data frame based on their names. Use commas to select multiple columns out of a data frame. (You can also use "first:last" for consecutive columns). Deselect a column with "-". ```{r select1} starwars %>% select(name:skin_color, species, -height) %>% head() ``` --- # 3) dplyr::select *cont.* You can also rename some (or all) of your selected variables in place. ```{r select2} starwars %>% select(alias=name, crib=homeworld, sex=gender) %>% head() ``` -- If you just want to rename columns without subsetting them, you can use `rename`. Try this now by replacing `select(...)` in the above code chunk with `rename(...)`. --- # 3) dplyr::select *cont.* The `select(contains(PATTERN))` option provides a nice shortcut in relevant cases. ```{r select3} starwars %>% select(name, contains("color")) %>% head() ``` --- name: mutate # 4) dplyr::mutate You can create new columns from scratch, or (more commonly) as transformations of existing columns. ```{r mutate1} starwars %>% select(name, birth_year) %>% mutate(dog_years = birth_year * 7) %>% mutate(comment = paste0(name, " is ", dog_years, " in dog years.")) %>% head() ``` --- # 4) dplyr::mutate *cont.* *Note:* `mutate` is order aware. So you can chain multiple mutates in a single call. ```{r mutate2} starwars %>% select(name, birth_year) %>% mutate( dog_years = birth_year * 7, ## Separate with a comma comment = paste0(name, " is ", dog_years, " in dog years.")) %>% head() ``` --- # 4) dplyr::mutate *cont.* Boolean, logical and conditional operators all work well with `mutate` too. ```{r mutate3} starwars %>% select(name, height) %>% filter(name %in% c("Luke Skywalker", "Anakin Skywalker")) %>% mutate(tall1 = height > 180) %>% mutate(tall2 = ifelse(height > 180, "Tall", "Short")) ## Same effect, but can choose labels ``` --- # 4) dplyr::mutate *cont.* Lastly, combining `mutate` with the `across` feature allows you to easily work on a subset of variables. For example: ```{r, mutate4} starwars %>% select(name:eye_color) %>% mutate(across(where(is.character), \(x) toupper(x))) %>% #<< head(5) ``` -- Try to intuit what `\(x)` does above! --- name: summarise # 5) dplyr::summarise Particularly useful in combination with the `group_by` command. ```{r summ1} starwars %>% group_by(species, gender) %>% summarise(mean_height = mean(height, na.rm = TRUE)) %>% head() ``` --- # 5) dplyr::summarise *cont.* Note that including "na.rm = TRUE" (or, its alias "na.rm = T") is usually a good idea with summarise functions. Otherwise, your output will be missing too. ```{r summ2} ## Probably not what we want starwars %>% summarise(mean_height = mean(height)) ## Much better starwars %>% summarise(mean_height = mean(height, na.rm = TRUE)) ``` --- # 5) dplyr::summarise *cont.* The same `across`-based workflow that we saw with `mutate` a few slides back also works with `summarise`. For example: ```{r summ4} starwars %>% group_by(species) %>% summarise(across(where(is.numeric), \(x) (mean(x, na.rm=T)))) %>% #<< head() ``` --- # Other dplyr goodies `group_by` and `ungroup`: For (un)grouping. - Particularly useful with the `summarise` and `mutate` commands, as we've already seen. -- `slice`: Subset rows by position rather than filtering by values. - `starwars %>% slice(c(1, 5))` -- `pull`: Extract a column as a vector or scalar. - `starwars %>% filter(gender=="female") %>% pull(height)` returns `height` as a vector -- `count` and `distinct`: Number and isolate unique observations. - `starwars %>% count(species)`, or `starwars %>% distinct(species)` - Or use `mutate`, `group_by`, and `n()`, e.g. `starwars %>% group_by(species) %>% mutate(num = n())`. There are also [window functions](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) for leads and lags, ranks, cumulative aggregation, etc. - See `vignette("window-functions")`. --- # Quick quiz Write me code that will tells me the average birth year, unique skin colors, and number of characters by homeworld of the human characters in the `starwars` dataset. Submit on mentimeter: https://www.mentimeter.com/app/presentation/bleq87wo3evgh3j6ks3wqro6zdfh7nwz/v5wc59b5wfph Join at https://www.menti.com | use code `97 37 37 3` --- name: combining # Combining data frames The final set of dplyr "goodies" are the family of **append** and **join** operations. However, these are important enough that I want to go over some concepts in a bit more depth... - We will encounter and practice these many more times as the course progresses. - Imagine you have two data frames, `df1` and `df2`, that you want to combine. - You can **append**: stack the datasets on top of each other and match up the columns - You can **merge** (AKA a **join**): match the rows based on a common identifier - Each of these is possible with base R, **dplyr**, and **data.table**. - The appropriate choice depends on the task you are trying to accomplish - Are you trying to add new observations or new variables? --- # Visualize the difference
Taken from [Pere A. Taberner](https://www.peretaberner.eu/merging-and-appending-datasets-with-dplyr/). --- name: append # Appending - One way to append in the `tidyverse` is with `bind_rows()` - Base R has `rbind()`, which requires column names to match - `data.table` has `rbindlist()`, which requires column names to match unless you specify `fill` ```{r, append} df1 <- data.frame(x = 1:3, y = 4:6) df2 <- data.frame(x = 1:4, y = 10:13, z=letters[1:4]) ## Append df2 to df1 bind_rows(df1, df2) ``` --- name: joins # Joins One of the mainstays of the dplyr package is merging data with the family [join operations](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html). - `inner_join(df1, df2)` - `left_join(df1, df2)` - `right_join(df1, df2)` - `full_join(df1, df2)` - `semi_join(df1, df2)` - `anti_join(df1, df2)` Joins are how you get .hi[Relational Database Managment] (RDBM) to work in R. (See visual depictions of the different join operations [here](https://r4ds.had.co.nz/relational-data.html).) -- For the simple examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package. - Load it now and then inspect these data frames in your own console. ```{r flights, echo = F} library(nycflights13) ``` ```{r, eval = F} library(nycflights13) flights planes ``` - Any variable names in common? - Are the definitions of these variables consistent across data frames? --- # Joins (cont.)
--- # Joins (cont.) Let's perform a [left join](https://stat545.com/bit001_dplyr-cheatsheet.html#left_joinsuperheroes-publishers) on the flights and planes datasets. - *Note*: I'm going subset columns after the join, but only to keep text on the slide. -- ```{r join1} left_join(flights, planes) %>% select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) ``` --- # Joins (cont.) (*continued from previous slide*) Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices: ``` *## Joining, by = c("year", "tailnum") ``` However, there's a problem here: the variable "year" does not have a consistent meaning across our joining datasets! - In one it refers to the *year of flight*, in the other it refers to *year of construction*. -- Luckily, there's an easy way to avoid this problem. - See if you can figure it out before turning to the next slide. - Try `?dplyr::join`. - Submit your answer to https://www.menti.com using code `97 37 37 3` --- # Joins (cont.) (*continued from previous slide*) You just need to be more explicit in your join call by using the `by = ` argument. - You can also rename any ambiguous columns to avoid confusion. ```{r join2} left_join( flights, planes %>% rename(year_built = year), ## Not necessary w/ below line, but helpful by = "tailnum" ## Be specific about the joining column ) %>% select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) %>% head(3) ## Just to save vertical space on the slide ``` --- # Joins (cont.) (*continued from previous slide*) Last thing I'll mention for now; note what happens if we again specify the join column... but don't rename the ambiguous "year" column in at least one of the given data frames. ```{r join3} left_join( flights, planes, ## Not renaming "year" to "year_built" this time by = "tailnum" ) %>% select(contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>% head(3) ``` -- Make sure you know what "year.x" and "year.y" are. Again, it pays to be specific. --- class: inverse, center, middle name: tidyr # tidyr

--- # Key tidyr verbs 1. `pivot_longer`: Pivot wide data into long format.1 2. `pivot_wider`: Pivot long data into wide format.2 3. `separate`: Separate (i.e. split) one column into multiple columns. 4. `unite`: Unite (i.e. combine) multiple columns into one. .footnote[ 1 Updated version of `tidyr::gather`. 2 Updated version of `tidyr::spread`. ] --
Let's practice these verbs together in class. - Side question: Which of `pivot_longer` vs `pivot_wider` produces "tidy" data? --- name: pivot_longer # 1) tidyr::pivot_longer ```{r pivot_longer1} stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer time = as.Date('2009-01-01') + 0:1, X = rnorm(2, 0, 1), Y = rnorm(2, 0, 2), Z = rnorm(2, 0, 4)) stocks tidy_stocks = stocks %>% pivot_longer(-time, names_to="stock", values_to="price") tidy_stocks ``` --- name: pivot_wider # 2) tidyr::pivot_wider ```{r pivot_wider1} tidy_stocks %>% pivot_wider(names_from=stock, values_from=price) tidy_stocks %>% pivot_wider(names_from=time, values_from=price) ``` --
Note that the second example — which has combined different pivoting arguments — has effectively transposed the data. --- # 2) tidyr::pivot_longer with prefix Let's pivot the pre-loaded billboard data: showing weekly rankings of top 100 in the year 2000 ```{r billboard} head(billboard) ``` --- # 2) tidyr::pivot_longer with prefix *cont.* Wait, why is there 'wk' in the 'week' column? ```{r pivot_longer3} billboard %>% pivot_longer(cols=starts_with('wk'), names_to="week", values_to="rank") %>% head() ``` Try to fix with the `names_prefix` argument. Submit your answer to https://www.menti.com using code `97 37 37 3` --- # 2) tidyr::pivot_longer with prefix *cont.* That fixed it. ```{r pivot_longer4} billboard %>% pivot_longer(cols=starts_with('wk'), names_to="week", values_to="rank",names_prefix='wk') %>% mutate(week=as.numeric(week)) %>% # Make week a numeric variable head() ``` --- # Aside: Remembering the pivot_* syntax There's a long-running joke about no-one being able to remember Stata's "reshape" command. ([Exhibit A](https://twitter.com/helleringer143/status/1117234887902285836).) It's easy to see this happening with the `pivot_*` functions too. Remember the documentation is your friend! ```{r pivot_help,eval=FALSE} ?pivot_longer ``` And GitHub CoPilot, ChatGPT and other AI tools are also your friends if you use precise language about what you want the AI tool to do and you try their suggestions carefully.^[1] .footnote[1 Back in my day we had to scour StackOverflow for hours to find the right answer. And we liked it!] --- name: separate # 3) tidyr::separate ```{r sep1} economists = data.frame(name = c("Adam.Smith", "Paul.Samuelson", "Milton.Friedman")) economists economists %>% separate(name, c("first_name", "last_name")) ``` --
This command is pretty smart at detecting separators. But to avoid ambiguity, you can also specify the separation character with `separate(..., sep=".")`. --- # 3) tidyr::separate *cont.* A related function is `separate_rows`, for splitting up cells that contain multiple fields or observations (a frustratingly common occurence with survey data). ```{r sep2} jobs = data.frame( name = c("Jack", "Jill"), occupation = c("Homemaker", "Philosopher, Philanthropist, Troublemaker") ) jobs ## Now split out Jill's various occupations into different rows jobs %>% separate_rows(occupation) ``` --- name: unite # 4) tidyr::unite ```{r unite1} gdp = data.frame( yr = rep(2016, times = 4), mnth = rep(1, times = 4), dy = 1:4, gdp = rnorm(4, mean = 100, sd = 2) ) gdp ## Combine "yr", "mnth", and "dy" into one "date" column gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-") ``` --- # 4) tidyr::unite *cont.* Note that `unite` will automatically create a character variable. You can see this better if we convert it to a tibble. ```{r unite2} gdp_u = gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-") %>% as_tibble() gdp_u ``` -- If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using `mutate`. See the next slide for an example, using the [lubridate](https://lubridate.tidyverse.org/) package's super helpful date conversion functions. --- # 4) tidyr::unite *cont.* *(continued from previous slide)* ```{r unite3, message=F} library(lubridate) gdp_u %>% mutate(date = ymd(date)) ``` --- # Other tidyr goodies - `fill`: Fill in missing values with the last non-missing value. - `fill(df, starts_with("X"))` will fill in missing values in all columns that start with "X". - `drop_na`: Drop rows with missing values. - `expand`: Create a complete set of combinations from a set of factors. - `nest` and `unnest`: Combine columns into lists within a single cell or split a column of lists into separate rows. - Try with the `starwars` data frame: `unnest(starwars, films,names_sep='')` - And much, much more --- class: inverse, center, middle name: summary # Summary

--- # Key verbs ### dplyr 1. `filter` 2. `arrange` 3. `select` 4. `mutate` 5. `summarise` ### tidyr 1. `pivot_longer` 2. `pivot_wider` 3. `separate` 4. `unite` -- Other useful items include: pipes (`%>%`), grouping (`group_by`), joining functions (`left_join`, `inner_join`, etc.). --- # Start your problem set! - Go to the course calendar and click on the link for problem set 2 - Fork and clone the repository to your GitHub account - Follow instructions to get the NLSY downloaded and loaded into your R environment --- class: inverse, center, middle # Next lecture: Scraping data!

```{r gen_pdf, include = FALSE, cache = FALSE, eval = TRUE} infile = list.files(pattern = '.html') pagedown::chrome_print(input = infile, timeout = 100) ```