Where we've been:
R Basics
Programming
Productivity tools
Where we're going:
Data wrangling in the tidyverse
Data cleaning
Introduce the term project
Data acquisition & webscraping
A whole "universe" of functions within R
Advantages:
Why did I make you learn base R to start with?
Note: Another approach with many fans is data.table, which you can learn about on your own!
There is often a direct correspondence between a tidyverse command and its base R equivalent.
tidyverse | base |
---|---|
?readr::read_csv |
?utils::read.csv |
?dplyr::if_else |
?base::ifelse |
?tibble::tibble |
?base::data.frame |
Notice these generally follow a tidyverse::snake_case
vs base::period.case
rule.
The tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart.
The two most important properties of tidy data are:
Tidy data is easier to work with, because you have a consistent way of referring to variables and observations. It then becomes easy to manipulate, visualize, and model.
Image is from "R for Data Science" by Hadley Wickham & Garrett Grolemund, used under CC BY-NC-ND 3.0 and is not included under this project's overall CC license.
Both of these data sets display information on heart rate observed in individuals across 3 different time periods:
## name time1 time2 time3## 1 Wilbur 67 56 70## 2 Petunia 80 90 67## 3 Gregory 64 50 101
## name time heartrate## 1 Wilbur 1 67## 2 Petunia 1 80## 3 Gregory 1 64## 4 Wilbur 2 56## 5 Petunia 2 90## 6 Gregory 2 50## 7 Wilbur 3 70## 8 Petunia 3 67## 9 Gregory 3 10
Which dataframe is in tidy format?
Wide data:
Long data:
Tidy data is more likely to be long.
"Tidy datasets are all alike but every messy dataset is messy in its own way."
– Hadley Wickham
We need to install and load a couple of packages. Run these preliminaries:
install.packages('tidyverse')install.packages('nycflights13')library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──## ✔ dplyr 1.1.4 ✔ readr 2.1.5## ✔ forcats 1.0.0 ✔ stringr 1.5.1## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::lag() masks stats::lag()## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We need to install and load a couple of packages. Run these preliminaries:
install.packages('tidyverse')install.packages('nycflights13')library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──## ✔ dplyr 1.1.4 ✔ readr 2.1.5## ✔ forcats 1.0.0 ✔ stringr 1.5.1## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::lag() masks stats::lag()## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We see that we have actually loaded a number of packages (which could also be loaded individually): ggplot2, tibble, dplyr, etc.
The tidyverse actually comes with a lot more packages than those that are just loaded automatically.
tidyverse_packages()
## [1] "broom" "cli" "crayon" "dbplyr" ## [5] "dplyr" "dtplyr" "forcats" "googledrive" ## [9] "googlesheets4" "ggplot2" "haven" "hms" ## [13] "httr" "jsonlite" "lubridate" "magrittr" ## [17] "modelr" "pillar" "purrr" "readr" ## [21] "readxl" "reprex" "rlang" "rstudioapi" ## [25] "rvest" "stringr" "tibble" "tidyr" ## [29] "xml2" "tidyverse"
All of these are super useful
This lecture will focus on two that are automatically loaded: dplyr and tidyr.
Pipes take the output of one function and feed it into the first argument of the next (which you then skip).
dataframe |> filter(condition)
is equivalent to filter(dataframe, condition)
.
Note: |>
on these slides is generated by the two characters | >
, without the space.
Pipes take the output of one function and feed it into the first argument of the next (which you then skip).
dataframe |> filter(condition)
is equivalent to filter(dataframe, condition)
.
Note: |>
on these slides is generated by the two characters | >
, without the space.
Older version of the pipe: %>%
magrittr
package loaded with the tidyverse|>
in most situations.
Keyboard shortcut: Ctl/Cmd + Shift + M
|>
the defaultPipes can dramatically improve the experience of reading and writing code. Compare:
## These next two lines of code do exactly the same thing.mpg |> filter(manufacturer=="audi") |> group_by(model) |> summarize(hwy_mean = mean(hwy))summarize(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))
The first line reads from left to right, exactly how you think about the operations.
The second line totally inverts this logical order (the final operation comes first!)
Best practice is to put each function on its own line and indent. Look how much more readable this is:
mpg |> filter(manufacturer == "audi") |> group_by(model) |> summarize(hwy_mean = mean(hwy))
## # A tibble: 3 × 2## model hwy_mean## <chr> <dbl>## 1 a4 28.3## 2 a4 quattro 25.8## 3 a6 quattro 24
Vertical space costs nothing and makes for much more readable/writable code than cramming things horizontally.
All together, this multi-line line of code is called a pipeline.
There are five key dplyr verbs that you need to learn.
filter
: Filter (i.e. subset) rows based on their values.
arrange
: Arrange (i.e. reorder) rows based on their values.
select
: Select (i.e. subset) columns by their names:
mutate
: Create new columns.
summarize
: Collapse multiple rows into a single summary value.1
1 Many sources spell summarise
with an "s" -- either one works!
There are five key dplyr verbs that you need to learn.
filter
: Filter (i.e. subset) rows based on their values.
arrange
: Arrange (i.e. reorder) rows based on their values.
select
: Select (i.e. subset) columns by their names:
mutate
: Create new columns.
summarize
: Collapse multiple rows into a single summary value.1
1 Many sources spell summarise
with an "s" -- either one works!
Let's practice these functions together using the starwars
data frame that comes pre-packaged with dplyr.
We can chain multiple filter commands with the pipe (|>
), or just separate them within a single filter command using commas.
starwars |> filter( species == "Human", height >= 190 )
## # A tibble: 4 × 14## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…## 2 Qui-Gon Jinn 193 89 brown fair blue 92 male mascu… <NA> ## 3 Dooku 193 80 white fair brown 102 male mascu… Serenno## 4 Bail Presto… 191 NA black tan brown 67 male mascu… Aldera…## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,## # ³eye_color, ⁴birth_year, ⁵homeworld## # ℹ Use colnames() to see all variable names
Regular expressions work well too.
starwars |> filter(str_detect(name, "Skywalker"))
## # A tibble: 3 × 14## name height mass hair_color skin_color eye_color birth_year sex gender## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…## 2 Anakin S… 188 84 blond fair blue 41.9 male mascu…## 3 Shmi Sky… 163 NA black fair brown 72 fema… femin…## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,## # vehicles <list>, starships <list>
A very common filter
use case is identifying (or removing) missing data cases.
starwars |> filter(is.na(height))
## # A tibble: 6 × 14## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA> ## 2 Finn NA NA black dark dark NA male mascu… <NA> ## 3 Rey NA NA brown light hazel NA fema… femin… <NA> ## 4 Poe Dameron NA NA brown light brown NA male mascu… <NA> ## 5 BB8 NA NA none none black NA none mascu… <NA> ## 6 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA> ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,## # ³eye_color, ⁴birth_year, ⁵homeworld## # ℹ Use colnames() to see all variable names
A very common filter
use case is identifying (or removing) missing data cases.
starwars |> filter(is.na(height))
## # A tibble: 6 × 14## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Arvel Crynyd NA NA brown fair brown NA male mascu… <NA> ## 2 Finn NA NA black dark dark NA male mascu… <NA> ## 3 Rey NA NA brown light hazel NA fema… femin… <NA> ## 4 Poe Dameron NA NA brown light brown NA male mascu… <NA> ## 5 BB8 NA NA none none black NA none mascu… <NA> ## 6 Captain Pha… NA NA unknown unknown unknown NA <NA> <NA> <NA> ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,## # ³eye_color, ⁴birth_year, ⁵homeworld## # ℹ Use colnames() to see all variable names
To remove missing observations, simply use negation: filter(!is.na(height))
.
Try this yourself.
arrange
sorts your data frame by a particular column (numerically, or alphabetically)
starwars |> arrange(birth_year)
## # A tibble: 87 × 14## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor ## 2 IG-88 200 140 none metal red 15 none mascu… <NA> ## 3 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…## 4 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…## 5 Wedge Anti… 170 77 brown fair hazel 21 male mascu… Corell…## 6 Plo Koon 188 80 none orange black 22 male mascu… Dorin ## 7 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…## 8 Han Solo 180 80 brown fair brown 29 male mascu… Corell…## 9 Lando Calr… 177 79 black dark brown 31 male mascu… Socorro## 10 Boba Fett 183 78.2 black fair brown 31.5 male mascu… Kamino ## # … with 77 more rows, 4 more variables: species <chr>, films <list>,## # vehicles <list>, starships <list>, and abbreviated variable names## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld## # ℹ Use print(n = ...) to see more rows, and colnames() to see all variable names
We can also arrange items in descending order using arrange(desc())
.
starwars |> arrange(desc(birth_year))
## # A tibble: 87 × 14## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Yoda 66 17 white green brown 896 male mascu… <NA> ## 2 Jabba Desi… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…## 3 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…## 4 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…## 5 Dooku 193 80 white fair brown 102 male mascu… Serenno## 6 Qui-Gon Ji… 193 89 brown fair blue 92 male mascu… <NA> ## 7 Ki-Adi-Mun… 198 82 white pale yellow 92 male mascu… Cerea ## 8 Finis Valo… 170 NA blond fair blue 91 male mascu… Corusc…## 9 Palpatine 170 75 grey pale yellow 82 male mascu… Naboo ## 10 Cliegg Lars 183 NA brown fair blue 82 male mascu… Tatooi…## # … with 77 more rows, 4 more variables: species <chr>, films <list>,## # vehicles <list>, starships <list>, and abbreviated variable names## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld## # ℹ Use print(n = ...) to see more rows, and colnames() to see all variable names
Use commas to select multiple columns out of a data frame. (You can also use "first:last" for consecutive columns). Deselect a column with "-".
starwars |> select(name:skin_color, species, -height)
## # A tibble: 87 × 5## name mass hair_color skin_color species## <chr> <dbl> <chr> <chr> <chr> ## 1 Luke Skywalker 77 blond fair Human ## 2 C-3PO 75 <NA> gold Droid ## 3 R2-D2 32 <NA> white, blue Droid ## 4 Darth Vader 136 none white Human ## 5 Leia Organa 49 brown light Human ## 6 Owen Lars 120 brown, grey light Human ## 7 Beru Whitesun lars 75 brown light Human ## 8 R5-D4 32 <NA> white, red Droid ## 9 Biggs Darklighter 84 black light Human ## 10 Obi-Wan Kenobi 77 auburn, white fair Human ## # … with 77 more rows## # ℹ Use print(n = ...) to see more rows
You can also rename some (or all) of your selected variables in place.
starwars |> select(alias=name, planet=homeworld)
## # A tibble: 87 × 2## alias planet ## <chr> <chr> ## 1 Luke Skywalker Tatooine## 2 C-3PO Tatooine## 3 R2-D2 Naboo ## 4 Darth Vader Tatooine## 5 Leia Organa Alderaan## 6 Owen Lars Tatooine## 7 Beru Whitesun lars Tatooine## 8 R5-D4 Tatooine## 9 Biggs Darklighter Tatooine## 10 Obi-Wan Kenobi Stewjon ## # ℹ 77 more rows
You can also rename some (or all) of your selected variables in place.
starwars |> select(alias=name, planet=homeworld)
## # A tibble: 87 × 2## alias planet ## <chr> <chr> ## 1 Luke Skywalker Tatooine## 2 C-3PO Tatooine## 3 R2-D2 Naboo ## 4 Darth Vader Tatooine## 5 Leia Organa Alderaan## 6 Owen Lars Tatooine## 7 Beru Whitesun lars Tatooine## 8 R5-D4 Tatooine## 9 Biggs Darklighter Tatooine## 10 Obi-Wan Kenobi Stewjon ## # ℹ 77 more rows
If you just want to rename columns without subsetting them, you can use rename
. Try this!
The select(contains(PATTERN))
option provides a nice shortcut in relevant cases.
starwars |> select(name, contains("color"))
## # A tibble: 87 × 4## name hair_color skin_color eye_color## <chr> <chr> <chr> <chr> ## 1 Luke Skywalker blond fair blue ## 2 C-3PO <NA> gold yellow ## 3 R2-D2 <NA> white, blue red ## 4 Darth Vader none white yellow ## 5 Leia Organa brown light brown ## 6 Owen Lars brown, grey light blue ## 7 Beru Whitesun lars brown light blue ## 8 R5-D4 <NA> white, red red ## 9 Biggs Darklighter black light brown ## 10 Obi-Wan Kenobi auburn, white fair blue-gray## # … with 77 more rows## # ℹ Use print(n = ...) to see more rows
Some other selection helpers: starts_with()
, ends_with()
, all_of(c("name1", "name2"))
, matches()
.
You can create new columns from scratch, or (more commonly) as transformations of existing columns.
starwars |> select(name, birth_year) |> mutate(dog_years = birth_year * 7) |> mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
## # A tibble: 87 × 4## name birth_year dog_years comment ## <chr> <dbl> <dbl> <chr> ## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. ## 2 C-3PO 112 784 C-3PO is 784 in dog years. ## 3 R2-D2 33 231 R2-D2 is 231 in dog years. ## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. ## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. ## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. ## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea…## 8 R5-D4 NA NA R5-D4 is NA in dog years. ## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year…## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. ## # … with 77 more rows## # ℹ Use print(n = ...) to see more rows
Note: mutate
is order aware. So you can chain multiple mutates in a single call.
starwars |> select(name, birth_year) |> mutate( dog_years = birth_year * 7, # Separate with a comma comment = paste0(name, " is ", dog_years, " in dog years.") )
## # A tibble: 87 × 4## name birth_year dog_years comment ## <chr> <dbl> <dbl> <chr> ## 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years. ## 2 C-3PO 112 784 C-3PO is 784 in dog years. ## 3 R2-D2 33 231 R2-D2 is 231 in dog years. ## 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years. ## 5 Leia Organa 19 133 Leia Organa is 133 in dog years. ## 6 Owen Lars 52 364 Owen Lars is 364 in dog years. ## 7 Beru Whitesun lars 47 329 Beru Whitesun lars is 329 in dog yea…## 8 R5-D4 NA NA R5-D4 is NA in dog years. ## 9 Biggs Darklighter 24 168 Biggs Darklighter is 168 in dog year…## 10 Obi-Wan Kenobi 57 399 Obi-Wan Kenobi is 399 in dog years. ## # … with 77 more rows## # ℹ Use print(n = ...) to see more rows
Boolean, logical and conditional operators all work well with mutate
too.
starwars |> select(name, height) |> filter(name %in% c("Luke Skywalker", "Anakin Skywalker")) |> mutate(tall1 = height > 180) |> mutate(tall2 = if_else(height > 180, "Tall", "Short"))
## # A tibble: 2 × 4## name height tall1 tall2## <chr> <int> <lgl> <chr>## 1 Luke Skywalker 172 FALSE Short## 2 Anakin Skywalker 188 TRUE Tall
Lastly, combining mutate
with across
allows you to easily perform the same operation on a subset of variables.
starwars |> select(name:eye_color) |> mutate(across(where(is.character), toupper))
## # A tibble: 87 × 6## name height mass hair_color skin_color eye_color## <chr> <int> <dbl> <chr> <chr> <chr> ## 1 LUKE SKYWALKER 172 77 BLOND FAIR BLUE ## 2 C-3PO 167 75 <NA> GOLD YELLOW ## 3 R2-D2 96 32 <NA> WHITE, BLUE RED ## 4 DARTH VADER 202 136 NONE WHITE YELLOW ## 5 LEIA ORGANA 150 49 BROWN LIGHT BROWN ## 6 OWEN LARS 178 120 BROWN, GREY LIGHT BLUE ## 7 BERU WHITESUN LARS 165 75 BROWN LIGHT BLUE ## 8 R5-D4 97 32 <NA> WHITE, RED RED ## 9 BIGGS DARKLIGHTER 183 84 BLACK LIGHT BROWN ## 10 OBI-WAN KENOBI 182 77 AUBURN, WHITE FAIR BLUE-GRAY## # … with 77 more rows## # ℹ Use print(n = ...) to see more rows
Particularly useful in combination with the group_by
command.
starwars |> group_by(species) |> summarize(mean_height = mean(height))
## # A tibble: 38 × 2## species mean_height## <chr> <dbl>## 1 Aleena 79 ## 2 Besalisk 198 ## 3 Cerean 198 ## 4 Chagrian 196 ## 5 Clawdite 168 ## 6 Droid NA ## 7 Dug 112 ## 8 Ewok 88 ## 9 Geonosian 183 ## 10 Gungan 209.## # … with 28 more rows## # ℹ Use print(n = ...) to see more rows
Notice that some of these summarized values are missing. If we want to ignore missing values, use na.rm = T
:
## Much betterstarwars |> group_by(species) |> summarize(mean_height = mean(height, na.rm = T))
## # A tibble: 38 × 2## species mean_height## <chr> <dbl>## 1 Aleena 79 ## 2 Besalisk 198 ## 3 Cerean 198 ## 4 Chagrian 196 ## 5 Clawdite 168 ## 6 Droid 131.## 7 Dug 112 ## 8 Ewok 88 ## 9 Geonosian 183 ## 10 Gungan 209.## # … with 28 more rows## # ℹ Use print(n = ...) to see more rows
The same across
-based workflow that we saw with mutate
a few slides back also works with summarize
.
starwars |> group_by(species) |> summarize(across(where(is.numeric), mean))
## # A tibble: 38 × 4## species height mass birth_year## <chr> <dbl> <dbl> <dbl>## 1 Aleena 79 15 NA## 2 Besalisk 198 102 NA## 3 Cerean 198 82 92## 4 Chagrian 196 NA NA## 5 Clawdite 168 55 NA## 6 Droid NA NA NA## 7 Dug 112 40 NA## 8 Ewok 88 20 8## 9 Geonosian 183 80 NA## 10 Gungan 209. NA NA## # ℹ 28 more rows
The same across
-based workflow that we saw with mutate
a few slides back also works with summarize
. Though to add arguments, we have to use an anonymous function:
starwars |> group_by(species) |> summarize(across(where(is.numeric), ~ mean(.x, na.rm=T)))
## # A tibble: 38 × 4## species height mass birth_year## <chr> <dbl> <dbl> <dbl>## 1 Aleena 79 15 NaN ## 2 Besalisk 198 102 NaN ## 3 Cerean 198 82 92 ## 4 Chagrian 196 NaN NaN ## 5 Clawdite 168 55 NaN ## 6 Droid 131. 69.8 53.3## 7 Dug 112 40 NaN ## 8 Ewok 88 20 8 ## 9 Geonosian 183 80 NaN ## 10 Gungan 209. 74 52 ## # ℹ 28 more rows
ungroup
: For ungrouping after using group_by
.
summarize
or mutate
operation, or everything else you do will be super slow.ungroup
: For ungrouping after using group_by
.
summarize
or mutate
operation, or everything else you do will be super slow.slice
: Subset rows by position rather than filtering by values.
starwars |> slice(1:10)
ungroup
: For ungrouping after using group_by
.
summarize
or mutate
operation, or everything else you do will be super slow.slice
: Subset rows by position rather than filtering by values.
starwars |> slice(1:10)
pull
: Extract a column from as a data frame as a vector or scalar.
starwars |> filter(sex=="female") |> pull(height)
ungroup
: For ungrouping after using group_by
.
summarize
or mutate
operation, or everything else you do will be super slow.slice
: Subset rows by position rather than filtering by values.
starwars |> slice(1:10)
pull
: Extract a column from as a data frame as a vector or scalar.
starwars |> filter(sex=="female") |> pull(height)
distinct
and count
: List unique values, with or without their number of appearances.
starwars |> distinct(species)
, or starwars |> count(species)
count
is equivalent to group_by
and summarize
with n()
:starwars |> group_by(species) |> summarize(n = n())
Take a few minutes to try this on your own:
List the most common eye colors among female Star Wars characters in descending order of frequency.
So far we haven't been saving the dataframes that result from our code in memory. Usually, we will want to use them for the next task.
Solution: Create a new object each time you write a pipeline.
women = starwars |> filter(sex == "female")brown_eyed_women = women |> filter(eye_color == "brown")
Resist the temptation to use the same object name. This is called clobbering and it ruins your ability to easily look back at a previous version of your object.
# DON'T do thisstarwars = starwars |> filter(sex == "female")
So far we haven't been saving the dataframes that result from our code in memory. Usually, we will want to use them for the next task.
Solution: Create a new object each time you write a pipeline.
women = starwars |> filter(sex == "female")brown_eyed_women = women |> filter(eye_color == "brown")
Resist the temptation to use the same object name. This is called clobbering and it ruins your ability to easily look back at a previous version of your object.
# DON'T do thisstarwars = starwars |> filter(sex == "female")
By keeping multiple copies of very similar dataframes, will you waste your computer's memory? Usually, no -- R is smart and stores only the changes between objects.
Most data analysis requires combining information from multiple data tables. To combine tables, we use joins (similar to base::merge()
).
Most data analysis requires combining information from multiple data tables. To combine tables, we use joins (similar to base::merge()
).
There are multiple types of joins.
An inner join returns a dataframe of all rows that appear in both x and y.
Image is from "R for Data Science" by Hadley Wickham & Garrett Grolemund, used under CC BY-NC-ND 3.0 and not included under this resource's overall CC license.
A left join keeps all rows in x.
A right join keeps all rows in y.
A full join keeps all rows in x and y.
Image is from "R for Data Science" by Wickham & Grolemund, used under CC BY-NC-ND 3.0 and not included under this resource's overall CC license.
Left joins are safer than inner joins
Use the left join unless you have a strong reason to prefer another.
Left joins are safer than inner joins
Use the left join unless you have a strong reason to prefer another.
For the examples that I'm going to show here, we'll need some data sets that come bundled with the nycflights13 package.
year
column means in each data frame.library(nycflights13)data(flights)data(planes)
Let's perform a left join on the flights and planes datasets.
flights |> left_join(planes)
## Joining, by = c("year", "tailnum")
## # A tibble: 336,776 × 10## year month day dep_time arr_time carrier flight tailnum type model## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr>## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA> ## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA> ## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA> ## 4 2013 1 1 544 1004 B6 725 N804JB <NA> <NA> ## 5 2013 1 1 554 812 DL 461 N668DN <NA> <NA> ## 6 2013 1 1 554 740 UA 1696 N39463 <NA> <NA> ## 7 2013 1 1 555 913 B6 507 N516JB <NA> <NA> ## 8 2013 1 1 557 709 EV 5708 N829AS <NA> <NA> ## 9 2013 1 1 557 838 B6 79 N593JB <NA> <NA> ## 10 2013 1 1 558 753 AA 301 N3ALAA <NA> <NA> ## # … with 336,766 more rows## # ℹ Use print(n = ...) to see more rows
(continued from previous slide)
Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:
## Joining with by = join_by(year, tailnum)
What's the problem here?
(continued from previous slide)
Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:
## Joining with by = join_by(year, tailnum)
What's the problem here?
The variable "year" does not have a consistent meaning across our joining datasets!
(continued from previous slide)
Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:
## Joining with by = join_by(year, tailnum)
What's the problem here?
The variable "year" does not have a consistent meaning across our joining datasets!
Luckily, there's an easy way to avoid this problem.
?dplyr::join
.(continued from previous slide)
You just need to be more explicit in your join call by using the by =
argument.
flights |> left_join( planes |> rename(year_built = year), by = "tailnum" )
## # A tibble: 3 × 11## year month day dep_time arr_time carrier flight tailnum year_…¹ type model## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr> <chr>## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixe… 737-…## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixe… 737-…## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixe… 757-…## # … with abbreviated variable name ¹year_built
pivot_longer
: Pivot wide data into long format.
pivot_wider
: Pivot long data into wide format.
separate
: Separate (i.e. split) one column into multiple columns.
unite
: Unite (i.e. combine) multiple columns into one.
pivot_longer
: Pivot wide data into long format.
pivot_wider
: Pivot long data into wide format.
separate
: Separate (i.e. split) one column into multiple columns.
unite
: Unite (i.e. combine) multiple columns into one.
Which of pivot_longer
vs pivot_wider
produces "tidy" data?
stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer time = as.Date('2009-01-01') + 0:1, X = rnorm(2, 10, 1), Y = rnorm(2, 10, 2), Z = rnorm(2, 10, 5) )stocks
## time X Y Z## 1 2009-01-01 9.353794 10.892745 9.648994## 2 2009-01-02 9.657906 7.384239 19.472134
tidy_stocks = stocks |> pivot_longer(cols=X:Z, names_to="stock", values_to="price")tidy_stocks
## # A tibble: 6 × 3## time stock price## <date> <chr> <dbl>## 1 2009-01-01 X 9.35## 2 2009-01-01 Y 10.9 ## 3 2009-01-01 Z 9.65## 4 2009-01-02 X 9.66## 5 2009-01-02 Y 7.38## 6 2009-01-02 Z 19.5
Now we can use pivot_wider to go back to the original dataframe:
tidy_stocks |> pivot_wider(names_from=stock, values_from=price)
## # A tibble: 2 × 4## time X Y Z## <date> <dbl> <dbl> <dbl>## 1 2009-01-01 9.35 10.9 9.65## 2 2009-01-02 9.66 7.38 19.5
Or, we can put it into a new ("transposed") format, in which the observations are stocks and the columns are dates:
tidy_stocks |> pivot_wider(names_from=time, values_from=price)
## # A tibble: 3 × 3## stock 2009-01-01 2009-01-02## <chr> <dbl> <dbl>## 1 X 9.35 9.66## 2 Y 10.9 7.38## 3 Z 9.65 19.5
separate
helps when you have more than one value in a single column:
economists = data.frame(name = c("Adam_Smith", "Paul_Samuelson", "Milton_Friedman"))economists
## name## 1 Adam_Smith## 2 Paul_Samuelson## 3 Milton_Friedman
economists |> separate(name, c("first_name", "last_name"))
## first_name last_name## 1 Adam Smith## 2 Paul Samuelson## 3 Milton Friedman
separate
helps when you have more than one value in a single column:
economists = data.frame(name = c("Adam_Smith", "Paul_Samuelson", "Milton_Friedman"))economists
## name## 1 Adam_Smith## 2 Paul_Samuelson## 3 Milton_Friedman
economists |> separate(name, c("first_name", "last_name"))
## first_name last_name## 1 Adam Smith## 2 Paul Samuelson## 3 Milton Friedman
This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with the sep
argument:
economists |> separate(name, c("first_name", "last_name"), sep = "_")
Related is separate_rows
, for splitting cells with multiple values into multiple rows:
jobs = data.frame( name = c("Joe", "Jill"), occupation = c("President", "First Lady, Professor, Grandmother") ) jobs
## name occupation## 1 Joe President## 2 Jill First Lady, Professor, Grandmother
# Now split out Jill's various occupations into different rowsjobs |> separate_rows(occupation)
## # A tibble: 5 × 2## name occupation ## <chr> <chr> ## 1 Joe President ## 2 Jill First ## 3 Jill Lady ## 4 Jill Professor ## 5 Jill Grandmother
Related is separate_rows
, for splitting cells with multiple values into multiple rows:
jobs = data.frame( name = c("Joe", "Jill"), occupation = c("President", "First Lady, Professor, Grandmother") ) jobs
## name occupation## 1 Joe President## 2 Jill First Lady, Professor, Grandmother
# Now split out Jill's various occupations into different rowsjobs |> separate_rows(occupation, sep = ", ")
## # A tibble: 4 × 2## name occupation ## <chr> <chr> ## 1 Joe President ## 2 Jill First Lady ## 3 Jill Professor ## 4 Jill Grandmother
gdp = data.frame( yr = rep(2016, times = 4), mnth = rep(1, times = 4), dy = 1:4, gdp = rnorm(4, mean = 100, sd = 2) )gdp
## yr mnth dy gdp## 1 2016 1 1 98.44216## 2 2016 1 2 98.98793## 3 2016 1 3 99.91618## 4 2016 1 4 100.97192
## Combine "yr", "mnth", and "dy" into one "date" columngdp |> unite(date, c("yr", "mnth", "dy"), sep = "-")
## date gdp## 1 2016-1-1 98.44216## 2 2016-1-2 98.98793## 3 2016-1-3 99.91618## 4 2016-1-4 100.97192
Note that unite
will automatically create a character variable.
If you want to convert it to something else (e.g. date or numeric) then you will need to modify it using mutate
. This example uses the lubridate package's super helpful date conversion functions.
library(lubridate)gdp_u |> mutate(date = ymd(date))
## # A tibble: 4 × 2## date gdp## <date> <dbl>## 1 2016-01-01 98.4## 2 2016-01-02 99.0## 3 2016-01-03 99.9## 4 2016-01-04 101.
Using nycflights13
, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).
Using nycflights13
, create a table of average arrival delay (in minutes) by day (in rows) and carrier (in columns).
Hint: Recall that you can tabulate summary statistics using group_by
and summarize
:
flights |> group_by(carrier) |> summarize(avg_late = mean(arr_delay, na.rm=T))
## # A tibble: 16 × 2## carrier avg_late## <chr> <dbl>## 1 9E 7.38 ## 2 AA 0.364## 3 AS -9.93 ## 4 B6 9.46 ## 5 DL 1.64 ## 6 EV 15.8 ## 7 F9 21.9 ## 8 FL 20.1 ## 9 HA -6.92 ## 10 MQ 10.8 ## 11 OO 11.9 ## 12 UA 3.56 ## 13 US 2.13 ## 14 VX 1.76 ## 15 WN 9.65 ## 16 YV 15.6
Most data does not come nicely in R packages! You will download it from somewhere and load it in R.
Go here: https://github.com/nytimes/covid-19-data/tree/master/colleges
and click on the Raw CSV link.
In your browser, save this file somewhere sensible on your computer (perhaps a folder you already created to put your work for this class).
Your computer stores all its programs, files, and information in a filesystem consisting of a set of nested folders or directories.
The working directory is your "current location" in the filesystem. What's your current working directory?
getwd()
## [1] "C:/git/covid-example"
C:/
on Windows or /
on Mac.Relative paths are defined relative to the full path of the working directory.
"C:/git/ecns-560/"
"C:/git/ecns-560/assignment-3/assignment-3.Rmd"
"assignment-3/assignment-3.Rmd"
It's always best to use relative paths, though you can also use an absolute path in any given situation.
You can change your working directory with setwd(dir)
. Now:
Find the location you saved your CSV file and copy the filepath.
Use the console in R to set your working directory to that location.
For example:
setwd("C:/git/covid-example")
Do not use setwd()
in scripts. Setting the working directory is one of the few exceptions to the rule that you should always write all your code in scripts.
The tidyverse package readr
provides lots of options to read data into R. Read in the college Covid data like this:
dat = read_csv("colleges.csv")
View
this data to take a look at it. It automatically formatted everything as a nice and neat data frame!
str(dat)
## spc_tbl_ [1,948 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)## $ date : Date[1:1948], format: "2021-05-26" "2021-05-26" ...## $ state : chr [1:1948] "Alabama" "Alabama" "Alabama" "Alabama" ...## $ county : chr [1:1948] "Madison" "Montgomery" "Limestone" "Lee" ...## $ city : chr [1:1948] "Huntsville" "Montgomery" "Athens" "Auburn" ...## $ ipeds_id : chr [1:1948] "100654" "100724" "100812" "100858" ...## $ college : chr [1:1948] "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ...## $ cases : num [1:1948] 41 2 45 2742 220 ...## $ cases_2021: num [1:1948] NA NA 10 567 80 NA 49 53 10 35 ...## $ notes : chr [1:1948] NA NA NA NA ...## - attr(*, "spec")=## .. cols(## .. date = col_date(format = ""),## .. state = col_character(),## .. county = col_character(),## .. city = col_character(),## .. ipeds_id = col_character(),## .. college = col_character(),## .. cases = col_double(),## .. cases_2021 = col_double(),## .. notes = col_character()## .. )## - attr(*, "problems")=<externalptr>
Which state had the least total reported Covid-19 cases at colleges and universities?
Is it Montana?
Many things can go wrong with CSV files. When this happens, read_csv
has many options -- look at the help file.
You can read files directly from the internet without downloading them first (but you usually shouldn't!):
url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv"dat = read_csv(url)
write_csv
outputs a dataframe as a CSV to your computer.
readxl::read_excel
reads directly from Excel spreadsheets.
googlesheets4::read_sheet
reads directly from Google Sheets.
haven::read_dta
reads directly from Stata files! (Stata can't read R files....)
filter
arrange
select
mutate
summarize
pivot_longer
pivot_wider
separate
unite
|>
)group_by
)left_join
)getwd
& setwd
read_csv
Where we've been:
R Basics
Programming
Productivity tools
Where we're going:
Data wrangling in the tidyverse
Data cleaning
Introduce the term project
Data acquisition & webscraping
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |