tidyverse
tidyverse
package itself includes multiple packages that are useful for different types of data analyses. For example, the package includes ggplot2
, dplyr
, tidyr
, readr
, purrr
, etc. Check out here to find more about the package. Notice that either you use p_load(tidyverse)
or library(tidyverse)
, after running the line, on your console, you see something like the following, indicating that it’s also attaching several packages that tidyverse
package holds (ggplot2, dplyr, tidyr, …).
# ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
# ✓ ggplot2 3.3.5 ✓ dplyr 1.0.7
# ✓ tidyr 1.1.4 ✓ stringr 1.4.0
# ✓ readr 2.1.0 ✓ forcats 0.5.1
# ✓ purrr 0.3.4
# ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
# x dplyr::filter() masks stats::filter()
# x dplyr::lag() masks stats::lag()
Now let’s create our own data set containing some information about characters in minions.
library(tidyverse)
tibble(
# you could also add comment here!
# Since there's # at the beginning of the line, this line won't run on R.
# this is good because plain texts wouldn't run on R and it'll throw an error.
name=c("Bob","Kevin","Stuart","Jerry", "Jorge"),
# Below is weight information
weight=c(12,25,18,21,35),
# Below is height information
height=c(80,120,94,105,95),
# Below is number of eyes information
num_eyes=c(2,2,1,2,2)
) -> minion_d
minion_d
## # A tibble: 5 × 4
## name weight height num_eyes
## <chr> <dbl> <dbl> <dbl>
## 1 Bob 12 80 2
## 2 Kevin 25 120 2
## 3 Stuart 18 94 1
## 4 Jerry 21 105 2
## 5 Jorge 35 95 2
%>%
Pipe operator, %>%
, is the most frequently used function in tidyverse
. Pipe operator is defined inside tidyverse
package so make sure to run library(tidyverse)
before using the pipe operator.
Example 1
## [1] 1
## [1] 1
## use pipe operator and you could also explicitly use `.` as a placeholder
## to indicate the output of the previous statement
pi %>% sin(.) %>% cos(.)
## [1] 1
Example 2
x <- c(0.110, 0.333, 0.323, 0.944, 0.345, 0.042, 0.127, 0.729, 0.997)
# Below three lines return the same output.
round(exp(diff(log(x))), 1)
## [1] 3.0 1.0 2.9 0.4 0.1 3.0 5.7 1.4
## [1] 3.0 1.0 2.9 0.4 0.1 3.0 5.7 1.4
## [1] 3.0 1.0 2.9 0.4 0.1 3.0 5.7 1.4
Example 3
## [1] "hello"
## [1] "hello"
# Consider creating following function
# that concatenates two input arguments and print that out.
print_fun=function(banana,apple){
print(paste0(banana, " ", apple))
}
Depending on how you place your placeholder .
, it returns different output.
## [1] "world hello"
## [1] "hello world"
Logical condition means that it returns either TRUE
or FALSE
depending on whether it satisfies the stated condition. For example, suppose you want to filter the tibble such that you only have information about Bob. By running filter(name=="Bob")
, under the hood, it’s scanning each observation in the tibble and only collects those that satisfy the condition, or return TRUE
. For more details,
## Help on topic 'filter' was found in the following packages:
##
## Package Library
## stats /Library/Frameworks/R.framework/Versions/4.0/Resources/library
## dplyr /Library/Frameworks/R.framework/Versions/4.0/Resources/library
##
##
## Using the first match ...
## # A tibble: 1 × 4
## name weight height num_eyes
## <chr> <dbl> <dbl> <dbl>
## 1 Bob 12 80 2
## # A tibble: 1 × 4
## name weight height num_eyes
## <chr> <dbl> <dbl> <dbl>
## 1 Jorge 35 95 2
# Select observations whose height is 50 AND weight is over 30
minion_d %>% filter(height>50 && weight > 30)
## # A tibble: 0 × 4
## # … with 4 variables: name <chr>, weight <dbl>, height <dbl>, num_eyes <dbl>
# Select observations whose height is 50 OR weight is over 30
minion_d %>% filter(height>50 | weight > 30)
## # A tibble: 5 × 4
## name weight height num_eyes
## <chr> <dbl> <dbl> <dbl>
## 1 Bob 12 80 2
## 2 Kevin 25 120 2
## 3 Stuart 18 94 1
## 4 Jerry 21 105 2
## 5 Jorge 35 95 2
# For more details,
?select
# Select weight and num_eyes columns
minion_d %>% select(weight, num_eyes)
## # A tibble: 5 × 2
## weight num_eyes
## <dbl> <dbl>
## 1 12 2
## 2 25 2
## 3 18 1
## 4 21 2
## 5 35 2
For more details,
?mutate
# Below code creates a new variable called `fav_food` which has `banana` as values.
minion_d %>% mutate(fav_food=rep("banana", 5))
## # A tibble: 5 × 5
## name weight height num_eyes fav_food
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Bob 12 80 2 banana
## 2 Kevin 25 120 2 banana
## 3 Stuart 18 94 1 banana
## 4 Jerry 21 105 2 banana
## 5 Jorge 35 95 2 banana
For more details,
Below code sorts the minion_d
by num_eyes
in ascending order. In other words, it orders the rows of the dataset that has the minimum num_eyes
to the maximum num_eyes
.
## # A tibble: 5 × 4
## name weight height num_eyes
## <chr> <dbl> <dbl> <dbl>
## 1 Stuart 18 94 1
## 2 Bob 12 80 2
## 3 Kevin 25 120 2
## 4 Jerry 21 105 2
## 5 Jorge 35 95 2
Below code sorts the minion_d
by num_eyes
in descending order. In other words, it orders the rows of the dataset that has the maximum num_eyes
to the minimum num_eyes
.
## # A tibble: 5 × 4
## name weight height num_eyes
## <chr> <dbl> <dbl> <dbl>
## 1 Bob 12 80 2
## 2 Kevin 25 120 2
## 3 Jerry 21 105 2
## 4 Jorge 35 95 2
## 5 Stuart 18 94 1
It is often used together with other functions, such as summarise()
, count()
, etc. ### summarise(): Creates a new data frame that summarizes observations ### count(): Counts the number of observations ### nrow(): Counts the number of rows For more details,
Below code groups minion_d
by num_eyes
, then counts the number of observations for each group
## # A tibble: 2 × 2
## # Groups: num_eyes [2]
## num_eyes n
## <dbl> <int>
## 1 1 1
## 2 2 4
This is equivalent to 1) filtering the dataset by the number of cases for the values of num_eyes
, 2) then calculate the number of rows for each case. Since num_eyes
in minion_d
only takes either value of 1 or 2, group_by(num_eyes)
would be grouping the data by those with one eye and those with two eyes, then calculate the number of observations for each group. In other words, this is equivalent to,
## # A tibble: 1 × 1
## n
## <int>
## 1 1
## # A tibble: 1 × 1
## n
## <int>
## 1 4
Now you could use summarize() function to return the same output as minion_d %>% group_by(num_eyes) %>% nrow()
.
## # A tibble: 2 × 2
## num_eyes `n()`
## <dbl> <int>
## 1 1 1
## 2 2 4
Below code groups minion_d
by num_eyes
, then returns the summary of height, i.e., mean height, median height, maximum height and minimum height in each group.
minion_d %>%
group_by(num_eyes) %>%
summarize(mean_h=mean(height),
median_h=median(height),
max_h=max(height),
min_h=min(height))
## # A tibble: 2 × 5
## num_eyes mean_h median_h max_h min_h
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 94 94 94 94
## 2 2 100 100 120 80
Please open up the 03-Exercise.R
and fill out your answer for each question.