This R Markdown file contains examples on how to use various functions from dyplr
package. I used flights
data set from nycflights13
package, and mtcars
data set to demonstrate the usage of different functions.
Load dplyr
and nycflights13
packages.
library(dplyr)
library(nycflights13)
Summary of the flights
data set.
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :29.00 Median :2013-07-03 10:00:00
## Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :59.00 Max. :2013-12-31 23:00:00
##
Summary of the mtcars
data set.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
In the following section, you can found examples on how to use each functions with an brief explanation of the function, and the requirements for the inputs.
filter: select a subset of rows in a data frame. filter(data, conditions separated by commas)
Example 1: select flights where month
= 11, day
= 3, and carrier
name is ‘AA’
fl <- flights %>%
filter(month == 11, day == 3, carrier == 'AA')
head(fl)
Example 2: select cars where mpg
> 20 and cyl
= 6
mtcars %>%
filter(mpg > 20, cyl == 6)
Example 3: use %in%
to select multiple matches for an argument
flights %>%
filter(month %in% c(1,2,5), carrier == 'AA') %>%
distinct(month)
slice: select rows by position slice(data, start_row_# : end_row_#)
Example 1: select the first 10 rows from flights
flights %>% slice(1:10)
arrange: reorder the rows arrange(data, year, month,day, arr_time): order by year, month, day, arr_time Use desc(variable_name) if descending order is wanted
Example 1: arrange the data by cyl
in ascending order
arr_1 <- mtcars %>% arrange(cyl)
head(arr_1)
Example 2: arrange the data first by year
, and then by month
, by day
, lastly by arr_time
(in descending order)
arr <- flights %>% arrange(year, month,day,desc(arr_time))
head(arr)
select: select the columns wanted select(data, column_names)
Example 1: select columns carrier
and month
sele <- flights %>% select(carrier, month,day)
head(sele)
Example 2: select columns mpg
, hp
mtcars %>% select(mpg,hp)
rename: rename the columns rename(data, new_column_name = old_column_name)
Example 1: change column name carrier
to airline_carrier
. Note: the order is new column name = old column name, not the other way around.
rename_fl <- flights %>% rename(airline_carrier = carrier) %>%
select(airline_carrier)
head(rename_fl)
distinct: select distinct values or unique values in a column distinct(select(data, column_name))
Example 1: return all carrier names without repetition.
distinct(select(flights, carrier))
# or
flights %>% distinct(carrier)
mutate: add new columns that are functions of existing columns mutate(data, new_column_name = column_name - another_column_name)
transmute: similar to mutate but contains only the new column that has been created
Example 1: expected total travel time = schedule arrival time - schedule departure time.
mutate_fl <- flights %>%
mutate(total_time = sched_arr_time - sched_dep_time)
head(mutate_fl)
Same as above but with only the newly created column total_time
transmute_fl <- flights%>%
transmute(total_time = sched_arr_time - sched_dep_time)
head(transmute_fl)
Example 2: create a new column Performance
by dividing hp
by wt
.
mutate_car <- mtcars %>%
mutate(Performance = hp/wt)
head(mutate_car)
summarise: collapse data frames into single rows using some sort of function that aggregates a result. summarise(data, new_column_name = mean(column_name, na.rm = TRUE))
Example 1: calculate the mean air_time
with summarise
and mean
.
Note: avg_air_time contains the average air_time with rows contain NA removed (that’s what ‘na.rm = TRUE’ do)
flights %>% summarise(avg_air_time = mean(air_time, na.rm = TRUE))
Example 2: calculate the total air_time
with summarise
and sum
flights %>% summarise(total_air_time = sum(air_time, na.rm = TRUE))
Example 3: calculate the average mpg
across entire data set.
mtcars %>% summarise(avg_mpg = mean(mpg)) #the mean mpg
Example 4: calculate the standard deviation pf hp value for cars with 6 cylinders.
mtcars %>% filter(cyl == 6) %>%
summarise(std_hp = sd(hp)) #the standard deviation pf hp value for cars with 6 cylinders.
sample_n: select random number of rows sample_n(data, # rows to select)
sample_frac: select a fraction of the rows sample_frac(data, fraction of rows to select)
Example 1: randomly select 10 rows from the data frame
sample_n(flights, 10)
Example 2: randomly select 10% of the rows from data frame
sample_frac(flights,0.1)
To be continued…