Overview

This R Markdown file contains examples on how to use various functions from dyplr package. I used flights data set from nycflights13 package, and mtcars data set to demonstrate the usage of different functions.

Set up

Load dplyr and nycflights13 packages.

library(dplyr)
library(nycflights13)

Summary of the flightsdata set.

summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

Summary of the mtcars data set.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Examples

In the following section, you can found examples on how to use each functions with an brief explanation of the function, and the requirements for the inputs.

filter()

filter: select a subset of rows in a data frame. filter(data, conditions separated by commas)

Example 1: select flights where month = 11, day = 3, and carrier name is ‘AA’

fl <- flights %>%
  filter(month == 11, day == 3, carrier == 'AA')
head(fl)

Example 2: select cars where mpg > 20 and cyl = 6

mtcars %>%
  filter(mpg > 20, cyl == 6)

Example 3: use %in% to select multiple matches for an argument

flights %>%
  filter(month %in% c(1,2,5), carrier == 'AA') %>% 
  distinct(month)

slice()

slice: select rows by position slice(data, start_row_# : end_row_#)

Example 1: select the first 10 rows from flights

flights %>% slice(1:10)

arrange()

arrange: reorder the rows arrange(data, year, month,day, arr_time): order by year, month, day, arr_time Use desc(variable_name) if descending order is wanted

Example 1: arrange the data by cyl in ascending order

arr_1 <- mtcars %>% arrange(cyl)
head(arr_1)

Example 2: arrange the data first by year, and then by month, by day, lastly by arr_time (in descending order)

arr <- flights %>% arrange(year, month,day,desc(arr_time))
head(arr)

select

select: select the columns wanted select(data, column_names)

Example 1: select columns carrier and month

sele <- flights %>% select(carrier, month,day)
head(sele)

Example 2: select columns mpg, hp

mtcars %>% select(mpg,hp)

rename

rename: rename the columns rename(data, new_column_name = old_column_name)

Example 1: change column name carrier to airline_carrier. Note: the order is new column name = old column name, not the other way around.

rename_fl <- flights %>% rename(airline_carrier = carrier) %>% 
  select(airline_carrier)
head(rename_fl)

distinct

distinct: select distinct values or unique values in a column distinct(select(data, column_name))

Example 1: return all carrier names without repetition.

distinct(select(flights, carrier))
# or 
flights %>% distinct(carrier)

mutate & transmute

mutate: add new columns that are functions of existing columns mutate(data, new_column_name = column_name - another_column_name)

transmute: similar to mutate but contains only the new column that has been created

Example 1: expected total travel time = schedule arrival time - schedule departure time.

mutate_fl <- flights %>% 
  mutate(total_time = sched_arr_time - sched_dep_time) 
head(mutate_fl)

Same as above but with only the newly created column total_time

transmute_fl <- flights%>% 
  transmute(total_time = sched_arr_time - sched_dep_time) 
head(transmute_fl)

Example 2: create a new column Performance by dividing hp by wt.

mutate_car <- mtcars %>%
  mutate(Performance = hp/wt) 
head(mutate_car)

summarise

summarise: collapse data frames into single rows using some sort of function that aggregates a result. summarise(data, new_column_name = mean(column_name, na.rm = TRUE))

Example 1: calculate the mean air_time with summarise and mean.

Note: avg_air_time contains the average air_time with rows contain NA removed (that’s what ‘na.rm = TRUE’ do)

flights %>% summarise(avg_air_time = mean(air_time, na.rm = TRUE))

Example 2: calculate the total air_time with summarise and sum

flights %>% summarise(total_air_time = sum(air_time, na.rm = TRUE))

Example 3: calculate the average mpg across entire data set.

mtcars %>% summarise(avg_mpg = mean(mpg)) #the mean mpg

Example 4: calculate the standard deviation pf hp value for cars with 6 cylinders.

mtcars %>% filter(cyl == 6) %>% 
  summarise(std_hp = sd(hp))  #the standard deviation pf hp value for cars with 6 cylinders.

sample_n & sample_frac

sample_n: select random number of rows sample_n(data, # rows to select)

sample_frac: select a fraction of the rows sample_frac(data, fraction of rows to select)

Example 1: randomly select 10 rows from the data frame

sample_n(flights, 10)

Example 2: randomly select 10% of the rows from data frame

sample_frac(flights,0.1)

To be continued…