Data Manipulation with dplyr

Masumbuko Semba & Nyamisi Peter

2024-05-09

Learning Agenda

  1. Get familiar with R and Rstudio
  2. Data structure and data types
  3. Reading and writing data in Rstudio
  4. Tidying with tidyverse
  5. Plotting and Visualization
  6. Descriptive Statistics
  7. Data manipulation with tidyverse
  8. Inferential Statistics
  9. Modelling and simulation
  10. Spatial Handling and Analysis

Data Manipulation with dplyr

Loading a package

  • We rely on tidyverse for data manipulation
  • tidyverse is an ecosystem of packages
  • Has dplyr package, which is dedicated for data manipulation
  • tidyverse does not load by default, we need to load it
  • It brings dplyr and other needed packages

Loading tidyverse

require(tidyverse)
Loading required package: tidyverse
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.1
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import the data

  • Import the chinook dataset into our session
lfq = readxl::read_excel("optimal/posts/data/LFQ_sample_1.xls")
  • Have a glimpse of the loaded datase
lfq |> 
  glimpse()
Rows: 779
Columns: 6
$ Date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ Species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ `Size (cm)`  <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ `Size Class` <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ `Gear type`  <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ Landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

Tidy Data

Tidy data

Tidy data follows the following three rules:

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.

Common Tidy data format

Example 1 create a dataframe of temperature from 2019 to 2023 for the five regions

Code
temperature = tibble(
  year = 2019:2023,
  dodoma = rnorm(n = 5, mean = 21, sd = .2),
  Dar = rnorm(n = 5, mean = 28, sd = 1.2),
  Mwanza = rnorm(n = 5, mean = 21, sd = .6),
  Tanga = rnorm(n = 5, mean = 26, sd = 1.8),
) |> 
  mutate(across(is.numeric, round, 2))

Wide format

year

dodoma

Dar

Mwanza

Tanga

2,019

21.26

27.92

21.74

25.32

2,020

20.66

29.33

20.94

25.65

2,021

20.45

29.48

21.36

28.02

2,022

20.82

27.20

21.17

23.66

2,023

20.84

29.53

20.06

24.31

Long format

year

regions

temperature_celcius

2,019

dodoma

21.26

2,019

Dar

27.92

2,019

Mwanza

21.74

2,019

Tanga

25.32

2,020

dodoma

20.66

2,020

Dar

29.33

2,020

Mwanza

20.94

2,020

Tanga

25.65

2,021

dodoma

20.45

2,021

Dar

29.48

2,021

Mwanza

21.36

2,021

Tanga

28.02

2,022

dodoma

20.82

2,022

Dar

27.20

2,022

Mwanza

21.17

2,022

Tanga

23.66

2,023

dodoma

20.84

2,023

Dar

29.53

2,023

Mwanza

20.06

2,023

Tanga

24.31

Basic Verbs

  • dplyr has set of functions for data manipulation
  • These functions are called verbs
  • Similar to verbs in English language,
  • dplyr is also called grammar of data manipulation
  • filter
  • slice
  • select
  • mutate
  • arrange
  • summarise
  • group_by

Basic Verbs

  • Each verb does one particular computation
  • Require combination of multiple verbs
  • Chain the process with pipe operator (%>%)
lfq |> 
  filter(Species == "Siganus sutor") |> 
  group_by(`Gear type`) |> 
  summarise(n = n()) |> 
  mutate(percentage = n/sum(n)*100) |> 
  mutate(label = paste0(round(percentage,2), "%")) |> 
  arrange(percentage) |> 
  flextable::flextable() |> 
  flextable::autofit()

Gear type

n

percentage

label

Speargun

13

6.633

6.63%

Gill net

23

11.735

11.73%

hook and line

31

15.816

15.82%

Traps

129

65.816

65.82%

Tidyverse verbs

rename

  • the rename() streamline renaming data frame columns.
  • Efficient for renaming multiple columns
  • The rename() simplifies data wrangling tasks
  • enhance code readability and efficiency
Code
lfq_clean = lfq |> 
  rename(
    date = Date,
    species_fish = Species,
    size_cm = `Size (cm)`,
    size_class = `Size Class`,
    gear_type = `Gear type`,
    landing_site = 6
    )

date

species_fish

size_cm

size_class

gear_type

landing_site

2019-07-11 00:00:00

Siganus sutor

15.5

18

Traps

Pwani

2020-06-15 00:00:00

Lethrinus mahsena

12.4

15

Traps

Bahari

2019-07-11 00:00:00

Calotomus carolinus

16.5

18

Beach seine

Pwani

2019-05-30 00:00:00

Lutjanus fulviflamma

15.8

18

hook and line

Pwani

2019-05-30 00:00:00

Lethrinus mahsena

11.5

12

hook and line

Pwani

2020-06-15 00:00:00

Acanthurus chronixis

12.2

15

Speargun

Bahari

2019-07-12 00:00:00

Lethrinus mahsena

11.0

12

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

16.0

18

Beach seine

Pwani

2019-05-30 00:00:00

Lethrinus mahsena

12.0

12

hook and line

Pwani

2020-04-08 00:00:00

Anampses spp.

17.5

18

Speargun

Bahari

filter

  • filter() used to subset data frames based on conditions.
  • Uses logical (comparison) operations (<, <=, >, >=, &, |, ==)
  • Only records that meet the condition (valid) are displayed
  • If the condition does not meet condition (invalid) throw an error
  • For single variable
lfq_clean |> 
  filter(
    species_fish == "Lethrinus mahsena"
    )
  • Chain for more than one variable
lfq_clean |> 
  filter(
    species_fish == "Lethrinus mahsena" & 
      size_cm > 15 & 
      landing_site == "Pwani"
    )

slice

  • The slice function allows to extract specific rows from a data frame.

  • It offers flexibility for selecting rows by position or logical conditions.

  • Select a row

lfq_clean |> 
  slice(1)
  • Select range of rows
lfq_clean |> 
  slice(40:50)
  • Select specific rows
lfq_clean |> 
  slice(c(5,25,400,605))
  • negate (-) Used to exclude specific row
lfq_clean |> 
  slice(-1)
  • negagte (-) to exclude specific rows
lfq_clean |> 
  slice(-c(5,25,400,605))

select

  • select() function used for selecting specific variables from a data frame.
  • It allows to create new data frames focused on relevant variables.
  • Include column names directly within select().
  • Use numeric vectors to select columns by position (base R style).
  • Combine positive and negative values to include/exclude specific columns.
  • Select columns
lfq_clean |> 
  select(species_fish, size_cm)
  • select and rename
lfq_clean |> 
  select(
    site = landing_site,
    species = species_fish, 
    size = size_cm
    )
  • negate (-) to deselect
lfq_clean |> 
  select(-species_fish, -size_cm)

mutate

  • The mutate() function in {dplyr} adds new columns.
  • It’s not destructive—all our existing data will still be there after you add new columns

  • By default, mutate() sticks the new column on the far right of the dataset
  • We can also control where the new column shows up with either the .before or .after argument:
lfq_clean |> 
  mutate(
    size_mm = size_cm *100, 
    .after = size_cm
    )

date

species_fish

size_cm

size_mm

size_class

gear_type

landing_site

2020-06-15 00:00:00

Lethrinus harak

13.7

1,370

15

Traps

Bahari

2019-05-30 00:00:00

Leptoscarus vaigiensis

13.3

1,330

15

Speargun

Pwani

2019-07-12 00:00:00

Lethrinus harak

10.5

1,050

12

Traps

Pwani

2019-05-30 00:00:00

Leptoscarus vaigiensis

12.8

1,280

15

Speargun

Pwani

2019-05-30 00:00:00

Lutjanus fulviflamma

15.5

1,550

18

hook and line

Pwani

2020-06-15 00:00:00

Lethrinus mahsena

12.5

1,250

15

Traps

Bahari

2019-07-11 00:00:00

Calotomus carolinus

15.0

1,500

15

Beach seine

Pwani

2019-07-12 00:00:00

Siganus sutor

17.5

1,750

18

Traps

Pwani

2019-07-11 00:00:00

Calotomus carolinus

16.5

1,650

18

Beach seine

Pwani

2019-07-11 00:00:00

Calotomus carolinus

17.0

1,700

18

Beach seine

Pwani

arrange

something

something

grouping

  • The group_by() function splits a dataset into smaller subsets based on the variable specified.

  • This splitting happens behind the scenes—you don’t actually ever see the data split up into smaller datasets.
  • To undo the grouping and bring all the rows back together, use ungroup().
lfq_clean |> 
  group_by(species_fish)

Note

group_by() when used alone doesn’t alter the order of the rows in the dataset. The only sign that the data is invisibly grouped is a little Groups: sex [2] note at the top of the output.

grouping

  • If we use mutate() after grouping, new columns are added to each subset separately.

  • In many cases, you won’t notice any difference between using mutate() on an ungrouped or grouped dataset—you’ll get the same values.
lfq_clean |> 
  group_by(species_fish) |> 
  mutate(
    min_size = min(size_cm),
    diff = size_cm - min_size
    
    )

date

species_fish

size_cm

size_class

gear_type

landing_site

min_size

diff

2020-04-08 00:00:00

Acanthurus chronixis

11.0

12

Speargun

Bahari

11.0

0.0

2020-04-08 00:00:00

Acanthurus chronixis

12.6

15

Speargun

Bahari

11.0

1.6

2020-04-08 00:00:00

Acanthurus chronixis

16.4

18

Speargun

Bahari

11.0

5.4

2019-07-12 00:00:00

Acanthurus chronixis

13.5

15

Speargun

Pwani

11.0

2.5

2020-04-08 00:00:00

Anampses spp.

13.0

15

Speargun

Bahari

12.5

0.5

2020-04-08 00:00:00

Anampses spp.

30.0

30

Speargun

Bahari

12.5

17.5

2020-04-08 00:00:00

Anampses spp.

18.5

21

Speargun

Bahari

12.5

6.0

2020-04-08 00:00:00

Anampses spp.

13.5

15

Speargun

Bahari

12.5

1.0

2019-07-11 00:00:00

Calotomus carolinus

17.0

18

Beach seine

Pwani

9.5

7.5

2019-07-11 00:00:00

Calotomus carolinus

17.0

18

Beach seine

Pwani

9.5

7.5

2019-07-11 00:00:00

Calotomus carolinus

16.0

18

Beach seine

Pwani

9.5

6.5

2019-07-11 00:00:00

Calotomus carolinus

17.0

18

Beach seine

Pwani

9.5

7.5

2020-06-15 00:00:00

Cheilinus chlorourus

13.0

15

Speargun

Bahari

13.0

0.0

2020-04-08 00:00:00

Cheilinus chlorourus

14.7

15

Speargun

Bahari

13.0

1.7

2020-06-15 00:00:00

Cheilinus chlorourus

13.5

15

Speargun

Bahari

13.0

0.5

2020-06-15 00:00:00

Cheilinus chlorourus

14.5

15

Speargun

Bahari

13.0

1.5

2020-06-15 00:00:00

Leptoscarus vaigiensis

21.0

21

Speargun

Bahari

11.7

9.3

2019-07-11 00:00:00

Leptoscarus vaigiensis

21.8

24

Speargun

Pwani

11.7

10.1

2020-06-15 00:00:00

Leptoscarus vaigiensis

14.0

15

Speargun

Bahari

11.7

2.3

2019-05-30 00:00:00

Leptoscarus vaigiensis

24.0

24

Speargun

Pwani

11.7

12.3

2020-06-15 00:00:00

Lethrinus harak

12.5

15

Traps

Bahari

6.0

6.5

2019-07-11 00:00:00

Lethrinus harak

16.0

18

Traps

Pwani

6.0

10.0

2020-04-08 00:00:00

Lethrinus harak

11.7

12

Traps

Bahari

6.0

5.7

2019-07-12 00:00:00

Lethrinus harak

6.0

6

Traps

Pwani

6.0

0.0

2019-05-30 00:00:00

Lethrinus mahsena

13.0

15

hook and line

Pwani

8.3

4.7

2019-05-30 00:00:00

Lethrinus mahsena

11.1

12

hook and line

Pwani

8.3

2.8

2019-05-30 00:00:00

Lethrinus mahsena

13.0

15

hook and line

Pwani

8.3

4.7

2019-05-30 00:00:00

Lethrinus mahsena

10.0

12

Traps

Pwani

8.3

1.7

2019-05-30 00:00:00

Lutjanus fulviflamma

16.9

18

hook and line

Pwani

12.0

4.9

2020-06-15 00:00:00

Lutjanus fulviflamma

18.5

21

Speargun

Bahari

12.0

6.5

2020-07-25 00:00:00

Lutjanus fulviflamma

13.0

15

Speargun

Bahari

12.0

1.0

2019-05-30 00:00:00

Lutjanus fulviflamma

17.0

18

hook and line

Pwani

12.0

5.0

2019-07-12 00:00:00

Siganus sutor

20.0

21

Traps

Pwani

9.0

11.0

2019-07-12 00:00:00

Siganus sutor

17.0

18

Traps

Pwani

9.0

8.0

2019-07-12 00:00:00

Siganus sutor

26.0

27

Gill net

Pwani

9.0

17.0

2019-07-12 00:00:00

Siganus sutor

26.0

27

Gill net

Pwani

9.0

17.0

summarize

  • The summarize() function, is destructive.
  • It collapses our dataset into a single value and throws away any columns that we don’t use when summarizing.

  • All other variables are dropped and only the computed ones remains
lfq_clean |> 
  summarise(
    count = n(),
    avg_size = mean(size_cm),
    standard_deviation = sd(size_cm)
  )

count

avg_size

standard_deviation

779

16.16

4.234

Summarizing groups and summarize

  • summarize() is better used with groups.
  • If we use summarize() on a grouped dataset, each subset is collapsed into a single row.
  • This will create different summary values, depending on the groups you use.
  • For instance we want to know the number and average size of species landed
lfq_clean |> 
  group_by(species_fish) |> 
  summarise(
    count = n(),
    avg_size = mean(size_cm)
  )

species_fish

count

avg_size

Acanthurus chronixis

21

13.65

Anampses spp.

17

17.48

Calotomus carolinus

176

16.38

Cheilinus chlorourus

4

13.93

Leptoscarus vaigiensis

146

18.10

Lethrinus harak

37

12.59

Lethrinus mahsena

135

11.91

Lutjanus fulviflamma

47

16.70

Siganus sutor

196

18.19

Summarizing multiple groups

  • We can specify more than one group with group_by()
  • It create summary for each unique combination of values in the groups.
  • For instance, if we want compute the number of observations and average size of fish by both specied of fish and site
lfq_clean |> 
  group_by(species_fish, landing_site) |> 
  summarise(
    count = n(),
    avg_size = mean(size_cm)
  )

species_fish

landing_site

count

avg_size

Acanthurus chronixis

Bahari

20

13.66

Acanthurus chronixis

Pwani

1

13.50

Anampses spp.

Bahari

12

17.25

Anampses spp.

Pwani

5

18.02

Calotomus carolinus

Bahari

13

16.95

Calotomus carolinus

Pwani

163

16.34

Cheilinus chlorourus

Bahari

4

13.93

Leptoscarus vaigiensis

Bahari

63

17.94

Leptoscarus vaigiensis

Pwani

83

18.22

Lethrinus harak

Bahari

24

12.04

Lethrinus harak

Pwani

13

13.59

Lethrinus mahsena

Bahari

36

12.29

Lethrinus mahsena

Pwani

99

11.78

Lutjanus fulviflamma

Bahari

10

16.76

Lutjanus fulviflamma

Pwani

37

16.68

Siganus sutor

Bahari

82

15.28

Siganus sutor

Pwani

114

20.28

Mutating Joins

Mutating Joins…..

  • A mutating join allows you to combine variables from two tables.
  • It first matches observations by their keys, then copies across variables from one table to the other.

Inner Join

  • All rows from x where there are matching values in y, and all columns from x and y.
inner_join(x, y, by = "id")

Left join

  • All rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.
left_join(x, y, by = "id")

Right Join

  • All rows from y, and all columns from x and y.
  • Rows in y with no match in x will have NA values in the new columns.
right_join(x, y, by = "id")

Full join

  • All rows and all columns from both x and y.
  • Where there are not matching values, returns NA for the one missing.
full_join(x, y, by = "id")

Filtering Joins

Filtering Joins

  • Filtering joins match observations in the same way as mutating joins,
  • affect the observations, not the variables.
  • … Semi-joins are useful for matching filtered summary tables back to the original rows.
  • Anti-joins are useful for diagnosing join mismatches.

Semi Join

  • All rows from x where there are matching values in y, keeping just columns from x.
semi_join(x, y, by = "id")

Anti Join

  • All rows from x where there are not matching values in y, keeping just columns from x.
anti_join(x, y, by = "id")

Set operations

Set operations

  • Set operations are occasionally useful when you want to break a single complex filter into simpler pieces.
  • All these operations work with a complete row, comparing the values of every variable.
  • These expect the x and y inputs to have the same variables, and treat the observations like sets.

Union

All rows from x and y, keeping duplicates.

Intersection

  • Common rows in both x and y, keeping just unique rows.

Set difference

  • All rows from x which are not also rows in y, keeping just unique rows.

Pivoting

pivot_longer() and pivot_wider() are used

Wide format

  • pivot_wider() is the opposite of pivot_longer()
  • it makes a dataset wider by increasing the number of columns and decreasing the number of rows.
  • t’s relatively rare to need pivot_wider() to make tidy data

Long Format

  • pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

Thank You for Attending

Acknowledgments{.scrollable}

I am grateful for the insightful comments offered by the anonymous peer reviewers at Books & Texts. The generosity and expertise of one and all have improved this study in innumerable ways and saved me from many errors; those that inevitably remain are entirely my own responsibility.