Data Manipulation with dplyr

Masumbuko Semba &amp; Nyamisi Peter

Data Manipulation with dplyr

Masumbuko Semba & Nyamisi Peter

2024-05-09

Learning Agenda

Get familiar with R and Rstudio
Data structure and data types
Reading and writing data in Rstudio
Tidying with tidyverse
Plotting and Visualization
Descriptive Statistics
Data manipulation with tidyverse
Inferential Statistics
Modelling and simulation
Spatial Handling and Analysis

Data Manipulation with dplyr

Loading a package

We rely on tidyverse for data manipulation
tidyverse is an ecosystem of packages
Has dplyr package, which is dedicated for data manipulation
tidyverse does not load by default, we need to load it
It brings dplyr and other needed packages

Loading tidyverse

require(tidyverse)

Loading required package: tidyverse

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.1

Warning: package 'dplyr' was built under R version 4.3.2

Warning: package 'stringr' was built under R version 4.3.2

Warning: package 'lubridate' was built under R version 4.3.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import the data

Import the chinook dataset into our session

lfq = readxl::read_excel("optimal/posts/data/LFQ_sample_1.xls")

Have a glimpse of the loaded datase

lfq |> 
  glimpse()

Rows: 779
Columns: 6
$ Date         <dttm> 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-30, 2019-05-…
$ Species      <chr> "Siganus sutor", "Siganus sutor", "Siganus sutor", "Sigan…
$ `Size (cm)`  <dbl> 16.5, 12.0, 15.0, 33.0, 11.2, 17.0, 19.0, 10.0, 11.0, 9.0…
$ `Size Class` <dbl> 18, 12, 15, 33, 12, 18, 21, 12, 12, 12, 12, 12, 12, 12, 1…
$ `Gear type`  <chr> "Speargun", "Speargun", "Speargun", "Speargun", "hook and…
$ Landing_site <chr> "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pwani", "Pw…

Tidy Data

Tidy data

Tidy data follows the following three rules:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Common Tidy data format

Example 1 create a dataframe of temperature from 2019 to 2023 for the five regions

Code

temperature = tibble(
  year = 2019:2023,
  dodoma = rnorm(n = 5, mean = 21, sd = .2),
  Dar = rnorm(n = 5, mean = 28, sd = 1.2),
  Mwanza = rnorm(n = 5, mean = 21, sd = .6),
  Tanga = rnorm(n = 5, mean = 26, sd = 1.8),
) |> 
  mutate(across(is.numeric, round, 2))

Wide format

year	dodoma	Dar	Mwanza	Tanga
2,019	21.26	27.92	21.74	25.32
2,020	20.66	29.33	20.94	25.65
2,021	20.45	29.48	21.36	28.02
2,022	20.82	27.20	21.17	23.66
2,023	20.84	29.53	20.06	24.31

Long format

year	regions	temperature_celcius
2,019	dodoma	21.26
2,019	Dar	27.92
2,019	Mwanza	21.74
2,019	Tanga	25.32
2,020	dodoma	20.66
2,020	Dar	29.33
2,020	Mwanza	20.94
2,020	Tanga	25.65
2,021	dodoma	20.45
2,021	Dar	29.48
2,021	Mwanza	21.36
2,021	Tanga	28.02
2,022	dodoma	20.82
2,022	Dar	27.20
2,022	Mwanza	21.17
2,022	Tanga	23.66
2,023	dodoma	20.84
2,023	Dar	29.53
2,023	Mwanza	20.06
2,023	Tanga	24.31

Basic Verbs

dplyr has set of functions for data manipulation
These functions are called verbs
Similar to verbs in English language,
dplyr is also called grammar of data manipulation

filter
slice
select
mutate
arrange
summarise
group_by

Basic Verbs

Each verb does one particular computation
Require combination of multiple verbs
Chain the process with pipe operator (%>%)

lfq |> 
  filter(Species == "Siganus sutor") |> 
  group_by(`Gear type`) |> 
  summarise(n = n()) |> 
  mutate(percentage = n/sum(n)*100) |> 
  mutate(label = paste0(round(percentage,2), "%")) |> 
  arrange(percentage) |> 
  flextable::flextable() |> 
  flextable::autofit()

Gear type	n	percentage	label
Speargun	13	6.633	6.63%
Gill net	23	11.735	11.73%
hook and line	31	15.816	15.82%
Traps	129	65.816	65.82%

Tidyverse verbs

rename

the rename() streamline renaming data frame columns.
Efficient for renaming multiple columns
The rename() simplifies data wrangling tasks
enhance code readability and efficiency

Code

lfq_clean = lfq |> 
  rename(
    date = Date,
    species_fish = Species,
    size_cm = `Size (cm)`,
    size_class = `Size Class`,
    gear_type = `Gear type`,
    landing_site = 6
    )

date	species_fish	size_cm	size_class	gear_type	landing_site
2019-07-11 00:00:00	Siganus sutor	15.5	18	Traps	Pwani
2020-06-15 00:00:00	Lethrinus mahsena	12.4	15	Traps	Bahari
2019-07-11 00:00:00	Calotomus carolinus	16.5	18	Beach seine	Pwani
2019-05-30 00:00:00	Lutjanus fulviflamma	15.8	18	hook and line	Pwani
2019-05-30 00:00:00	Lethrinus mahsena	11.5	12	hook and line	Pwani
2020-06-15 00:00:00	Acanthurus chronixis	12.2	15	Speargun	Bahari
2019-07-12 00:00:00	Lethrinus mahsena	11.0	12	Traps	Pwani
2019-07-11 00:00:00	Calotomus carolinus	16.0	18	Beach seine	Pwani
2019-05-30 00:00:00	Lethrinus mahsena	12.0	12	hook and line	Pwani
2020-04-08 00:00:00	Anampses spp.	17.5	18	Speargun	Bahari

filter

filter() used to subset data frames based on conditions.
Uses logical (comparison) operations (<, <=, >, >=, &, |, ==)
Only records that meet the condition (valid) are displayed
If the condition does not meet condition (invalid) throw an error

For single variable

lfq_clean |> 
  filter(
    species_fish == "Lethrinus mahsena"
    )

Chain for more than one variable

lfq_clean |> 
  filter(
    species_fish == "Lethrinus mahsena" & 
      size_cm > 15 & 
      landing_site == "Pwani"
    )

slice

The slice function allows to extract specific rows from a data frame.
It offers flexibility for selecting rows by position or logical conditions.
Select a row

lfq_clean |> 
  slice(1)

Select range of rows

lfq_clean |> 
  slice(40:50)

Select specific rows

lfq_clean |> 
  slice(c(5,25,400,605))

negate (-) Used to exclude specific row

lfq_clean |> 
  slice(-1)

negagte (-) to exclude specific rows

lfq_clean |> 
  slice(-c(5,25,400,605))

select

select() function used for selecting specific variables from a data frame.
It allows to create new data frames focused on relevant variables.
Include column names directly within select().
Use numeric vectors to select columns by position (base R style).
Combine positive and negative values to include/exclude specific columns.

Select columns

lfq_clean |> 
  select(species_fish, size_cm)

select and rename

lfq_clean |> 
  select(
    site = landing_site,
    species = species_fish, 
    size = size_cm
    )

negate (-) to deselect

lfq_clean |> 
  select(-species_fish, -size_cm)

mutate

The mutate() function in {dplyr} adds new columns.
It’s not destructive—all our existing data will still be there after you add new columns

By default, mutate() sticks the new column on the far right of the dataset
We can also control where the new column shows up with either the .before or .after argument:

lfq_clean |> 
  mutate(
    size_mm = size_cm *100, 
    .after = size_cm
    )

date	species_fish	size_cm	size_mm	size_class	gear_type	landing_site
2020-06-15 00:00:00	Lethrinus harak	13.7	1,370	15	Traps	Bahari
2019-05-30 00:00:00	Leptoscarus vaigiensis	13.3	1,330	15	Speargun	Pwani
2019-07-12 00:00:00	Lethrinus harak	10.5	1,050	12	Traps	Pwani
2019-05-30 00:00:00	Leptoscarus vaigiensis	12.8	1,280	15	Speargun	Pwani
2019-05-30 00:00:00	Lutjanus fulviflamma	15.5	1,550	18	hook and line	Pwani
2020-06-15 00:00:00	Lethrinus mahsena	12.5	1,250	15	Traps	Bahari
2019-07-11 00:00:00	Calotomus carolinus	15.0	1,500	15	Beach seine	Pwani
2019-07-12 00:00:00	Siganus sutor	17.5	1,750	18	Traps	Pwani
2019-07-11 00:00:00	Calotomus carolinus	16.5	1,650	18	Beach seine	Pwani
2019-07-11 00:00:00	Calotomus carolinus	17.0	1,700	18	Beach seine	Pwani

arrange

something

grouping

The group_by() function splits a dataset into smaller subsets based on the variable specified.

This splitting happens behind the scenes—you don’t actually ever see the data split up into smaller datasets.
To undo the grouping and bring all the rows back together, use ungroup().

lfq_clean |> 
  group_by(species_fish)

Note

group_by() when used alone doesn’t alter the order of the rows in the dataset. The only sign that the data is invisibly grouped is a little Groups: sex [2] note at the top of the output.

grouping

If we use mutate() after grouping, new columns are added to each subset separately.

In many cases, you won’t notice any difference between using mutate() on an ungrouped or grouped dataset—you’ll get the same values.

lfq_clean |> 
  group_by(species_fish) |> 
  mutate(
    min_size = min(size_cm),
    diff = size_cm - min_size
    
    )

date	species_fish	size_cm	size_class	gear_type	landing_site	min_size	diff
2020-04-08 00:00:00	Acanthurus chronixis	11.0	12	Speargun	Bahari	11.0	0.0
2020-04-08 00:00:00	Acanthurus chronixis	12.6	15	Speargun	Bahari	11.0	1.6
2020-04-08 00:00:00	Acanthurus chronixis	16.4	18	Speargun	Bahari	11.0	5.4
2019-07-12 00:00:00	Acanthurus chronixis	13.5	15	Speargun	Pwani	11.0	2.5
2020-04-08 00:00:00	Anampses spp.	13.0	15	Speargun	Bahari	12.5	0.5
2020-04-08 00:00:00	Anampses spp.	30.0	30	Speargun	Bahari	12.5	17.5
2020-04-08 00:00:00	Anampses spp.	18.5	21	Speargun	Bahari	12.5	6.0
2020-04-08 00:00:00	Anampses spp.	13.5	15	Speargun	Bahari	12.5	1.0
2019-07-11 00:00:00	Calotomus carolinus	17.0	18	Beach seine	Pwani	9.5	7.5
2019-07-11 00:00:00	Calotomus carolinus	17.0	18	Beach seine	Pwani	9.5	7.5
2019-07-11 00:00:00	Calotomus carolinus	16.0	18	Beach seine	Pwani	9.5	6.5
2019-07-11 00:00:00	Calotomus carolinus	17.0	18	Beach seine	Pwani	9.5	7.5
2020-06-15 00:00:00	Cheilinus chlorourus	13.0	15	Speargun	Bahari	13.0	0.0
2020-04-08 00:00:00	Cheilinus chlorourus	14.7	15	Speargun	Bahari	13.0	1.7
2020-06-15 00:00:00	Cheilinus chlorourus	13.5	15	Speargun	Bahari	13.0	0.5
2020-06-15 00:00:00	Cheilinus chlorourus	14.5	15	Speargun	Bahari	13.0	1.5
2020-06-15 00:00:00	Leptoscarus vaigiensis	21.0	21	Speargun	Bahari	11.7	9.3
2019-07-11 00:00:00	Leptoscarus vaigiensis	21.8	24	Speargun	Pwani	11.7	10.1
2020-06-15 00:00:00	Leptoscarus vaigiensis	14.0	15	Speargun	Bahari	11.7	2.3
2019-05-30 00:00:00	Leptoscarus vaigiensis	24.0	24	Speargun	Pwani	11.7	12.3
2020-06-15 00:00:00	Lethrinus harak	12.5	15	Traps	Bahari	6.0	6.5
2019-07-11 00:00:00	Lethrinus harak	16.0	18	Traps	Pwani	6.0	10.0
2020-04-08 00:00:00	Lethrinus harak	11.7	12	Traps	Bahari	6.0	5.7
2019-07-12 00:00:00	Lethrinus harak	6.0	6	Traps	Pwani	6.0	0.0
2019-05-30 00:00:00	Lethrinus mahsena	13.0	15	hook and line	Pwani	8.3	4.7
2019-05-30 00:00:00	Lethrinus mahsena	11.1	12	hook and line	Pwani	8.3	2.8
2019-05-30 00:00:00	Lethrinus mahsena	13.0	15	hook and line	Pwani	8.3	4.7
2019-05-30 00:00:00	Lethrinus mahsena	10.0	12	Traps	Pwani	8.3	1.7
2019-05-30 00:00:00	Lutjanus fulviflamma	16.9	18	hook and line	Pwani	12.0	4.9
2020-06-15 00:00:00	Lutjanus fulviflamma	18.5	21	Speargun	Bahari	12.0	6.5
2020-07-25 00:00:00	Lutjanus fulviflamma	13.0	15	Speargun	Bahari	12.0	1.0
2019-05-30 00:00:00	Lutjanus fulviflamma	17.0	18	hook and line	Pwani	12.0	5.0
2019-07-12 00:00:00	Siganus sutor	20.0	21	Traps	Pwani	9.0	11.0
2019-07-12 00:00:00	Siganus sutor	17.0	18	Traps	Pwani	9.0	8.0
2019-07-12 00:00:00	Siganus sutor	26.0	27	Gill net	Pwani	9.0	17.0
2019-07-12 00:00:00	Siganus sutor	26.0	27	Gill net	Pwani	9.0	17.0

summarize

The summarize() function, is destructive.
It collapses our dataset into a single value and throws away any columns that we don’t use when summarizing.

All other variables are dropped and only the computed ones remains

lfq_clean |> 
  summarise(
    count = n(),
    avg_size = mean(size_cm),
    standard_deviation = sd(size_cm)
  )

count	avg_size	standard_deviation
779	16.16	4.234

Summarizing groups and summarize

summarize() is better used with groups.
If we use summarize() on a grouped dataset, each subset is collapsed into a single row.
This will create different summary values, depending on the groups you use.
For instance we want to know the number and average size of species landed

lfq_clean |> 
  group_by(species_fish) |> 
  summarise(
    count = n(),
    avg_size = mean(size_cm)
  )

species_fish	count	avg_size
Acanthurus chronixis	21	13.65
Anampses spp.	17	17.48
Calotomus carolinus	176	16.38
Cheilinus chlorourus	4	13.93
Leptoscarus vaigiensis	146	18.10
Lethrinus harak	37	12.59
Lethrinus mahsena	135	11.91
Lutjanus fulviflamma	47	16.70
Siganus sutor	196	18.19

Summarizing multiple groups

We can specify more than one group with group_by()
It create summary for each unique combination of values in the groups.
For instance, if we want compute the number of observations and average size of fish by both specied of fish and site

lfq_clean |> 
  group_by(species_fish, landing_site) |> 
  summarise(
    count = n(),
    avg_size = mean(size_cm)
  )

species_fish	landing_site	count	avg_size
Acanthurus chronixis	Bahari	20	13.66
Acanthurus chronixis	Pwani	1	13.50
Anampses spp.	Bahari	12	17.25
Anampses spp.	Pwani	5	18.02
Calotomus carolinus	Bahari	13	16.95
Calotomus carolinus	Pwani	163	16.34
Cheilinus chlorourus	Bahari	4	13.93
Leptoscarus vaigiensis	Bahari	63	17.94
Leptoscarus vaigiensis	Pwani	83	18.22
Lethrinus harak	Bahari	24	12.04
Lethrinus harak	Pwani	13	13.59
Lethrinus mahsena	Bahari	36	12.29
Lethrinus mahsena	Pwani	99	11.78
Lutjanus fulviflamma	Bahari	10	16.76
Lutjanus fulviflamma	Pwani	37	16.68
Siganus sutor	Bahari	82	15.28
Siganus sutor	Pwani	114	20.28

Mutating Joins

Mutating Joins…..

A mutating join allows you to combine variables from two tables.
It first matches observations by their keys, then copies across variables from one table to the other.

Inner Join

All rows from x where there are matching values in y, and all columns from x and y.

inner_join(x, y, by = "id")

Left join

All rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns.

left_join(x, y, by = "id")

Right Join

All rows from y, and all columns from x and y.
Rows in y with no match in x will have NA values in the new columns.

right_join(x, y, by = "id")

Full join

All rows and all columns from both x and y.
Where there are not matching values, returns NA for the one missing.

full_join(x, y, by = "id")

Filtering Joins

Filtering joins match observations in the same way as mutating joins,
affect the observations, not the variables.
… Semi-joins are useful for matching filtered summary tables back to the original rows.
Anti-joins are useful for diagnosing join mismatches.

Semi Join

All rows from x where there are matching values in y, keeping just columns from x.

semi_join(x, y, by = "id")

Anti Join

All rows from x where there are not matching values in y, keeping just columns from x.

anti_join(x, y, by = "id")

Set operations

Set operations are occasionally useful when you want to break a single complex filter into simpler pieces.
All these operations work with a complete row, comparing the values of every variable.
These expect the x and y inputs to have the same variables, and treat the observations like sets.

Union

All rows from x and y, keeping duplicates.

Intersection

Common rows in both x and y, keeping just unique rows.

Set difference

All rows from x which are not also rows in y, keeping just unique rows.

Pivoting

pivot_longer() and pivot_wider() are used

Wide format

pivot_wider() is the opposite of pivot_longer()
it makes a dataset wider by increasing the number of columns and decreasing the number of rows.
t’s relatively rare to need pivot_wider() to make tidy data

Long Format

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

Thank You for Attending

Acknowledgments{.scrollable}

I am grateful for the insightful comments offered by the anonymous peer reviewers at Books & Texts. The generosity and expertise of one and all have improved this study in innumerable ways and saved me from many errors; those that inevitably remain are entirely my own responsibility.