Tidying and Data manipulation with tidyverse

Bending data with tidyverse to answer your questions

Masumbuko Semba

2024-03-07

The concept

  • The concept is based on five itmes—plan, simulate, acquire, explore and share

Learning Agenda

  1. Get familiar with R and Rstudio
  2. Data structure and data types
  3. Reading and writing data in Rstudio
  4. Tidying and Data manipulation with tidyverse
  5. Plotting and Visualization
  6. Descriptive Statistics
  7. Inferential Statistics
  8. Modelling and simulation
  9. Spatial Handling and Analysis

02:30

Introduction

Introduction to Data Manipulation

  • Manipulating vectors through reordering and subsetting
  • Advanced analyses require the use of tidy data
  • Data frames become the preferred data storage

Importance of Data Frames

  • Data frames facilitate the organization of data
  • R put more Emphasis on working with data frames
  • Data frame combine more than one vector
  • Each vector contain one data type
  • Leading to tidy data

01:30

Tidy data

What is Tidy Data?

  • Tidy data is a specific structure that simplifies data manipulation and analysis.
  • We ought to learn the best practices,
  • We also need to know the benefits of organizing our data in a tidy format

Tidy Data Characteristic

  • Definition of Tidy Data: Each row represents one observation, and columns represent variables for each observation
  • Rows represent single observations
  • Columns represent single Variables
  • Emphasis on clarity and structure in organizing data tables

Tidy data quotes

“Happy families are all alike; every unhappy family is unhappy in its own way.” —- Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” —- Hadley Wickham

Example of tidy data

Benefits of Tidy Data

  • Easier data cleaning and manipulation
  • Efficient analysis
  • Improved collaboration
  • Clearer communication

The Tidyverse

Tidyverse: Your Tidy Data Toolkit

  • Collection of R packages for tidy data
  • Streamlines data tasks
  • Powerful functions for data wrangling
## install first time

install.packages("tidyverse")

## Check if installed otherwise install

if (!require(tidyverse, character.only = TRUE)) {
  install.packages("tidyverse")
}

The Tidyverse …

require(tidyverse)
Loading required package: tidyverse
Warning: package 'ggplot2' was built under R version 4.3.1
Warning: package 'purrr' was built under R version 4.3.1
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

02:30

Tidy

Tidy can mean neat and well-organized, clean and orderly in appearance, methodical and efficient, or sufficient and adequate depending on the context.

Tidy data

  • Most statistical datasets are data frames made up of rows and columns.
  • The columns represent variables
  • The rows represent records or observations
  • Each observation is among the common data types
  • Each variable store one type of data type

Tidy data ….

  • Reports indicate that 80% of data analysis time is spent on cleaning and preparing data.
  • Organizing datasets for easier analysis and plotting.
  • Data arrives with repeated observations in a single row.
  • Memory-saving or inherent structure makes ‘wide’ format appealing.
  • Convert between wide and long formats as needed (Semba and Peter 2020).

Tidy data ….

  • Structuring data frames is a significant challenge in statistical analysis, visualization, and modeling.
  • Tidy data implies organizing data consistently for easy analysis and visualization.
  • Data preparation in R involves thoughtful consideration of the desired data frame structure.
  • Clarifying the purpose of each row and column ensures consistent and clear data manipulation.

tidy data….

There are basically three principles that we can follow to make a tidy dataset.

  • Each variable must have its own a column,
  • Each observation must have its own row, and
  • Each cell must have its own value.

Tip

The tidyr package is designed to structure to make data frame tidy

tidy data….

Many datasets that you receive are untidy and will require some work on your end. There are several reasons why a dataset is messy.

  • Often times the people who created the dataset aren’t familiar with the principles of tidy data.
  • Most datasets are messy is that data is often organized to facilitate something other than analysis.
  • Data entry is perhaps the most common reason for organizing data in untidy format.

tidy data

We are going to look for the function that are regularly used to tidy the data frames. These inlude:

  • Gathering from wide format
  • Spreading from long format
  • Separating columns
  • Uniting columns

01:30

The Pipe operator: %>%

  • The pipe operator%>% allows to combine multiple operations in R into a single sequential chain of actions
  • The %>% operator, improve code readability and simplifying data manipulation and analysis workflows
data %>%
  filter(variable > 10) %>%
  group_by(category) %>%
  summarize(
    avg_value = mean(value)
    )
  • allows for a more natural and readable chaining of operations, making code easier to understand.

01:30

Wide format

Wide format organize data where each observation is represented by a single row, and each variable has its own column.

  • Suitable for small datasets with a limited number of variables.
  • Easy to read and interpret.
  • Commonly used for summary statistics and aggregate data.

year

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

1,959

315.42

316.31

316.50

317.56

318.13

318.00

316.39

314.65

313.68

313.18

314.66

315.43

1,960

316.27

316.81

317.42

318.87

319.87

319.43

318.01

315.74

314.00

313.68

314.84

316.03

1,961

316.73

317.54

318.38

319.31

320.42

319.61

318.42

316.63

314.83

315.16

315.94

316.85

1,962

317.78

318.40

319.53

320.42

320.85

320.45

319.45

317.25

316.11

315.27

316.53

317.53

1,963

318.58

318.92

319.70

321.22

322.08

321.31

319.58

317.61

316.05

315.83

316.91

318.20

1,964

319.41

320.07

320.74

321.40

322.06

321.73

320.27

318.54

316.54

316.71

317.53

318.55

1,965

319.27

320.28

320.73

321.97

322.00

321.71

321.05

318.71

317.66

317.14

318.70

319.25

1,966

320.46

321.43

322.23

323.54

323.91

323.59

322.24

320.20

318.48

317.94

319.63

320.87

1,967

322.17

322.34

322.88

324.25

324.83

323.93

322.38

320.76

319.10

319.24

320.56

321.80

1,968

322.40

322.99

323.73

324.86

325.40

325.20

323.98

321.95

320.18

320.09

321.16

322.74

1,969

323.83

324.26

325.47

326.50

327.21

326.54

325.72

323.50

322.22

321.62

322.69

323.95

1,970

324.89

325.82

326.77

327.97

327.91

327.50

326.18

324.53

322.93

322.90

323.85

324.96

1,971

326.01

326.51

327.01

327.62

328.76

328.40

327.20

325.27

323.20

323.40

324.63

325.85

1,972

326.60

327.47

327.58

329.56

329.90

328.92

327.88

326.16

324.68

325.04

326.34

327.39

1,973

328.37

329.40

330.14

331.33

332.31

331.90

330.70

329.15

327.35

327.02

327.99

328.48

1,974

329.18

330.55

331.32

332.48

332.92

332.08

331.01

329.23

327.27

327.21

328.29

329.41

1,975

330.23

331.25

331.87

333.14

333.80

333.43

331.73

329.90

328.40

328.17

329.32

330.59

1,976

331.58

332.39

333.33

334.41

334.71

334.17

332.89

330.77

329.14

328.78

330.14

331.52

1,977

332.75

333.24

334.53

335.90

336.57

336.10

334.76

332.59

331.42

330.98

332.24

333.68

1,978

334.80

335.22

336.47

337.59

337.84

337.72

336.37

334.51

332.60

332.38

333.75

334.78

1,979

336.05

336.59

337.79

338.71

339.30

339.12

337.56

335.92

333.75

333.70

335.12

336.56

1,980

337.84

338.19

339.91

340.60

341.29

341.00

339.39

337.43

335.72

335.84

336.93

338.04

1,981

339.06

340.30

341.21

342.33

342.74

342.08

340.32

338.26

336.52

336.68

338.19

339.44

1,982

340.57

341.44

342.53

343.39

343.96

343.18

341.88

339.65

337.81

337.69

339.09

340.32

1,983

341.20

342.35

342.93

344.77

345.58

345.14

343.81

342.21

339.69

339.82

340.98

342.82

1,984

343.52

344.33

345.11

346.88

347.25

346.62

345.22

343.11

340.90

341.18

342.80

344.04

1,985

344.79

345.82

347.25

348.17

348.74

348.07

346.38

344.51

342.92

342.62

344.06

345.38

1,986

346.11

346.78

347.68

349.37

350.03

349.37

347.76

345.73

344.68

343.99

345.48

346.72

1,987

347.84

348.29

349.23

350.80

351.66

351.07

349.33

347.92

346.27

346.18

347.64

348.78

1,988

350.25

351.54

352.05

353.41

354.04

353.62

352.22

350.27

348.55

348.72

349.91

351.18

1,989

352.60

352.92

353.53

355.26

355.52

354.97

353.75

351.52

349.64

349.83

351.14

352.37

1,990

353.50

354.55

355.23

356.04

357.00

356.07

354.67

352.76

350.82

351.04

352.69

354.07

1,991

354.59

355.63

357.03

358.48

359.22

358.12

356.06

353.92

352.05

352.11

353.64

354.89

1,992

355.88

356.63

357.72

359.07

359.58

359.17

356.94

354.92

352.94

353.23

354.09

355.33

1,993

356.63

357.10

358.32

359.41

360.23

359.55

357.53

355.48

353.67

353.95

355.30

356.78

1,994

358.34

358.89

359.95

361.25

361.67

360.94

359.55

357.49

355.84

356.00

357.59

359.05

1,995

359.98

361.03

361.66

363.48

363.82

363.30

361.94

359.50

358.11

357.80

359.61

360.74

1,996

362.09

363.29

364.06

364.76

365.45

365.01

363.70

361.54

359.51

359.65

360.80

362.38

1,997

363.23

364.06

364.61

366.40

366.84

365.68

364.52

362.57

360.24

360.83

362.49

364.34

Long format

Long format organize data where observations are represented by multiple rows and columns represent variabels.

  • Suitable for large datasets with varying numbers of observations and variables.
  • Facilitates easier data manipulation and analysis.
  • Supports complex hierarchical structures.

day

months

year

co2

15

1

1,959

315.42

15

2

1,959

316.31

15

3

1,959

316.50

15

4

1,959

317.56

15

5

1,959

318.13

15

6

1,959

318.00

15

7

1,959

316.39

15

8

1,959

314.65

15

9

1,959

313.68

15

10

1,959

313.18

15

11

1,959

314.66

15

12

1,959

315.43

15

1

1,960

316.27

15

2

1,960

316.81

15

3

1,960

317.42

15

4

1,960

318.87

15

5

1,960

319.87

15

6

1,960

319.43

15

7

1,960

318.01

15

8

1,960

315.74

15

9

1,960

314.00

15

10

1,960

313.68

15

11

1,960

314.84

15

12

1,960

316.03

15

1

1,961

316.73

15

2

1,961

317.54

15

3

1,961

318.38

15

4

1,961

319.31

15

5

1,961

320.42

15

6

1,961

319.61

15

7

1,961

318.42

15

8

1,961

316.63

15

9

1,961

314.83

15

10

1,961

315.16

15

11

1,961

315.94

15

12

1,961

316.85

15

1

1,962

317.78

15

2

1,962

318.40

15

3

1,962

319.53

15

4

1,962

320.42

15

5

1,962

320.85

15

6

1,962

320.45

15

7

1,962

319.45

15

8

1,962

317.25

15

9

1,962

316.11

15

10

1,962

315.27

15

11

1,962

316.53

15

12

1,962

317.53

15

1

1,963

318.58

15

2

1,963

318.92

15

3

1,963

319.70

15

4

1,963

321.22

15

5

1,963

322.08

15

6

1,963

321.31

15

7

1,963

319.58

15

8

1,963

317.61

15

9

1,963

316.05

15

10

1,963

315.83

15

11

1,963

316.91

15

12

1,963

318.20

15

1

1,964

319.41

15

2

1,964

320.07

15

3

1,964

320.74

15

4

1,964

321.40

15

5

1,964

322.06

15

6

1,964

321.73

15

7

1,964

320.27

15

8

1,964

318.54

15

9

1,964

316.54

15

10

1,964

316.71

15

11

1,964

317.53

15

12

1,964

318.55

15

1

1,965

319.27

15

2

1,965

320.28

15

3

1,965

320.73

15

4

1,965

321.97

15

5

1,965

322.00

15

6

1,965

321.71

15

7

1,965

321.05

15

8

1,965

318.71

15

9

1,965

317.66

15

10

1,965

317.14

15

11

1,965

318.70

15

12

1,965

319.25

15

1

1,966

320.46

15

2

1,966

321.43

15

3

1,966

322.23

15

4

1,966

323.54

15

5

1,966

323.91

15

6

1,966

323.59

15

7

1,966

322.24

15

8

1,966

320.20

15

9

1,966

318.48

15

10

1,966

317.94

15

11

1,966

319.63

15

12

1,966

320.87

15

1

1,967

322.17

15

2

1,967

322.34

15

3

1,967

322.88

15

4

1,967

324.25

15

5

1,967

324.83

15

6

1,967

323.93

15

7

1,967

322.38

15

8

1,967

320.76

15

9

1,967

319.10

15

10

1,967

319.24

15

11

1,967

320.56

15

12

1,967

321.80

15

1

1,968

322.40

15

2

1,968

322.99

15

3

1,968

323.73

15

4

1,968

324.86

15

5

1,968

325.40

15

6

1,968

325.20

15

7

1,968

323.98

15

8

1,968

321.95

15

9

1,968

320.18

15

10

1,968

320.09

15

11

1,968

321.16

15

12

1,968

322.74

15

1

1,969

323.83

15

2

1,969

324.26

15

3

1,969

325.47

15

4

1,969

326.50

15

5

1,969

327.21

15

6

1,969

326.54

15

7

1,969

325.72

15

8

1,969

323.50

15

9

1,969

322.22

15

10

1,969

321.62

15

11

1,969

322.69

15

12

1,969

323.95

15

1

1,970

324.89

15

2

1,970

325.82

15

3

1,970

326.77

15

4

1,970

327.97

15

5

1,970

327.91

15

6

1,970

327.50

15

7

1,970

326.18

15

8

1,970

324.53

15

9

1,970

322.93

15

10

1,970

322.90

15

11

1,970

323.85

15

12

1,970

324.96

15

1

1,971

326.01

15

2

1,971

326.51

15

3

1,971

327.01

15

4

1,971

327.62

15

5

1,971

328.76

15

6

1,971

328.40

15

7

1,971

327.20

15

8

1,971

325.27

15

9

1,971

323.20

15

10

1,971

323.40

15

11

1,971

324.63

15

12

1,971

325.85

15

1

1,972

326.60

15

2

1,972

327.47

15

3

1,972

327.58

15

4

1,972

329.56

15

5

1,972

329.90

15

6

1,972

328.92

15

7

1,972

327.88

15

8

1,972

326.16

15

9

1,972

324.68

15

10

1,972

325.04

15

11

1,972

326.34

15

12

1,972

327.39

15

1

1,973

328.37

15

2

1,973

329.40

15

3

1,973

330.14

15

4

1,973

331.33

15

5

1,973

332.31

15

6

1,973

331.90

15

7

1,973

330.70

15

8

1,973

329.15

15

9

1,973

327.35

15

10

1,973

327.02

15

11

1,973

327.99

15

12

1,973

328.48

15

1

1,974

329.18

15

2

1,974

330.55

15

3

1,974

331.32

15

4

1,974

332.48

15

5

1,974

332.92

15

6

1,974

332.08

15

7

1,974

331.01

15

8

1,974

329.23

15

9

1,974

327.27

15

10

1,974

327.21

15

11

1,974

328.29

15

12

1,974

329.41

15

1

1,975

330.23

15

2

1,975

331.25

15

3

1,975

331.87

15

4

1,975

333.14

15

5

1,975

333.80

15

6

1,975

333.43

15

7

1,975

331.73

15

8

1,975

329.90

15

9

1,975

328.40

15

10

1,975

328.17

15

11

1,975

329.32

15

12

1,975

330.59

15

1

1,976

331.58

15

2

1,976

332.39

15

3

1,976

333.33

15

4

1,976

334.41

15

5

1,976

334.71

15

6

1,976

334.17

15

7

1,976

332.89

15

8

1,976

330.77

15

9

1,976

329.14

15

10

1,976

328.78

15

11

1,976

330.14

15

12

1,976

331.52

15

1

1,977

332.75

15

2

1,977

333.24

15

3

1,977

334.53

15

4

1,977

335.90

15

5

1,977

336.57

15

6

1,977

336.10

15

7

1,977

334.76

15

8

1,977

332.59

15

9

1,977

331.42

15

10

1,977

330.98

15

11

1,977

332.24

15

12

1,977

333.68

15

1

1,978

334.80

15

2

1,978

335.22

15

3

1,978

336.47

15

4

1,978

337.59

15

5

1,978

337.84

15

6

1,978

337.72

15

7

1,978

336.37

15

8

1,978

334.51

15

9

1,978

332.60

15

10

1,978

332.38

15

11

1,978

333.75

15

12

1,978

334.78

15

1

1,979

336.05

15

2

1,979

336.59

15

3

1,979

337.79

15

4

1,979

338.71

15

5

1,979

339.30

15

6

1,979

339.12

15

7

1,979

337.56

15

8

1,979

335.92

15

9

1,979

333.75

15

10

1,979

333.70

15

11

1,979

335.12

15

12

1,979

336.56

15

1

1,980

337.84

15

2

1,980

338.19

15

3

1,980

339.91

15

4

1,980

340.60

15

5

1,980

341.29

15

6

1,980

341.00

15

7

1,980

339.39

15

8

1,980

337.43

15

9

1,980

335.72

15

10

1,980

335.84

15

11

1,980

336.93

15

12

1,980

338.04

15

1

1,981

339.06

15

2

1,981

340.30

15

3

1,981

341.21

15

4

1,981

342.33

15

5

1,981

342.74

15

6

1,981

342.08

15

7

1,981

340.32

15

8

1,981

338.26

15

9

1,981

336.52

15

10

1,981

336.68

15

11

1,981

338.19

15

12

1,981

339.44

15

1

1,982

340.57

15

2

1,982

341.44

15

3

1,982

342.53

15

4

1,982

343.39

15

5

1,982

343.96

15

6

1,982

343.18

15

7

1,982

341.88

15

8

1,982

339.65

15

9

1,982

337.81

15

10

1,982

337.69

15

11

1,982

339.09

15

12

1,982

340.32

15

1

1,983

341.20

15

2

1,983

342.35

15

3

1,983

342.93

15

4

1,983

344.77

15

5

1,983

345.58

15

6

1,983

345.14

15

7

1,983

343.81

15

8

1,983

342.21

15

9

1,983

339.69

15

10

1,983

339.82

15

11

1,983

340.98

15

12

1,983

342.82

15

1

1,984

343.52

15

2

1,984

344.33

15

3

1,984

345.11

15

4

1,984

346.88

15

5

1,984

347.25

15

6

1,984

346.62

15

7

1,984

345.22

15

8

1,984

343.11

15

9

1,984

340.90

15

10

1,984

341.18

15

11

1,984

342.80

15

12

1,984

344.04

15

1

1,985

344.79

15

2

1,985

345.82

15

3

1,985

347.25

15

4

1,985

348.17

15

5

1,985

348.74

15

6

1,985

348.07

15

7

1,985

346.38

15

8

1,985

344.51

15

9

1,985

342.92

15

10

1,985

342.62

15

11

1,985

344.06

15

12

1,985

345.38

15

1

1,986

346.11

15

2

1,986

346.78

15

3

1,986

347.68

15

4

1,986

349.37

15

5

1,986

350.03

15

6

1,986

349.37

15

7

1,986

347.76

15

8

1,986

345.73

15

9

1,986

344.68

15

10

1,986

343.99

15

11

1,986

345.48

15

12

1,986

346.72

15

1

1,987

347.84

15

2

1,987

348.29

15

3

1,987

349.23

15

4

1,987

350.80

15

5

1,987

351.66

15

6

1,987

351.07

15

7

1,987

349.33

15

8

1,987

347.92

15

9

1,987

346.27

15

10

1,987

346.18

15

11

1,987

347.64

15

12

1,987

348.78

15

1

1,988

350.25

15

2

1,988

351.54

15

3

1,988

352.05

15

4

1,988

353.41

15

5

1,988

354.04

15

6

1,988

353.62

15

7

1,988

352.22

15

8

1,988

350.27

15

9

1,988

348.55

15

10

1,988

348.72

15

11

1,988

349.91

15

12

1,988

351.18

15

1

1,989

352.60

15

2

1,989

352.92

15

3

1,989

353.53

15

4

1,989

355.26

15

5

1,989

355.52

15

6

1,989

354.97

15

7

1,989

353.75

15

8

1,989

351.52

15

9

1,989

349.64

15

10

1,989

349.83

15

11

1,989

351.14

15

12

1,989

352.37

15

1

1,990

353.50

15

2

1,990

354.55

15

3

1,990

355.23

15

4

1,990

356.04

15

5

1,990

357.00

15

6

1,990

356.07

15

7

1,990

354.67

15

8

1,990

352.76

15

9

1,990

350.82

15

10

1,990

351.04

15

11

1,990

352.69

15

12

1,990

354.07

15

1

1,991

354.59

15

2

1,991

355.63

15

3

1,991

357.03

15

4

1,991

358.48

15

5

1,991

359.22

15

6

1,991

358.12

15

7

1,991

356.06

15

8

1,991

353.92

15

9

1,991

352.05

15

10

1,991

352.11

15

11

1,991

353.64

15

12

1,991

354.89

15

1

1,992

355.88

15

2

1,992

356.63

15

3

1,992

357.72

15

4

1,992

359.07

15

5

1,992

359.58

15

6

1,992

359.17

15

7

1,992

356.94

15

8

1,992

354.92

15

9

1,992

352.94

15

10

1,992

353.23

15

11

1,992

354.09

15

12

1,992

355.33

15

1

1,993

356.63

15

2

1,993

357.10

15

3

1,993

358.32

15

4

1,993

359.41

15

5

1,993

360.23

15

6

1,993

359.55

15

7

1,993

357.53

15

8

1,993

355.48

15

9

1,993

353.67

15

10

1,993

353.95

15

11

1,993

355.30

15

12

1,993

356.78

15

1

1,994

358.34

15

2

1,994

358.89

15

3

1,994

359.95

15

4

1,994

361.25

15

5

1,994

361.67

15

6

1,994

360.94

15

7

1,994

359.55

15

8

1,994

357.49

15

9

1,994

355.84

15

10

1,994

356.00

15

11

1,994

357.59

15

12

1,994

359.05

15

1

1,995

359.98

15

2

1,995

361.03

15

3

1,995

361.66

15

4

1,995

363.48

15

5

1,995

363.82

15

6

1,995

363.30

15

7

1,995

361.94

15

8

1,995

359.50

15

9

1,995

358.11

15

10

1,995

357.80

15

11

1,995

359.61

15

12

1,995

360.74

15

1

1,996

362.09

15

2

1,996

363.29

15

3

1,996

364.06

15

4

1,996

364.76

15

5

1,996

365.45

15

6

1,996

365.01

15

7

1,996

363.70

15

8

1,996

361.54

15

9

1,996

359.51

15

10

1,996

359.65

15

11

1,996

360.80

15

12

1,996

362.38

15

1

1,997

363.23

15

2

1,997

364.06

15

3

1,997

364.61

15

4

1,997

366.40

15

5

1,997

366.84

15

6

1,997

365.68

15

7

1,997

364.52

15

8

1,997

362.57

15

9

1,997

360.24

15

10

1,997

360.83

15

11

1,997

362.49

15

12

1,997

364.34