Data visualization with ggplot2

Nyamisi Peter & Masumbuko Semba

2024-04-01

Learning Agenda

  1. Get familiar with R and Rstudio
  2. Data structure and data types
  3. Reading and writing data in Rstudio
  4. Tidying with tidyverse
  5. Plotting and Visualization
  6. Data manipulation with tidyverse
  7. Descriptive Statistics
  8. Inferential Statistics
  9. Modelling and simulation
  10. Spatial Handling and Analysis

OVERVIEW

Questions

  1. How do you make plots using R?
  2. How do you customize and modify plots?

Objectives

  1. Produce histogram, scatter plots, boxplots, line plots and barplots using ggplot
  2. Describe what faceting is and apply faceting in ggplot
  3. Modify the aesthetics of an existing ggplot plot .

Plotting with ggplot2

  • ggplot2 create complex plots from data in a data frame.

  • ggplot2 plots work best with data in the ‘long’ format

  • To build a ggplot, we will use the following basic template that can be used for different types of plots:

\[ Plot = Data + Aesthetics + Geom \]

Tidyverse: Tidy Data Toolkit

  • Collection of R packages to tidy, manipulate & visualize data
## install first time

install.packages("tidyverse")

## Check if installed otherwise install

if (!require(tidyverse, character.only = TRUE)) {
  install.packages("tidyverse")
}

Five Key Plots

  1. Histogram
  2. Boxplot
  3. scatterplot
  4. barplot
  5. lineplot

Data used

  • We use a chinook dataset
  • Has three variables (total length (tl), weight (w) and sites (loc)) and 112 observation
  • we can import the dataset as highlighted in chunk
chinook = read_csv("chinook_lw.csv")
  • It is important to check the structure of the dataset
  • We can also check the data structure with this chunk
chinook |> 
  glimpse()
Rows: 112
Columns: 3
$ tl  <dbl> 120.1, 115.0, 111.2, 110.2, 110.0, 109.7, 105.0, 100.1, 98.0, 92.1…
$ w   <dbl> 17.9, 17.2, 16.8, 15.8, 14.3, 13.8, 12.8, 11.7, 12.8, 14.8, 9.7, 7…
$ loc <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Argentina", "…

Histogram

  • Used to assess the distribution of single numerical variable.
  • Displays the frequency or density of data within predefined intervals (bins).
  • Use only numerical data types.
ggplot(data = chinook, aes(x = tl)) +
  geom_histogram()

Scatterplot

  • Used to assess the relationship between two numerical variables.
  • Independent variable (X) and dependent variable (Y)
  • Visualizes patterns, and correlations within data.
  • Use only numerical data types.
ggplot(data = chinook, aes(x = tl, y = w)) +
  geom_point()

boxplot

  • Offers a quick snapshot of data distribution.
  • Captures central tendency, variability, and outliers effectively.
  • combine string and numerical data types
ggplot( data = chinook, aes(x = loc, y = w)) +
  geom_boxplot()

barplot

  • Comparing frequencies or proportions of different categories.
  • Visualizing distributions of categorical variables.
  • Presenting survey results or demographic data.
  • Require character, factor or logical data type
ggplot( data = chinook, aes(x = loc)) +
  geom_bar()

Lineplot

  • A simple graphical representation of data points connected by straight lines.
  • Useful to assess trends or patterns over time
  • only need two continuous variable.
  • The x-variable should be time interval
data <- data.frame(
  years = c(2018, 2019, 2020, 2021, 2022),
  income = c(30000, 32000, 35000, 38000, 42000)
)

data |> ggplot(aes(x = years, y = income)) + 
  geom_line()

Examples

Example 1 Based catch landing datset you have been exploring load the dataset and calculate the median total catch by country and epoch.

Solution. Your calculation will use the catch variable and median function to calculate the median.

Examples

Example 2 Based catch landing datset you have been exploring load the dataset and calculate the median total catch by country and epoch.

Solution. Your calculation will use the catch variable and median function to calculate the median.

ANOVA

term

df

sumsq

meansq

statistic

p.value

loc

2

32,171.73

16,085.8674

83.38093

0.0000000000000000000001073077

Residuals

109

21,028.31

192.9202

Thank You for Attending