Introduction

Getting you up to speed with R

Welcome! 👋

Welcome to your first lab session of the Introduction to Data Science course! These are the practical component of the course and rely largely on the R programming language.

There are two tutors for these labs - Killian Conyngham and Carol Sobral. We are here to talk you through the lab each week and help you with any questions you may have.

As there is a variety of experience levels in the room, we will try to go at a pace that fits our schedule and suits as many of you as possible. If we go too fast or too slow, do let us know!

This week’s tutorial will be divided in three parts.

  • First, we will recap R basics and the RStudio infrastructure setup.
  • Second, we will dive into the Tidyverse, a collection of R packages that are designed to work together to make data science easier and more intuitive.
  • Third, we will discuss some best practices in R.

But first things first: Have you all successfully installed R and RStudio? Otherwise, install R first, then install RStudio - or think about updating R, RStudio, and your installed packages.

Have you all registered a free GitHub account? Otherwise, register here.


Working with R

The RStudio interface

RStudio is an integrated development environment (IDE) for R. Think of RStudio as a front that allows us to interact, compile, and render R code in a more instinctive way. The following image shows what the standard RStudio interface looks like:

RStudio Interface
RStudio Interface
  1. Console: The console provides a means to interact directly with R. You can type some code at the console and when you press ENTER, R will run that code. Depending on what you type, you may see some output in the console or if you make a mistake, you may get a warning or an error message.

  2. Script editor: You will utilize the script editor to complete your assignments. The script editor will be the space where files will be displayed. For example, once you download and open the bi-weekly assignment .Rmd template, it will appear here. The editor is a where you should place code you care about, since the code from the console cannot be saved into a script.

  3. Environment: This area holds the abstractions you have created with your code. If you run myresult <- 5+3+2, the myresult object will appear there.

  4. Plots and files: This area will be where graphic output will be generated. Additionally, if you write a question mark before any function, (i.e. ?mean) the online documentation will be displayed here.


Getting Help 🔍

The key to learning R is: Google! We can give you an overview over basic R functions, but to really learn R you will have to actively use it yourself, trouble shoot, ask questions, and google! Help pages such as http://stackoverflow.com offer a rich archive of questions and answers by the R community. For example, if you google “recode data in r” you will find a variety of useful websites explaining how to do this on the first page of the search results. Don’t be surprised if you find a variety of different ways to execute the same task.

RStudio also has a useful help menu. In addition, you can get information on any function or integrated data set in R through the console, for example:

?plot

In addition, there are a lot of free R comprehensive guides, such as Quick-R at http://www.statmethods.net or the R cookbook at http://www.cookbook-r.com.

You are also surrounded by a wealth of knowledge in your classmates around you. Feel free to collaborate or ask your neighbours questions to help you!

I recommend you avoid using ChatGPT or any other AI tools for coding help in these labs!! The point of these labs is to learn how to code yourself, so make sure you always try each task yourself first and that you understand each line of code. If you have copilot enabled in your RStudio, please disable it for these labs.

Executing a line of code

To execute a single line of code. In RStudio, with the curser in the line you want R to execute,

Press command + return (on macOS) or Crtl + Enter (on Windows).

To execute multiple lines of code at once, highlight the respective portion of the code and then run it using the operations above.

Objects in R 📦

R stores information as an object. You can name objects whatever you like.

A few things to remember though:

  • Do not to use names that are reserved for built-in functions or functions in the packages you use, such as sum, mean, or abs. Most of the time, R will let you use these as names, but it leads to confusion in your code.
  • Do not use spaces or special characters such as $ or %. Common symbols that are used in variable names include . or _.
  • Remember that R is case sensitive.
  • To assign values to objects, we use the assignment operator <-. Sometimes you will also see = as the assignment operator. This is a matter of preference and subject to debate among R programmers. Personally, I use <- to assign values to objects and = within functions.
  • The # symbol is used for commenting and demarcation. Any code following # will not be executed.

Data Types in R

There are four main variable types you should be familiar with:

  • Numerical: Any number. An integer is a numerical variable without any decimals, while a double is a numerical variable with decimals.
  • Character: This is what Stata (and other programming languages such as Python) calls a string. We typically store any alphanumeric data that is not ordered as a character vector.
  • Logical: A collection of TRUE and FALSE values.
  • Factor: Think about it as an ordinal variable, i.e. an ordered categorical variable.

Data Structures in R

R has many data structures, the most important ones for now are:

  • Vectors: The most basic data structure in R; a sequence of elements of the same type.
  • Factors: Special vectors used to store categorical data with fixed possible values (levels).
  • Lists: A collection that can hold elements of different types, including other data structures.
  • Matrices: A two-dimensional structure where all elements must be of the same type.
  • Data frames: A table-like structure where each column can be of a different type, often used for datasets.

R Packages 📁

For the most part, R Packages are collections of code and functions that leverage R programming to expand on the basic functionalities. There are a plethora of packages in R designed to facilitate the completion of tasks, built by the R community. In later parts of this course, you will also learn how to write packages yourself.

Unlike other programming languages, in R you only need to install a package once. The following times you will only need to load the package. As a good practice I recommend running the code to install packages only in your R console, not in the code editor. You can install a package with the following syntax

install.packages("name_of_your_package") #note that the quotation marks are mandatory at this stage

Once the package has been installed, you just need to “call it” every time you want to use it in a file by running:

library("name_of_your_package") #either of this lines will require the package
library(name_of_your_package) #library understands the code with, or without, quotation marks

It is extremely important that you do not have any lines installing packages in your assignments because the file will fail to knit


Working directory

The working directory is just a file path on your computer that sets the default location of any files you read into R, or save out of R. Normally, when you open RStudio it will have a default directory (a folder in your computer). You can check you directory by running getwd() in your console:

#this is the default in my case
getwd()
#[1] "/Users/l.oswald"

When your RStudio is closed and you open a file from your finder in MacOS or file explorer in Windows, the default working directory will be the folder where the file is hosted


Setting your working directory

You can set you directory manually from RStudio: use the menu to change your working directory under Session > Set Working Directory > Choose Directory.

You can also use the setwd() function:

setwd("/path/to/your/directory") #in macOS
setwd("c:/path/to/your/directory") #in windows

R File types

There are two main file types you will be working with in this course: .R files and .Rmd files. The former are simply scripts of R code, while the latter are R Markdown files that allow you to combine text and code in a single document. You can “knit” R markdown files to produce outputs in a variety of formats, including HTML, PDF, and Word. This whole lab is just a fancy .Rmd file!


Working with R Projects

Another option that allows you to circumvent the folder structure mess to some degree are R Projects. We highly recommend working in Projects because they are easily integrated with Git and GitHub!


Data Frames

Let’s now go a bit deeper with data frames using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:

  • species: a factor denoting the penguin species (Adelie, Chinstrap, or Gentoo)
  • island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed
  • bill_length_mm: a number denoting length of the dorsal ridge of penguin bill (millimeters)
  • bill_depth_mm: a number denoting the depth of the penguin bill (millimeters)
  • flipper_length_mm: an integer denoting penguin flipper length (millimeters)
  • body_mass_g: an integer denoting penguin body mass (grams)
  • sex: a factor denoting penguin sex (MALE, FEMALE)
  • year an integer denoting the year of the record

Handily, this data is contained in a package called palmerpenguins. To make the palmerpenguins package available for use, install it and then use the library() command to load it. While packages need to be installed only once, the library() command needs to be run every time you want to use a particular package.

# install.packages("palmerpenguins") # Run this in console only
library(palmerpenguins)

Loading the data

Once we have loaded the package we can load the data set by simply typing data(penguins). The data will be stored in a data frame called penguins. (Note: Usually, you would load data from a file on your computer or from a URL as most datasets are not packaged.)

data(penguins)

Data Frame Structure

Let’s find out what these data look like. First, we can print out the data frame to the console. If we just type the name of the data frame, R will print out the first 10 rows and all columns of the data frame.

print(penguins)
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Next we can use the str() function to explore the variable names and which data class they are stored in. Note: int stands for integer and is a special case of the class numeric.

str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Exercise 1

Look at the output from the function adove. How many observations (rows) and variables (columns) does the penguins dataset have? What different data types do you see?

If we are only interested in what the variables are called, we can use the names() function.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

We can alter the names of vectors by using the names() function and indexing. Because data frames are essentially just combinations of vectors, we can do the same for variable names inside data frames. Suppose we want to change the variable year.

names(penguins)[8] <- "rec_year"
names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "rec_year"

We can use the summary() function to get a first look at the data.

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex         rec_year   
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

A data frame has two dimensions: rows and columns.

nrow(penguins) # Number of rows
## [1] 344
ncol(penguins) # Number of columns
## [1] 8
dim(penguins) # Rows first then columns.
## [1] 344   8

Accessing elements of a data frame - Indexing

As a rule, whenever we access a dataframe or another object using two-dimensional indexing in R, the order is: [row, column]. To access the first row of the data frame, we specify the row we want to see and leave the column slot following the comma empty.

penguins[1, ]
## # A tibble: 1 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## # ℹ 2 more variables: sex <fct>, rec_year <int>

We can use the concatenate function c() to access multiple rows (or columns) at once. Below we print out the first and second row of the dataframe.

penguins[c(1, 2), ]
## # A tibble: 2 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## # ℹ 2 more variables: sex <fct>, rec_year <int>

We can also access a range of rows by separating the minimum and maximum value with a :. Below we print out the first five rows of the dataframe.

penguins[1:5,]
## # A tibble: 5 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## # ℹ 2 more variables: sex <fct>, rec_year <int>

If we try to access a data point that is out of bounds, R returns an error.

#penguins[3,10]

Exercise 2

Access the species information (column 1) for the 10th penguin in the dataset using indexing.

Exercise 3

Access rows 15 through 20 and columns 1 through 3 of the penguins dataset.

The $ operator

The $ operator in R is used to specify a variable within a data frame. This is an alternative to indexing. All the following commands will return essentially the same output: the species of all penguins in the dataset.

which(colnames(penguins) == "species")
## [1] 1
penguins[,'species']
## # A tibble: 344 × 1
##    species
##    <fct>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # ℹ 334 more rows
penguins[,1]
## # A tibble: 344 × 1
##    species
##    <fct>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # ℹ 334 more rows
penguins$species
##   [1] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##   [8] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [15] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [22] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [29] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [36] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [43] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [50] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [57] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [64] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [71] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [78] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [85] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [92] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
##  [99] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [106] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [113] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [120] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [127] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [134] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [141] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
## [148] Adelie    Adelie    Adelie    Adelie    Adelie    Gentoo    Gentoo   
## [155] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [162] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [169] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [176] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [183] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [190] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [197] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [204] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [211] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [218] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [225] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [232] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [239] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [246] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [253] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [260] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [267] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
## [274] Gentoo    Gentoo    Gentoo    Chinstrap Chinstrap Chinstrap Chinstrap
## [281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [344] Chinstrap
## Levels: Adelie Chinstrap Gentoo

table() function

The table() function can be used to tabularize one or more variables. For example, let’s find out how many observations (i.e. individual penguins) we have per species.

table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

Using logical operations, we can create more complex tabularizations. For example, below, we show how many penguins have above average body mass per species.

summary(penguins$body_mass_g)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2700    3550    4050    4202    4750    6300       2
table(penguins$species, penguins$body_mass_g > mean(penguins$body_mass_g, na.rm = TRUE))
##            
##             FALSE TRUE
##   Adelie      126   25
##   Chinstrap    61    7
##   Gentoo        6  117

Exercise 4

Create a table showing the distribution of penguin sex by species. What do you notice about the missing values?

NAs in R

NA is how R denotes missing values. For certain functions, NAs cause problems.

vec <- c(4, 1, 2, NA, 3)
mean(vec) #Result is NA!
## [1] NA
sum(vec) #Result is NA!
## [1] NA

We can tell R to remove the NA and execute the function on the remainder of the data.

mean(vec, na.rm = T)
## [1] 2.5
sum(vec, na.rm = T)
## [1] 10

Adding observations

Let’s add another observation to the data. Suppose we wanted to add a hypothetical observation for a new Adelie penguin. We can use the rbind() function to do so. rbind() stands for “row bind.” Save the output in a new data frame!

new_penguin <- data.frame(
  species = "Adelie",
  island = "Dream", 
  bill_length_mm = 40.5,
  bill_depth_mm = 18.2,
  flipper_length_mm = 195,
  body_mass_g = 3500,
  sex = "male",
  rec_year = 2020
)

penguins_new <- rbind(penguins, new_penguin)
dim(penguins_new)  # Check that we added one row
## [1] 345   8
tail(penguins_new, 2)  # Look at the last two rows
## # A tibble: 2 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <dbl>       <dbl>
## 1 Chinstrap Dream            50.2          18.7               198        3775
## 2 Adelie    Dream            40.5          18.2               195        3500
## # ℹ 2 more variables: sex <fct>, rec_year <dbl>

Working with R II

Saving data

Suppose we wanted to save a newly created data frame. We have multiple options to do so. If we wanted to save it as a native .RData format, we would run the following command.

# Make sure you specified the right working directory!
save(penguins_new, file = "penguins_new.RData")

Most of the time, however, we would want to save our data in formats that can be read by other programs as well. .csv is an obvious choice.

write.csv(penguins_new, file = "penguins_new.csv")

Dealing with errors in R

Errors in R occur when code is used in a way that it is not intended. For example when you try to add two character strings, you will get the following error:

"hello" + "world"
Error in "hello" + "world": non-numeric argument to binary operator

Normally, when something has gone wrong with your program, R will notify you of an error in the console. There are errors that will prevent the code from running, while others will only produce warning messages. In the following case, the code will run, but you will notice that the string “three” is turned into a NA.

as.numeric(c("1", "2", "three"))
Warning: NAs introduced by coercion
[1]  1  2 NA

Since we will be utilizing widely used packages and functions in the course of the semester, the errors that you may come across in the process of completing your assignments will be common for other R users. Most errors occur because of typos. A Google search of the error message can take you a long way as well. Most of the times the first entry on stackoverflow.com will solve the problem.


The double colon operator ::

You may have noted in the previous section that the functions were preceded by their package name and two colons, for example: readr::read_rds(). The double colon operator :: helps us ensure that we select functions from a particular package. We utilize the operator to explicitly state where the function is coming. This may become even more important when you are doing data analysis as part of a team further in your careers. Though it is likely that this will not be a problem during the course, we can try to employ the following convention package_name::function() to ensure that we will not encounter errors in our knitting process:

dplyr::select()

Let’s look at what happens when we load tidyverse.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 #──
#✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#✓ tibble  3.0.3     ✓ dplyr   1.0.2
#✓ tidyr   1.1.2     ✓ stringr 1.4.0
#✓ readr   1.3.1     ✓ forcats 0.5.0
#── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() #──
#x dplyr::filter() masks stats::filter()
#x dplyr::lag()    masks stats::lag()

You may notice that R points out some conflicts where some functions are being masked. The default in this machine will become the filter() from the dplyr package during this session. If you were to run some code that is based on the filter() from the stats package, your code will probably result in errors.


Tidyverse

Now, let us leave base R behind and make our lives much easier by introducing you to the tidyverse. ✨ We think for data cleaning, wrangling, and plotting, the tidyverse really is a no-brainer. A few good reasons for teaching the tidyverse are:

  • Outstanding documentation and community support
  • Consistent philosophy and syntax
  • Convenient “front-end” for more advanced methods

Read more on this here if you like.

But… this certainly shouldn’t put you off learning base R alternatives.

  • Base R is extremely flexible and powerful (and stable).
  • There are some things that you’ll have to venture outside of the tidyverse for.
  • A combination of tidyverse and base R is often the best solution to a problem.
  • Excellent base R data manipulation tutorial here.

Tidyverse packages

Why is it called the tidyverse? Well, as we saw above when we loaded the tidyverse package, it actually loaded a number of packages (which could also be loaded individually): ggplot2, tibble, dplyr, etc. We can also see information about the package versions and some namespace conflicts.

The tidyverse actually comes with a lot more packages than those that are just loaded automatically.

tidyverse_packages()
##  [1] "broom"         "conflicted"    "cli"           "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
##  [9] "googledrive"   "googlesheets4" "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "ragg"         
## [21] "readr"         "readxl"        "reprex"        "rlang"        
## [25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
## [29] "tidyr"         "xml2"          "tidyverse"

We’ll use several of these additional packages during the remainder of this course.

Underlying these packages are two key ideas:


The pipe %>% operator

You might have seen the pipe operator in the past.

The beauty of pipes

  • The forward-pipe operator %>% pipes the left-hand side values forward into expressions on the right-hand side.
  • It serves the natural way of reading (“do this, then this, then this, …”).
  • We replace f(x) with x %>% f().
  • It avoids nested function calls.
  • It minimizes the need for local variables and function definitions.

The classic way

hertie(
  bvg(
    walk(
      breakfast(
        shower(
          wake_up(
            Alex, 7
          ),
          temp = 38
        ),
        c("coffee", "croissant")
      ),
      step_function()
    ),
    train = "U2",
    destination = "Stadtmitte"
  ),
  course = "Intro to DS"
)

The classic way, nightmare edition

alex_awake <- wake_up(Alex, 7)
alex_showered <- shower(alex_awake, 
                        temp = 38)
alex_replete <- breakfast(alex_showered, 
                          c("coffee", "croissant"))
alex_underway <- walk(alex_replete, 
                      step_function())
alex_on_train <- bvg(alex_underway, 
                     train = "U2", 
                     destination = "Stadtmitte")
alex_hertie <- hertie(alex_on_train, 
                      course = "Intro to DS")

The pipe way

Alex %>%
  wake_up(7) %>%
  shower(temp = 38) %>%
  breakfast(c("coffee", "croissant")) %>%
  walk(step_function()) %>%
  bvg(
    train = "U2",
    destination = "Stadtmitte"
  ) %>%
  hertie(course = "Intro to DS")

Piping etiquette

  • Pipes are not very handy when you need to manipulate more than one object at a time. Reserve pipes for a sequence of steps applied to one primary object.
  • Don’t use the pipe when there are meaningful intermediate objects that can be given informative names (and that are used later on).
  • %>% should always have a space before it, and should usually be followed by a new line.

The base R pipe: |>

The magrittr pipe has proven so successful and popular that the R core team recently added a “native” pipe operator to base R (version 4.1), denoted |>. Here’s how it works:

mtcars |> subset(cyl == 4) |> head()
mtcars |> subset(cyl == 4) |> (\(x) lm(mpg ~ disp, data = x))()

Now, should we use the magrittr pipe or the native pipe? The native pipe might make more sense in the long term, since it avoids dependencies and might be more efficient. Check out this Stackoverflow post and this Tidyverse blog post for a discussion of differences.

You can update your settings, if you’d like RStudio to default to the native pipe operator |>.


Tidy Data

Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table


Data manipulation with dplyr

A second fundamental package of the tidyverse is called dplyr. In this section you’ll learn and practice examples using some functions in dplyr to work with data. Those are:

  • dplyr::select(): Select (i.e. subset) columns by their names (keep or exclude some columns)
  • dplyr::filter(): Filter (i.e. subset) rows based on their values (keep rows that satisfy your conditions)
  • dplyr::mutate(): Create new columns or edit existing ones
  • dplyr::group_by(): Define groups within your data set
  • dplyr::summarize(): Collapse multiple rows into a single summary value (summary statistics)
  • dplyr::arrange(): Arrange (i.e. reorder) rows based on their values (reorder rows according to single or multiple variables)

To demonstrate and practice how these verbs (functions) work, we’ll use the penguins dataset we saw earlier.


dplyr::select()

The first verb (function) we will utilize is dplyr::select(). We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names() to retrieve the names of the variables in a data frame.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "rec_year"

Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:

dplyr::select(data, columns)

Exercise 1

The following code chunk would select the variables we need. Can you adapt it, so that we keep the body_mass_g and sex variables as well?

dplyr::select(penguins, species, island, year)

Good to know: To drop variables, use - before the variable name, i.e. select(penguins, -year) to drop the year column (select everything but the year column).


dplyr::filter()

The second verb (function) we will employ is dplyr::filter(). dplyr::filter() lets you use a logical test to extract specific rows from a data frame. To use dplyr::filter(), pass it the data frame followed by one or more logical tests. dplyr::filter() will return every row that passes each logical test.

The more commonly used logical operators are:

  • ==: Equal to
  • !=: Not equal to
  • >, >=: Greater than, greater than or equal to
  • <, <=: Less than, less than or equal to
  • &, |: And, or

Say we are interested in retrieving the observations from the year 2007. We would do:

dplyr::filter(penguins, year == 2007)

# same as writing
# penguins %>% dplyr::filter(year == 2007)

Exercise 2

We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunk with the pipe and conditional logical operators we discussed?

only_2009 <- dplyr::filter(penguins, rec_year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, rec_year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console
## # A tibble: 24 × 3
##    species   sex    rec_year
##    <fct>     <fct>     <int>
##  1 Chinstrap female     2009
##  2 Chinstrap male       2009
##  3 Chinstrap female     2009
##  4 Chinstrap male       2009
##  5 Chinstrap male       2009
##  6 Chinstrap female     2009
##  7 Chinstrap female     2009
##  8 Chinstrap male       2009
##  9 Chinstrap female     2009
## 10 Chinstrap male       2009
## # ℹ 14 more rows

dplyr::mutate() 🌂 ☂️

dplyr::mutate() lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more comfortable with assessing the weight of the penguins in pounds. We would utilize mutate() as such:

dplyr::mutate(new_var_name = manipulated old_var(s))
penguins |>
  dplyr::mutate(body_mass_lbs = body_mass_g/453.6)

dplyr::group_by() and dplyr::summarize()

These two verbs dplyr::group_by() and dplyr::summarize() tend to go together. When combined , dplyr::summarize() will create a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarizing all observations in the input. For example:

# compare this output with the one below
penguins |>
  dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T)) 
## # A tibble: 1 × 1
##   heaviest_penguin
##              <int>
## 1             6300
penguins |>
  dplyr::group_by(species) |>
  dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T)) |>
  dplyr::ungroup()
## # A tibble: 3 × 2
##   species   heaviest_penguin
##   <fct>                <int>
## 1 Adelie                4775
## 2 Chinstrap             4800
## 3 Gentoo                6300

There is also an alternate approach to calculating grouped summary statistics called per-operation grouping. This allows you to define groups in a .by argument, passing them directly in the summarize() call. These groups don’t persist in the output whereas the ones used with group_by do. You can read more about both these approaches in R for Data Science, 2nd edition.

penguins |>
  dplyr::summarise(heaviest_penguin = max(body_mass_g, na.rm = T), .by = species) |>
  dplyr::ungroup()
## # A tibble: 3 × 2
##   species   heaviest_penguin
##   <fct>                <int>
## 1 Adelie                4775
## 2 Gentoo                6300
## 3 Chinstrap             4800
penguins |>
  dplyr::summarise(heaviest_penguin = max(body_mass_g, na.rm = T), .by = c(species, sex)) |>
  dplyr::ungroup()
## # A tibble: 8 × 3
##   species   sex    heaviest_penguin
##   <fct>     <fct>             <int>
## 1 Adelie    male               4775
## 2 Adelie    female             3900
## 3 Adelie    <NA>               4250
## 4 Gentoo    female             5200
## 5 Gentoo    male               6300
## 6 Gentoo    <NA>               4875
## 7 Chinstrap female             4150
## 8 Chinstrap male               4800

Notice that we are using dplyr::ungroup() after performing grouped calculations. It is a convention we encourage. If you forget to ungroup() data, future data management can produce errors in downstream operations. Just to be sage, use dplyr::ungroup() when you’ve finished with your calculations.

Exercise 3

Can you get the weight of the lightest penguin of each species? You can use min(). What happens when in addition to species you also group by year group_by(species, year)?


dplyr::arrange() 🥚 🐣 🐥

The dplyr::arrange() verb is pretty self-explanatory. dplyr::arrange() orders the rows of a data frame by the values of selected columns in ascending order. You can use the desc() argument inside to arrange in descending order. The following chunk arranges the data frame based on the length of the penguins’ bill. You hint tab contains the code for the descending order alternative.

penguins |>
  dplyr::arrange(bill_length_mm)
penguins |>
  dplyr::arrange(desc(bill_length_mm))

Exercise 4

Can you create a data frame arranged by body_mass_g of the penguins observed in the “Dream” island?


Optional: Other dplyr functions

dplyr::slice(): Subset rows by position rather than filtering by values.

penguins |> dplyr::slice(c(1, 5))

dplyr::pull(): Extract a column from a data frame as a vector or scalar.

penguins |> 
  dplyr::filter(sex == "female") |> 
  dplyr::pull(flipper_length_mm)

dplyr::count() and dplyr::distinct(): Number and isolate unique observations.

penguins |> dplyr::count(species)
penguins |> dplyr::distinct(species)

Note: You could also use a combination of dplyr::mutate, dplyr::group_by, and n(), e.g. penguins |> dplyr::group_by(species) |> dplyr::summarize(num = n()).


dplyr::where(): Select the variables for which a function returns true.

penguins |> dplyr::select(dplyr::where(is.numeric)) |> names()

dplyr::across(): Summarize or mutate multiple variables in the same way. More information here.

penguins |> dplyr::mutate(dplyr::across(dplyr::where(is.numeric), scale)) |> head(3)

dplyr::case_when(): Vectorize multiple if_else() (or base R ifelse()) statements.

#multiple conditional statements
penguins |> 
  dplyr::mutate( 
    flipper_length_cat = 
      dplyr::case_when(
        flipper_length_mm < 190 ~ "small",
        flipper_length_mm >= 190 & flipper_length_mm < 210 ~ "medium",
        flipper_length_mm >= 210  ~ "large"
      )
  ) |>
  dplyr::pull(flipper_length_cat) |> table()

Window functions: There are also a whole class of window functions for getting leads and lags, ranking, creating cumulative aggregates, etc. See vignette("window-functions").


Best practices

Let us conclude the tutorial with some notes on best practices in R. There is a multitude of collections on best practices online, for example, some very useful ones are here.

Script structure

When structuring your scripts, remember a few things

  • Libraries go first
  • Hard coded variables (e.g. when reading in data) go second
  • Relative paths over absolute paths (give your friends and colleagues a chance to execute your scripts without trouble)

Naming conventions

Always use easy to interpret names and don’t use whitespaces in file or variable names! Some good examples:

  • survey_data_2024.R
  • student_ids <- c(101, 102, 103)
  • calculate_gpa()

Spacing

It is somewhat easier to read if you generally leave a space after a comma

  • my_function(1:10, c(2, 4))

Repetition

When you start copying and pasting code (creating a lot of redundancy), you might want to consider slimming down your code by creating a function. Functions? Let us leave this as a note of caution and return to functions and efficient code in our next session!

Actually learning R 🎒

Again, the key to learning R is: practice! We can only give you an overview over basic R functions, but to really learn R you will have to actively use it yourself, trouble shoot, ask questions, and google! It is very likely that someone else has had the exact same or just similar enough issue before and that the R community has answered it with 5+ different solutions years ago. 😉

To-do!

For those of you who have yet to create a Github account, do so now. We will be using Github extensively in this course so make sure you get familiar with version control and project management with Git and Github. Everyone has their own preferences, but to get started either GitHub Desktop or the RStudio Git interface are good options. While we won’t have a full lab session on the topic we will cover it briefly next week, and we have included materials from a previous version of this course under session-00 which you should go through if you still feel unsure about it.

See you next week! 💻


Acknowledgements

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral. It draws heavily on the materials for the Introduction to R workshop within the Data Science Seminar series at Hertie, created by Therese Anders, Statistical Modeling and Causal Inference by Sebastian Ramirez-Ruiz and Lisa Oswald.

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081