Introduction
Getting you up to speed with R
Welcome! 👋
Welcome to your first lab session of the Introduction to Data Science course! These are the practical component of the course and rely largely on the R programming language.
There are two tutors for these labs - Killian Conyngham and Carol Sobral. We are here to talk you through the lab each week and help you with any questions you may have.
As there is a variety of experience levels in the room, we will try to go at a pace that fits our schedule and suits as many of you as possible. If we go too fast or too slow, do let us know!
This week’s tutorial will be divided in three parts.
- First, we will recap R basics and the RStudio infrastructure setup.
- Second, we will dive into the Tidyverse, a collection of R packages that are designed to work together to make data science easier and more intuitive.
- Third, we will discuss some best practices in R.
But first things first: Have you all successfully installed R and RStudio? Otherwise, install R first, then install RStudio - or think about updating R, RStudio, and your installed packages.
Have you all registered a free GitHub account? Otherwise, register here.
Working with R
The RStudio interface
RStudio is an integrated development environment (IDE) for R. Think of RStudio as a front that allows us to interact, compile, and render R code in a more instinctive way. The following image shows what the standard RStudio interface looks like:
Console: The console provides a means to interact directly with R. You can type some code at the console and when you press ENTER, R will run that code. Depending on what you type, you may see some output in the console or if you make a mistake, you may get a warning or an error message.
Script editor: You will utilize the script editor to complete your assignments. The script editor will be the space where files will be displayed. For example, once you download and open the bi-weekly assignment .Rmd template, it will appear here. The editor is a where you should place code you care about, since the code from the console cannot be saved into a script.
Environment: This area holds the abstractions you have created with your code. If you run
myresult <- 5+3+2
, themyresult
object will appear there.Plots and files: This area will be where graphic output will be generated. Additionally, if you write a question mark before any function, (i.e.
?mean
) the online documentation will be displayed here.
Getting Help 🔍
The key to learning R
is: Google! We
can give you an overview over basic R
functions, but to
really learn R
you will have to actively use it yourself,
trouble shoot, ask questions, and google! Help pages such as http://stackoverflow.com
offer a rich archive of questions and answers by the R
community. For example, if you google “recode data in r” you will find a
variety of useful websites explaining how to do this on the first page
of the search results. Don’t be surprised if you find a variety of
different ways to execute the same task.
RStudio also has a useful help menu. In addition, you can get
information on any function or integrated data set in R
through the console, for example:
?plot
In addition, there are a lot of free R
comprehensive
guides, such as Quick-R at http://www.statmethods.net or the R
cookbook at http://www.cookbook-r.com.
You are also surrounded by a wealth of knowledge in your classmates around you. Feel free to collaborate or ask your neighbours questions to help you!
I recommend you avoid using ChatGPT or any other AI tools for coding help in these labs!! The point of these labs is to learn how to code yourself, so make sure you always try each task yourself first and that you understand each line of code. If you have copilot enabled in your RStudio, please disable it for these labs.
Executing a line of code
To execute a single line of code. In RStudio, with the curser in the
line you want R
to execute,
Press command + return
(on macOS) or
Crtl + Enter
(on Windows).
To execute multiple lines of code at once, highlight the respective portion of the code and then run it using the operations above.
Objects in R 📦
R
stores information as an object. You can name
objects whatever you like.
A few things to remember though:
- Do not to use names that are reserved for built-in functions or
functions in the packages you use, such as
sum
,mean
, orabs
. Most of the time,R
will let you use these as names, but it leads to confusion in your code. - Do not use spaces or special characters such as
$
or%
. Common symbols that are used in variable names include.
or_
. - Remember that
R
is case sensitive. - To assign values to objects, we use the assignment operator
<-
. Sometimes you will also see=
as the assignment operator. This is a matter of preference and subject to debate amongR
programmers. Personally, I use<-
to assign values to objects and=
within functions. - The
#
symbol is used for commenting and demarcation. Any code following#
will not be executed.
Data Types in R
There are four main variable types you should be familiar with:
- Numerical: Any number. An integer is a numerical variable without any decimals, while a double is a numerical variable with decimals.
- Character: This is what Stata (and other programming languages such as Python) calls a string. We typically store any alphanumeric data that is not ordered as a character vector.
- Logical: A collection of
TRUE
andFALSE
values. - Factor: Think about it as an ordinal variable, i.e. an ordered categorical variable.
Data Structures in R
R has many data structures, the most important ones for now are:
- Vectors: The most basic data structure in R; a sequence of elements of the same type.
- Factors: Special vectors used to store categorical data with fixed possible values (levels).
- Lists: A collection that can hold elements of
different types, including other data structures.
- Matrices: A two-dimensional structure where all
elements must be of the same type.
- Data frames: A table-like structure where each column can be of a different type, often used for datasets.
R Packages 📁
For the most part, R Packages are collections of code and functions that leverage R programming to expand on the basic functionalities. There are a plethora of packages in R designed to facilitate the completion of tasks, built by the R community. In later parts of this course, you will also learn how to write packages yourself.
Unlike other programming languages, in R you only need to install a package once. The following times you will only need to load the package. As a good practice I recommend running the code to install packages only in your R console, not in the code editor. You can install a package with the following syntax
Once the package has been installed, you just need to “call it” every time you want to use it in a file by running:
library("name_of_your_package") #either of this lines will require the package
library(name_of_your_package) #library understands the code with, or without, quotation marks
It is extremely important that you do not have any lines installing packages in your assignments because the file will fail to knit
Working directory
The working directory is just a file path on your computer
that sets the default location of any files you read into R, or save out
of R. Normally, when you open RStudio it will have a default
directory (a folder in your computer). You can check you directory by
running getwd()
in your console:
When your RStudio is closed and you open a file from your finder in MacOS or file explorer in Windows, the default working directory will be the folder where the file is hosted
Setting your working directory
You can set you directory manually from RStudio: use the menu to change your working directory under Session > Set Working Directory > Choose Directory.
You can also use the setwd()
function:
R File types
There are two main file types you will be working with in this
course: .R
files and .Rmd
files. The former
are simply scripts of R code, while the latter are R Markdown files that
allow you to combine text and code in a single document. You can “knit”
R markdown files to produce outputs in a variety of formats, including
HTML, PDF, and Word. This whole lab is just a fancy .Rmd
file!
Working with R Projects
Another option that allows you to circumvent the folder structure mess to some degree are R Projects. We highly recommend working in Projects because they are easily integrated with Git and GitHub!
Data Frames
Let’s now go a bit deeper with data frames using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:
- species: a factor denoting the penguin species (Adelie, Chinstrap, or Gentoo)
- island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed
- bill_length_mm: a number denoting length of the dorsal ridge of penguin bill (millimeters)
- bill_depth_mm: a number denoting the depth of the penguin bill (millimeters)
- flipper_length_mm: an integer denoting penguin flipper length (millimeters)
- body_mass_g: an integer denoting penguin body mass (grams)
- sex: a factor denoting penguin sex (MALE, FEMALE)
- year an integer denoting the year of the record
Handily, this data is contained in a package called
palmerpenguins
. To make the palmerpenguins
package available for use, install it and then use the
library()
command to load it. While packages need to be
installed only once, the library()
command needs to be run
every time you want to use a particular package.
Loading the data
Once we have loaded the package we can load the data set by simply
typing data(penguins)
. The data will be stored in a data
frame called penguins
. (Note: Usually, you would load data
from a file on your computer or from a URL as most datasets are not
packaged.)
Data Frame Structure
Let’s find out what these data look like. First, we can print out the
data frame to the console. If we just type the name of the data frame,
R
will print out the first 10 rows and all columns of the
data frame.
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Next we can use the str()
function to explore the
variable names and which data class they are stored in. Note:
int
stands for integer
and is a special case
of the class numeric
.
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Exercise 1
Look at the output from the function adove. How many observations (rows) and variables (columns) does the penguins dataset have? What different data types do you see?
If we are only interested in what the variables are called, we can
use the names()
function.
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
We can alter the names of vectors by using the names()
function and indexing. Because data frames are essentially just
combinations of vectors, we can do the same for variable names inside
data frames. Suppose we want to change the variable
year
.
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "rec_year"
We can use the summary()
function to get a first look at
the data.
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex rec_year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
A data frame has two dimensions: rows and columns.
## [1] 344
## [1] 8
## [1] 344 8
Accessing elements of a data frame - Indexing
As a rule, whenever we access a dataframe or another object using
two-dimensional indexing in R
, the order is:
[row, column]
. To access the first row of the data frame,
we specify the row we want to see and leave the column slot following
the comma empty.
## # A tibble: 1 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## # ℹ 2 more variables: sex <fct>, rec_year <int>
We can use the concatenate function c()
to access
multiple rows (or columns) at once. Below we print out the first and
second row of the dataframe.
## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## # ℹ 2 more variables: sex <fct>, rec_year <int>
We can also access a range of rows by separating the minimum and
maximum value with a :
. Below we print out the first five
rows of the dataframe.
## # A tibble: 5 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## # ℹ 2 more variables: sex <fct>, rec_year <int>
If we try to access a data point that is out of bounds,
R
returns an error.
Exercise 2
Access the species information (column 1) for the 10th penguin in the dataset using indexing.
Exercise 3
Access rows 15 through 20 and columns 1 through 3 of the penguins dataset.
The $
operator
The $
operator in R
is used to specify a
variable within a data frame. This is an alternative to indexing. All
the following commands will return essentially the same output: the
species of all penguins in the dataset.
## [1] 1
## # A tibble: 344 × 1
## species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## 4 Adelie
## 5 Adelie
## 6 Adelie
## 7 Adelie
## 8 Adelie
## 9 Adelie
## 10 Adelie
## # ℹ 334 more rows
## # A tibble: 344 × 1
## species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## 4 Adelie
## 5 Adelie
## 6 Adelie
## 7 Adelie
## 8 Adelie
## 9 Adelie
## 10 Adelie
## # ℹ 334 more rows
## [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [8] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [15] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [22] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [29] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [36] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [43] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [50] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [57] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [64] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [71] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [78] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [85] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [92] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [99] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [106] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [113] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [120] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [127] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [134] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [141] Adelie Adelie Adelie Adelie Adelie Adelie Adelie
## [148] Adelie Adelie Adelie Adelie Adelie Gentoo Gentoo
## [155] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [162] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [169] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [176] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [183] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [190] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [197] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [204] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [211] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [218] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [225] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [232] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [239] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [246] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [253] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [260] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [267] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
## [274] Gentoo Gentoo Gentoo Chinstrap Chinstrap Chinstrap Chinstrap
## [281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
## [344] Chinstrap
## Levels: Adelie Chinstrap Gentoo
table()
function
The table()
function can be used to tabularize one or
more variables. For example, let’s find out how many observations
(i.e. individual penguins) we have per species.
##
## Adelie Chinstrap Gentoo
## 152 68 124
Using logical operations, we can create more complex tabularizations. For example, below, we show how many penguins have above average body mass per species.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2700 3550 4050 4202 4750 6300 2
##
## FALSE TRUE
## Adelie 126 25
## Chinstrap 61 7
## Gentoo 6 117
Exercise 4
Create a table showing the distribution of penguin sex by species. What do you notice about the missing values?
NAs in R
NA
is how R
denotes missing values. For
certain functions, NA
s cause problems.
## [1] NA
## [1] NA
We can tell R
to remove the NA and execute the function
on the remainder of the data.
## [1] 2.5
## [1] 10
Adding observations
Let’s add another observation to the data. Suppose we wanted to add a hypothetical observation for a new Adelie penguin. We can use the rbind() function to do so. rbind() stands for “row bind.” Save the output in a new data frame!
new_penguin <- data.frame(
species = "Adelie",
island = "Dream",
bill_length_mm = 40.5,
bill_depth_mm = 18.2,
flipper_length_mm = 195,
body_mass_g = 3500,
sex = "male",
rec_year = 2020
)
penguins_new <- rbind(penguins, new_penguin)
dim(penguins_new) # Check that we added one row
## [1] 345 8
## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Chinstrap Dream 50.2 18.7 198 3775
## 2 Adelie Dream 40.5 18.2 195 3500
## # ℹ 2 more variables: sex <fct>, rec_year <dbl>
Working with R II
Saving data
Suppose we wanted to save a newly created data frame. We have
multiple options to do so. If we wanted to save it as a native
.RData
format, we would run the following command.
# Make sure you specified the right working directory!
save(penguins_new, file = "penguins_new.RData")
Most of the time, however, we would want to save our data in formats
that can be read by other programs as well. .csv
is an
obvious choice.
write.csv(penguins_new, file = "penguins_new.csv")
Dealing with errors in R
Errors in R occur when code is used in a way that it is not intended. For example when you try to add two character strings, you will get the following error:
Normally, when something has gone wrong with your program, R will notify you of an error in the console. There are errors that will prevent the code from running, while others will only produce warning messages. In the following case, the code will run, but you will notice that the string “three” is turned into a NA.
Since we will be utilizing widely used packages and functions in the course of the semester, the errors that you may come across in the process of completing your assignments will be common for other R users. Most errors occur because of typos. A Google search of the error message can take you a long way as well. Most of the times the first entry on stackoverflow.com will solve the problem.
The double colon operator ::
You may have noted in the previous section that the functions were
preceded by their package name and two colons, for example:
readr::read_rds()
. The double colon operator
::
helps us ensure that we select functions from a
particular package. We utilize the operator to explicitly state where
the function is coming. This may become even more important when you are
doing data analysis as part of a team further in your careers. Though it
is likely that this will not be a problem during the course, we can try
to employ the following convention package_name::function()
to ensure that we will not encounter errors in our knitting process:
Let’s look at what happens when we load tidyverse
.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 #──
#✓ ggplot2 3.3.2 ✓ purrr 0.3.4
#✓ tibble 3.0.3 ✓ dplyr 1.0.2
#✓ tidyr 1.1.2 ✓ stringr 1.4.0
#✓ readr 1.3.1 ✓ forcats 0.5.0
#── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() #──
#x dplyr::filter() masks stats::filter()
#x dplyr::lag() masks stats::lag()
You may notice that R points out some conflicts where some
functions are being masked. The default in this machine will become the
filter()
from the dplyr
package during this
session. If you were to run some code that is based on the
filter()
from the stats
package, your code
will probably result in errors.
Tidyverse
Now, let us leave base R behind and make our lives much easier by introducing you to the tidyverse. ✨ We think for data cleaning, wrangling, and plotting, the tidyverse really is a no-brainer. A few good reasons for teaching the tidyverse are:
- Outstanding documentation and community support
- Consistent philosophy and syntax
- Convenient “front-end” for more advanced methods
Read more on this here if you like.
But… this certainly shouldn’t put you off learning base R alternatives.
- Base R is extremely flexible and powerful (and stable).
- There are some things that you’ll have to venture outside of the tidyverse for.
- A combination of tidyverse and base R is often the best solution to a problem.
- Excellent base R data manipulation tutorial here.
Tidyverse packages
Why is it called the tidyverse? Well, as we saw above when
we loaded the tidyverse
package, it actually loaded a
number of packages (which could also be loaded individually):
ggplot2, tibble,
dplyr, etc. We can also see information about the
package versions and some namespace
conflicts.
The tidyverse actually comes with a lot more packages than those that are just loaded automatically.
## [1] "broom" "conflicted" "cli" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "ggplot2"
## [9] "googledrive" "googlesheets4" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "ragg"
## [21] "readr" "readxl" "reprex" "rlang"
## [25] "rstudioapi" "rvest" "stringr" "tibble"
## [29] "tidyr" "xml2" "tidyverse"
We’ll use several of these additional packages during the remainder of this course.
Underlying these packages are two key ideas:
The pipe %>%
operator
You might have seen the pipe operator in the past.
The beauty of pipes
- The forward-pipe operator
%>%
pipes the left-hand side values forward into expressions on the right-hand side. - It serves the natural way of reading (“do this, then this, then this, …”).
- We replace
f(x)
withx %>% f()
. - It avoids nested function calls.
- It minimizes the need for local variables and function definitions.
The classic way
The classic way, nightmare edition
alex_awake <- wake_up(Alex, 7)
alex_showered <- shower(alex_awake,
temp = 38)
alex_replete <- breakfast(alex_showered,
c("coffee", "croissant"))
alex_underway <- walk(alex_replete,
step_function())
alex_on_train <- bvg(alex_underway,
train = "U2",
destination = "Stadtmitte")
alex_hertie <- hertie(alex_on_train,
course = "Intro to DS")
Piping etiquette
- Pipes are not very handy when you need to manipulate more than one object at a time. Reserve pipes for a sequence of steps applied to one primary object.
- Don’t use the pipe when there are meaningful intermediate objects that can be given informative names (and that are used later on).
%>%
should always have a space before it, and should usually be followed by a new line.
The base R pipe: |>
The magrittr
pipe has proven so successful and popular
that the R core team recently
added a “native” pipe operator to base R (version 4.1), denoted
|>
. Here’s how it works:
mtcars |> subset(cyl == 4) |> head()
mtcars |> subset(cyl == 4) |> (\(x) lm(mpg ~ disp, data = x))()
Now, should we use the magrittr
pipe or the native pipe?
The native pipe might make more sense in the long term, since it avoids
dependencies and might be more efficient. Check out this
Stackoverflow post and this
Tidyverse blog post for a discussion of differences.
You can update your settings, if you’d like RStudio to default to the
native pipe operator |>
.
Tidy Data
Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table
Data manipulation with dplyr
A second fundamental package of the tidyverse is called
dplyr
. In this section you’ll learn and practice examples
using some functions in dplyr
to work with data. Those
are:
dplyr::select()
: Select (i.e. subset) columns by their names (keep or exclude some columns)dplyr::filter()
: Filter (i.e. subset) rows based on their values (keep rows that satisfy your conditions)dplyr::mutate()
: Create new columns or edit existing onesdplyr::group_by()
: Define groups within your data setdplyr::summarize()
: Collapse multiple rows into a single summary value (summary statistics)dplyr::arrange()
: Arrange (i.e. reorder) rows based on their values (reorder rows according to single or multiple variables)
To demonstrate and practice how these verbs (functions) work, we’ll use the penguins dataset we saw earlier.
dplyr::select()
The first verb (function) we will utilize is
dplyr::select()
. We can employ it to manipulate our data
based on columns. If you recall from our initial
exploration of the data set there were eight variables attached to every
observation. Do you recall them? If you do not, there is no problem. You
can utilize names()
to retrieve the names of the variables
in a data frame.
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "rec_year"
Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:
dplyr::select(data, columns)
Exercise 1
The following code chunk would select the variables we need. Can you adapt it, so that we keep the body_mass_g and sex variables as well?
Good to know: To drop variables, use -
before the variable name, i.e. select(penguins, -year) to drop the year
column (select everything but the year column).
dplyr::filter()
☕
The second verb (function) we will employ is
dplyr::filter()
. dplyr::filter()
lets you use
a logical test to extract specific rows from a data
frame. To use dplyr::filter()
, pass it the data frame
followed by one or more logical tests. dplyr::filter()
will
return every row that passes each logical test.
The more commonly used logical operators are:
==
: Equal to!=
: Not equal to>
,>=
: Greater than, greater than or equal to<
,<=
: Less than, less than or equal to&
,|
: And, or
Say we are interested in retrieving the observations from the year 2007. We would do:
Exercise 2
We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunk with the pipe and conditional logical operators we discussed?
only_2009 <- dplyr::filter(penguins, rec_year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, rec_year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console
## # A tibble: 24 × 3
## species sex rec_year
## <fct> <fct> <int>
## 1 Chinstrap female 2009
## 2 Chinstrap male 2009
## 3 Chinstrap female 2009
## 4 Chinstrap male 2009
## 5 Chinstrap male 2009
## 6 Chinstrap female 2009
## 7 Chinstrap female 2009
## 8 Chinstrap male 2009
## 9 Chinstrap female 2009
## 10 Chinstrap male 2009
## # ℹ 14 more rows
dplyr::mutate()
🌂 ☂️
dplyr::mutate()
lets us create, modify, and delete
columns. The most common use for now will be to create new variables
based on existing ones. Say we are working with a U.S. American client
and they feel more comfortable with assessing the weight of the penguins
in pounds. We would utilize mutate()
as such:
dplyr::mutate(new_var_name = manipulated old_var(s))
dplyr::group_by()
and
dplyr::summarize()
These two verbs dplyr::group_by()
and
dplyr::summarize()
tend to go together. When combined ,
dplyr::summarize()
will create a new data frame. It will
have one (or more) rows for each combination of grouping variables; if
there are no grouping variables, the output will have a single row
summarizing all observations in the input. For example:
# compare this output with the one below
penguins |>
dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T))
## # A tibble: 1 × 1
## heaviest_penguin
## <int>
## 1 6300
penguins |>
dplyr::group_by(species) |>
dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T)) |>
dplyr::ungroup()
## # A tibble: 3 × 2
## species heaviest_penguin
## <fct> <int>
## 1 Adelie 4775
## 2 Chinstrap 4800
## 3 Gentoo 6300
There is also an alternate approach to calculating grouped summary
statistics called per-operation grouping. This allows you to define
groups in a .by
argument, passing them directly in the
summarize()
call. These groups don’t persist in the output
whereas the ones used with group_by
do. You can read more
about both these approaches in R for
Data Science, 2nd edition.
penguins |>
dplyr::summarise(heaviest_penguin = max(body_mass_g, na.rm = T), .by = species) |>
dplyr::ungroup()
## # A tibble: 3 × 2
## species heaviest_penguin
## <fct> <int>
## 1 Adelie 4775
## 2 Gentoo 6300
## 3 Chinstrap 4800
penguins |>
dplyr::summarise(heaviest_penguin = max(body_mass_g, na.rm = T), .by = c(species, sex)) |>
dplyr::ungroup()
## # A tibble: 8 × 3
## species sex heaviest_penguin
## <fct> <fct> <int>
## 1 Adelie male 4775
## 2 Adelie female 3900
## 3 Adelie <NA> 4250
## 4 Gentoo female 5200
## 5 Gentoo male 6300
## 6 Gentoo <NA> 4875
## 7 Chinstrap female 4150
## 8 Chinstrap male 4800
Notice that we are using
dplyr::ungroup()
after performing grouped calculations. It is a convention we encourage. If you forget toungroup()
data, future data management can produce errors in downstream operations. Just to be sage, usedplyr::ungroup()
when you’ve finished with your calculations.
Exercise 3
Can you get the weight of the lightest penguin of each species? You
can use min()
. What happens when in addition to species you
also group by year group_by(species, year)
?
dplyr::arrange()
🥚 🐣 🐥
The dplyr::arrange()
verb is pretty self-explanatory.
dplyr::arrange()
orders the rows of a data frame by the
values of selected columns in ascending order. You can use the
desc()
argument inside to arrange in descending order. The
following chunk arranges the data frame based on the length of the
penguins’ bill. You hint tab contains the code for the descending order
alternative.
Exercise 4
Can you create a data frame arranged by body_mass_g of the penguins observed in the “Dream” island?
Optional: Other dplyr
functions
dplyr::slice()
: Subset rows by position rather than
filtering by values.
dplyr::pull()
: Extract a column from a data frame as a
vector or scalar.
dplyr::count()
and dplyr::distinct()
:
Number and isolate unique observations.
Note: You could also use a combination of
dplyr::mutate
, dplyr::group_by
, and
n()
,
e.g. penguins |> dplyr::group_by(species) |> dplyr::summarize(num = n())
.
dplyr::where()
: Select the variables for which a
function returns true.
dplyr::across()
: Summarize or mutate multiple variables
in the same way. More information here.
dplyr::case_when()
: Vectorize multiple
if_else()
(or base R ifelse()
) statements.
#multiple conditional statements
penguins |>
dplyr::mutate(
flipper_length_cat =
dplyr::case_when(
flipper_length_mm < 190 ~ "small",
flipper_length_mm >= 190 & flipper_length_mm < 210 ~ "medium",
flipper_length_mm >= 210 ~ "large"
)
) |>
dplyr::pull(flipper_length_cat) |> table()
Window functions: There are also a whole class of window
functions for getting leads and lags, ranking, creating cumulative
aggregates, etc. See vignette("window-functions")
.
Best practices
Let us conclude the tutorial with some notes on best practices in R. There is a multitude of collections on best practices online, for example, some very useful ones are here.
Script structure
When structuring your scripts, remember a few things
- Libraries go first
- Hard coded variables (e.g. when reading in data) go second
- Relative paths over absolute paths (give your friends and colleagues a chance to execute your scripts without trouble)
Naming conventions
Always use easy to interpret names and don’t use whitespaces in file or variable names! Some good examples:
survey_data_2024.R
student_ids <- c(101, 102, 103)
calculate_gpa()
Spacing
It is somewhat easier to read if you generally leave a space after a comma
my_function(1:10, c(2, 4))
Repetition
When you start copying and pasting code (creating a lot of redundancy), you might want to consider slimming down your code by creating a function. Functions? Let us leave this as a note of caution and return to functions and efficient code in our next session!
Actually learning R 🎒
Again, the key to learning R
is:
practice! We can only give you an overview over basic
R
functions, but to really learn R
you will
have to actively use it yourself, trouble shoot, ask questions, and
google! It is very likely that someone else has had the exact same or
just similar enough issue before and that the R community has
answered it with 5+ different solutions years ago. 😉
To-do!
For those of you who have yet to create a Github account, do so now. We will be using Github extensively in this course so make sure you get familiar with version control and project management with Git and Github. Everyone has their own preferences, but to get started either GitHub Desktop or the RStudio Git interface are good options. While we won’t have a full lab session on the topic we will cover it briefly next week, and we have included materials from a previous version of this course under session-00 which you should go through if you still feel unsure about it.
See you next week! 💻
Acknowledgements
This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral. It draws heavily on the materials for the Introduction to R workshop within the Data Science Seminar series at Hertie, created by Therese Anders, Statistical Modeling and Causal Inference by Sebastian Ramirez-Ruiz and Lisa Oswald.
Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081