[1] 18
Part 1: The Basics
Introduction to R and RStudio
R (https://www.r-project.org/about.html) is a programming language originally designed for statistical computing and data visualization.
Nowadays, R can do many more things! such as:
Make music?!😮 (tuneR)
Create some simple games
There are multiple ways of doing the same thing in R. As long as you get to the desired result, how you got there is (usually) irrelevant.
R can do just about anything you can think of. In most cases, the only limit is your imagination and your Googling 🔎 skills.
Whereas R is a programming language, RStudio is an integrated development environment (IDE…a what? 😕)
An IDE is a software that facilitates writing code in general. Although RStudio was developed with R in mind, it also supports many other programming languages (e.g., Python, Javascript, C…)
Likewise, you do not need RStudio to use R. However, RStudio is by far the best IDE for R and it makes the process much more efficient!
Not super important, but just a distinction that I wanted to point out
🤷
You will never have to open R directly, but this is what R looks like compared to RStudio:
The RStudio interface is divided into 4 panes:
* Extra info:You can actually write and run code directly in the console, but you cannot save your code (which you should always do!). When you run your code from the Source pane, RStudio sends it to the console to be interpreted. All computer code is just plain text; what you need to run code of a certain computer language is to have something that interprets it and runs it. The R console is what interprets and runs your code (Hence why you need to have R on your computer to use R in RStudio)
R Basics
Before we can do any coding, we need to open a new R script! To do that navigate to file → new file → R script
A tab named “Untitled1” will appear in your source pane.
This is where we are going to write code!
As any other file, you can later save this file anywhere on your computer. It will have the .R extension.
R can perform just about any mathematical operation. At the same time, let’s see how to run some code:
In Rsudio, you can either run one or more line of code at once, or run the whole R script file at once.
You will see your code with output appear in the console.
Output is indicated by “[n]”, where n represents the line of the output.
Here we only have one line for output each of our inputs (the 3 math operations), but you can have more lines.
The # sign represents comments. R will not run commented lines. Comments are good for explaining code to either your future self or to other people reading your code!
R “reads” code until it find the end of a statement (code that produces output), and then expects the following statement to appear on a new line.
Note: Spacing among elements of a statement is irrelevant, but it is good practice to be reasonable and consistent.
Just as many other programming languages, R is object-oriented. You can think of objects as containers where information is stored.
To create an object in R, you use the “<” + “-” (assignment operator):
The keyboard shortcut for the assignment operator is “alt” + “-” (Win) or “Option” + “-” (Mac).
Just like there are many different types of containers (boxes, drawers, fridges, etc…), there are many different types of R objects!
The x objects that we just created is technically a numeric vector (type of object) of length 1 🤔
A vector is a one-dimensional (dimensions of object) collection of numbers. To create a vector we can do the following:
# `c()` is a function (more on functions later), and it stands for "concatenate". The `c()` binds things together. This function comes up a lot and has many different applications.
y <- c(1, 5, 7, 9)
# math operations can be applied to vectors! It turns out that for computers it is much more efficient to do operations as vectors instead of one at a time.
y - 3
[1] -2 2 4 6
So far we have only dealt with numbers, but R also handles characters quite well!
Although it may sound obvious, you cannot apply any math operations to character vectors
Also note that you can create character vectors that have numbers in them, but you will not be able to apply math operations to them:
It is sometimes the case that some of the data that you open will have numbers saved as characters, and you will be stuck wondering why your code does not run. so it is good to check that you are using the right type of object when code is not working.
A function is something that takes one or more objects as input and produces an output.
Functions also have arguments, that allow you to tweak what the function does.
# `Sort()`, by default, sorts vectors from smallest to largest (or in alphabetical order if you give it a character!)
# Here, we use "decreasing = TRUE" to sort from largest to smallest.
sort(x, decreasing = TRUE)
[1] 12 11 10 6 4 2
R comes with many built in functions. You can find a list here. However, your best friend for finding the function you need is Google (or chatGPT for simple coding questions!)
Let’s say I ask Google for an R function that sorts vectors and I find the sort()
function!… But how do I know about its arguments? How do I know whether it sorts in ascending or descending order?
There are multiple ways to open the help menu. Try the following:
There’s much more going on here, but notice the {base}
after the name of the function. That is the Package the function comes from 🧐
Usually, the base R functions are not enough for most of the tasks that one needs to accomplish in R. Often people have to create their own custom functions.
A package is simply a collection of functions that other users make for everyone out of the kindness of their heart!
Let us install a package that makes opening data in R very smooth, the rio
package:
The install.packages()
function installs packages from the comprehensive R archive network (CRAN). Among other things, CRAN maintains a library of packages made by users.
The process to get a package on CRAN is a bit lengthy (and sometimes packages get removed), so some people just upload their packages to Github.
To see all of the packages installed in your RStudio, you can navigate to your viewer pane and select “packages”.
Let’s open the Titanic_Survival.csv data set with the import()
function from the rio
package. This takes a few steps:
# to load the functions from a package you need to run the `library(package)` funcition first
library(rio)
# rio also suggests to add a few extra packages, so also run the line below. It is the case that packages have functions that use functions from other packages to run, hence why rio suggests to also install other packages here
install_formats()
Since the data is a separate file, R needs to know where that file is on your computer. There are two ways (more actually) of doing that:
Either you use the absolute file path (i.e., the unique address that identifies the location of all files on your computer)
You Save your data in you working directory (WD; the default folder where RStudio saves/looks for files). Your current WD is always displayed at the top of the R console pane next to the R version number.
To look at the full data, you can simply click on the data object that just appeared in your environment. However, there are also functions that can help us to get as sense of the data we are dealing with:
# the `str()` function tells us how many rows/columns our data has, what the variables are, and what type of variables we are dealing with (integers, numeric, characters)
str(dat)
'data.frame': 1309 obs. of 11 variables:
$ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
$ survived : int 1 1 0 0 0 1 1 0 1 0 ...
$ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
$ sex : chr "female" "male" "female" "male" ...
$ age : num 29 0.917 2 30 25 ...
$ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
$ parch : int 0 2 2 2 2 0 0 0 0 0 ...
$ fare : num 211 152 152 152 152 ...
$ embarked : chr "S" "S" "S" "S" ...
$ boat : chr "2" "11" "" "" ...
$ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
Note the “$” operator in front of the variables. When dealing with data.frame objects, you can interact with specific variables this way:
# dat$age means the "age" column in the "dat" data.frame object. The "na.rm = TRUE" let's the `mean()` function know to ignore the missing values (which are represented by "NA" in R) when calculating the mean.
mean(dat$age, na.rm = TRUE)
[1] 29.88113
Some slides back, I mentioned the concept of dimensions. data.frame
objects have 2 dimensions, rows and columns.
Knowing the number of dimensions of objects lets us subset objects. You can subset 2D objects by referring to the indices of their dimensions in this way “object_name[row number, column number]”:
# You can select the entire 2nd row of the "dat" object. If you leave a dimension empty when subsetting, it mean "all of this dimensions".
dat[2,]
pclass survived name sex age sibsp parch fare
2 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 151.55
embarked boat home.dest
2 S 11 Montreal, PQ / Chesterville, ON
# You can remove row or columns this way
# The "dat_2" object will be "dat" without the first row. `nrow()` counts the rows of an object.
dat_2 <- dat[-1,]
nrow(dat_2)
[1] 1308
You may think that you have not learned much so far. But I see learning R this way:
Give a man a fish, and you feed him for a day; teach a man to fish and you feed him for a lifetime. 🐡
These things that I would like you to always keep in mind as you work with R:
Open the “Workshop-Activity-1.pdf” file.
Form groups of 3 or more people and try solving the questions together!
It is fine if you can’t solve all of the questions.
I will go over the solutions to each question and also send you a file with those solutions at the end of the workshop!
Part 2: Dplyr, GGplot, and Stats
dplyr
package for data manipulation
ggplot2
package for plotting
Besides subsetting, the dplyr
package offers many functions for data manipulation.
subsetting: select and manipulate data based on its position or condition within the data structure.
dplyr
: provides a coherent set of verbs that help users manipulate data in a clear, and efficient way.
dplyr
offers:
filter()
: picks cases based on their values
select()
: picks variables based on their names
arrange()
: changes the ordering of the rows
mutate()
: adds new variables that are functions of existing variables
summarise()
: reduces multiple values down to a single summary
The pipe operator, %>%
, is a powerful feature in R.
It allows us to chain together operations in a linear sequence
%>%
is like saying “then”
In Starwars data, filter for Droids, and then arrange the filtered dataset by height in descending order.
To access the starwars dataset in R, you need this line:
ggplot2
is a powerful and widely used data visualization package in R. It’s part of the tidyverse
package, a collection of R packages designed for data science.
+
operatorggplot(starwars, aes(x = height, y = mass, color = gender)) +
geom_point() +
xlim(0, 250) +
ylim(0, 250) +
labs(title = "Height vs. Mass of Star Wars Characters",
x = "Height (cm)",
y = "Mass (kg)",
color = "Gender") +
theme_minimal() +
theme(legend.position = "right",
plot.title = element_text(hjust = .5, size = 14, face = "bold"),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt"))
Color = gender
: it uses the ‘gender’ variable to color-code the points on the plot
Labs(title = …., x = …, y = …, color = …)
: customizes the plot’s main title, labels for the x/y axes, and color legend
Theme_minimal()
: it applies a minimalistic theme to the plot
Theme(legend.position = …)
: further customizes the appearance of plot components
Draw a bar plot of the eye color variable.
PlantGrowth data (a built-in dataset in R): contains data on the weight of plants grown in three different conditions.
Research question:
Independent-samples t test!
t.test()
: is used to perform t-tests in R
PlantGrowth$weight[PlantGrowth$group == "ctrl"
: select the ‘weight’ values from the PlantGrowth dataset for all observations under the control group
PlantGrowth$weight[PlantGrowth$group == "trt1"
: select the ‘weight’ values from the PlantGrowth dataset for all observations under the treatment 1 group
Alternative = "two.sided"
: this is used for a two-tailed t-test
var.equal = TRUE
: assumes equal variance for the t-test
# There are many other ways to do a t-test, this is just one of them
control <- PlantGrowth$weight[PlantGrowth$group == "ctrl"]
treatment_1 <- PlantGrowth$weight[PlantGrowth$group == "trt1"]
t.test(control,
treatment_1,
alternative = "two.sided", var.equal = TRUE)
Two Sample t-test
data: control and treatment_1
t = 1.1913, df = 18, p-value = 0.249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2833003 1.0253003
sample estimates:
mean of x mean of y
5.032 4.661
Research question:
H0: the average weights of all groups are equal
H1: the average weights of all groups are not equal
One-way ANOVA!
You can run a one-way ANOVA with the aov()
function:
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
weight ~ group
: specifies the model for the ANOVA.
Formula
: aov(DV ~ IV, dataset)
Summary()
: used to obtain and print a summary of the results from the ANOVA test
Undergraduate R workshop Spring 2024