Having R and R Studio on your laptop will allow you to work on problem sets and explore the magnificent functionality of R outside the lab. R is the language and R Studio helps us interact with R. It is important that you install R before you install R Studio.
In your web browser, go to r-project.org and then click download R. At this point, you’ll be directed to a page with a list of institutions that host the Comprehensive R Archive Network (CRAN). The idea is to pick an institution near you. Scroll down to USA and click the link for either UC Berkeley or OSU. The choice is arbitrary, but you can maintain a Beaver-free machine by picking Berkeley. Also, the Ducks play Cal this week. Know your enemy.
Windows Instructions: If you have a Windows machine, click Download R for Windows
then install R for the first time
then Download R 3.6.1 for Windows
. To complete installation, run the .exe
file you downloaded.
Mac Instructions: If you have a Mac, click Download R for (Mac) OS X
then R-3.6.1.pkg
under “latest release.” To complete installation, run the .pkg
file you downloaded.
Linux Instructions: If you run a Linux distro, note that installation instructions vary by distro. That said, you probably know what you’re doing.
In your web browser, go to rstudio.com/products/rstudio/, scroll down to R Studio Desktop, and then click Download RStudio Desktop under “Open Source Edition.” Scroll down to “Installers for Supported Platforms” and click the link that corresponds with your operating system. To complete installation, run the installer you downloaded.
When you open R Studio for the first time, you should notice three panels. The large panel to the left is the console. This is where you run code that tells R what to do. You can also use the console as a calculator. For example, if you type 5+5*2-1
in the console and hit Enter
, then R will return
## [1] 14
in the console.
The upper-right panel is the global environment. This is where R Studio stores datasets, user-defined functions, and other objects.
To define an object, you use the assignment operator <-
.1 For example, suppose that you want to assign the number 5 to an object called a
. In the console, you would type
a <- 5
which reads “a
gets 5.” When you execute this code (by hitting Enter
), a
will show up in the global environment. Hovering your cursor over a
in the global environment tells you that a
is a numeric object.
There are other kinds of objects, too. For example,
b <- "I Love Metrics"
is a character object, and
mat <- matrix(c(1, 2, 3, 4),
nrow = 2)
is a matrix.
One of the nice features of R is that it can store multiple objects at a time. This is especially useful for analyzing data in a data.frame
object and then storing the results in another data.frame
object. It is also useful for cleaning and merging data. You might think that the ability to store multiple datasets is a trivial feature, but many other statistics packages can’t do this.2
R functions come in packages. When you open a fresh R session in R Studio, a number of packages come pre-loaded. These include packages with common math and statistics functions and are known collectively as “base R.” Base R is wonderful, but non-default packages offer a great deal of flexibility and functionality.
Install a package: install.packages("package.name.here")
package.name.here
with the name of the package you want to install.Alternatively, you can click on the Packages
tab of the bottom-right panel:
Load a package: library(package.name.here)
pacman
We will often need to load several packages in a single session. One way to do this is to execute library(package.1)
, then library(package.2)
, then library(package.3)
, and so forth. A less cumbersome way to load multiple packages is to use the p_load
function from the pacman
package.
pacman
.pacman
package with library(pacman)
.p_load(package.1, package.2, package.3)
.p_load
first checks to see if the packages are installed. If they aren’t, then it will install them for you.To produce reproducible3 R code, it is best to use scripts. Open a new R script file with the .R
extension by clicking File
then New File
then R Script
. We will write our first script to generate a histogram and scatter plot using ggplot2
.
ggplot2
Start by writing code to install and load ggplot2
.
library(pacman)
p_load(ggplot2)
Run
at the upper-right corner of your R script. A quicker alternative is to click the line you want to run and then use the keyboard shortcut Ctrl
Enter
.Source
at the upper-right corner of your R script or use the keyboard shortcut Ctrl
Alt
R
.Aside: It is useful to leave comments in your code to explain to your future self what your code is doing and why. You can leave a comment by typing a hash #
:
# This is a comment. R will ignore it.
Check out the example dataset midwest
from ggplot2
. You can view the first few rows of the dataset with variable names by using the head
function.
head(midwest)
## # A tibble: 6 x 28
## PID county state area poptotal popdensity popwhite popblack
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702
## 2 562 ALEXA~ IL 0.014 10626 759 7054 3496
## 3 563 BOND IL 0.022 14991 681. 14477 429
## 4 564 BOONE IL 0.017 30806 1812. 29344 127
## 5 565 BROWN IL 0.018 5836 324. 5264 547
## 6 566 BUREAU IL 0.05 35688 714. 35157 50
## # ... with 20 more variables: popamerindian <int>, popasian <int>,
## # popother <int>, percwhite <dbl>, percblack <dbl>, percamerindan <dbl>,
## # percasian <dbl>, percother <dbl>, popadults <int>, perchsd <dbl>,
## # percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## # percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
Next, make a histogram of county poverty rates (measured by the variable percbelowpoverty
) using the ggplot
function. You will need to tell ggplot
geom_histogram()
).ggplot(data = midwest, aes(x = percbelowpoverty)) +
geom_histogram()
How many counties have poverty rates over 40 percent? What is the modal poverty rate?
To visualize relationships between variables, you can make a scatter plot. Do poverty rates appear positively or negatively correlated with race, as measured by the variable percblack
?
ggplot(data = midwest, aes(x = percblack, y = percbelowpoverty)) +
geom_point()