Installing R

R: To use R, navigate your browser to cran.r-project.org.1 Download. You’re ready to use.

RStudio: Most R users interact with R through an (amazing) IDE called “RStudio”. Navigate to https://www.rstudio.com/products/rstudio/ and download the desktop IDE. Now you’re really ready.

Differences between R and Stata

Relative to Stata, R introduces a few new dimensions:

  1. R is free.
  2. R is an object-oriented language, in which objects have types.
  3. R uses packages (a.k.a. libraries).
  4. R tries to guess what you meant.
  5. R easily (and infinitely) parallelizes.
  6. R makes it easy to work with matrices.
  7. R plays nicely with with Markdown.

Let’s review these in differences in more depth.

R is free

Free to use. Free to update and upgrade. Free to dissimenate. Free for your students to install on their own laptops.

You know the old joke about an economist being told there is a $100 bill lying on the sidewalk? (“Impossible! Someone would have picked it up already.”) Now think about the crazy license fees for proprietary econometrics and modelling software. You can see where this is going…

R is an object-oriented language, in which objects have types

You might have heard or read something along the lines of: “In R, everything has a name and everything is an object”. This probably sounds very abstract if you’re coming from a language like Stata. However, the key practical implications of this so-called object-oriented (OO) approach are as follows:

  • You hold many objects in memory at the same time.
    • This could include multiple data frames, scalars, lists, functions, etc. (Remember: everything in R is an object.)
    • One of the upshots is no more “preserve”, “snapshot”, “restore” Stata-eque hackery if you want to summarise some variables in your dataset, or have multiple datasets that you want to work on at the same time.
  • As a corollary of this, defining or naming objects is a thing:
    • a <- 3 (i.e. the object a has been assigned as a scalar — or single-length vector — equal to 3)
    • b <- matrix(1:4, nrow = 2) (i.e. the object b has been assigned as a 2x2 matrix)
    • Side note: the <- assignment operator is read aloud as “gets”. You can also use a regular old equal sign if you prefer, e.g. a = 3.
  • Object types matter: e.g., a matrix is a bit different from data.frame or a vector. More.

All of this might sound simple – and it is! – but one aspect of the OO approach that can trip up new R users (especially those coming from Stata) is that you have to be specific about which object you are referring to.

  • In Stata, because there is only ever one dataset in memory, there can be no ambiguity about which variable you are referring to (or, more correctly, where Stata should look for it).
  • However, because you can have multiple data frames in memory in R, you typically have to tell it that you want the variable “wage” from, say, dataframe1 and not from dataframe2.
  • There are various ways to do this and it soon becomes second nature.
    • E.g. You could use the $ index operator: dataframe1$wage.
    • E.g. Some functions let you specify the data frame (or parent object) as part of the function call. We’ll see some practical examples of this approach in the next section on regression models.

R uses packages

  • Just as LaTeX uses packages (i.e., \usepackage{foo}), R also draws upon non-default packages (i.e., library(foo)).
  • Note that R automatically loads with a set of default packages called the base installation, which includes the most commonly used packages and functions across all use cases (core probability and statistical operations, linear regression functions, etc.).
  • However, to really become effective in R, you will need to install and use non-default packages too.
    • Seriously, R intends for you to make use of outside packages. Don’t constrain yourself.2

Install a package: install.packages("package.name")

  • Notice that the installed package’s name is in quotes.3
  • You generally only need to install a package once. That is, assuming you use the update.packages() command to update all of your installed packages at once (see below).

Load a package: library(package.name)

  • Notice that you don’t need quotation marks now. Reason: Once you have installed the package, R treats it as an object rather than a character.
  • You will need to load any non-base package that you want to use at the start of a new R session.

Update packages: update.packages(ask=FALSE)

  • This command will update all of your installed packages simultaneously. If you want to only update a specific package, you should simply reinstall it (i.e. install.packages("package.name"))

If you don’t feel like typing in these commands manually, one of the many advantages of the RStudio IDE is that makes installing and updating packages very easy (autocompletion, package search, etc.). Just click on the “Packages” tab of bottom-right panel:

R tries to guess what you meant

R is friendly and tries to help if you weren’t specific enough. Consider the following hypothetical OLS regression, where lm() is just the workhorse function for linear models in R:

lm(wage ~ education + gender, data = dataframe1)

Here, we could use a string variable like gender (which takes values like "female" and "male") directly in our regression call. R knows what you mean: you want indicator variables for the levels of the variable.4

Mostly, this is a good thing, but sometimes R’s desire to help can hide programming mistakes and idiosyncrasies. So it’s best to be aware, e.g.:

## [1] 2

R easily (and infinitely) parallelizes

Parallelization in R is easily done thanks to various packages like parallel, pbapply, future, and foreach.

Let’s illustrate by way of a simulation. First we’ll create some data (our_data) and a function (our_reg), which draws a sample of 10,000 observations and runs a regression.

With our data and function created, let’s run the simulation without parallelization:

## 73.576 sec elapsed

Now run the simulation with parallelization (12 cores):

## 18.125 sec elapsed

Not only was this about four times faster5, but notice how little the syntax changed to run the parallel version. To highlight the differences in bold: pblapply(X = 1:1e4, FUN = our_reg, cl = 12).

Here’s another parallel option just to drive home the point. (In R, there are almost always multiple ways to get a particular job done.)

## 17.942 sec elapsed

Further, many packages in R default (or have options) to work in parallel. E.g., the regression package lfe uses the available processing power to estimate fixed-effect models.

Again, all of this extra parallelization functionality comes for free. In contrast, have you looked up the cost of a Stata/MP license recently? (Nevermind that you effectively pay per core!)

Note: This parallelization often means that you move away from for loops and toward parallelized replacements (e.g., lapply has many parallelized implementations).6

Working with matrices

Because R began its life as a statistical language/environment, it plays very nicely with matrices.

Create a matrix:

##      [,1] [,2] [,3]
## [1,]    3    5    3
## [2,]    2    9    2
## [3,]    3    4    7

Assign (store) a matrix:

Invert a matrix:

##            [,1]        [,2]  [,3]
## [1,]  0.8088235 -0.33823529 -0.25
## [2,] -0.1176471  0.17647059  0.00
## [3,] -0.2794118  0.04411765  0.25

R plays nicely with with Markdown

Notebooks, websites, presentations, etc. can all easily include:

code chunks,

## [1] 32

evaluated code,

## [1] TRUE

normal or mathematical text,

\[\left(\text{e.g., }\dfrac{x^2}{3}\right)\]

and even interactive content like leaflet maps.

Yes, Stata 15 has some Markdown support, but the difference in functionality is pretty stark.

What’s next?

Now that you (hopefully) have a better sense of R, let’s head over to the regression intro section to try some hands-on examples.


  1. CRAN stands for Comprehensive R Archive Network. It is the central repository for downloading R itself and (vetted) packages.

  2. If you want to get really meta: the pacman package helps you… manage packages. More.

  3. R uses both single ('word') and double quotes ("word") to reference characters (strings).

  4. Variables in R that have different qualitative levels are known as “factors” Behind the scenes, R is converting gender from a string to a factor for you, although you can also do this explicitly yourself. More.

  5. It’s not a full 12 times faster because of the overhead needed to run this code in parallel (among other things). Since this overhead is largely a sunk cost, the relative speed-up will improve as we increase the number of iterations.

  6. Though there are parallelized for loop versions. More.