Traditional .R
scripts (analogous to a .do
file in Stata) are a standard way to write pure R code. You could create a new R script yourself now by clicking the “New File” icon at the top-left of your RStudio session. Seriously, go ahead and try quickly.
What you see in front of you, however, is not a plain R script. It is an R Markdown document (file extension: .Rmd
). This is simply a document type that allows you to combine both text — like the sentence you are reading now — and actual R code. It is similar to Stata’s dyndoc
and is a very convenient way to integrate text and code in a single document. Think of it like LaTeX and R had a baby, but was very easy to use (was a very good baby?). In fact, the same Rmd file can be “knitted” to multiple formats: html, pdf, rich text, etc.
We don’t want to get too sidetracked by the special features of R Markdown… and should emphasise that it is not the only (or even “standard”) way to do analysis in R. But it is an extremely popular feature of R and is also a great way to teach a session like this. If you are interested in learning more, then the official website is a great place to start.1 Note further that this document is about as vanilla as it gets, but you can get extremely fancy.
Behind the scenes, in the .Rmd
document we are typing in “code-chunks” that get run by R. These code chunks are fenced in with the backticks.
For example, this is a code chunk that will evaluate the command sin(3)
```{r}
sin(3)
```
The output of this will appear as follows:
## [1] 0.14112
You can also create calculations inline for example `r sin(3)`
will evaluate sin(3)
. With output rendering like this 0.14112.
Okay, with that bit of R Markdown prologue out of the way, let’s get down to running actual R code.
An awful lot of work in R gets done through third-party packages that must be installed separately to “base” R. This is similar to how third-party packages in Stata can be installed from ssc. We recommend using RStudio’s package installer to find and install (or update) any external packages. We’ll show you how do to this in the live session, or you can take a look here. But note that you can also install R packages directly from the R console, e.g.
install.packages(c('ggplot2', 'dplyr'))
A key difference between R and Stata, however, is that installed R packages must be loaded into memory if you want to use them in that session. Think of it like an app on your phone. You might have downloaded the app already, but you need to open it every time you use it. Later on we will cover how to manage packages efficiently. But right now, assuming that you have already installed them, let’s load and play around with some R packages.
Here we are going to load the ggplot2
package, which is an excellent graphics package. gg
here stands for “grammar of graphics”. This is a part of the tidyverse
suite of packages that you may have heard of. We will also load the dplyr
package, which is a popular data wrangling package (and is also a part of the tidyverse
).
# library(tidyverse) ## Shortcut to load both ggplot2 & dplyr (& several other packages)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Aside: Comments in
.R
files (or code chunks) are denoted with the#
character. A shortcut to comment out a line (or region) in RStudio isCtrl + Shift + c
. This shortcut generally just works correctly for whatever language or script that you open in RStudio.2
Now that these packages are loaded into memory, we can begin to use them. In the code below, we’ll explore the diamonds
dataset that comes bundled with ggplot2
. Much like Stata’s auto dataset, many packages bundle pre-installed datasets that are useful for tutorials and debugging.
The diamonds
dataset is already available to us, since we’ve loaded ggplot2
into memory. But let’s bring it visibly into our global Environment (top right-hand pane in RStudio). This would happen automatically if we read in an external file like a .csv or .dta. We’ll get to external file I/O later, though.
data("diamonds")
We can look at the first five observations using the head
command
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
We can get a list of column names fairly easily too
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
Tip: You can bring the dataset into view (similar to Stata’s browse feature) by typing
View(diamonds)
, or just by clicking on it in your Environment pane.
Let’s compute the average price by color. The next code chunk is using dplyr
commands (“verbs”) and invokes a “pipe” (the %>%
syntax) to write cleaner code.
## summarise(group_by(diamonds, color), mean(price))
diamonds %>% group_by(color) %>% summarise(mean(price))
## # A tibble: 7 x 2
## color `mean(price)`
## <ord> <dbl>
## 1 D 3170.
## 2 E 3077.
## 3 F 3725.
## 4 G 3999.
## 5 H 4487.
## 6 I 5092.
## 7 J 5324.
Here the pipe command %>%
will take the output from the command on the left and “pipes” it as input to the command on the right. So above we
group_by
, which groups by the unique values in the column color
summarize
command, which calculates the mean price in each group.Two things to note. First off, you can just move onto a new line without an error. This is neat because it allows us to write cleaner code and you don’t need a delimiter or the ///
you may be used to from stata. So we can rewrite the above like this.
diamonds %>%
group_by(color) %>%
summarize(mean(price))
## # A tibble: 7 x 2
## color `mean(price)`
## <ord> <dbl>
## 1 D 3170.
## 2 E 3077.
## 3 F 3725.
## 4 G 3999.
## 5 H 4487.
## 6 I 5092.
## 7 J 5324.
Second, in more recent versions of R (4.1 and above), you don’t need to have dpylr
installed to pipe. You can do the same thing using |>
.
diamonds |>
group_by(color) |>
summarize(mean(price))
## # A tibble: 7 x 2
## color `mean(price)`
## <ord> <dbl>
## 1 D 3170.
## 2 E 3077.
## 3 F 3725.
## 4 G 3999.
## 5 H 4487.
## 6 I 5092.
## 7 J 5324.
Note: In this example,
%>%
and|>
are interchangeable, but the two are not interchangeable in all settings. See https://www.r-bloggers.com/2021/05/the-new-r-pipe/ for more details.
An annoying, but also powerful, feature of R is that there are 15 ways to accomplish any task. In general, limiting the number of packages you use is a good idea since you won’t build dependencies on code that could become outdated as base R and other packages are updated. That being said, the power of R is the ability to use fantastic user written packages. Here we will use the datasummary*()
family of functions from the modelsummary
package to make fantastic summary statistic plots. Otherwise we’d have to continue to do things like we did above, exploring the data and making summary tables on our own.
Aside: In Stata, many ssc modules consist of one primary function, which shares the same name (e.g.
ivreg2
,reghdfe
). R packages tend to have many functions, so you shouldn’t expect these functions to share the exact same name as the package itself.
First, load the model summary package
library(modelsummary)
Now we can use the datasummary_skim
function.
datasummary_skim(diamonds)
Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
---|---|---|---|---|---|---|---|---|
carat | 273 | 0 | 0.8 | 0.5 | 0.2 | 0.7 | 5.0 | |
depth | 184 | 0 | 61.7 | 1.4 | 43.0 | 61.8 | 79.0 | |
table | 127 | 0 | 57.5 | 2.2 | 43.0 | 57.0 | 95.0 | |
price | 11602 | 0 | 3932.8 | 3989.4 | 326.0 | 2401.0 | 18823.0 | |
x | 554 | 0 | 5.7 | 1.1 | 0.0 | 5.7 | 10.7 | |
y | 552 | 0 | 5.7 | 1.1 | 0.0 | 5.7 | 58.9 | |
z | 375 | 0 | 3.5 | 0.7 | 0.0 | 3.5 | 31.8 |
But not every variable is numeric. datasummary
has this covered with the type = "categorical"
sub-option.
datasummary_skim(diamonds, type = "categorical")
N | % | ||
---|---|---|---|
cut | Fair | 1610 | 3.0 |
Good | 4906 | 9.1 | |
Very Good | 12082 | 22.4 | |
Premium | 13791 | 25.6 | |
Ideal | 21551 | 40.0 | |
color | D | 6775 | 12.6 |
E | 9797 | 18.2 | |
F | 9542 | 17.7 | |
G | 11292 | 20.9 | |
H | 8304 | 15.4 | |
I | 5422 | 10.1 | |
J | 2808 | 5.2 | |
clarity | I1 | 741 | 1.4 |
SI2 | 9194 | 17.0 | |
SI1 | 13065 | 24.2 | |
VS2 | 12258 | 22.7 | |
VS1 | 8171 | 15.1 | |
VVS2 | 5066 | 9.4 | |
VVS1 | 3655 | 6.8 | |
IF | 1790 | 3.3 |
Learn more about the datasummary*
family of functions here: https://vincentarelbundock.github.io/modelsummary/articles/datasummary.html
Next we will use the ggplot
function. You can learn more about any R function by typing in ?
before the function name (e.g., ?ggplot
).
diamonds
), the y-variable (price
), and the x-variable (carat
).ggplot(data = diamonds, aes(y = price, x = carat)) +
geom_point()
Think changing graphic features in ggplot as adding layers. You can do this by repeating all of the same code. Or by saving the above as an object then adding to that object.
For example let’s add a third line where use a default theme called classic that I like.
ggplot(data = diamonds, aes(y = price, x = carat)) +
geom_point() +
theme_classic()
We can obtain the same result, by saving the above as an object and adding to it.
base_plot = ggplot(data = diamonds, aes(y = price, x = carat)) +
geom_point()
You can also use the <-
arrow instead of the =
to the name base_plot
base_plot +
theme_classic()
All we need to do is add this theme line theme_classic()
to the above.
It’s very simple to make complex plots in R.
We can add some transparency
ggplot(data = diamonds, aes(y = price, x = carat)) +
geom_point(alpha = .33) +
theme_classic()
Customized labels and increased size of text
ggplot(data = diamonds, aes(y = price, x = carat)) +
geom_point(alpha = .1) +
theme_classic() +
theme(text = element_text(size = 18)) +
labs(title = "Larger diamonds cost more",
subtitle = "Price, $",
y = "",
x = "Carat")
Recall that the data contain information on diamond color. We can easily create small multiples of the scatter plot for each color.
ggplot(data = diamonds, aes(y = price, x = carat)) +
geom_point(alpha = .1) +
facet_wrap(~color) +
theme_classic() +
theme(text = element_text(size = 14)) +
labs(title = "Larger diamonds cost more by diamond color",
subtitle = "Price, $",
y = "",
x = "Carat")
Similarly, we can add color (or shapes) based on diamond clarity
ggplot(data = diamonds, aes(y = price, x = carat, color = clarity)) +
geom_point(alpha = .33) +
facet_wrap(~color) +
theme_classic() +
theme(text = element_text(size = 14)) +
labs(title = "Larger diamonds cost more by diamond color",
subtitle = "Price, $",
y = "",
x = "Carat")