Before we start the econometrics part of this course, I’ll introduce the data.table
R package which I strongly suggest you to learn. The data.table
package provides a high-performance version of base R’s data.frame. It’s much faster and memory-efficient, being ideal to use with large and complex datasets (i.e. the kind you will probably deal with during your thesis).
This class is based on the introduction vignette of data.table
which is more detailed and should be consulted.
Installation
Install and load the package in order to use it.
# install.packages("data.table")
library(data.table)
Reading csv files
We use the package’s function fread
to load datasets saved in text formatted files. The most important arguments are: sep, nrows and encoding. You can always type in the console > ?fread
and check the function’s help page.
<- fread("Data/escolas.csv",
dt sep = ",",
nrows = Inf,
encoding = "UTF-8")
dt
## V1 escola aluno logico mulher idade
## 1: 89 1 89 -0.6110555 1 10.30411
## 2: 151 1 151 -1.0899623 0 10.64384
## 3: 152 1 152 -0.1321487 1 11.06027
## 4: 235 1 235 0.5063938 0 11.26301
## 5: 281 1 281 1.6238429 1 11.04658
## ---
## 902: 791 30 791 0.6660294 1 10.72603
## 903: 792 30 792 -0.2917843 1 11.37808
## 904: 793 30 793 -0.2917843 1 11.56438
## 905: 794 30 794 0.1871226 1 11.45205
## 906: 865 30 865 0.1871226 1 10.62740
Now dt
is a data.table
which also inherits methods from base R’s data.frame.
Data Manipulation
The general form of data.table
syntax is DT[i, j, by] where we can read it as: Take DT, subset/reorder rows using i,
then calculate columns j
, grouped by by
.
Subsetting rows in i
We can select only a handful of DT’s rows according to some criteria. Let’s choose only the observations of girls of age 10 or more. Notice we do not need to quote column’s name inside data.table
’s brackets when selecting them in either i
or j
.
== 1 & idade >= 10] dt[mulher
## V1 escola aluno logico mulher idade
## 1: 89 1 89 -0.6110555 1 10.30411
## 2: 152 1 152 -0.1321487 1 11.06027
## 3: 281 1 281 1.6238429 1 11.04658
## 4: 290 1 290 0.1871226 1 10.62466
## 5: 308 1 308 -0.4514198 1 10.43288
## ---
## 415: 791 30 791 0.6660294 1 10.72603
## 416: 792 30 792 -0.2917843 1 11.37808
## 417: 793 30 793 -0.2917843 1 11.56438
## 418: 794 30 794 0.1871226 1 11.45205
## 419: 865 30 865 0.1871226 1 10.62740
You can also select the n-first rows with a numeric index. The resulting data.table
can then be chained to perform other operations, for example, reordering.
1:5][order(-idade)] dt[
## V1 escola aluno logico mulher idade
## 1: 235 1 235 0.5063938 0 11.26301
## 2: 152 1 152 -0.1321487 1 11.06027
## 3: 281 1 281 1.6238429 1 11.04658
## 4: 151 1 151 -1.0899623 0 10.64384
## 5: 89 1 89 -0.6110555 1 10.30411
Selecting columns in j
If you are not interested in filtering out rows then you should skip i
by leaving it blank. Otherwise, subsetting rows and selecting columns can coexist as expected. Here we select only the escola
and aluno
columns.
dt[, .(escola, aluno)]
## escola aluno
## 1: 1 89
## 2: 1 151
## 3: 1 152
## 4: 1 235
## 5: 1 281
## ---
## 902: 30 791
## 903: 30 792
## 904: 30 793
## 905: 30 794
## 906: 30 865
Whenever j
returns a list, each element of the list will be converted to a column in the resulting data.table
. The .()
is just an alias for list()
, therefore, in the example above we are computing a list with 2 elements, escola and aluno in j
and the final result is a data.table
containing those two columns only.
Suppose we want to select columns programmatically. We do not know beforehand the name of the columns we are interested in, but only a pattern their names follow. Let’s select all columns ended in ade (i.e. in our case we will select only idade).
<- grep("ade$", colnames(dt), value = TRUE)
cols dt[, ..cols]
## idade
## 1: 10.30411
## 2: 10.64384
## 3: 11.06027
## 4: 11.26301
## 5: 11.04658
## ---
## 902: 10.72603
## 903: 11.37808
## 904: 11.56438
## 905: 11.45205
## 906: 10.62740
Computing over columns
The greatest advantage of storing your data in a tidy way is that it is easier to compute new variables from pre-existent ones. In a tidy data structure each column represents a variable and each row is an observation (i.e. imagine a cross-section) such that new variables are created just by making new columns in the data structure. Suppose we want to compute the product of the logical exam score with the age and return this product for each observation, just do that in j
.
*idade][1:5] dt[, logico
## [1] -6.296382 -11.601380 -1.461600 5.703520 17.937903
What is happening here? Notice that j
is not returning a list, but instead a product of two vectors, which is a vector in R. Therefore, our result is not a data.table
but a vector (i.e. a unidimensional array). We can return a data.table
just by enclosing the expression with .()
. Moreover, we can also name this column.
product = logico*idade,
dt[, .(sum = logico + idade)]
## product sum
## 1: -6.296382 9.693054
## 2: -11.601380 9.553873
## 3: -1.461600 10.928125
## 4: 5.703520 11.769408
## 5: 17.937903 12.670418
## ---
## 902: 7.143849 11.392057
## 903: -3.319946 11.086298
## 904: -3.374306 11.272599
## 905: 2.142938 11.639177
## 906: 1.988626 10.814520
Don’t mind just yet that we are not returning all other columns in the original data.table
. When showing the reference semantics of data.table
we will do that.
Aggregations
Aggregations are operations in a vector that results in a single value. Some of the most common aggregations are: sum, mean, min (or max) and first (or last). We can perform those operations on the whole dataset or on subsamples of it defined by the argument by
. When using by
we are actually computing the aggregations by group, each group according to the chosen variable. This framework is also known as split-apply-combine since we split the data by groups, apply a function (typically an aggregation) to each of these groups and recombine the results back into one data structure.
First let’s introduce the special symbol .N
that holds the number of observations in the current group. It is used to calculate the number of observations in a subsample. How many boys and girls are there in the dataset?
= list(mulher)] dt[, .N, by
## mulher N
## 1: 1 429
## 2: 0 477
When there is only one expression in j
, the use of .N
will return a data.table
. If you have more than one column being computed in j
, then you should enclose them in a list. Also, by
accepts either a list of unquoted column names or a character vector of columns names (ideal to use when grouping programmatically). If there is only one group, then a bare name is acceptable as convenience.
Another very useful special symbol is .SD
. The Subset of the Data acronym. It’s also helpful to consider the “S” to stand for “Selfsame” or “Self-reference.” Special symbol .SD
is in its most basic guise a reflexive reference to the data.table itself.
Often, we’d like to perform some operation on our data at the group level. When we specify by =
(or keyby =
), what happens when data.table
processes j
is that your data.table is split into many component sub-data.tables (.SD
), each of which corresponds to a single value of your by
variable:
Different aggregations to different columns
Now let’s do some aggregations. What are the number of children and their average score by each school? Return a data.table
with columns named obs
and avg
.
obs = .N, avg = mean(logico)), by = escola][1:5] dt[, .(
## escola obs avg
## 1: 1 12 0.05409289
## 2: 2 33 0.63216722
## 3: 3 21 -0.12454697
## 4: 4 71 NA
## 5: S/escola 2 0.34675819
Same aggregation to different columns
Now suppose we want to calculate the average exam score AND student’s age by school. We can certainly compute these two columns individually, using the same function, mean
, twice. But we can also use the special symbol .SD
which stands for Subset of Data. .SD
is itself a data.table
that holds the data for the current by
group and columns from .SDcols. We can check what we mean by “.SD being a data.table
split by groups with selected columns from .SDcols
.”
print(.SD), by = mulher, .SDcols = c("logico", "idade")] dt[,
## logico idade
## 1: -0.6110555 10.30411
## 2: -0.1321487 11.06027
## 3: 1.6238429 11.04658
## 4: 0.1871226 10.62466
## 5: -0.4514198 10.43288
## ---
## 425: 0.6660294 10.72603
## 426: -0.2917843 11.37808
## 427: -0.2917843 11.56438
## 428: 0.1871226 11.45205
## 429: 0.1871226 10.62740
## logico idade
## 1: -1.08996230 10.64384
## 2: 0.50639380 11.26301
## 3: 1.30457182 10.61096
## 4: 0.66602936 10.61096
## 5: 0.18712257 10.84658
## ---
## 473: 0.82566491 11.12877
## 474: -0.13214867 10.94795
## 475: 0.02748688 10.92877
## 476: -0.45141984 10.59178
## 477: 0.98530047 10.52603
## Empty data.table (0 rows and 1 cols): mulher
We have printed two data.tables, one with 429 rows (girls) and the other with 477. Each data.table
contains only the selected columns by .SDcols
, logico
and idade
. The grouping variable mulher
is NOT part of .SD
. To compute on multiple columns, we can then simply use the base R function lapply(). This function will take as the first argument the .SD
data.table
and compute the function FUN (the second argument) on each column, returning a list of computed values. Answering our question, what’s the average exam score and age by school?
lapply(.SD, mean), by = escola,
dt[, = c("logico", "idade")][1:5] .SDcols
## escola logico idade
## 1: 1 0.05409289 10.93721
## 2: 2 0.63216722 11.28775
## 3: 3 -0.12454697 11.07763
## 4: 4 NA 11.21408
## 5: S/escola 0.34675819 10.38904
Same aggregation for some columns only
Now suppose you want to know the mean of both exam score and age by school, but also the minimum and maximum ages also by school. You can combine the previous two sections and remember, j
must return a list in order for the result to be a data.table
.
c(lapply(.SD, mean), min = min(idade), max = max(idade)), by = escola,
dt[, = c("logico", "idade")][1:5] .SDcols
## escola logico idade min max
## 1: 1 0.05409289 10.93721 10.304110 12.65479
## 2: 2 0.63216722 11.28775 9.758904 14.21096
## 3: 3 -0.12454697 11.07763 10.361644 12.57260
## 4: 4 NA 11.21408 9.698630 14.74521
## 5: S/escola 0.34675819 10.38904 10.221918 10.55616
Multiple aggregations to the same columns
If your objective is to do multiple aggregations to the same column(s) the use of the sapply
function is very flexible in a way you can define your own custom function to apply on .SD
columns. That way you can define a summary statistics function that returns the number of observations, the mean and standard deviation of selected columns for example. Let’s do that and compute descriptive statistics, grouped by school, on the student’s age and exam.
<- c("count", "mean", "sd")
suffix = escola,
dt[, by setNames(
sapply(.SD, function(x){
list(.N, mean(x), sd(x))
}),c(paste(rep(names(.SD), each = length(suffix)),
suffix, sep = "_")
)
),= c("idade", "logico")][1:5] .SDcols
## escola idade_count idade_mean idade_sd logico_count logico_mean logico_sd
## 1: 1 12 10.93721 0.6170148 12 0.05409289 0.8659292
## 2: 2 33 11.28775 0.9582739 33 0.63216722 1.3126071
## 3: 3 21 11.07763 0.5454569 21 -0.12454697 0.6513371
## 4: 4 71 11.21408 0.8510487 71 NA NA
## 5: S/escola 2 10.38904 0.2363480 2 0.34675819 0.2257589
Here, the sapply
function returns a matrix where the rows contain the functions’ results and the columns are the variables you passed on .SDcols
, for each group. data.table
will justapose all the results for each column in order and then return. Mind that the custom function must return a list. The columns of the resulting data.table
will have default names and you can always use setnames
to rename them. Also notice that j
and by
can be interchanged inside the brackets. This is useful when j
is a long expression and you want to write it in several lines of code.
UPDATE: Notice we have embraced the the sapply expression in the setNames function. That way we can rename the list returned by sapply and the resulting data.table will have the desired column names.
We could also have done the following:
<- c("idade", "logico")
cols <- c("_mean", "_sd")
suffix = escola,
dt[, by setNames(
c(.N, lapply(.SD, mean), lapply(.SD, sd)),
c("obs", paste0(rep(names(.SD), length(suffix)),
rep(suffix, each = length(cols)))
)),= cols][1:5] .SDcols
## escola obs idade_mean logico_mean idade_sd logico_sd
## 1: 1 12 10.93721 0.05409289 0.6170148 0.8659292
## 2: 2 33 11.28775 0.63216722 0.9582739 1.3126071
## 3: 3 21 11.07763 -0.12454697 0.5454569 0.6513371
## 4: 4 71 11.21408 NA 0.8510487 NA
## 5: S/escola 2 10.38904 0.34675819 0.2363480 0.2257589
combinning results for each aggregation. Beware of the column’s order and repeated names (if setNames
is not used). This is ok for a handful of operations but can become quite repetitive if you have many more aggregations to perform, and you will have to rename (and possibly reorder) your columns anyway.
Whenever you perform calculations in a data.table
where the reference semantics is not used you should assign your results to another local variable. For example, up until now, all our operations did not changed the contents of dt
! If you want to further work down an intermediary result like the last one you should save this to another data.table
variable with the <-
operator.
Saving an intermediary results is specially useful (and needed actually) when doing grouped computations that change the number of rows. The reference operator :=
cannot save a data.table
with a smaller number of rows, therefore you need to save this result by yourself. So, keep in mind, whenever your computation is not using the reference operator :=
, you should assign your results to a local variable.
<- dt[, lapply(.SD, mean), by = escola,
averages = c("logico", "idade")][1:5]
.SDcols averages
## escola logico idade
## 1: 1 0.05409289 10.93721
## 2: 2 0.63216722 11.28775
## 3: 3 -0.12454697 11.07763
## 4: 4 NA 11.21408
## 5: S/escola 0.34675819 10.38904
Reference Semantics
Reference semantics allows to add/update/delete columns by reference. Performing operations by reference is much faster than copying data to another position in memory and only than modifying it, which is the default in base R and dplyr
. Besides the speed, operations by reference, since they do not copy memory contents, uses very little of RAM and it is ideal when your are running out of memory when dealing with your dataset.
To add/update/delete columns of a data.table
you should use the :=
operator. Besides the speed and memory considerations before mentioned, operations by reference are done in-place, thus, you don’t need to save the results in another local variable (even if it takes the same name), the same data.table
will be updated and you don’t need to keep up inventing new names for your variables.
Adding columns by reference
The simplest use case of an operation by reference is to create a single new column from other ones. Let’s creat an id
column which is the concatenation of school and student’s numbers.
:= paste(escola, aluno, sep = "_")]
dt[, id 1:5] dt[
## V1 escola aluno logico mulher idade id
## 1: 89 1 89 -0.6110555 1 10.30411 1_89
## 2: 151 1 151 -1.0899623 0 10.64384 1_151
## 3: 152 1 152 -0.1321487 1 11.06027 1_152
## 4: 235 1 235 0.5063938 0 11.26301 1_235
## 5: 281 1 281 1.6238429 1 11.04658 1_281
Note that we do not need to quote the new column’s name id when only one column is being created/updated. When more than one column will be assigned, the functional form of the :=
is preferable.
:= NULL] # removing the id column
dt[, id
`:=`(
dt[, id = paste(escola, aluno, sep = "_"),
product = logico*idade
)]1:5] dt[
## V1 escola aluno logico mulher idade id product
## 1: 89 1 89 -0.6110555 1 10.30411 1_89 -6.296382
## 2: 151 1 151 -1.0899623 0 10.64384 1_151 -11.601380
## 3: 152 1 152 -0.1321487 1 11.06027 1_152 -1.461600
## 4: 235 1 235 0.5063938 0 11.26301 1_235 5.703520
## 5: 281 1 281 1.6238429 1 11.04658 1_281 17.937903
Deleting columns
You could also pass a character vector of column names to the LHS of :=
, while the RHS will always hold a list of values. This way you can programmatically create new columns.
c("id", "product") := NULL] dt[,
In the example above we have assigned the NULL value to the columns id
and product
and this is the way to remove columns from a data.table
.
Sub-assign by reference
A reader paying close attention to the examples above must have identified some observations without a schoold number assigned. Those observations have the value “S/escola” and that’s why escola
is of type character. Now we want to fix that. Let’s assign the value 99, by reference, only to the rows where escola == "S/escola"
. In other words, let’s update column escola
for selected observations only. After that we can also change the column class to integer
by chaining the operations
== "S/escola", escola := "99"][, escola := as.integer(escola)]
dt[escola lapply(.SD, class)] dt[,
## V1 escola aluno logico mulher idade
## 1: integer integer integer numeric integer numeric
Notice that inside the second pair of brackets we did not filtered any rows (i
is empty), this means that j
will be applied to all observations.
Grouped operations
You can also perform grouped operations by reference using the by
argument as before. This can be quite useful in data anlysis and econometrics in particular. Suppose we want to demean age and exam score, by school. That is, for each observation we want to subtract the group mean given by school and save this result in columns named idade_demean
and logico_demean
.
= escola,
dt[, by `:=`(
idade_demean = idade - mean(idade),
logico_demean = logico - mean(logico, na.rm = TRUE)
)]1:5] dt[
## V1 escola aluno logico mulher idade idade_demean logico_demean
## 1: 89 1 89 -0.6110555 1 10.30411 -0.6331050 -0.6651483
## 2: 151 1 151 -1.0899623 0 10.64384 -0.2933790 -1.1440552
## 3: 152 1 152 -0.1321487 1 11.06027 0.1230594 -0.1862416
## 4: 235 1 235 0.5063938 0 11.26301 0.3257991 0.4523009
## 5: 281 1 281 1.6238429 1 11.04658 0.1093607 1.5697500
Another way to accomplish the same result.
c("idade_demean", "logico_demean") := NULL] # Remove columns
dt[,
<- c("idade", "logico")
cols <- paste(cols, "demean", sep = "_")
new_cols = escola,
dt[, by := lapply(.SD, function(x) x - mean(x, na.rm = TRUE)),
(new_cols) = cols]
.SDcols 1:5] dt[
## V1 escola aluno logico mulher idade idade_demean logico_demean
## 1: 89 1 89 -0.6110555 1 10.30411 -0.6331050 -0.6651483
## 2: 151 1 151 -1.0899623 0 10.64384 -0.2933790 -1.1440552
## 3: 152 1 152 -0.1321487 1 11.06027 0.1230594 -0.1862416
## 4: 235 1 235 0.5063938 0 11.26301 0.3257991 0.4523009
## 5: 281 1 281 1.6238429 1 11.04658 0.1093607 1.5697500