dbplyr
The primary references for today’s discussion can be found here, here, here and here.
Once again, most of the material that we will cover for the next few weeks is heavily influenced by the amazing lecture notes by Grant McDermott. You can visit his Github page for more information.
General packages for SQL and R.
> if (!require("pacman")) install.packages("pacman") R
## Loading required package: pacman
> pacman::p_load(tidyverse, DBI, dbplyr, RSQLite, bigrquery, hrbrthemes, glue) R
For the world inequality database.
> install.packages("devtools")
R> devtools::install_github("WIDworld/wid-r-tool") R
The session from this week is going to be about databases and a quick introduction to the basics of the SQL programming language. I am assuming that most of you have only worked with data that can easily fit in into your computer’s memory, since R
(by default) takes all data that you import and places it in your Random Access Memory (RAM) The benefit of doing this is that RAM is very fast and this makes operations on data fast. The downside is that most of you don’t have massive amounts of RAM on your computers so it easy for data manipulations to overwhelm your RAM capacity. You will most surely encounter a situation, in your not so distant future, where you will have to deal with large datasets that are outside of the scope of the memory contained in your laptop or desktop PC. If this happens you might want to look into storing the data in a relational database.
Relational databases (or databases) exist either on your local computer, or on some remote server (storage). Generally, in the world of “big data”, we will access data remotely. This means that the data won’t be found on your physical drive on the computer. It is important to note that with large datasets, we are often not concerned with the entire dataset itself, but simply a small subset of a larger dataset. What we will focus on this class is connecting with a database to retrieve the chunks of data that you need for your analysis. Getting information that we want is referred to as submitting a query
to the database. This means that if you have access to a really large dataset that is available in a public or private database then you can submit a query without having to download the entire dataset onto your computer. A query is our way of telling the database that we want to work with a subset of the data, which we can then draw into the memory on our computer. Once again, we don’t have to download the data directly on the computer in the case of remote servers, we simply connect to these servers and then once that connection is established we submit our query and the selected data is drawn into our computer memory.
In the next session we will be working directly with a SQLite database. We will some SQL commands directly on the database. This is not a detailed guide on how to use SQL, it simply introduces some of the core concepts. There are many resources online for learning SQL. This lecture just provides a quick survey so that you have an idea of what SQL is and what it does.
We are going to import some data from the WID database and then insert that into a SQLite database for us to work with. We could also work with the data in memory, but we want to showcase some of the SQL commands. We are going to use the command line to interface with the SQLite database at first and then we will use the dbplyr
package to work directly through RStudio
.
wid
packageThere is a package that allows you to directly interface with the WID database. If you want to download some data from there you could use the following code as an example,
> data <- download_wid(
Rindicators = "tptinc", # Thresholds of pre-tax national income
areas = c("US", "ZA"), # United States and South Africa
perc = c("p10p100", "p50p100", "p90p100", "p99p100", "p99.9p100")
)
> df <- tibble(data) R
We will not use this functionality for this lecture though.
We could also access the full dataset for South Africa by downloading it from the website in CSV format. There are many ways to get data from websites, but in this case I just downloaded it directly by clicking on the appropriate link and storing it on my computer. The CSV file is in the data
directory for this lecture, so you don’t have to download it if you have cloned the repository. It is called wid-data-za.csv
.
In this section we will show you how to utilise the command line to convert the CSV to an SQLite database. We can use the sqlite3
command to do this as follows,
$ cd ./data
$ sqlite3 -separator ';' wid_data_za.db ".import wid-dta-za.csv wid-dta-za"
This has already been done for you. The SQLite database is contained GitHub repository for this course.
In order to use the SQLite commands interactively, we need to enter into the SQLite console.
So, open up a terminal, and run
$ cd ./data
$ sqlite3 wid_data_za.db
My screen then displays the following information.
3.36.0 2021-06-18 18:36:39
SQLite version ".help" for usage hints.
Enter > sqlite
The SQLite command is sqlite3
and you are telling SQLite to open up the wid_table_za.db
. You need to specify the .db
file, otherwise SQLite
will open up a temporary, empty database.
To get out of SQLite
, type out .exit
or .quit
. For some terminals, Ctrl-D
can also work. If you forget any SQLite .
(dot) command, type .help
.
Note: If you want nice syntax highlighting in the terminal then you can use the LiteCLI package. I will be showing this in the lecture.
Remember that for this section things are interactive, so we will not get output in the notebook. We need to type the code into our terminal. The first thing we would like to do is see the tables in the database. In our case there is only going to be one table.
> .tables sqlite
We can also see what the structure of the table looks like by using the schema
command.
> .schema sqlite
We can select all of the columns in a table using *
. However, I would not recommend doing this, since it will show all the columns on the screen. If you have a massive dataset this will show everything.
> SELECT *
sqliteFROM "wid-dta-za";
You can also select certain columns by replacing the *
with the column name.
> SELECT year
sqliteFROM "wid-dta-za";
Once again, we don’t want to display everything on screen, so don’t run these commands.
Sometimes we want to look at unique values in a column, then we can use the DISTINCT
command.
> SELECT DISTINCT year
sqliteFROM "wid-dta-za";
In this case you will see that the only country in our dataset is South Africa, given by ZA
. We can also use the DISTINCT
keyword on multiple columns,
> SELECT DISTINCT country, year
sqliteFROM "wid-dta-za";
Another thing that we can do is order all the variables by year
.
> SELECT *
sqliteFROM "wid-dta-za"
ORDER by year;
We can do this in either ASC
or DESC
order.
> SELECT *
sqliteFROM "wid-dta-za"
ORDER by year DESC;
One of the most important features of a database is the ability to filter data. We have encountered this in R
with the filter
command. In the case of SQL we use the WHERE
clause in a query,
> SELECT *
sqliteFROM "wid-dta-za"
WHERE year <= '1955';
If we want to find out what the values in the database after the second world war AND where GDP
was above 16.5 billion rand per year, we can use the following,
> SELECT *
sqliteFROM "wid-dta-za"
WHERE gross_domest_product > '16500' AND year >= '1945';
Are there any interesting patterns that you can observe here?
We can also calculate new values from existing values. Perhaps we realise that our GDP values after the year 2000 need to be corrected by 5% due to some measurement error. Rather than modifying the stored data we can do this calculation on the fly as follows.
> SELECT gross_domest_product, 1.05 * gross_domest_product
sqliteFROM "wid-dta-za"
WHERE year > '2000';
We can rename the new field as follows,
> SELECT gross_domest_product as gdp, 1.05 * gross_domest_product as new_gdp
sqliteFROM "wid-dta-za"
WHERE year > '2000';
Databases use a special value called NULL
to represent missing information. Almost all operations on NULL
produce NULL
. Queries can test for NULL
s using IS NULL
and IS NOT NULL
. Below is an example where we check for a NULL
value:
> SELECT year
sqliteFROM "wid-dta-za"
WHERE one_perc_share_income IS NULL;
This reveals the years where there is a NULL
value. These values can cause headaches wherever they appear. For example, suppose we want to the years at which the one percent share of income is greater than 0.2 and percentage share of wealth is greater than 0.5. Then we could do the following,
> SELECT year
sqliteFROM "wid-dta-za"
WHERE one_perc_share_income > 0.2 AND one_perc_share_wealth > 0.5;
This reveals that this occurs on three different occasions, 2007, 2020 and 2021. However, if we specified that the country should also be South Africa, we get the following,
> SELECT year
sqliteFROM "wid-dta-za"
WHERE one_perc_share_income > 0.2 AND one_perc_share_wealth > 0.5 AND country = 'ZA';
The result is that we lose the value for 2021, since there is no country value allocated for 2021. This might simply be a mistake in the data, but it prevents us from seeing an obvious instance where there is a high level of inequality.
In SQL we can aggregation functions such as min
or max
. Each of these functions takes a set of records as input, and produces a single record as output:
> SELECT max(gross_domest_product)
sqliteFROM "wid-dta-za";
min
and max
are just two of the aggregation functions built into SQL. Three others are avg
, count
, and sum
:
> SELECT avg(gross_domest_product)
sqliteFROM "wid-dta-za";
Interestingly, when there are no values to aggregate — for example, where there are no rows satisfying the WHERE
clause — aggregation’s result is “don’t know” rather than zero or some other arbitrary value:
> SELECT min(gross_domest_product)
sqliteFROM "wid-dta-za"
WHERE country = "USA";
We have covered some of the main commands that you will encounter in SQL. However, this has truly just been an introduction to get you started. We haven’t even dealt with a fundamental concept such as JOIN
, so there is much more to learn. The reason that we aren’t exploring every dimension of SQL is that the R
interface with SQL often means that you won’t need to write much SQL code to execute your queries. We will see this below.
After some experience with SQL you will agree that it is a bit of an archaic language that is much less intuitive than the R
tools that we have using thus far in the course. There is some good news: Even if you aren’t comfortable with SQL you already have all the programming skills you need to start working with databases. This is because the tidyverse
— through dplyr
— allows for direct communication with databases from your local R
environment.
This means that you can interact with the vast datasets that are stored in relational databases using the same tidyverse
verbs and syntax that we already know. All of this is possible thanks to the dbplyr
package, which provides a database backend to dplyr
. What’s happening even further behind the scenes is that, upon installation, dbplyr
suggests the DBI
package as a dependency. DBI
provides a common interface that allows dplyr
to work with many different databases using exactly the same code. You don’t even need to leave your RStudio
session or learn SQL!
Aside: You will probably want to learn SQL eventually! Luckily,
dplyr
anddbplyr
come with several features that can really help to speed up the learning and translation process. We’ll get to these later in the lecture.
While DBI
is automatically bundled with dbplyr
, you’ll need to install a specific backend package for the type of database that you want to connect to. There are many backends to use. We have worked with SQLite since it is quite lightweight and installed on most operating systems already. We will talk later about bigrquery
which connects to Google BigQuery when we have our session on cloud computing and big data.
Our goal for this section is to create a makeshift database on our local computers — using the excellent SQLite backend — and then connect to it from R
. I’ll use this to demonstrate the ease with which we can execute queries from R
, as well as underscore some principles for working with databases in general. The lessons that we learn here will carry over to more complicated cases and much larger datasets.
Start by opening an (empty) database connection via the DBI::dbConnect()
function, which we’ll call con
.
> library(DBI) ## Already loaded
R>
R> con = DBI::dbConnect(RSQLite::SQLite(), ":memory:")
R>
R> dbListTables(con) R
## character(0)
The arguments to DBI::dbConnect()
vary from database to database. However, the first argument is always the database backend, i.e. RSQLite::SQLite()
in this case since we’re using SQLite. Again, while this differs depending on the database type that you’re connecting with, SQLite only needs one other argument: the path to the database. Here we use the special string, “:memory:”, which causes SQLite to make a temporary in-memory database. We’ll explore more complicated connections later on that will involve things like password prompts for remote databases.
Our makeshift database connection con is currently empty. So let’s copy across the world inequality dataset that we worked with before. There are a couple of ways to do this, but here I’ll use the dplyr::copy_to()
convenience function. Note that we are specifying the table name (“wid_za”) that will exist within this database.
> wid_za <- read.csv(file = 'data/wid-dta-za.csv', sep = ',')
R> copy_to(con, wid_za) R
Now that we’ve copied over the data, we can reference it from R
via the dplyr::tbl()
function. This will allow us to treat it as a normal data frame that be manipulated with dplyr
commands.
> wid_za_tbl <- tbl(con, "wid_za")
R> head(wid_za_tbl) R
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.37.2 [:memory:]
## year national_income gross_domest_product one_perc_share_i… one_perc_share_w…
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1914 NA NA 0.220 NA
## 2 1915 NA NA 0.219 NA
## 3 1916 NA NA 0.220 NA
## 4 1917 NA NA 0.228 NA
## 5 1918 NA NA 0.210 NA
## 6 1919 NA NA 0.190 NA
## # … with 1 more variable: carbon_footprint <dbl>
It worked! Everything looks pretty good, although you may notice something slightly strange about the output. We’ll get to that in a minute.
One of the best things about dplyr
is that it automatically translates tidyverse-style code into SQL for you. In fact, many of the key dplyr
verbs are based on SQL equivalents. Let’s try out a few queries using the typical dplyr
syntax that we already know.
> ## Select some columns
R> wid_za_tbl %>% select(year, one_perc_share_income, gross_domest_product) R
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.37.2 [:memory:]
## year one_perc_share_income gross_domest_product
## <int> <dbl> <dbl>
## 1 1914 0.220 NA
## 2 1915 0.219 NA
## 3 1916 0.220 NA
## 4 1917 0.228 NA
## 5 1918 0.210 NA
## 6 1919 0.190 NA
## 7 1920 0.202 NA
## 8 1921 0.218 NA
## 9 1922 0.192 NA
## 10 1923 0.195 NA
## # … with more rows
> ## Get the mean GDP
R> wid_za_tbl %>%
R+ summarise(mean_gdp = mean(gross_domest_product), na.rm = TRUE)
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.37.2 [:memory:]
## mean_gdp na.rm
## <dbl> <int>
## 1 14664. 1
The modus operandi of dplyr
is to be as lazy as possible. What this means in practice is that your R
code is translated into SQL and executed in the database, not in R
. This is a good thing, since:
R
unless you explicitly ask for it.For example, consider an example where we are interested in values in the database after the second world war and where GDP
was above 16.5 billion rand per year.
> gdp_post_ww2_db <- wid_za_tbl %>%
R+ select(year, gross_domest_product) %>%
+ filter(year > 1945 & gross_domest_product > 16500)
Surprisingly, this sequence of operations never touches the database. It’s not until you actually ask for the data that dplyr
generates the SQL and requests the results from the database. Even then it tries to do as little work as possible and only pulls down a few rows.
> gdp_post_ww2_db R
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.37.2 [:memory:]
## year gross_domest_product
## <int> <dbl>
## 1 1973 16742.
## 2 1974 17692.
## 3 1975 17301.
## 4 1976 16920.
## 5 1977 16684.
## 6 1978 16812.
## 7 1979 17173.
## 8 1980 18204.
## 9 1981 18082.
## 10 1982 17265.
## # … with more rows
Typically, you’ll iterate a few times before you figure out what data you need from the database. Once you’ve figured it out, use collect()
to pull all the data into a local data frame. I’m going to assign this collected data frame to a new object (), but only because I want to keep the queried data base object (gdp_post_ww2
) separate for demonstrating some SQL translation principles in the next section.
> gdp_post_ww2 <- gdp_post_ww2_db %>%
R+ collect()
>
R> gdp_post_ww2 R
## # A tibble: 22 × 2
## year gross_domest_product
## <int> <dbl>
## 1 1973 16742.
## 2 1974 17692.
## 3 1975 17301.
## 4 1976 16920.
## 5 1977 16684.
## 6 1978 16812.
## 7 1979 17173.
## 8 1980 18204.
## 9 1981 18082.
## 10 1982 17265.
## # … with 12 more rows
Super. We have successfully pulled the queried database into our local R
environment as a data frame. You can now proceed to use it in exactly the same way as you would any other data frame.
show_query
Behind the scenes, dplyr
is translating your R
code into SQL. You can use the show_query()
function to display the SQL code that was used to generate a queried table.
> gdp_post_ww2_db %>% show_query() R
## <SQL>
## SELECT *
## FROM (SELECT `year`, `gross_domest_product`
## FROM `wid_za`)
## WHERE (`year` > 1945.0 AND `gross_domest_product` > 16500.0)
Note that the SQL call is much less appealing/intuitive than our piped dplyr
code. In part, this is an artefact of the translation steps involved. The dplyr
translation engine includes various safeguards that are designed to ensure that the resulting SQL code works. But this comes at the expense of code concision (e.g. those repeated SELECT
commands at the top of the SQL string are redundant). However, it also reflects the simple fact that SQL is not an elegant language to work with. In particular, SQL imposes a lexical order of operations that doesn’t necessarily preserve the logical order of operations. This lexical ordering is also known as “order of execution” and is strict in the sense that every SQL query must follow the same hierarchy of commands.
I don’t really want to get into all of this now. But I do want to make you aware of the fact that SQL queries are not written in a way that you would think about them logically. Still, while it can take a while to wrap your head around, the good news is that SQL’s lexical ordering is certainly learnable.
Now, at this point, you may be wondering: Do we even need to learn SQL, given that it’s a pain and that the dplyr
translation works so well?
That’s a fair question. The short answer is that, “yes”, at some point you will probably find yourself needing to write raw SQL code. Luckily, writing and submitting SQL queries directly from R
and RStudio is easily done, thanks to the DBI
package.
> dbDisconnect(con) R
In the section on cloud computing we might have some time to discuss the ideas behind BigQuery, which is like SQL in the cloud.