References

The primary references for today’s discussion can be found here, here, here and here.

Once again, most of the material that we will cover for the next few weeks is heavily influenced by the amazing lecture notes by Grant McDermott. You can visit his Github page for more information.

Packages

General packages for SQL and R.

R> if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
R> pacman::p_load(tidyverse, DBI, dbplyr, RSQLite, bigrquery, hrbrthemes, glue)

For the world inequality database.

R> install.packages("devtools")
R> devtools::install_github("WIDworld/wid-r-tool")

Introduction to databases

The session from this week is going to be about databases and a quick introduction to the basics of the SQL programming language. I am assuming that most of you have only worked with data that can easily fit in into your computer’s memory, since R (by default) takes all data that you import and places it in your Random Access Memory (RAM) The benefit of doing this is that RAM is very fast and this makes operations on data fast. The downside is that most of you don’t have massive amounts of RAM on your computers so it easy for data manipulations to overwhelm your RAM capacity. You will most surely encounter a situation, in your not so distant future, where you will have to deal with large datasets that are outside of the scope of the memory contained in your laptop or desktop PC. If this happens you might want to look into storing the data in a relational database.

Relational databases (or databases) exist either on your local computer, or on some remote server (storage). Generally, in the world of “big data”, we will access data remotely. This means that the data won’t be found on your physical drive on the computer. It is important to note that with large datasets, we are often not concerned with the entire dataset itself, but simply a small subset of a larger dataset. What we will focus on this class is connecting with a database to retrieve the chunks of data that you need for your analysis. Getting information that we want is referred to as submitting a query to the database. This means that if you have access to a really large dataset that is available in a public or private database then you can submit a query without having to download the entire dataset onto your computer. A query is our way of telling the database that we want to work with a subset of the data, which we can then draw into the memory on our computer. Once again, we don’t have to download the data directly on the computer in the case of remote servers, we simply connect to these servers and then once that connection is established we submit our query and the selected data is drawn into our computer memory.

Using SQL directly

In the next session we will be working directly with a SQLite database. We will some SQL commands directly on the database. This is not a detailed guide on how to use SQL, it simply introduces some of the core concepts. There are many resources online for learning SQL. This lecture just provides a quick survey so that you have an idea of what SQL is and what it does.

World inequality database

We are going to import some data from the WID database and then insert that into a SQLite database for us to work with. We could also work with the data in memory, but we want to showcase some of the SQL commands. We are going to use the command line to interface with the SQLite database at first and then we will use the dbplyr package to work directly through RStudio.

wid package

There is a package that allows you to directly interface with the WID database. If you want to download some data from there you could use the following code as an example,

R> data <- download_wid(
    indicators = "tptinc", # Thresholds of pre-tax national income
    areas = c("US", "ZA"), # United States and South Africa
    perc = c("p10p100", "p50p100", "p90p100", "p99p100", "p99.9p100")
)

R> df <- tibble(data)

We will not use this functionality for this lecture though.

CSV to SQLite

We could also access the full dataset for South Africa by downloading it from the website in CSV format. There are many ways to get data from websites, but in this case I just downloaded it directly by clicking on the appropriate link and storing it on my computer. The CSV file is in the data directory for this lecture, so you don’t have to download it if you have cloned the repository. It is called wid-data-za.csv.

In this section we will show you how to utilise the command line to convert the CSV to an SQLite database. We can use the sqlite3 command to do this as follows,

$ cd ./data
$ sqlite3 -separator ';' wid_data_za.db ".import wid-dta-za.csv wid-dta-za"

This has already been done for you. The SQLite database is contained GitHub repository for this course.

Access SQLite database

In order to use the SQLite commands interactively, we need to enter into the SQLite console.

So, open up a terminal, and run

$ cd ./data
$ sqlite3 wid_data_za.db

My screen then displays the following information.

SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite>

The SQLite command is sqlite3 and you are telling SQLite to open up the wid_table_za.db. You need to specify the .db file, otherwise SQLite will open up a temporary, empty database.

To get out of SQLite, type out .exit or .quit. For some terminals, Ctrl-D can also work. If you forget any SQLite . (dot) command, type .help.

Note: If you want nice syntax highlighting in the terminal then you can use the LiteCLI package. I will be showing this in the lecture.

Selecting data

Remember that for this section things are interactive, so we will not get output in the notebook. We need to type the code into our terminal. The first thing we would like to do is see the tables in the database. In our case there is only going to be one table.

sqlite> .tables

We can also see what the structure of the table looks like by using the schema command.

sqlite> .schema

We can select all of the columns in a table using *. However, I would not recommend doing this, since it will show all the columns on the screen. If you have a massive dataset this will show everything.

sqlite> SELECT * 
FROM "wid-dta-za";

You can also select certain columns by replacing the * with the column name.

sqlite> SELECT year 
FROM "wid-dta-za";

Once again, we don’t want to display everything on screen, so don’t run these commands.

Sometimes we want to look at unique values in a column, then we can use the DISTINCT command.

sqlite> SELECT DISTINCT year 
FROM "wid-dta-za";

In this case you will see that the only country in our dataset is South Africa, given by ZA. We can also use the DISTINCT keyword on multiple columns,

sqlite> SELECT DISTINCT country, year 
FROM "wid-dta-za";

Another thing that we can do is order all the variables by year.

sqlite> SELECT * 
FROM "wid-dta-za" 
ORDER by year;

We can do this in either ASC or DESC order.

sqlite> SELECT * 
FROM "wid-dta-za" 
ORDER by year DESC;

Filtering

One of the most important features of a database is the ability to filter data. We have encountered this in R with the filter command. In the case of SQL we use the WHERE clause in a query,

sqlite> SELECT * 
FROM "wid-dta-za" 
WHERE year <= '1955';

If we want to find out what the values in the database after the second world war AND where GDP was above 16.5 billion rand per year, we can use the following,

sqlite> SELECT * 
FROM "wid-dta-za" 
WHERE gross_domest_product > '16500' AND year >= '1945';

Are there any interesting patterns that you can observe here?

New values

We can also calculate new values from existing values. Perhaps we realise that our GDP values after the year 2000 need to be corrected by 5% due to some measurement error. Rather than modifying the stored data we can do this calculation on the fly as follows.

sqlite> SELECT gross_domest_product, 1.05 * gross_domest_product 
FROM "wid-dta-za" 
WHERE year > '2000';

We can rename the new field as follows,

sqlite> SELECT gross_domest_product as gdp, 1.05 * gross_domest_product as new_gdp 
FROM "wid-dta-za" 
WHERE year > '2000';

Missing values

Databases use a special value called NULL to represent missing information. Almost all operations on NULL produce NULL. Queries can test for NULLs using IS NULL and IS NOT NULL. Below is an example where we check for a NULL value:

sqlite> SELECT year 
FROM "wid-dta-za"  
WHERE one_perc_share_income IS NULL;

This reveals the years where there is a NULL value. These values can cause headaches wherever they appear. For example, suppose we want to the years at which the one percent share of income is greater than 0.2 and percentage share of wealth is greater than 0.5. Then we could do the following,

sqlite> SELECT year  
FROM "wid-dta-za"   
WHERE one_perc_share_income > 0.2 AND one_perc_share_wealth > 0.5; 

This reveals that this occurs on three different occasions, 2007, 2020 and 2021. However, if we specified that the country should also be South Africa, we get the following,

sqlite> SELECT year  
FROM "wid-dta-za"   
WHERE one_perc_share_income > 0.2 AND one_perc_share_wealth > 0.5 AND country = 'ZA';

The result is that we lose the value for 2021, since there is no country value allocated for 2021. This might simply be a mistake in the data, but it prevents us from seeing an obvious instance where there is a high level of inequality.

Aggregation

In SQL we can aggregation functions such as min or max. Each of these functions takes a set of records as input, and produces a single record as output:

sqlite> SELECT max(gross_domest_product)  
FROM "wid-dta-za";

min and max are just two of the aggregation functions built into SQL. Three others are avg, count, and sum:

sqlite> SELECT avg(gross_domest_product)  
FROM "wid-dta-za";

Interestingly, when there are no values to aggregate — for example, where there are no rows satisfying the WHERE clause — aggregation’s result is “don’t know” rather than zero or some other arbitrary value:

sqlite> SELECT min(gross_domest_product)  
FROM "wid-dta-za"
WHERE country = "USA";

We have covered some of the main commands that you will encounter in SQL. However, this has truly just been an introduction to get you started. We haven’t even dealt with a fundamental concept such as JOIN, so there is much more to learn. The reason that we aren’t exploring every dimension of SQL is that the R interface with SQL often means that you won’t need to write much SQL code to execute your queries. We will see this below.

Databases and R

After some experience with SQL you will agree that it is a bit of an archaic language that is much less intuitive than the R tools that we have using thus far in the course. There is some good news: Even if you aren’t comfortable with SQL you already have all the programming skills you need to start working with databases. This is because the tidyverse — through dplyr — allows for direct communication with databases from your local R environment.

This means that you can interact with the vast datasets that are stored in relational databases using the same tidyverse verbs and syntax that we already know. All of this is possible thanks to the dbplyr package, which provides a database backend to dplyr. What’s happening even further behind the scenes is that, upon installation, dbplyr suggests the DBI package as a dependency. DBI provides a common interface that allows dplyr to work with many different databases using exactly the same code. You don’t even need to leave your RStudio session or learn SQL!

Aside: You will probably want to learn SQL eventually! Luckily, dplyr and dbplyr come with several features that can really help to speed up the learning and translation process. We’ll get to these later in the lecture.

While DBI is automatically bundled with dbplyr, you’ll need to install a specific backend package for the type of database that you want to connect to. There are many backends to use. We have worked with SQLite since it is quite lightweight and installed on most operating systems already. We will talk later about bigrquery which connects to Google BigQuery when we have our session on cloud computing and big data.

Our goal for this section is to create a makeshift database on our local computers — using the excellent SQLite backend — and then connect to it from R. I’ll use this to demonstrate the ease with which we can execute queries from R, as well as underscore some principles for working with databases in general. The lessons that we learn here will carry over to more complicated cases and much larger datasets.

Local database

Start by opening an (empty) database connection via the DBI::dbConnect() function, which we’ll call con.

R> library(DBI) ## Already loaded
R> 
R> con = DBI::dbConnect(RSQLite::SQLite(), ":memory:")
R> 
R> dbListTables(con)
## character(0)

The arguments to DBI::dbConnect() vary from database to database. However, the first argument is always the database backend, i.e. RSQLite::SQLite() in this case since we’re using SQLite. Again, while this differs depending on the database type that you’re connecting with, SQLite only needs one other argument: the path to the database. Here we use the special string, “:memory:”, which causes SQLite to make a temporary in-memory database. We’ll explore more complicated connections later on that will involve things like password prompts for remote databases.

Our makeshift database connection con is currently empty. So let’s copy across the world inequality dataset that we worked with before. There are a couple of ways to do this, but here I’ll use the dplyr::copy_to() convenience function. Note that we are specifying the table name (“wid_za”) that will exist within this database.

R> wid_za <- read.csv(file = 'data/wid-dta-za.csv', sep = ',')
R> copy_to(con, wid_za)

Now that we’ve copied over the data, we can reference it from R via the dplyr::tbl() function. This will allow us to treat it as a normal data frame that be manipulated with dplyr commands.

R> wid_za_tbl <- tbl(con, "wid_za")
R> head(wid_za_tbl)
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.37.2 [:memory:]
##    year national_income gross_domest_product one_perc_share_i… one_perc_share_w…
##   <int>           <dbl>                <dbl>             <dbl>             <dbl>
## 1  1914              NA                   NA             0.220                NA
## 2  1915              NA                   NA             0.219                NA
## 3  1916              NA                   NA             0.220                NA
## 4  1917              NA                   NA             0.228                NA
## 5  1918              NA                   NA             0.210                NA
## 6  1919              NA                   NA             0.190                NA
## # … with 1 more variable: carbon_footprint <dbl>

It worked! Everything looks pretty good, although you may notice something slightly strange about the output. We’ll get to that in a minute.

Generating queries

One of the best things about dplyr is that it automatically translates tidyverse-style code into SQL for you. In fact, many of the key dplyr verbs are based on SQL equivalents. Let’s try out a few queries using the typical dplyr syntax that we already know.

R> ## Select some columns
R> wid_za_tbl %>% select(year, one_perc_share_income, gross_domest_product)
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.37.2 [:memory:]
##     year one_perc_share_income gross_domest_product
##    <int>                 <dbl>                <dbl>
##  1  1914                 0.220                   NA
##  2  1915                 0.219                   NA
##  3  1916                 0.220                   NA
##  4  1917                 0.228                   NA
##  5  1918                 0.210                   NA
##  6  1919                 0.190                   NA
##  7  1920                 0.202                   NA
##  8  1921                 0.218                   NA
##  9  1922                 0.192                   NA
## 10  1923                 0.195                   NA
## # … with more rows
R> ## Get the mean GDP 
R> wid_za_tbl %>%
+   summarise(mean_gdp = mean(gross_domest_product), na.rm = TRUE)
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.2 [:memory:]
##   mean_gdp na.rm
##      <dbl> <int>
## 1   14664.     1

Laziness as a virtue

The modus operandi of dplyr is to be as lazy as possible. What this means in practice is that your R code is translated into SQL and executed in the database, not in R. This is a good thing, since:

  1. It never pulls data into R unless you explicitly ask for it.
  2. It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

For example, consider an example where we are interested in values in the database after the second world war and where GDP was above 16.5 billion rand per year.

R> gdp_post_ww2_db <- wid_za_tbl %>% 
+   select(year, gross_domest_product) %>%
+   filter(year > 1945 & gross_domest_product > 16500)

Surprisingly, this sequence of operations never touches the database. It’s not until you actually ask for the data that dplyr generates the SQL and requests the results from the database. Even then it tries to do as little work as possible and only pulls down a few rows.

R> gdp_post_ww2_db
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.2 [:memory:]
##     year gross_domest_product
##    <int>                <dbl>
##  1  1973               16742.
##  2  1974               17692.
##  3  1975               17301.
##  4  1976               16920.
##  5  1977               16684.
##  6  1978               16812.
##  7  1979               17173.
##  8  1980               18204.
##  9  1981               18082.
## 10  1982               17265.
## # … with more rows

Collecting data

Typically, you’ll iterate a few times before you figure out what data you need from the database. Once you’ve figured it out, use collect() to pull all the data into a local data frame. I’m going to assign this collected data frame to a new object (), but only because I want to keep the queried data base object (gdp_post_ww2) separate for demonstrating some SQL translation principles in the next section.

R> gdp_post_ww2 <- gdp_post_ww2_db %>%
+   collect()
R> 
R> gdp_post_ww2
## # A tibble: 22 × 2
##     year gross_domest_product
##    <int>                <dbl>
##  1  1973               16742.
##  2  1974               17692.
##  3  1975               17301.
##  4  1976               16920.
##  5  1977               16684.
##  6  1978               16812.
##  7  1979               17173.
##  8  1980               18204.
##  9  1981               18082.
## 10  1982               17265.
## # … with 12 more rows

Super. We have successfully pulled the queried database into our local R environment as a data frame. You can now proceed to use it in exactly the same way as you would any other data frame.

Using show_query

Behind the scenes, dplyr is translating your R code into SQL. You can use the show_query() function to display the SQL code that was used to generate a queried table.

R> gdp_post_ww2_db %>% show_query() 
## <SQL>
## SELECT *
## FROM (SELECT `year`, `gross_domest_product`
## FROM `wid_za`)
## WHERE (`year` > 1945.0 AND `gross_domest_product` > 16500.0)

Note that the SQL call is much less appealing/intuitive than our piped dplyr code. In part, this is an artefact of the translation steps involved. The dplyr translation engine includes various safeguards that are designed to ensure that the resulting SQL code works. But this comes at the expense of code concision (e.g. those repeated SELECT commands at the top of the SQL string are redundant). However, it also reflects the simple fact that SQL is not an elegant language to work with. In particular, SQL imposes a lexical order of operations that doesn’t necessarily preserve the logical order of operations. This lexical ordering is also known as “order of execution” and is strict in the sense that every SQL query must follow the same hierarchy of commands.

I don’t really want to get into all of this now. But I do want to make you aware of the fact that SQL queries are not written in a way that you would think about them logically. Still, while it can take a while to wrap your head around, the good news is that SQL’s lexical ordering is certainly learnable.

Now, at this point, you may be wondering: Do we even need to learn SQL, given that it’s a pain and that the dplyr translation works so well?

That’s a fair question. The short answer is that, “yes”, at some point you will probably find yourself needing to write raw SQL code. Luckily, writing and submitting SQL queries directly from R and RStudio is easily done, thanks to the DBI package.

R> dbDisconnect(con)

In the section on cloud computing we might have some time to discuss the ideas behind BigQuery, which is like SQL in the cloud.