Introduction

Many “big data” problems are actually “small data problems in disguise”. That is, we only really need a subset of the data, or maybe we want to aggregate the data into some larger dataset. For example, we might want to access Census data, but only for a handful of municipalities. Or, we might want to analyse climate data collected from a large number of weather stations, but aggregated up to the national or monthly level. In such cases, the underlying bottleneck is interacting with the original data, which is too big to fit into memory. How do we store data of this magnitude and and then access it effectively? The answer is through a database.

Databases can exist either locally or remotely, as well as in-memory or on-disk. Regardless of where a database is located, the key point is that information is stored in a way that allows for very quick extraction and/or aggregation. More oftne than not you will probably need to extract several subsets and harmonise or transform them. To facilitate and automate this task, you will need to write your own functions and know to iterate them over the relevant subsets of data. This week’s session thus ties in well with the sections on functions and iteration that we quickly touched upon during the last lab.

Joins and Databases 🔭

Although this week’s session is nominally about databases - and we will spend the majority of this session on them - we believe that joins in R are a key skill that deserves more attention than we were able to devote to it during the tidyverse session. Therefore, we will split the session in two. First we will cover joins as implemented by the tidyverse. In that section you will learn to:

  • join together different datasets
  • differentiate between types of joins

The second part of the session will deal with databases and SQL. Here you will learn to:

  • connect to remote databases with R
  • generate SQL queries in R with dbplyr
  • manipulate and transform data in a remote database
  • how to collect hosted data and store it locally

Joins with dplyr

One of the mainstays of the dplyr package is merging data with the family join operations.

  • inner_join(df1, df2)
  • left_join(df1, df2)
  • right_join(df1, df2)
  • full_join(df1, df2)
  • semi_join(df1, df2)
  • anti_join(df1, df2)

(You might find it helpful to to see visual depictions of the different join operations here.)

For the simple examples that I’m going to show here, we’ll need some data sets that come bundled with the nycflights13 package. - Load it now and then inspect these data frames in your own console.

Let’s perform a left join on the flights and planes datasets. - Note: I’m going subset columns after the join, but only to keep text on the slide.

left_join(flights, planes) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>%
  head(3) ## Just to save vertical space in output

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

## Joining, by = c("year", "tailnum")

However, there’s an obvious problem here: the variable “year” does not have a consistent meaning across our joining datasets! - In one it refers to the year of flight, in the other it refers to year of construction.

Luckily, there’s an easy way to avoid this problem. - Try ?dplyr::join.

You just need to be more explicit in your join call by using the by = argument. - You can also rename any ambiguous columns to avoid confusion.

left_join(
  flights,
  planes %>% rename(year_built = year), ## Not necessary w/ below line, but helpful
  by = "tailnum" ## Be specific about the joining column
  ) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) %>%
  head(3) 

Last thing I’ll mention for now; note what happens if we again specify the join column… but don’t rename the ambiguous “year” column in at least one of the given data frames.

left_join(
  flights,
  planes, ## Not renaming "year" to "year_built" this time
  by = "tailnum"
  ) %>%
  select(contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>%
  head(3)

Make sure you know what “year.x” and “year.y” are. Again, it pays to be specific.


Working with databases 💾

So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets that are too large for your computer to handle as a whole? In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the parts needed for your current analysis.

Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.

Necessary packages

Accessing databases in R requires a few packages:

  • dbplyrmakes use of the same functions as dplyr, but also works with remote data stored in databases.

  • DBI is a package that allows R to connect easily to a DBMS (DataBase Management System)

  • Some package to interact with the back-end of the remote database such as RSQLite, other options might be:

    • RMariaDB::MariaDB() for RMariaDB,
    • RPostgres::Postgres() for RPostgres,
    • odbc::odbc() for odbc,
    • and bigrquery::bigquery() for BigQuery.

Connecting to a database

To connect to the database, we will use dbConnect() from the DBI package which defines a common interface between R and database management systems. The first argument is the database driver which in our case is SQLite and the second argument is the name and location of the database.

Most existing databases don’t live in a file, but instead live on another server. In addition to the two arguments above, database drivers will therefore also require details like user, password, host, port, etc. That means your code will often look more like this:

con <- DBI::dbConnect(RMariaDB::MariaDB(), 
                      host = "database.rstudio.com",
                      user = "tom",
                      password = rstudioapi::askForPassword("Tom's password")
)

For the purposes of this lab however, we are connecting to an in-memory database. That way we can avoid potential issues regarding the registration for access to a database, creation and caching of credentials, as well as defining safe ports and other boring details.

To avoid all this hassle, we basically create and host our own (small) database. Luckily, the code to do so is the same as in the general case above. But, SQLite only needs a path to the database. (Here, “:memory:” is a special path that creates an in-memory database.)

We then save the database connection and store it in the object “con” for further use in exploring and querying data.

# set up connection with DBI and RSQLite
con <- dbConnect(RSQLite::SQLite(), ":memory:")

Next, let us get a quick summary of the database connection using summary(). It shows “SQLiteConnection” under class and we can ignore the other details for the time being. Great!

summary(con)
##           Length            Class             Mode 
##                1 SQLiteConnection               S4

If you were to connect to a real online database that someone else generated, you could now call DBI::dbListTables(con) to see a list of the tables present in the database. Our local database is however still devoid of content.

We need to populate our database. We copy the data from last week to our database connection. In real life this step would probably be taken care of by the responsible database maintainer. We need to populate our database. We copy the data from earlier to our database connection. In real life this step would probably be taken care of by the responsible database maintainer.

# upload local data frame into remote data source; here: database
copy_to(
  dest = con, 
  df = nycflights13::flights, 
  name = "flights")

Indexing

Unfortunately, it is not enough to just copy data to our database. We also need to pass a list of indexes to the function. In this example, we set up indexes that will allow us to quickly process the data by time, carrier, plane, and destination. Creating the right indices is key to good database performance. Again, in applications where we don’t set up the database, this will be taken care of by the database maintainer.

copy_to(
  dest = con, 
  df = nycflights13::flights, 
  name = "flights",
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day"), 
    "carrier", 
    "tailnum",
    "dest"
  ),
  overwrite = T # throws error as table already exists
)

List Tables in Database

Now that we are connected to a database, let us list all the tables present in it using dbListTables().

DBI::dbListTables(con)
## [1] "flights"      "sqlite_stat1" "sqlite_stat1" "sqlite_stat4" "sqlite_stat4"

As you can see there is only one table for now (flights). The other objects that show up are infrastructure specificities for SQLite that you can safely ignore. Usually you would find many different tables in a relational database.


Queries

A query is a request for data or information from a database table or combination of tables. 📖

Reference Table

So how do you query a table in a database?

It is actually fairly straightforward. You use the tbl()function where you indicate the connection and the name of the table you want to interact with.

# generate reference table from the database
flights_db <- tbl(con, "flights")
flights_db 

The console output shows that this is a remote source; the table is not stored in our RStudio environment. Nor should you need to transfer the entire table to your RStudio environment. You can perform operations directly on the remote source. What is more you can rely on the dplyr syntax from last week to formulate your queries. R will automatically translate it into SQL (more on that below). The console output shows that this is a remote source; the table is not stored in our RStudio environment. Nor should you need to transfer the entire table to your RStudio environment. You can perform operations directly on the remote source. What is more you can rely on the dplyr syntax from earlier to formulate your queries. R will automatically translate it into SQL (more on that below).

Selecting Columns

You can select specific columns:

# perform various queries
flights_db %>% select(year:day, dep_delay, arr_delay)

Filtering by Rows

Access only specific rows:

flights_db %>% filter(dep_delay > 240)

Summary Statisitics

Or immediately generate summary statistics for different groups:

flights_db %>% 
  group_by(dest) %>%
  summarise(delay = mean(dep_time))

More advanced operations

You can even generate and plot figures without the need to store the data in your local environment:

flights_db %>% 
  filter(distance > 75) %>%
  group_by(origin, hour) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(hour, delay, color = origin)) + geom_line()

Joins

Databases become more exciting with more tables. So let’s add a couple more:

copy_to(
  dest = con, 
  df = nycflights13::planes, 
  name = "planes",
  temporary = FALSE, 
  indexes = "tailnum"
)

copy_to(
  dest = con, 
  df = nycflights13::airlines, 
  name = "airlines",
  temporary = FALSE, 
  indexes = "carrier"
)

copy_to(
  dest = con, 
  df = nycflights13::airports, 
  name = "airports",
  temporary = FALSE, 
  indexes = "faa"
)

copy_to(
  dest = con, 
  df = nycflights13::weather, 
  name = "weather",
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day", "hour", "origin")
  )
)

Let us call dbListTables() again on our “con” database connection. As you can see, there are several more tables now.

dbListTables(con)
## [1] "airlines"     "airports"     "flights"      "planes"       "sqlite_stat1"
## [6] "sqlite_stat1" "sqlite_stat4" "sqlite_stat4" "weather"

The join syntax has its origin in SQL. Unsurprisingly, you can join tables without having to store the data in memory. Here is how you perform a left join:

planes_db = tbl(con, 'planes')
left_join(
  flights_db,
  planes_db %>% rename(year_built = year),
  by = "tailnum" ## Important: Be specific about the joining column
) %>%
  select(year, month, day, dep_time, arr_time, carrier, flight, tailnum,
         year_built, type, model) 

This should all feel very familiar right? 😁


A look under the Hood

As you saw, you can conduct your analyses in a database, the same way you are used to do it in R. All that without your data having to be stored on your own device.

Unfortunately, there are however some differences between ordinary data frames and remote database queries that are worth pointing out.

The most important among these is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine.

This has the following implications. When working with databases, dplyr tries to be as lazy as possible 😪:

  • It never pulls data into R unless you explicitly ask for it.
  • It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

This even applies when you assign the output of a database query to an object:

tailnum_delay_db <- flights_db %>% 
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  ) %>% 
  arrange(desc(delay)) %>%
  filter(n > 100)

This leads to some unexpected behaviour:

Exhibit A: Because there’s generally no way to determine how many rows a query will return unless you actually run it, nrow() is always NA.

nrow(tailnum_delay_db)
## [1] NA

Exhibit B: Because you can’t find the last few rows without executing the whole query, you can’t use tail().

tail(tailnum_delay_db)
## Error: tail() is not supported by sql sources

Inspecting queries

We can always inspect the SQL code that dbplyr is generating:

tailnum_delay_db %>% show_query()
## <SQL>
## SELECT *
## FROM (SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n`
## FROM `flights`
## GROUP BY `tailnum`)
## WHERE (`n` > 100.0)

That’s probably not how you would write the SQL yourself, but it works.

More information about SQL translation can be found here: vignette("translation-verb").


From remote to local storage

If you then want to pull the data into a local data frame, use collect():

tailnum_delay <- tailnum_delay_db %>% collect()
tailnum_delay

Using SQL directly in R

If, for whatever reason you might want to write your SQL queries yourself, you can use DBI::dbGetQuery() to run SQL queries in R scripts:

sql_query <- "SELECT * FROM flights WHERE dep_delay > 240.0 LIMIT 5"
dbGetQuery(con, sql_query)

If you want to learn more about writing SQL with dbplyr, check out vignette('sql', package = 'dbplyr').


Disconnect from database

When you are done with your SQL queries, it is a good idea to disconnect from the database. What seems evident, becomes increasingly important if you work with servers that charge you for their services!

DBI::dbDisconnect(con)

Exercises on Databases

If you want to practice accessing databases going forward, have a look at the practice script here. It comes with a database file so you can open the connection locally without the need to register. Thanks for the excellent example to our colleague Will Lowe! If you want to practice accessing databases going forward, have a look at the practice script here. It comes with a database file so you can open the connection locally without the need to register. Thanks for the excellent example to our colleague Will Lowe!


(Advanced) BigQuery

If you are still curious about databases and SQL and wonder how you might scale all this up in for the purposes of a real project, you might be interested to look into Google’s BigQuery service. You can register for a free Google Cloud account here. Be aware that you only have a certain amount of free queries (1 TB / month) before you are charged. BigQuery is the most widely used service to interact with online databses and it has a number of public datasets that you can easily practice with. Everything we saw above applies, with the exception that you need to specify the backend bigrquery::bigquery().

Here is an example of how it would look:

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = google_cloud_project_name # This will tell Google whom to charge
)

Actually learning R 🎒

Let us remind you again, the key to learning R is: Google! We can only give you an overview over basic R functions, but to really learn R you will have to actively use it yourself, trouble shoot, ask questions, and google! It is very likely that someone else has had the exact same or just similar enough issue before and that the R community has answered it with 5+ different solutions years ago. 😉


Sources

The section on databases and SQL relies on the vignette from the dbplyr package, RStudio Tutorial on databases as well as the Databases Session in McDermott’s Data Science for Economists by Grant McDermott.

 

A work by Lisa Oswald & Tom Arend

Prepared for Intro to Data Science, taught by Simon Munzert

Hertie School, Berlin