class: center, middle, inverse, title-slide # Introduction to Data Science ## Session 4: Relational databases and SQL ### Simon Munzert ### Hertie School |
GRAD-C11/E1339
--- <style type="text/css"> @media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619 .has-continuation { display: block !important; } } </style> # Table of contents <br> 1. [Why databases?](#whydb) 2. [Relational database fundamentals](#relationaldb) 3. [Back to `dplyr`: joins](#joins) 4. [SQL](#sql) 5. [Talking to databases with R](#dbr) 7. [Summary](#summary) --- class: inverse, center, middle name: whydb # Why databases? <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # The ubiqity of multi-dimensional data structures .pull-left[ ### From data frames... - When you have a background in social sciences, your top-of-the-head mental image of data might be a rectangular [spreadsheet](https://en.wikipedia.org/wiki/Spreadsheet). - In fact, much of classical "statistical" software (SPSS, Stata, MS Excel) operates with rectangular data frames by default. - At the same time, your perception might be [file-based](https://en.wikipedia.org/wiki/List_of_file_formats). Data is stored in files, and these files are read (and produced) by our data management software. - In many cases, the **two-dimensional structure** makes sense. For instance, we observe - persons x attitudes - countries x characteristics - social media posts x text features ] .pull-right-center[ <br><br> <div align="center"> <img src="pics/clippy-message.jpeg" height=300> </div> ] --- # The ubiqity of multi-dimensional data structures .pull-left-center[ <div align="center"> <img src="pics/legislator-structure.png" height=520> </div> `Credit` [saschagobel/legislatoR](https://github.com/saschagobel/legislatoR) ] .pull-right[ ### ... to complex data structures - However, the longer you think about it, the more problematic it becomes to store your data in two-dimensional structures. - **Examples**: - countries x persons x characteristics x time - countries x states x communities x time x variables - social media posts x retweets x users x user characteristics x network features x meta data - Mapping three- onto two-dimensional structures is easy (think: `pivot_longer`, `pivot_wider`). - With **multiple heterogeneous data sources**, things get messy. - Managing complex data structures is just one perk of using databases. ] --- # When databases become useful .pull-left[ ### Size and speed - You have **loads of data that exceed the working memory** on your computer. Databases are only limited by available disk size (or can be distributed across multiple disks/machines). - Your **data structure is complex**. Databases allow/encourage you to store, retrieve and subset data with complex data structures. - Your data is big and you have to **access/subset/operate frequently**. Querying databases is fast. - You care about **data quality** and have clear expectations how data should look like. Using databases you can define specific rules for extending and updating your database. ] -- .pull-right[ ### Accessibility and concurrency - You **collaborate with others** on a data collection project. With a database, you have a common, simultaneously accessible, and reliable infrastructure at hand that multiple users can access at the same time. - When several parties are involved, who is allowed to do what with the database might differ (e.g., read-only, access to parts of the data, limited admin rights, etc.). Most databases allow **defining different usage rights for different users**. ] --- # Talking about databases .pull-left[ ### What we should distinguish - The **types of databases**, e.g.: relational, navigational, NoSQL, NewSQL - The **database management system**, e.g.: PostgreSQL, Oracle, SQL Server, SQLite - The **data structure**, e.g.: tables, columns, keys, normal forms - The **data manipulations**, e.g.: selects, joins, grouping - The **query language**, e.g., SQL, SPARQL Also, there are so many more ways to [classify databases](https://en.wikipedia.org/wiki/Database#Classification). But that's enough for now. Today, **we focus on relational databases**. They are by no means the only type of databases (see above), but they're ubiquitous and won't go away any time soon. ] -- .pull-right[ ### Databases versus data frames When reading/talking about features of databases, you will encounter a particular jargon. Here's how database concepts map onto R data frame jargon: | R jargon | Database jargon | |---|---| | column | attribute/field | | row | tuple/record | | element/cell | attribute value | | data frame | relation/table | | column types | table schema | | bunch of related data frames | database | ] --- class: inverse, center, middle name: relationaldb # Relational database fundamentals <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Codd's relational model for databases .pull-left[ - The concept of relational databases builds on the [relational model (RM) for database management](https://en.wikipedia.org/wiki/Relational_model), as proposed by [Edgar F. "Ted" Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) in 1969/1970. - Todd described the RM formally, but also introduced it using concepts that are still in use today (normalization, keys, joins, redundancy, etc.). - The key assumption of the relational model is that all data can be represented as relations (tables). - Information is then represented by data values in relations. - When you think this is trivial, check out the [history of databases](https://en.wikipedia.org/wiki/Database) and live through the pain of the early era of navigational DBMS in the 1960s and the NoSQL era that we've (not yet) overcome. <!-- <div align="center"> <img src="pics/edgar-codd.jpeg" height=150> </div> [Ted Codd](https://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg), Turing Award winner 1981 --> ] .pull-right-center[ <div align="center"> <img src="pics/codd-relational-model.png" height=500> </div> `Credit` [Communications of the ACM 13(6), 1970](https://dl.acm.org/doi/10.1145/362384.362685) ] --- # Codd's relational model for databases (cont.) .pull-left[ ### Storing data in tables - Again, the key concept of relational databases is that all information can be represented in a table. - A single table already introduces relations: All data in one row belongs to the same record. - If we want to represent more complex relations (i.e., measuring a person's weight twice or measuring the weight of their children as well), we can relate data from one table to another. ### Example - We have collected data on Peter, Paul, and Mary. - We have information on birthdays, telephone numbers, and favorite foods. - How can we represent this information in tables? ] -- .pull-right[ <br> <div align="center"> <img src="pics/adcr-relational-data-1.png" width=600> </div> <div align="center"> <img src="pics/adcr-relational-data-2.png" width=300> </div> - We start representing the data in two tables. - They are linked via the key `nameid`, so we don't have to add the full names to the phone numbers table. - Note that we have a 1:m (one-to-many) relation here because Peter has two phone numbers. ] --- # Codd's relational model for databases (cont.) .pull-left[ ### Storing data in tables - Again, the key concept of relational databases is that all information can be represented in a table. - A single table already introduces relations: All data in one row belongs to the same record. - If we want to represent more complex relations (i.e., measuring a person's weight twice or measuring the weight of their children as well), we can relate data from one table to another. ### Example - We have collected data on Peter, Paul, and Mary. - We have information on birthdays, telephone numbers, and favorite foods. - How can we represent this information in tables? ] .pull-right[ <br> <div align="center"> <img src="pics/adcr-relational-data-1.png" width=600> </div> <div align="center"> <img src="pics/adcr-relational-data-2.png" width=300> </div> - However, the way we store the data is not ideal. In the first table, we have three columns measuring effectively the same thing. And what if there's more favorite food? Adding information in such a fashion creates a lot of redundant information. ] --- # Codd's relational model for databases (cont.) .pull-left[ ### Storing data in tables - Again, the key concept of relational databases is that all information can be represented in a table. - A single table already introduces relations: All data in one row belongs to the same record. - If we want to represent more complex relations (i.e., measuring a person's weight twice or measuring the weight of their children as well), we can relate data from one table to another. ### Example - We have collected data on Peter, Paul, and Mary. - We have information on birthdays, telephone numbers, and favorite foods. - How can we represent this information in tables? ] .pull-right[ <br> <div align="center"> <img src="pics/adcr-relational-data-3.png" width=300> </div> - Splitting up the information by creating another table for food preferences is better. - There's still some redundancy left. Is it really necessary to have `hamburger` in the table twice? ] --- # Codd's relational model for databases (cont.) .pull-left[ ### Storing data in tables - Again, the key concept of relational databases is that all information can be represented in a table. - A single table already introduces relations: All data in one row belongs to the same record. - If we want to represent more complex relations (i.e., measuring a person's weight twice or measuring the weight of their children as well), we can relate data from one table to another. ### Example - We have collected data on Peter, Paul, and Mary. - We have information on birthdays, telephone numbers, and favorite foods. - How can we represent this information in tables? ] .pull-right[ <br> <div align="center"> <img src="pics/adcr-relational-data-4.png" width=300> </div> <div align="center"> <img src="pics/adcr-relational-data-5.png" width=300> </div> - Now that's better. - In restructuring the information in our database, we **avoided redundancy (duplication)**. - This is the process of **database normalization**. ] --- # Database normalization .pull-left[ ### What is database normalization? **From the [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization)**: "Database normalization is the process of **structuring a database**, usually a relational database, in accordance with a series of so-called **normal forms** in order to **reduce data redundancy and improve data integrity**. It was first proposed by Edgar F. Codd as part of his relational model." - You'll probably not have to apply normalization yourself because you are a user not a designer of databases. - However, it helps to have an idea of what the first normal forms are. - Higher-order normal forms imply lower-order normal forms (e.g., in order to satisfy the 3rd normal form, the 1st and 2nd normal forms have to be satisfied, too). ] -- .pull-right[ ### Normalization and tidy data There is also a straightforward link to [Hadley Wickham's "tidy data"](https://www.jstatsoft.org/article/view/v059i10): 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. This is Codd's 3rd normal form using "statistical" jargon and applied to a single dataset. ] --- # Database normalization (cont.) The **normal forms** (from least normalized to most normalized): <div align="center"> <img src="pics/normal-forms.png" height=430> </div> `Credit` [English Wikipedia, "Database normalization"](https://en.wikipedia.org/wiki/Database_normalization) --- # Database normalization (cont.) <div align="center"> <img src="pics/normal-form-error-1.png" height=230> <img src="pics/normal-form-error-2.png" height=230> <img src="pics/normal-form-error-3.png" height=230> </div> <br> <div align="center"> <img src="pics/normal-form-error-4.png" height=230> <img src="pics/normal-form-error-5.png" height=230> </div> --- # Database schema .pull-left[ ### What schemas are - The database schema describes the structure of a database. It represents the map or blueprint of how the database is constructed. - The schema specifies all core ingredients of the database, including tables, fields, keys relationships, views, etc. - The visualization helps database users understand the relationships between the tables. <br> <div align="center"> <img src="pics/adcr-database-schema.png" width=550> </div> ] .pull-right[ ### How they can look like <div align="center"> <img src="pics/mediawiki-schema.png" width=600> </div> `Credit` [Timo Tijhof/Wikimedia Commons](https://en.wikipedia.org/wiki/Database_schema#/media/File:MediaWiki_1.28.0_database_schema.svg) ] --- # Databases and Database Management Systems .pull-left-wide[ ### What are databases? - Databases are an organized collection of data. - They are organized to afford efficient retrieval of (selections) of data. - They entail data + metadata about structure and organization. - They are generally accessed through a database management system. ### Where are databases? - Databases can exist locally or remotely, in-memory or on-disk. - When they are stored locally, they are stored as binary file (not text file). - Commonly, we think of a **client-server model**: - Databases live on a **server**, which manages them - Users interact with the server through a **client** program. - Lets multiple users **access** the same database **simultaneously**. ] --- # Databases and Database Management Systems .pull-left-wide[ ### What are DBMS? - Database Management Systems (DBMS) provide **efficient**, **reliable**, **convenient**, **safe**, **multi-user** storage of and access to **massive** data. - **Massive**: Think Terabytes, not Gigabytes. Handle data that resides outside memory. - **Safe**: Robust to power outages, node failures, etc. - **Multi-user**: Concurrency control. Not one user, but multiple. - **Convenient**: High-level query languages. - **Efficient**: Just fast. - **Reliable**: High uptime. - There are [so many DBMS](https://en.wikipedia.org/wiki/List_of_relational_database_management_systems) for relational database structures alone. - RDBMS differ in terms of capabilities, implemented features, operating system support, and much more. - You'll probably not be in the position to decide which database to use. If you're still interested in the differences, [this](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems) or [this overview](https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems) might be a good starting point. - Also, I've heard good things about [DuckDB](https://duckdb.org/). ] .pull-right-small[ <br> <div align="center"> <img src="pics/mysql-logo.png" height=120> <img src="pics/postgresql-logo.png" height=150> <img src="pics/sqlite-logo.png" height=110> <img src="pics/duckdb-logo.png" height=50> </div> ] --- class: inverse, center, middle name: joins # Back to dplyr: joins <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Relational data in R For the simple examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package. Let's load it now and then inspect these data frames in your own console. ```r R> library(nycflights13) ``` --- # Relational data in R (cont.) The package contains the tables `flights`, `airlines`, `airports`, `planes`, and `weather`. -- The `airlines` data frame lets you look up the full carrier name from its [abbreviated code](https://en.wikipedia.org/wiki/List_of_airline_codes): ```r R> head(airlines, 10) ``` ``` ## # A tibble: 10 × 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ## 7 F9 Frontier Airlines Inc. ## 8 FL AirTran Airways Corporation ## 9 HA Hawaiian Airlines Inc. ## 10 MQ Envoy Air ``` --- # Relational data in R (cont.) The package contains the tables `flights`, `airlines`, `airports`, `planes`, and `weather`. `airports` gives information about each airport, identified by the `faa` [airport code](https://airportcodes.io/en/faa-codes/): ```r R> head(airports, 10) ``` ``` ## # A tibble: 10 × 8 ## faa name lat lon alt tz dst tzone ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo… ## 2 06A Moton Field Municipal A… 32.5 -85.7 264 -6 A America/Chicago ## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago ## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Yo… ## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Yo… ## 6 0A9 Elizabethton Municipal … 36.4 -82.2 1593 -5 A America/New_Yo… ## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_Yo… ## 8 0G7 Finger Lakes Regional A… 42.9 -76.8 492 -5 A America/New_Yo… ## 9 0P2 Shoestring Aviation Air… 39.8 -76.6 1000 -5 U America/New_Yo… ## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_An… ``` --- # Relational data in R (cont.) The package contains the tables `flights`, `airlines`, `airports`, `planes`, and `weather`. `planes` gives information about each plane, identified by its `tailnum` (aircraft registration a.k.a. [tail number](https://en.wikipedia.org/wiki/Aircraft_registration)): ```r R> head(planes, 10) ``` ``` ## # A tibble: 10 × 9 ## tailnum year type manufacturer model engines seats speed engine ## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> ## 1 N10156 2004 Fixed wing m… EMBRAER EMB-1… 2 55 NA Turbo-… ## 2 N102UW 1998 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 3 N103US 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 4 N104UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 5 N10575 2002 Fixed wing m… EMBRAER EMB-1… 2 55 NA Turbo-… ## 6 N105UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 7 N107US 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 8 N108UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 9 N109UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ## 10 N110UW 1999 Fixed wing m… AIRBUS INDUST… A320-… 2 182 NA Turbo-… ``` --- # Relational data in R (cont.) The package contains the tables `flights`, `airlines`, `airports`, `planes`, and `weather`. `weather` gives the [weather](https://www.youtube.com/watch?v=1ZyT_Aiey1U) at each NYC airport for each hour: ```r R> head(weather, 10) ``` ``` ## # A tibble: 10 × 15 ## origin year month day hour temp dewp humid wind_dir wind_speed ## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 ## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 ## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 ## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 ## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 ## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 ## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 ## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 ## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 ## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 ## # … with 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>, ## # visib <dbl>, time_hour <dttm> ``` --- # Relational data in R (cont.) The package contains the tables `flights`, `airlines`, `airports`, `planes`, and `weather`. Finally, `flights` gives data on each of the 336776 flights in the dataset: ```r R> head(flights, 10) ``` ``` ## # A tibble: 10 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- # Relational data in R (cont.) .pull-left[ - We can illustrate the relationships between the different tables with a schematic drawing.<sup>1</sup> - One table seems central (`flights`), but that's not a necessary feature of relational databases. - Key to understanding the diagram is that each relation always concerns a pair of tables: - `flights` connects to `planes` via `tailnum`. - `flights` connects to `airlines` via `carrier`. - `flights` connects to `airports` via `origin` and `dest`. - `flights` connects to `weather` via `origin` (location) and `year`, `month`, `day`, and `hour` (time). ] .pull-right[ <br> <div align="center"> <img src="pics/relational-nycflights.png" height=350> </div> ] .footnote[<sup>1</sup> Of course there are R packages that help you create such data models visually, e.g., the [dm package](https://cynkra.github.io/dm/).] --- # Relational data in R (cont.) .pull-left[ ### Question time 1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine? 2. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables? ] .pull-right[ <br> <div align="center"> <img src="pics/relational-nycflights.png" height=350> </div> ] --- # Joins .pull-left[ One of the mainstays of the `dplyr` package is merging data with the family of [join operations](https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html). Luckily, the functions are both intuitive to apply and consistent with SQL join vocabulary. The main functions are: - `inner_join(df1, df2)` - `left_join(df1, df2)` - `right_join(df1, df2)` - `full_join(df1, df2)` - `semi_join(df1, df2)` - `anti_join(df1, df2)` ] .pull-right[ <br> <div align="center"> <img src="pics/joins-comic.png" height=350> </div> ] --- # The logic of joins We start with a simple setup of two tables `x` and `y`: <div align="center"> <img src="pics/join-setup.png" height=100> </div> -- The **colored columns** represent the key variables, the **gray columns** the value variables. The basic ideas of joining will generalize to multiple keys and values. -- Joining is about connecting each row in `x` to zero, one, or more rows in `y`: <div align="center"> <img src="pics/join-setup2.png" height=150> </div> --- # The logic of joins We start with a simple setup of two tables `x` and `y`: <div align="center"> <img src="pics/join-setup.png" height=100> </div> The **colored columns** represent the key variables, the **gray columns** the value variables. The basic ideas of joining will generalize to multiple keys and values. Joining is about connecting each row in `x` to zero, one, or more rows in `y`: <div align="center"> <img src="pics/join-inner.png" height=150> </div> In an actual join, matches will be indicated with dots. The number of dots = the number of matches = the number of rows in the output. --- # The logic of joins (cont.) .pull-left-wide[ ### Inner join - An inner join matches pairs of observations when their keys are equal. - The output of an inner join is a new table that contains the key and values of both tables. - Unmatched rows are not included in the result. To be used with caution, because [it's easy to lose observations](https://twitter.com/SolomonMg/status/1436742352039669760)! ### Outer joins - Outer joins keep observations that appear in at least one of the tables. - There are three types of outer joins: - A **left join** keeps all observations in `x`. - A **right join** keeps all observations in `y`. - A **full join** keeps all observations in `x` and `y`. - Use the left join unless you have a strong reason not to do so; it preserves original observations even when there isn't a match. ] .pull-right-narrow[ <div align="center"> <img src="pics/join-inner2.png" width=300> <br> <img src="pics/join-outer.png" width=300> </div> ] --- # Keys - Variables used to connect each pair of tables are called keys. - A key is a variable (or set of variables) that uniquely identifies an observation. - In simple cases, a single variable is sufficient to identify an observation (plane → `tailnum`). - In other cases, multiple variables are needed (look again at the `weather` table). -- ### Primary keys - A primary key uniquely identifies an observation in its own table. - It is either a column containing a (sometimes autogenerated and otherwise meaningless) identifier that uniquely identifies each row, or - Several substantively meaningful columns whose row values *taken together* uniquely identify each row. -- ### Foreign keys - A foreign key is a column containing primary key(s) from another table. - It is the piece of information necessary to join both tables. - Note that a variable can be both a primary *and* a foreign key. In our example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` table. --- # Joins in R Let's perform a [left join](https://stat545.com/bit001_dplyr-cheatsheet.html#left_joinsuperheroes-publishers) on the flights and planes datasets. - *Note*: I'm going subset columns after the join, but only to keep text on the slide. -- ```r R> left_join(flights, planes) %>% + select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) ``` ``` ## Joining, by = c("year", "tailnum") ``` ``` ## # A tibble: 336,776 × 10 ## year month day dep_time arr_time carrier flight tailnum type model ## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr> ## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA> ## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA> ## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA> ## 4 2013 1 1 544 1004 B6 725 N804JB <NA> <NA> ## 5 2013 1 1 554 812 DL 461 N668DN <NA> <NA> ## 6 2013 1 1 554 740 UA 1696 N39463 <NA> <NA> ## 7 2013 1 1 555 913 B6 507 N516JB <NA> <NA> ## 8 2013 1 1 557 709 EV 5708 N829AS <NA> <NA> ## 9 2013 1 1 557 838 B6 79 N593JB <NA> <NA> ## 10 2013 1 1 558 753 AA 301 N3ALAA <NA> <NA> ## # … with 336,766 more rows ``` --- # Joins in R (cont.) Note that `dplyr` made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices: ``` *### Joining, by = c("year", "tailnum") ``` However, there's an obvious problem here: the variable "year" does not have a consistent meaning across our joining datasets! - In one it refers to the *year of flight*, in the other it refers to *year of construction*. -- Luckily, there's an easy way to avoid this problem. - See if you can figure it out before turning to the next slide. - Try `?dplyr::join`. --- # Joins in R (cont.) You just need to be more explicit in your join call by using the `by = ` argument. - You can also rename any ambiguous columns to avoid confusion. ```r R> left_join( + flights, + planes %>% rename(year_built = year), ### Not necessary w/ below line, but helpful + by = "tailnum" ### Be specific about the joining column + ) %>% + select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) %>% + head(3) ### Just to save vertical space on the slide ``` ``` ## # A tibble: 3 × 11 ## year month day dep_time arr_time carrier flight tailnum year_built type ## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr> ## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixed w… ## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixed w… ## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixed w… ## # … with 1 more variable: model <chr> ``` --- # Joins in R (cont.) Last thing to mention on joins for now; note what happens if we again specify the join column... but don't rename the ambiguous "year" column in at least one of the given data frames: ```r R> left_join( + flights, + planes, ### Not renaming "year" to "year_built" this time + by = "tailnum" + ) %>% + select(contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>% + head(3) ``` ``` ## # A tibble: 3 × 11 ## year.x year.y month day dep_time arr_time carrier flight tailnum type model ## <int> <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr> ## 1 2013 1999 1 1 517 830 UA 1545 N14228 Fixe… 737-… ## 2 2013 1998 1 1 533 850 UA 1714 N24211 Fixe… 737-… ## 3 2013 1990 1 1 542 923 AA 1141 N619AA Fixe… 757-… ``` -- Make sure you know what "year.x" and "year.y" are. Again, it pays to be specific. --- # Duplicate keys If you're lucky, keys are unique. But that's not always the case. There are two common scenarios: -- 1. One table has duplicate keys, the other hasn't. This gives us a one-to-many relationship. <div align="center"> <img src="pics/join-one-to-many.png" height=110> </div> -- <ol start="2"> <li>Both tables have duplicate keys. This is usually an error because in neither table do the keys uniquely identify an observation. When joining duplicated keys, we get all possible combinations (the Cartesian product, a many-to-many relationship):</li> </ol> <div align="center"> <img src="pics/join-many-to-many.png" height=140> </div> --- class: inverse, center, middle name: sql # SQL <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # What is SQL? ### Background and history - SQL (pronounced [ɛsˌkjuːˈɛl] as in S-Q-L, or [siːkwəl] as in sequel) stands for **Structured Query Language**. Initially it was called SEQUEL (Structured English Query Language), but this was dropped due to trademark issues. - It's a domain-specific language designed to query data contained in relational databases. - Initially developed at IBM by [Donald D. Chamberlin](https://en.wikipedia.org/wiki/Donald_D._Chamberlin) and [Raymond F. Boyce](https://en.wikipedia.org/wiki/Raymond_F._Boyce) in 1974. -- ### Why SQL? - While database types differ, most of the relational databases you'll encounter speak SQL. - The key skill to work with databases (outside R) is to learn how to speak SQL. Once you've mastered this, you should be able to work with any of them. - SQL is featured as a required skill in many (most?) data science job ads out there. --- # General SQL syntax ### Classes of SQL syntax - **Data query language** (DQL) to perform queries on the data [`SELECT`, `FROM`, `WHERE`] - **Data definition language** (DDL) to describe data structure and its relations (create tables, columns, define data types, keys, constraints) [`CREATE`, `ALTER`, `DROP`] - **Data manipulation language** (DML) to fill database or retrieve information from it [`SELECT`, `FROM`, `WHERE`, `INSERT`, `UPDATE`, `DELETE`] - **Data control language** (DCL) to define usage/admin rights [`GRANT`, `REVOKE`] -- ### A generic query The main SQL tool is `SELECT`, which allows you to perform queries on a table in a database. It has the generic form: ```text R> SELECT columns or computations + FROM table + WHERE condition + GROUP BY columns + HAVING condition + ORDER BY column [ASC | DESC] + LIMIT offset,count; ``` --- # General SQL syntax (cont.) .pull-left[ ### Basic rules - SQL statements start with a command describing the desired action (`SELECT`), followed by the unit on which it should be executed (`SELECT column1`), and one or more clauses (`WHERE column 2 = 1`). - Although it's customary to write all SQL statements in capital letters, SQL is actually case insensitive towards its key words. - Each SQL statement ends with a semicolon, so SQL statements can span across multiple lines. - Comments either start with `--` or have to be put in between `/*` and `*/`. ] .pull-right[ <br> ```text R> CREATE DATABASE database1 ; + SELECT column1 FROM table1 WHERE column2 = 1 ; + UPDATE table1 SET column1 = 1 WHERE column2 > 3 ; + INSERT INTO table1 (column1, column2) + VALUES ('rc11', 'rc12'), ('rc21', 'rc22') ; ``` ```text R> -- One line comment. + /* + Comment spanning + several lines + */ ``` ] --- # General SQL syntax (cont.) ### Notes from a dplyr user - SQL syntax is intuitive [until it isn't](https://blog.jooq.org/a-beginners-guide-to-the-true-order-of-sql-operations/). - One problem: SQL imposes a **lexical order** of operations ("order of execution") which does not necessarily match the **logical order** of operations you'd have in mind. - Conceptually, every step (like "`WHERE`") of the query transforms its input. However, the query's steps don't happen in the order they're written (e.g., `SELECT` is written first but the operation comes much later in the process). - Compare this to our `dplyr` logic: Take this, do this, then do this, etc. -- ### The good news - With `dplyr` we have a package that can effectively speak SQL for us. - Those queries (formulated with `dplyr` commands!) can then easily be submitted as SQL queries to the database (also thanks to the `DBI` package). - So, do you still have to learn SQL? Probably not, but it won't hurt! --- # More resources to get started with SQL Probably the best way to learn SQL hands-on is the interactive tutorial at [SQLBolt](https://sqlbolt.com/). Another hands-on beginner SQL tutorial is at [DataQuest](https://www.dataquest.io/blog/sql-basics). Yet another good starting point to learn about the basics of SQL lives at [Codeacademy](https://www.codecademy.com/learn/learn-sql). If you want to know more about how indexes work in SQLite, check out [this resource](https://www.sqlite.org/queryplanner.html). Finally, another good read is [10 easy steps to a complete understanding of SQL](https://blog.jooq.org/10-easy-steps-to-a-complete-understanding-of-sql/) by Lukas Eder. --- class: inverse, center, middle name: dbr # Talking to databases with R <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Connecting to databases with R .pull-left-wide[ ### A database interface - DBMS implement SQL but all have somewhat different conventions. - R can connect to all major existing databases types. - The R package [`DBI`](https://cran.r-project.org/web/packages/DBI/vignettes/DBI-proposal.html) (**D**ata**B**ase **I**nterface) is a unified interface to them (car analogy: wheel, pedals). - In addition, we need a separate "driver" to connect to the database engine (car analogy: fuel (?)). ### Drivers for open-source/commercial DBs - There are various R packages that allow you to connect to particular **open-source** database types including SQLite (via [`RSQLite`](https://cran.r-project.org/web/packages/RSQLite/index.html)), MySQL ([`RMySQL`](https://cran.r-project.org/web/packages/RMySQL/index.html)), PostgreSQL ([`RPostgres`](https://cran.r-project.org/web/packages/RPostgres/index.html)), Google BigQuery ([`bigrquery`](https://cran.r-project.org/web/packages/bigrquery/index.html)), MariaDB ([`RMariaDB`](https://cran.r-project.org/web/packages/RMariaDB/index.html)), DuckDB ([`duckdb`](https://cran.r-project.org/web/packages/duckdb/index.html)), and more. - For **commercial databases** such as Microsoft SQL Server or Oracle, the [`odbc`](https://github.com/r-dbi/odbc) package provides a DBI-compliant interface to Open Database Connectivity (ODBC) drivers. ] .pull-right-small-center[ <br> <div align="center"> <img src="pics/db-connect-open-source.png" height=200> <br> <img src="pics/db-connect-commercial.png" height=200> </div> `Credit` [db.rstudio.com](https://db.rstudio.com/) ] --- # Talking to databases .pull-left[ ### dplyr as a database interface - Good news: `dplyr` is able to interact with databases directly by translating the `dplyr` verbs into SQL queries. This convenient feature allows you to "speak" directly with the database from R. - Using `dplyr` as an interface allows you to: 1. Run data exploration routines over all of the data, instead of importing parts of the data into R. 2. Use the SQL engine to run the data transformations. In effect, computation is being pushed to the database. 3. Collect into R only a targeted dataset. 4. Keep all your code in R. There is no need to alternate between languages or tools to perform the data exploration. ] .pull-right-center[ <br> <div align="center"> <img src="pics/db-interact.png" height=320> </div> `Credit` [db.rstudio.com](https://db.rstudio.com/) ] --- # A hands-on database session with R Before we start, we load the `DBI` package to connect to a DBMS and the `RSQLite` package to communicate with an SQLite database. Tidyverse and `dplyr`/`dbplyr` will be used, too. Also, we will rely on `nycflights13` for some toy data sets. ```r R> library(DBI) R> library(RSQLite) R> library(tidyverse) R> library(nycflights13) ``` --- # Connecting to the database Now, let's set up a connection with an SQLite database. In fact, we will not interact with an existing one but build up our own, which will live in the memory: ```r R> con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:") ``` `con` represents our database connection via which we'll interact with the database. -- The arguments to `DBI::dbConnect()` vary from database to database, but the first argument is always the database backend. For instance, it’s `RSQLite::SQLite()` for RSQLite, `RMariaDB::MariaDB()` for RMariaDB, `RPostgres::Postgres()`, and `bigrquery::bigquery()` for BigQuery. -- Also, in real life, chances are that the database lives on a server and you have to authenticate to connect to it. This could look as follows: ```r R> con <- DBI::dbConnect(RSQLite::SQLite(), + host = "mydatabase.host.com", + user = "simon", + password = "mypassword" + ) ``` --- # Filling the database Next, we upload a local data frame into the remote data source; here: our database. Note that this is specific to our (toy) example. You'll probably not have to build up your own database. ```r R> dplyr::copy_to( + dest = con, + df = nycflights13::flights, + name = "flights") ``` --- # Filling the database Next, we upload a local data frame into the remote data source; here: our database. Note that this is specific to our (toy) example. You'll probably not have to build up your own database. ```r R> dplyr::copy_to( + dest = con, + df = nycflights13::flights, + name = "flights", + temporary = FALSE, + indexes = list( + c("year", "month", "day"), + "carrier", + "tailnum", + "dest" + ) + ) ``` We can also explicitly set up indexes that will allow us to quickly process the data by day, carrier, plane, and destination. While creating the right indices is key to good database performance, in common applications this will be taken care of by the database maintainer. --- # Querying the database Now it's time to start querying our database. First, we generate a reference table from the database using `dplyr`'s `tbl`(): ```r R> flights_db <- tbl(con, "flights") ``` Note that `flights_db` is a remote source; the table is not stored in our local environment. We can use it as a "pointer" to the actual database. Next, we perform various queries, such as: ```r R> flights_db %>% select(year:day, dep_delay, arr_delay) ``` ``` ## # Source: lazy query [?? x 5] ## # Database: sqlite 3.36.0 ## # [/Users/simonmunzert/github/intro-to-data-science-21/lectures/04-databases/nycflightsdb] ## year month day dep_delay arr_delay ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 1 2 11 ## 2 2013 1 1 4 20 ## 3 2013 1 1 2 33 ## 4 2013 1 1 -1 -18 ## 5 2013 1 1 -6 -25 ## 6 2013 1 1 -4 12 ## 7 2013 1 1 -5 19 ## 8 2013 1 1 -3 -14 ## 9 2013 1 1 -3 -8 ## 10 2013 1 1 -2 8 ## # … with more rows ``` Yes, we can use `dplyr` syntax to do database queries! --- # Querying the database The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, `dplyr` **tries to be as lazy as possible**: - It never pulls data into R unless you explicitly ask for it. - It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step. This even applies when you assign the output of a database query to an object: ```r R> tailnum_delay_db <- flights_db %>% + group_by(tailnum) %>% + summarise( + delay = mean(arr_delay), + n = n() + ) %>% + arrange(desc(delay)) %>% + filter(n > 100) ``` --- # Querying the database Laziness also has some downsides. Because there’s generally no way to determine how many rows a query will return unless you actually run it, nrow() is always NA: ```r R> nrow(tailnum_delay_db) ``` ``` ## [1] NA ``` Because you can’t find the last few rows without executing the whole query, you can’t use tail(): ```r R> tail(tailnum_delay_db) ``` ``` ## Error: tail() is not supported by sql sources ``` If you then want to pull the data into a local data frame, use `dplyr::collect()`: ```r R> tailnum_delay <- tailnum_delay_db %>% collect() R> tailnum_delay ``` ``` ## # A tibble: 1,201 × 3 ## tailnum delay n ## <chr> <dbl> <int> ## 1 <NA> NA 2512 ## 2 N0EGMQ 9.98 371 ## 3 N10156 12.7 153 ## 4 N10575 20.7 289 ## 5 N11106 14.9 129 ## 6 N11107 15.0 148 ## 7 N11109 14.9 148 ## 8 N11113 15.8 138 ## 9 N11119 30.3 148 ## 10 N11121 10.3 154 ## # … with 1,191 more rows ``` --- # Using SQL directly in R Again, because it is so cool: Yes, we can use `dplyr` syntax to do database queries! Behind the scenes, `dplyr` is translating your R code into SQL. The `dbplyr` package is doing the work for us. You can use the `show_query()` function to display the SQL code that was used to generate a queried table: ```r R> tailnum_delay_db %>% show_query() ``` ``` ## <SQL> ## SELECT * ## FROM (SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n` ## FROM `flights` ## GROUP BY `tailnum`) ## WHERE (`n` > 100.0) ``` If you still want to formulate SQL queries and pass them on to the DBMS, use `DBI::dbGetQuery()`: ```r R> sql_query <- "SELECT * FROM flights WHERE dep_delay > 240.0 LIMIT 5" R> dbGetQuery(con, sql_query) ``` ``` ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## 1 2013 1 1 848 1835 853 1001 1950 ## 2 2013 1 1 1815 1325 290 2120 1542 ## 3 2013 1 1 1842 1422 260 1958 1535 ## 4 2013 1 1 2115 1700 255 2330 1920 ## 5 2013 1 1 2205 1720 285 46 2040 ## arr_delay carrier flight tailnum origin dest air_time distance hour minute ## 1 851 MQ 3944 N942MQ JFK BWI 41 184 18 35 ## 2 338 EV 4417 N17185 EWR OMA 213 1134 13 25 ## 3 263 EV 4633 N18120 EWR BTV 46 266 14 22 ## 4 250 9E 3347 N924XJ JFK CVG 115 589 17 0 ## 5 246 AA 1999 N5DNAA EWR MIA 146 1085 17 20 ## time_hour ## 1 1357081200 ## 2 1357063200 ## 3 1357066800 ## 4 1357077600 ## 5 1357077600 ``` --- # More useful resources to get started with databases ### Relational databases FTW A comprehensive overview of database interaction using R is available at [db.rstudio.com](https://db.rstudio.com/getting-started/overview). It also features a set of best practices that go beyond what we covered today. Not to mention [Databases 101](https://raw.githack.com/uo-ec607/lectures/master/16-databases/16-databases.html) by Grant McDermott, which also features example code to connect to DuckDB and Google BigQuery. Finally, there is the [introduction to `dbplyr`](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) which comes as a package vignette. --- # And what about NoSQL databases? We haven't talked about non-relational databases so far, but you might encounter them in the future. NoSQL became popular in the 2000s as a consequence of several developments: Cost of data storage decreased, which made the need of efficient, non-redundant but complex data storage less urgent. At the same time, heterogeneity in data formats increased and coming up with well-defined schemas was difficult. NoSQL offered more flexibility. There are several types of NoSQL databases, including document DBs where documents contain pairs of fields and values (think: JSON), key-value DBs, or graph DBs. Depending on your use case, this might be exactly what you need. If you want to learn more, start [here](https://www.mongodb.com/nosql-explained) or [here](https://www.digitalvidya.com/blog/nosql-tutorial/). --- class: inverse, center, middle name: summary # Summary <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Summing it up .pull-left[ Most data scientists **never design a database**. - But they almost all end up interacting with them. - Also, with database administrators (DBAs). Lots of academic data work consists of working with *bad re-inventions* of the relational database. - Looking at you, Excel users. - Looking at you, instructor. Thinking about data relationally **will help you with statistics**. - Multilevel models - Time-series - Network analysis - NLP ] .pull-right[ </br> | Database concept | Statistical concept | |---|---| | Table | Sample | | Column | Variable | | Row | Unit | | Value | Observation | | Foreign key relationships | Nested variables | | Many-to-many relationships | Crossed variables (possibly unbalanced) | ] --- class: center background-image: url("pics/excel-not-database-simpsons.png") background-size: contain background-color: #000000 # Databases: Maybe not the most exciting technology... <br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br> <p style="background-color:black;color:white;"><b>... but awesomely useful and not going away.</b></p> --- # Coming up <br><br> ### Assignment None! But you'll get the chance to practice databasing with R in the lab. ### Next lecture Web data and technologies - relational data structures FTW!