Introduction to Data Science

class: center, middle, inverse, title-slide

# Introduction to Data Science
## Session 4: Relational databases and SQL
### Simon Munzert
### Hertie School | <a href="https://github.com/intro-to-data-science-21">GRAD-C11/E1339</a>

---

# Table of contents

<br>

1. [Why databases?](#whydb)

2. [Relational database fundamentals](#relationaldb)

3. [Back to `dplyr`: joins](#joins)

4. [SQL](#sql)

5. [Talking to databases with R](#dbr)

7. [Summary](#summary)

---
class: inverse, center, middle
name: whydb

# Why databases?

---

# The ubiqity of multi-dimensional data structures

.pull-left[
### From data frames...

- When you have a background in social sciences, your top-of-the-head mental image of data might be a rectangular [spreadsheet](https://en.wikipedia.org/wiki/Spreadsheet).
- In fact, much of classical "statistical" software (SPSS, Stata, MS Excel) operates with rectangular data frames by default. 
- At the same time, your perception might be [file-based](https://en.wikipedia.org/wiki/List_of_file_formats). Data is stored in files, and these files are read (and produced) by our data management software.
- In many cases, the **two-dimensional structure** makes sense. For instance, we observe
  - persons x attitudes
  - countries x characteristics
  - social media posts x text features
]

.pull-right-center[
<br><br>
<div align="center">
<img src="pics/clippy-message.jpeg" height=300> 
</div>
]

---

# The ubiqity of multi-dimensional data structures

.pull-left-center[
<div align="center">
<img src="pics/legislator-structure.png" height=520> 
</div>
`Credit` [saschagobel/legislatoR](https://github.com/saschagobel/legislatoR)
]

.pull-right[
### ... to complex data structures

- However, the longer you think about it, the more problematic it becomes to store your data in two-dimensional structures.
- **Examples**:
  - countries x persons x characteristics x time
  - countries x states x communities x time x variables
  - social media posts x retweets x users x user characteristics x network features x meta data
- Mapping three- onto two-dimensional structures is easy (think: `pivot_longer`, `pivot_wider`).
- With **multiple heterogeneous data sources**, things get messy.
- Managing complex data structures is just one perk of using databases.
]

---

# When databases become useful

.pull-left[
### Size and speed
- You have **loads of data that exceed the working memory** on your computer. Databases are only limited by available disk size (or can be distributed across multiple disks/machines).
- Your **data structure is complex**. Databases allow/encourage you to store, retrieve and subset data with complex data structures.
- Your data is big and you have to **access/subset/operate frequently**. Querying databases is fast.
- You care about **data quality** and have clear expectations how data should look like. Using databases you can define specific rules for extending and updating your database.
]

.pull-right[
### Accessibility and concurrency
- You **collaborate with others** on a data collection project. With a database, you have a common, simultaneously accessible, and reliable infrastructure at hand that multiple users can access at the same time.
- When several parties are involved, who is allowed to do what with the database might differ (e.g., read-only, access to parts of the data, limited admin rights, etc.). Most databases allow **defining different usage rights for different users**.
]

---

# Talking about databases

.pull-left[

### What we should distinguish

- The **types of databases**, e.g.: relational, navigational, NoSQL, NewSQL
- The **database management system**, e.g.: PostgreSQL, Oracle, SQL Server, SQLite
- The **data structure**, e.g.: tables, columns, keys, normal forms
- The **data manipulations**, e.g.: selects, joins, grouping
- The **query language**, e.g., SQL, SPARQL

Also, there are so many more ways to [classify databases](https://en.wikipedia.org/wiki/Database#Classification). But that's enough for now.

Today, **we focus on relational databases**. They are by no means the only type of databases (see above), but they're ubiquitous and won't go away any time soon. 
]

--
.pull-right[

### Databases versus data frames

When reading/talking about features of databases, you will encounter a particular jargon. Here's how database concepts map onto R data frame jargon:

| R jargon  |  Database jargon |
|---|---|
| column  | attribute/field |
| row  | tuple/record |
| element/cell | attribute value |
|  data frame |  relation/table |
|  column types |  table schema |
|  bunch of related data frames |  database |

]

---
class: inverse, center, middle
name: relationaldb

# Relational database fundamentals

---
# Codd's relational model for databases

.pull-left[
- The concept of relational databases builds on the [relational model (RM) for database management](https://en.wikipedia.org/wiki/Relational_model), as proposed by [Edgar F. "Ted" Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) in 1969/1970.
- Todd described the RM formally, but also introduced it using concepts that are still in use today (normalization, keys, joins, redundancy, etc.).
- The key assumption of the relational model is that all data can be represented as relations (tables).
- Information is then represented by data values in relations.
- When you think this is trivial, check out the [history of databases](https://en.wikipedia.org/wiki/Database) and live through the pain of the early era of navigational DBMS in the 1960s and the NoSQL era that we've (not yet) overcome.

]

.pull-right-center[
<div align="center">
<img src="pics/codd-relational-model.png" height=500>
</div>
`Credit` [Communications of the ACM 13(6), 1970](https://dl.acm.org/doi/10.1145/362384.362685)
]

---
# Codd's relational model for databases (cont.)

.pull-left[
### Storing data in tables

- Again, the key concept of relational databases is that all information can be represented in a table.
- A single table already introduces relations: All data in one row belongs to the same record.
- If we want to represent more complex relations (i.e., measuring a person's weight twice or measuring the weight of their children as well), we can relate data from one table to another.

### Example

- We have collected data on Peter, Paul, and Mary.
- We have information on birthdays, telephone numbers, and favorite foods.
- How can we represent this information in tables?

]

.pull-right[
<br>
<div align="center">
<img src="pics/adcr-relational-data-1.png" width=600>
</div>
<div align="center">
<img src="pics/adcr-relational-data-2.png" width=300>
</div>

- We start representing the data in two tables.
- They are linked via the key `nameid`, so we don't have to add the full names to the phone numbers table.
- Note that we have a 1:m (one-to-many) relation here because Peter has two phone numbers.
]

---
# Codd's relational model for databases (cont.)

.pull-left[
### Storing data in tables

### Example

- We have collected data on Peter, Paul, and Mary.
- We have information on birthdays, telephone numbers, and favorite foods.
- How can we represent this information in tables?

]

.pull-right[
<br>
<div align="center">
<img src="pics/adcr-relational-data-1.png" width=600>
</div>
<div align="center">
<img src="pics/adcr-relational-data-2.png" width=300>
</div>

- However, the way we store the data is not ideal. In the first table, we have three columns measuring effectively the same thing. And what if there's more favorite food? Adding information in such a fashion creates a lot of redundant information. 
]

---
# Codd's relational model for databases (cont.)

.pull-left[
### Storing data in tables

### Example

- We have collected data on Peter, Paul, and Mary.
- We have information on birthdays, telephone numbers, and favorite foods.
- How can we represent this information in tables?

]

.pull-right[
<br>
<div align="center">
<img src="pics/adcr-relational-data-3.png" width=300>
</div>

- Splitting up the information by creating another table for food preferences is better.
- There's still some redundancy left. Is it really necessary to have `hamburger` in the table twice?
]

---
# Codd's relational model for databases (cont.)

.pull-left[
### Storing data in tables

### Example

- We have collected data on Peter, Paul, and Mary.
- We have information on birthdays, telephone numbers, and favorite foods.
- How can we represent this information in tables?

]

.pull-right[
<br>
<div align="center">
<img src="pics/adcr-relational-data-4.png" width=300>
</div>
<div align="center">
<img src="pics/adcr-relational-data-5.png" width=300>
</div>

- Now that's better.
- In restructuring the information in our database, we **avoided redundancy (duplication)**.
- This is the process of **database normalization**.
]

---
# Database normalization

.pull-left[
### What is database normalization?

**From the [Wikipedia](https://en.wikipedia.org/wiki/Database_normalization)**: "Database normalization is the process of **structuring a database**, usually a relational database, in accordance with a series of so-called **normal forms** in order to **reduce data redundancy and improve data integrity**. It was first proposed by Edgar F. Codd as part of his relational model."

- You'll probably not have to apply normalization yourself because you are a user not a designer of databases.
- However, it helps to have an idea of what the first normal forms are.
- Higher-order normal forms imply lower-order normal forms (e.g., in order to satisfy the 3rd normal form, the 1st and 2nd normal forms have to be satisfied, too).
]

.pull-right[
### Normalization and tidy data

There is also a straightforward link to [Hadley Wickham's "tidy data"](https://www.jstatsoft.org/article/view/v059i10):

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

This is Codd's 3rd normal form using "statistical" jargon and applied to a single dataset. 
]

---

# Database normalization (cont.)

The **normal forms** (from least normalized to most normalized):

<div align="center">
<img src="pics/normal-forms.png" height=430>
</div>
`Credit` [English Wikipedia, "Database normalization"](https://en.wikipedia.org/wiki/Database_normalization)

---

# Database normalization (cont.)

---
# Database schema

.pull-left[

### What schemas are

- The database schema describes the structure of a database. It represents the map or blueprint of how the database is constructed.
- The schema specifies all core ingredients of the database, including tables, fields, keys relationships, views, etc.
- The visualization helps database users understand the relationships between the tables.

]

.pull-right[

### How they can look like

<div align="center">
<img src="pics/mediawiki-schema.png" width=600>
</div>
`Credit` [Timo Tijhof/Wikimedia Commons](https://en.wikipedia.org/wiki/Database_schema#/media/File:MediaWiki_1.28.0_database_schema.svg)
]

---

# Databases and Database Management Systems

.pull-left-wide[

### What are databases?

- Databases are an organized collection of data.
- They are organized to afford efficient retrieval of (selections) of data.
- They entail data + metadata about structure and organization.
- They are generally accessed through a database management system.

### Where are databases?

- Databases can exist locally or remotely, in-memory or on-disk.
- When they are stored locally, they are stored as binary file (not text file). 
- Commonly, we think of a **client-server model**:
  - Databases live on a **server**, which manages them
  - Users interact with the server through a **client** program.
  - Lets multiple users **access** the same database **simultaneously**.

]

---
# Databases and Database Management Systems

.pull-left-wide[
### What are DBMS?

- Database Management Systems (DBMS) provide **efficient**, **reliable**, **convenient**, **safe**, **multi-user** storage of and access to **massive** data.
    - **Massive**: Think Terabytes, not Gigabytes. Handle data that resides outside memory.
    - **Safe**: Robust to power outages, node failures, etc.
    - **Multi-user**: Concurrency control. Not one user, but multiple.
    - **Convenient**: High-level query languages.
    - **Efficient**: Just fast.
    - **Reliable**: High uptime.
- There are [so many DBMS](https://en.wikipedia.org/wiki/List_of_relational_database_management_systems) for relational database structures alone.
- RDBMS differ in terms of capabilities, implemented features, operating system support, and much more. 
- You'll probably not be in the position to decide which database to use. If you're still interested in the differences, [this](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems) or [this overview](https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems) might be a good starting point.
- Also, I've heard good things about [DuckDB](https://duckdb.org/).

]

.pull-right-small[

]

---
class: inverse, center, middle
name: joins

# Back to dplyr: joins

---

# Relational data in R

For the simple examples that I'm going to show here, we'll need some data sets that come bundled with the [**nycflights13**](http://github.com/hadley/nycflights13) package.

Let's load it now and then inspect these data frames in your own console.

```r
R> library(nycflights13)
```