QTM 151 - Introduction to Statistical Computing II

Lecture 11 - Introduction to SQL

Danilo Freire

danilo.freire@emory.edu

Emory University

Hey, there! 😊
I hope you’re all doing well!

Today’s lecture 📚

Today’s agenda

SQL essentials

Introduction to SQL: We will start by learning the fundamentals of SQL, focusing on SQLite
Setup: You will learn how to:
- Install and load the SQLite3 Editor in VSCode
- Create a database
- Create a table
- Add and query rows
- View and modify the table schema
- Import and export data

Structured Query Language (SQL)

SQL: A relational database management system (RDBMS)

SQL stands for Structured Query Language and it is used to interact with relational databases
What is a relational database?
- A database that stores data in tables and allows you to define relationships between tables
- The most common type of database
- Examples: PostgreSQL, MySQL, SQLite, Oracle, SQL Server
Here we will focus on SQLite
Why? It is open-source, powerful, and widely used in industry: phones, browsers, TVs, operating systems, and even cars use it!
It has several advanced features too, such as full-text search and window functions

Let’s get started! 🤓

What is SQLite?

SQLite is a serverless database, which means it does not require a server to run
It is a single file that you can use to store your data
This has several advantages, such as:
- Easy to install and use
- Lightweight and fast
- Perfect for small projects
SQLite has some cons too, such as:
- Not being easy to scale
- Not allowing multiple users to write the database at the same time (although some workarounds exist)
- Poor handling of time-series data. So far, TimescaleDB for Postgres is the best option for this
- But SQLite also has lots of extensions to help you with many tasks (more on that later)

Some people argue that SQLite is not a real database because it does not have a server, or that storing the database in a single file is not suitable for large projects
Both arguments are false:
- You can use LiteFS to scale SQLite and allow multiple users to write the database at the same time
- SQLite is capable of handling databases that are an Exabyte in size (that’s one million Terabytes, or one billion Gigabytes 🤯)
- I’m sure that this is more than enough for pretty much any project you will ever work on 😅
And to be honest, the difference between different SQL databases is not that big
If you learn one of them, you can easily learn the others

Installing SQLite

macOS and Linux

If you use MacOS or Linux, you probably already have SQLite installed
To check, open a terminal and type sqlite3
- Then type .quit to exit the SQLite prompt!
If you see the SQLite prompt, you are good to go! 😊
In case you don’t have SQLite installed, you can download it from the SQLite website or using Homebrew (recommended)
To install Homebrew, paste the following command in your terminal

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then install SQLite and verify the installation with the following commands

brew install sqlite3
sqlite3 --version

Installing SQLite

Windows

Windows users can download the SQLite precompiled binaries from the SQLite website. Download the files that have the sqlite3 DLL and the command-line shell (sqlite-dll... and sqlite-tools...)
Then follow these steps:
- Create a folder C:\>sqlite and unzip the files in this folder, which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files
- Double click the sqlite3 file to open the software, and you will see the SQLite prompt

C:\>sqlite3
SQLite version 3.7.15.2 2013-01-09 11:53:05
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

If you want to use SQLite in the command line, you can add the folder to your PATH variable
Search and click on Edit the system environment variables
Click on Environment Variables...
Set C:\sqlite to Path in both or either User variables and/or System variables, then you can use SQLite on terminal (command prompt)
More about this here

Installing SQLite3 Editor in VSCode

To use SQLite in VSCode, you need to install the SQLite3 Editor extension (click on the link to go to the extension page)
Or you can search for SQLite3 Editor in the extensions tab in VSCode
Click on Install to install the extension
As we will see later, this extension allows you to connect to your SQLite database and run SQL commands directly in VSCode, and even edit files without having to write any SQL queries (!)
You can use it to create tables, export data to CSV and JSON formats, run queries in the command line, and compare versions using git diff
It is a very useful extension and I highly recommend it! 😊
We will use this extensions today to interact with our SQLite database

Download it here or search for it in the extensions tab in VSCode

Any issues with the installation? 🤔

Let’s create our first SQL table! 🚀

Creating a database with SQLite3 Editor

It is very easy to create a SQLite database file using the extension
Simply create a file ending in .sqlite or .db and open it with the SQLite3 Editor
Here I will create a file called example.db and open it with the extension

touch example.db

Then just click on the file (or right-click and select Open to the Side – same thing)
You will see a new tab with the SQLite3 Editor

Creating a table with SQLite3 Editor

Now that you have opened the file with the extension, you will see the following screen (right)
You will see that there is no table in the database yet
Most of the time, the data come from other sources, such as APIs, JSON or CSV files
But in small projects, it is common to create a database to store the data
So let’s create one right now! 😉
We will create a table with some information about Formula 1 drivers
System commands are in upper case and user-defined names are in lower case

Creating a table with SQLite3 Editor

Now let’s create our table
We won’t write any code yet, just add the values to the editor
We will create four columns in the table: driver_id, driver_name, nationality, and victories:
- driver_id is an integer, not null, and primary key (select these options in the editor)
- driver_name is a text variable
- nationality is a text variable too
- victories is an integer
Then click on commit to save the changes

This is equivalent of the following SQL code (as you can see in the editor):

CREATE TABLE drivers (
    driver_id INTEGER NOT NULL PRIMARY KEY,
    driver_name TEXT,
    nationality TEXT,
    victories INTEGER
);

Creating a table with SQLite3 Editor

Database structure and data types in SQLite

Database structure

SQL works best with tabular data, so the main object in SQL is the table (rows and columns)
SQLite (and most SQL databases) are strongly typed, that is, each column has a specific data type
The five core data types in SQLite:
- TEXT: Stores string values of any length using UTF-8 encoding
- INTEGER: Holds whole numbers (positive or negative)
- NUMERIC: Stores decimal numbers as floating-point values
- BLOB: Contains binary data preserved as input
- NULL: Represents missing or undefined data
SQLite also has other data types, such as REAL, BOOLEAN, DATE, TIME, and DATETIME

Primary and foreign keys

The PRIMARY KEY constraint is used to uniquely identify each row in a table
It cannot contain NULL values and are usually integers
If you don’t want to enter the numbers manually, you can use the AUTOINCREMENT keyword
- Example: driver_id INTEGER PRIMARY KEY AUTOINCREMENT
The FOREIGN KEY constraint is used to link two tables together
- It is a field in one table that refers to the PRIMARY KEY in another table
We can also use ROWID to uniquely identify each row in a table, but it is not recommended
- It is a hidden column that is automatically created, but deleting and reinserting a row can change the ROWID, and it is not compatible with other SQL databases

Example of tables with primary and foreign keys

Imagine that we have three tables, customers, products, and orders
The customers table will have a primary key customer_id
The products table will have a primary key product_id
And the orders table will have a primary key order_id
The orders table will also have two foreign keys: customer_id and product_id
We will see more about this in the next classes

customers
customer_id (PK)	first_name	last_name	email
1	John	Doe	john.doe@email.com
2	Jane	Smith	jane.smith@email.com
3	Bob	Johnson	bob.j@email.com

products
product_id (PK)	product_name	price
101	Laptop	999.99
102	Smartphone	599.99
103	Wireless Mouse	29.99

orders
order_id (PK)	customer_id (FK)	product_id (FK)	quantity	total_price
1	1	101	1	999.99
2	2	102	2	1199.98
3	3	103	3	89.97

Add rows to table using the editor

Adding rows to the table using SQL commands

We can add rows to a table by using the INSERT INTO command
This command adds rows with VALUES to the bottom of the table
In SQL, different commands need to be separated by a semicolon (;)
You write system commands in capital letters to easily distinguish column names from variables and values
Let’s add some information about the drivers
Click on Query Editor and paste the following code:

INSERT INTO drivers VALUES (2, 'Michael Schumacher', 'German', 53);
INSERT INTO drivers VALUES (3, 'Sebastian Vettel', 'German', 91);
INSERT INTO drivers VALUES (4, 'Fernando Alonso', 'Spanish', 32);

Or more efficiently:

INSERT INTO drivers (driver_id, driver_name, nationality, victories) VALUES
(2, 'Michael Schumacher', 'German', 53),
(3, 'Sebastian Vettel', 'German', 91),
(4, 'Fernando Alonso', 'Spanish', 32);

Now click on ▶️️ Execute ️to run the code

Adding rows to the table

Querying the table

The editor already shows all the rows in the table, but we can also query the table using the SELECT command

SELECT * FROM drivers;

This command selects all columns (*) from the drivers table
SELECT is probably the most used command in SQL and is used to extract data from a database
If you are running SQL in the command line, you need to type SELECT to see the results

Exercise 01

Click on Query Editor and create a table called students with the following columns:
- student_id (integer not null primary key)
- student_name (text)
- major (text)
- gpa (numeric)
Add the following rows to the table:
student_id: 1, student_name: ‘John Doe’, major: ‘Computer Science’, gpa: 3.5
student_id: 2, student_name: ‘Jane Doe’, major: ‘Mathematics’, gpa: 3.8
Query the table to see the results
If the table already exists on your database, you can drop it with the following command:

DROP TABLE students;

Appendix 01

Table Schemas 📄

What is a table schema?

The table schema is the structure of a table in a database
It defines the columns, data types, constraints, and relationships between tables
It is defined when the table is created but can be modified later
The schema is important because it ensures that the data is stored correctly and efficiently
Usually, there is no need to change the schema frequently, but it is possible to do so if needed
You can view it with the PRAGMA table_info() command

The PRAGMA table_info() command returns one row for each column in the table
The columns are:
- cid: Column ID
- name: Column name
- type: Data type
- notnull: 1 if the column cannot contain NULL values, 0 otherwise
- dflt_value: Default value for the column
- pk: 1 if the column is part of the primary key, 0 otherwise

Creating and modifying table schemas

To create a table, use the CREATE TABLE command, as we saw before
It is a good idea to use the IF NOT EXISTS clause to avoid errors if the table already exists
- Example: CREATE TABLE IF NOT EXISTS table_name (column1_name data_type, ...);
Add a column: ALTER TABLE table_name ADD column_name data_type;
Drop a column: ALTER TABLE table_name DROP COLUMN column_name;
Rename a column: ALTER TABLE table_name RENAME COLUMN old_name TO new_name;

To delete a table, use the DROP TABLE command
- Example: DROP TABLE table_name;
- Be careful with this command, as it will delete all the data in the table!
To delete a row, use the DELETE FROM command
- Example: DELETE FROM table_name WHERE condition;
- This command deletes all rows that satisfy the condition
If you don’t specify a condition, it will delete all rows in the table

Some examples with the `students` table

Add a column: ALTER TABLE students ADD COLUMN age INTEGER;

Drop a column: ALTER TABLE students DROP COLUMN major;

Rename a column: ALTER TABLE students RENAME COLUMN gpa TO grade;

Delete a row with student_id = 2: DELETE FROM students WHERE student_id = 2;

Table Operations 📊

Extracting columns from table

SELECT extracts certain columns

The SELECT command is used to extract certain columns from a table

SELECT driver_id, nationality FROM drivers;

Subsetting rows

The WHERE command evaluates logical conditions to subset rows

SELECT * FROM drivers
WHERE nationality = 'German';

AND and OR conditions

OR conditions

Evaluate OR conditions

SELECT * FROM drivers
WHERE (nationality = 'Spanish') OR (nationality = 'German');

AND conditions

Evaluate AND conditions

SELECT * FROM drivers
WHERE (nationality = 'German') AND (driver_id = 3);

Data description

Aggregate

To compute aggregate statistics use a statistic function, such as SUM() or AVG()
The AS is an alias which assigns a name to the aggregate statistic

SELECT SUM(victories) AS sum_victories,
       COUNT(*) AS num_rows,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers;

Sort

Use ORDER BY to sort the data
Order in ascending order of victories with ASC (DESC for descending)

SELECT driver_name, victories FROM drivers
ORDER BY victories ASC;

Exercise 02

List drivers who are either ‘German’ or ‘British’ and have victories greater than 50
List all drivers with victories greater than the average victories
- Hint: you will have to run another SELECT command to get the average victories (a subquery)
- Aggregate functions are not allowed in the WHERE clause. See the reason why here. More about subqueries here.
- We will see another way to do this soon (with HAVING)
Appendix 02

Groupby + Aggregate

Groupby

Use GROUP BY to group data by a column

SELECT nationality,
       SUM(victories) AS sum_victories,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Round

Use ROUND() to round the values
The first argument is the value to be rounded and the second is the number of decimal places
In this example, the results will be the same as the previous one

SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Filtering after aggregating with HAVING

Use HAVING to filter variables after aggregating
This differs from WHERE, which is used to filter variables before aggregating

SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 50;

HAVING must be used after GROUP BY and before ORDER BY
Why? Because HAVING is used to filter the results of the GROUP BY clause
So please remember: WHERE is used before aggregating and HAVING is used after aggregating!

Python vs. SQL

Here are some equivalences between Python and SQL commands

SQL Command	Pandas Equivalent
`SELECT`	`df[['column1', 'column2']]`
`INSERT INTO`	`df.loc[new_index] = new_row` or `df = df.append(new_row, ignore_index=True)`
`WHERE`	`df[df['column'] == value]`
`GROUP BY`	`df.groupby('column')`
`HAVING`	`df.groupby('column').filter(lambda x: condition)`
`ROUND()`	`df['column'].round(decimals)`
Aggregate Functions	`df.groupby('column').agg({'col1': 'sum', 'col2': 'mean', 'col3': 'min', 'col4': 'max'})`

-- SQL
SELECT nationality,
       SUM(victories) AS sum_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 10;

# Python
import pandas as pd

# Group by 'nationality' and calculate 'sum_victories' using agg()
result = (
    drivers.groupby('nationality')
    .agg(sum_victories=('victories', 'sum'))
    .reset_index()
)

# Apply 'HAVING' condition
result = result[result['sum_victories'] > 10]

Importing and exporting data

The SQLite3 Editor allows you to easily import and export data to many formats, including CSV and JSON
Just click on Other Tools and then Export / Import
If you want to export an XLXS file, you can choose between exporting one or all tables
Importing is just as easy: select the file and the table you want to import the data from
You can also do this with SQLite commands:
- Export to CSV: sqlite3 example.db .mode csv .output example.csv SELECT * FROM drivers;
- More about exporting data here
- Import from CSV: sqlite3 example.db .mode csv .import example.csv drivers;
- More about importing data here

Do you want to learn more? 🤓

Here is a list of resources to help you learn more about SQL:
- Overview of the SQLite3 Extension: Worth reading to understand all the features of the extension
- SQLean: SQLite extensions. MacOS users should install the sqlean.py Python package, as sqlite3 is compiled without extension support
- SQLBolt: Interactive SQL tutorials
- W3Schools: SQL tutorials
- Kaggle: SQL course
- Mode Analytics: SQL tutorial
- SQLZoo: Interactive SQL tutorials
- SQL for Data Science: DataCamp course
- SQL for Data Science: Coursera course
Or just come to our next class! 😊

And that’s it for today! 🚀

See you next time! 😊

Appendix 01

Here is the code to create the students table and add the rows
You can paste this code into the SQL editor and execute the commands

DROP TABLE students;

CREATE TABLE students(
    student_id INTEGER NOT NULL PRIMARY KEY, 
    student_name TEXT,
    major TEXT,
    gpa NUMERIC
);

INSERT INTO students VALUES 
(1, 'John Doe', 'Computer Science', 3.5),
(2, 'Jane Doe', 'Mathematics', 3.8);

SELECT * FROM students;

Back to exercise 01

Appendix 02

Here is the code to list drivers who are either ‘German’ or ‘British’ and have victories greater than 50

SELECT driver_name, nationality, victories
FROM drivers
WHERE (nationality = 'German' OR nationality = 'British') AND victories > 50;

SELECT driver_name, victories
FROM drivers
WHERE victories > (SELECT AVG(victories) FROM drivers);

Back to exercise 02

QTM 151 - Introduction to Statistical Computing II

Hey, there! 😊 I hope you’re all doing well!

Today’s lecture 📚

Today’s agenda

SQL essentials

Structured Query Language (SQL)

SQL: A relational database management system (RDBMS)

Let’s get started! 🤓

What is SQLite?

Installing SQLite

macOS and Linux

Installing SQLite

Windows

Installing SQLite3 Editor in VSCode

Any issues with the installation? 🤔

Let’s create our first SQL table! 🚀

Creating a database with SQLite3 Editor

Creating a table with SQLite3 Editor

Creating a table with SQLite3 Editor

Creating a table with SQLite3 Editor

Database structure and data types in SQLite

Database structure

Primary and foreign keys

Example of tables with primary and foreign keys

Add rows to table using the editor

Adding rows to the table using SQL commands

Adding rows to the table

Querying the table

Exercise 01

Table Schemas 📄

What is a table schema?

Creating and modifying table schemas

Some examples with the students table

Table Operations 📊

Extracting columns from table

SELECT extracts certain columns

Subsetting rows

AND and OR conditions

OR conditions

AND conditions

Data description

Aggregate

Sort

Exercise 02

Groupby + Aggregate

Groupby

Round

Filtering after aggregating with HAVING

Python vs. SQL

Importing and exporting data

Do you want to learn more? 🤓

And that’s it for today! 🚀

See you next time! 😊

Appendix 01

Appendix 02

Hey, there! 😊
I hope you’re all doing well!

Some examples with the `students` table