QTM 350 - Data Science Computing

Lecture 17 - Introduction to SQL

Danilo Freire

Emory University

Hey, there! 😊
I hope you’re all doing well!

Today’s lecture 📚

Today’s agenda

SQL essentials

  • Introduction to SQL: We will start by learning the fundamentals of SQL, focusing on SQLite
  • Setup: You will learn how to:
    • Install and load the SQLite3 Editor in VSCode
    • Create a database
    • Create a table
    • Add and query rows
    • View and modify the table schema
    • Import and export data

Structured Query Language (SQL)

SQL: A relational database management system (RDBMS)

  • SQL stands for Structured Query Language and it is used to interact with relational databases
  • What is a relational database?
    • A database that stores data in tables and allows you to define relationships between tables
    • The most common type of database
    • Examples: PostgreSQL, MySQL, SQLite, Oracle, SQL Server
  • Here we will focus on SQLite
  • Why? It is open-source, powerful, and widely used in industry: phones, browsers, TVs, operating systems, and even cars use it!
  • It has several advanced features too, such as full-text search and window functions

Let’s get started! 🤓

What is SQLite?

  • SQLite is a serverless database, which means it does not require a server to run
  • It is a single file that you can use to store your data
  • This has several advantages, such as:
    • Easy to install and use
    • Lightweight and fast
    • Perfect for small projects
  • SQLite has some cons too, such as:
  • Some people argue that SQLite is not a real database because it does not have a server, or that storing the database in a single file is not suitable for large projects
  • Both arguments are false:
    • You can use LiteFS to scale SQLite and allow multiple users to write the database at the same time
    • SQLite is capable of handling databases that are an Exabyte in size (that’s one million Terabytes, or one billion Gigabytes 🤯)
    • I’m sure that this is more than enough for pretty much any project you will ever work on 😅
  • And to be honest, the difference between different SQL databases is not that big
  • If you learn one of them, you can easily learn the others

Installing SQLite

macOS and Linux

  • If you use MacOS or Linux, you probably already have SQLite installed
  • To check, open a terminal and type sqlite3
    • Then type .quit to exit the SQLite prompt!
  • If you see the SQLite prompt, you are good to go! 😊
  • In case you don’t have SQLite installed, you can download it from the SQLite website or using Homebrew (recommended)
  • To install Homebrew, paste the following command in your terminal
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  • Then install SQLite and verify the installation with the following commands
brew install sqlite3
sqlite3 --version

Installing SQLite

Windows

  • Windows users can download the SQLite precompiled binaries from the SQLite website. Download the files that have the sqlite3 DLL and the command-line shell (sqlite-dll... and sqlite-tools...)
  • Then follow these steps:
    • Create a folder C:\>sqlite and unzip the files in this folder, which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files
    • Double click the sqlite3 file to open the software, and you will see the SQLite prompt
C:\>sqlite3
SQLite version 3.7.15.2 2013-01-09 11:53:05
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
  • If you want to use SQLite in the command line, you can add the folder to your PATH variable
  • Search and click on Edit the system environment variables
  • Click on Environment Variables...
  • Set C:\sqlite to Path in both or either User variables and/or System variables, then you can use SQLite on terminal (command prompt)
  • More about this here

Installing SQLite3 Editor in VSCode

  • To use SQLite in VSCode, you need to install the SQLite3 Editor extension (click on the link to go to the extension page)
  • Or you can search for SQLite3 Editor in the extensions tab in VSCode
  • Click on Install to install the extension
  • As we will see later, this extension allows you to connect to your SQLite database and run SQL commands directly in VSCode, and even edit files without having to write any SQL queries (!)
  • You can use it to create tables, export data to CSV and JSON formats, run queries in the command line, and compare versions using git diff
  • It is a very useful extension and I highly recommend it! 😊
  • We will use this extensions today to interact with our SQLite database

Download it here or search for it in the extensions tab in VSCode

Any issues with the installation? 🤔

Let’s create our first SQL table! 🚀

Creating a database with SQLite3 Editor

  • It is very easy to create a SQLite database file using the extension
  • Simply create a file ending in .sqlite or .db and open it with the SQLite3 Editor
  • Here I will create a file called example.db and open it with the extension
touch example.db
  • Then just click on the file (or right-click and select Open to the Side – same thing)
  • You will see a new tab with the SQLite3 Editor

Creating a table with SQLite3 Editor

  • Now that you have opened the file with the extension, you will see the following screen (right)
  • You will see that there is no table in the database yet
  • Most of the time, the data come from other sources, such as APIs, JSON or CSV files
  • But in small projects, it is common to create a database to store the data
  • So let’s create one right now! 😉
  • We will create a table with some information about Formula 1 drivers
  • System commands are in upper case and user-defined names are in lower case

Creating a table with SQLite3 Editor

  • Now let’s create our table
  • We won’t write any code yet, just add the values to the editor
  • We will create four columns in the table: driver_id, driver_name, nationality, and victories:
    • driver_id is an integer, not null, and primary key (select these options in the editor)
    • driver_name is a text variable
    • nationality is a text variable too
    • victories is an integer
  • Then click on commit to save the changes

  • This is equivalent of the following SQL code (as you can see in the editor):
CREATE TABLE drivers (
    driver_id INTEGER NOT NULL PRIMARY KEY,
    driver_name TEXT,
    nationality TEXT,
    victories INTEGER
);

Creating a table with SQLite3 Editor

Database structure and data types in SQLite

Database structure

  • SQL works best with tabular data, so the main object in SQL is the table (rows and columns)
  • SQLite (and most SQL databases) are strongly typed, that is, each column has a specific data type
  • The five core data types in SQLite:
    • TEXT: Stores string values of any length using UTF-8 encoding
    • INTEGER: Holds whole numbers (positive or negative)
    • NUMERIC: Stores decimal numbers as floating-point values
    • BLOB: Contains binary data preserved as input
    • NULL: Represents missing or undefined data
  • SQLite also has other data types, such as REAL, BOOLEAN, DATE, TIME, and DATETIME

Primary and foreign keys

  • The PRIMARY KEY constraint is used to uniquely identify each row in a table
  • It cannot contain NULL values and are usually integers
  • If you don’t want to enter the numbers manually, you can use the AUTOINCREMENT keyword
    • Example: driver_id INTEGER PRIMARY KEY AUTOINCREMENT
  • The FOREIGN KEY constraint is used to link two tables together
    • It is a field in one table that refers to the PRIMARY KEY in another table
  • We can also use ROWID to uniquely identify each row in a table, but it is not recommended
    • It is a hidden column that is automatically created, but deleting and reinserting a row can change the ROWID, and it is not compatible with other SQL databases

Example of tables with primary and foreign keys

  • Imagine that we have three tables, customers, products, and orders
  • The customers table will have a primary key customer_id
  • The products table will have a primary key product_id
  • And the orders table will have a primary key order_id
  • The orders table will also have two foreign keys: customer_id and product_id
  • We will see more about this in the next classes
customers
customer_id (PK) first_name last_name email
1 John Doe john.doe@email.com
2 Jane Smith jane.smith@email.com
3 Bob Johnson bob.j@email.com
products
product_id (PK) product_name price
101 Laptop 999.99
102 Smartphone 599.99
103 Wireless Mouse 29.99
orders
order_id (PK) customer_id (FK) product_id (FK) quantity total_price
1 1 101 1 999.99
2 2 102 2 1199.98
3 3 103 3 89.97

Add rows to table using the editor

Adding rows to the table using SQL commands

  • We can add rows to a table by using the INSERT INTO command
  • This command adds rows with VALUES to the bottom of the table
  • In SQL, different commands need to be separated by a semicolon (;)
  • You write system commands in capital letters to easily distinguish column names from variables and values
  • Let’s add some information about the drivers
  • Click on Query Editor and paste the following code:
INSERT INTO drivers VALUES (2, 'Michael Schumacher', 'German', 53);
INSERT INTO drivers VALUES (3, 'Sebastian Vettel', 'German', 91);
INSERT INTO drivers VALUES (4, 'Fernando Alonso', 'Spanish', 32);

Or more efficiently:

INSERT INTO drivers (driver_id, driver_name, nationality, victories) VALUES
(2, 'Michael Schumacher', 'German', 53),
(3, 'Sebastian Vettel', 'German', 91),
(4, 'Fernando Alonso', 'Spanish', 32);
  • Now click on ▶️️ Execute ️to run the code

Adding rows to the table

Querying the table

  • The editor already shows all the rows in the table, but we can also query the table using the SELECT command
SELECT * FROM drivers;
  • This command selects all columns (*) from the drivers table
  • SELECT is probably the most used command in SQL and is used to extract data from a database
  • If you are running SQL in the command line, you need to type SELECT to see the results

Exercise 01

  • Click on Query Editor and create a table called students with the following columns:

    • student_id (integer not null primary key)
    • student_name (text)
    • major (text)
    • gpa (numeric)
  • Add the following rows to the table:

  • student_id: 1, student_name: ‘John Doe’, major: ‘Computer Science’, gpa: 3.5

  • student_id: 2, student_name: ‘Jane Doe’, major: ‘Mathematics’, gpa: 3.8

  • Query the table to see the results

  • If the table already exists on your database, you can drop it with the following command:

DROP TABLE students;

Table Schemas 📄

What is a table schema?

  • The table schema is the structure of a table in a database
  • It defines the columns, data types, constraints, and relationships between tables
  • It is defined when the table is created but can be modified later
  • The schema is important because it ensures that the data is stored correctly and efficiently
  • Usually, there is no need to change the schema frequently, but it is possible to do so if needed
  • You can view it with the PRAGMA table_info() command
  • The PRAGMA table_info() command returns one row for each column in the table
  • The columns are:
    • cid: Column ID
    • name: Column name
    • type: Data type
    • notnull: 1 if the column cannot contain NULL values, 0 otherwise
    • dflt_value: Default value for the column
    • pk: 1 if the column is part of the primary key, 0 otherwise

Creating and modifying table schemas

  • To create a table, use the CREATE TABLE command, as we saw before
  • It is a good idea to use the IF NOT EXISTS clause to avoid errors if the table already exists
    • Example: CREATE TABLE IF NOT EXISTS table_name (column1_name data_type, ...);
  • Add a column: ALTER TABLE table_name ADD column_name data_type;
  • Drop a column: ALTER TABLE table_name DROP COLUMN column_name;
  • Rename a column: ALTER TABLE table_name RENAME COLUMN old_name TO new_name;
  • To delete a table, use the DROP TABLE command
    • Example: DROP TABLE table_name;
    • Be careful with this command, as it will delete all the data in the table!
  • To delete a row, use the DELETE FROM command
    • Example: DELETE FROM table_name WHERE condition;
    • This command deletes all rows that satisfy the condition
  • If you don’t specify a condition, it will delete all rows in the table

Some examples with the students table

  • Add a column: ALTER TABLE students ADD COLUMN age INTEGER;

  • Drop a column: ALTER TABLE students DROP COLUMN major;

  • Rename a column: ALTER TABLE students RENAME COLUMN gpa TO grade;

  • Delete a row with student_id = 2: DELETE FROM students WHERE student_id = 2;

Table Operations 📊

Extracting columns from table

SELECT extracts certain columns

  • The SELECT command is used to extract certain columns from a table
SELECT driver_id, nationality FROM drivers;

Subsetting rows

  • The WHERE command evaluates logical conditions to subset rows
SELECT * FROM drivers
WHERE nationality = 'German';

AND and OR conditions

OR conditions

  • Evaluate OR conditions
SELECT * FROM drivers
WHERE (nationality = 'Spanish') OR (nationality = 'German');

AND conditions

  • Evaluate AND conditions
SELECT * FROM drivers
WHERE (nationality = 'German') AND (driver_id = 3);

Data description

Aggregate

  • To compute aggregate statistics use a statistic function, such as SUM() or AVG()
  • The AS is an alias which assigns a name to the aggregate statistic
SELECT SUM(victories) AS sum_victories,
       COUNT(*) AS num_rows,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers;

Sort

  • Use ORDER BY to sort the data
  • Order in ascending order of victories with ASC (DESC for descending)
SELECT driver_name, victories FROM drivers
ORDER BY victories ASC;

Exercise 02

  • List drivers who are either ‘German’ or ‘British’ and have victories greater than 50
  • List all drivers with victories greater than the average victories
    • Hint: you will have to run another SELECT command to get the average victories (a subquery)
    • Aggregate functions are not allowed in the WHERE clause. See the reason why here. More about subqueries here.
    • We will see another way to do this soon (with HAVING)
  • Appendix 02

Groupby + Aggregate

Groupby

  • Use GROUP BY to group data by a column
SELECT nationality,
       SUM(victories) AS sum_victories,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Round

  • Use ROUND() to round the values
  • The first argument is the value to be rounded and the second is the number of decimal places
  • In this example, the results will be the same as the previous one
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Filtering after aggregating with HAVING

  • Use HAVING to filter variables after aggregating
  • This differs from WHERE, which is used to filter variables before aggregating
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 50;
  • HAVING must be used after GROUP BY and before ORDER BY
  • Why? Because HAVING is used to filter the results of the GROUP BY clause
  • So please remember: WHERE is used before aggregating and HAVING is used after aggregating!

Python vs. SQL

  • Here are some equivalences between Python and SQL commands
SQL Command Pandas Equivalent
SELECT df[['column1', 'column2']]
INSERT INTO df.loc[new_index] = new_row or df = df.append(new_row, ignore_index=True)
WHERE df[df['column'] == value]
GROUP BY df.groupby('column')
HAVING df.groupby('column').filter(lambda x: condition)
ROUND() df['column'].round(decimals)
Aggregate Functions df.groupby('column').agg({'col1': 'sum', 'col2': 'mean', 'col3': 'min', 'col4': 'max'})


-- SQL
SELECT nationality,
       SUM(victories) AS sum_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 10;
# Python
import pandas as pd

# Group by 'nationality' and calculate 'sum_victories' using agg()
result = (
    drivers.groupby('nationality')
    .agg(sum_victories=('victories', 'sum'))
    .reset_index()
)

# Apply 'HAVING' condition
result = result[result['sum_victories'] > 10]

Importing and exporting data

  • The SQLite3 Editor allows you to easily import and export data to many formats, including CSV and JSON
  • Just click on Other Tools and then Export / Import
  • If you want to export an XLXS file, you can choose between exporting one or all tables
  • Importing is just as easy: select the file and the table you want to import the data from
  • You can also do this with SQLite commands:

Do you want to learn more? 🤓

And that’s it for today! 🚀

See you next time! 😊

Appendix 01

  • Here is the code to create the students table and add the rows
  • You can paste this code into the SQL editor and execute the commands
DROP TABLE students;

CREATE TABLE students(
    student_id INTEGER NOT NULL PRIMARY KEY, 
    student_name TEXT,
    major TEXT,
    gpa NUMERIC
);

INSERT INTO students VALUES 
(1, 'John Doe', 'Computer Science', 3.5),
(2, 'Jane Doe', 'Mathematics', 3.8);

SELECT * FROM students;

Appendix 02

  • Here is the code to list drivers who are either ‘German’ or ‘British’ and have victories greater than 50
SELECT driver_name, nationality, victories
FROM drivers
WHERE (nationality = 'German' OR nationality = 'British') AND victories > 50;

SELECT driver_name, victories
FROM drivers
WHERE victories > (SELECT AVG(victories) FROM drivers);