DATASCI 350 - Data Science Computing

Lecture 16 - Introduction to SQL

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hey, there! 😊
I hope you’re all doing well!

A quick reminder! 📢

Check your AWS account!

Billing and Cost Management Home

How to contact AWS support if you were charged by mistake

  • If you were charged by mistake, you can contact AWS support to request a refund
  • Go to the AWS Support Center
  • Click on Create case
  • Select Account and billing support
  • Fill in the form with your details and explain the situation
  • Submit the case and wait for a response from AWS support (it’s usually pretty quick!)
  • If you need help with this, feel free to reach out to me

AWS Support Center

Final project instructions

  • The instructions for the final project are now available on GitHub
  • Please find it here: https://github.com/danilofreire/datasci350/blob/main/project/project-instructions.pdf
  • The project is due on April 27, 2026
  • Groups of 4-5 students
  • Please send me a list with the names and emails of the group members by next Monday if possible
  • If you don’t have a group, no problem! We’ll randomly assign you to one next week
  • You will create a GitHub repository with a report based on World Bank data
  • The report should be in Quarto, the data cleaning and descriptive statistics in SQL, and the data analysis and visualisation in Python
  • Please let me know if you have any questions!

Today’s lecture 📚

Today’s agenda

SQL essentials

  • Introduction to SQL: We will start by learning the fundamentals of SQL, focusing on SQLite
  • Setup: You will learn how to:
    • Install and load the SQLite3 Editor in VS Code
    • Create a database
    • Create a table
    • Add and query rows
    • View and modify the table schema
    • Import and export data

Structured Query Language (SQL)

SQL: A relational database management system (RDBMS)

  • SQL stands for Structured Query Language and it is used to interact with relational databases
  • What is a relational database?
    • A database that stores data in tables and allows you to define relationships between tables
    • The most common type of database
    • Examples: PostgreSQL, MySQL, SQLite, Oracle, SQL Server
  • Here we will focus on SQLite
  • Why? It is open-source, powerful, and widely used in industry: phones, browsers, TVs, operating systems, LLM/RAGs, and even cars use it!
  • It has several advanced features too, such as full-text search and window functions

Let’s get started! 🤓

What is SQLite?

  • SQLite is a serverless database, so it does not require a server to run
  • It is a single file that you can use to store your data
  • This has several advantages, such as:
    • Easy to install and use
    • Lightweight and fast
    • Perfect for small projects
  • SQLite has some cons too, such as:
  • Some people argue that SQLite is not a real database because it does not have a server, or that storing the database in a single file is not suitable for large projects
  • Both arguments are false:
    • You can use LiteFS to scale SQLite and allow multiple users to write to the database at the same time
    • SQLite is capable of handling databases that are an Exabyte in size (that’s one million Terabytes, or one billion Gigabytes 🤯)
    • I’m sure that this is more than enough for pretty much any project you will ever work on 😅
  • And to be honest, the difference between different SQL databases is not that big
  • If you learn one of them, you can easily learn the others

Installing SQLite

macOS and Linux

  • If you use macOS or Linux, you probably already have SQLite installed
  • To check, open a terminal and type sqlite3
    • Then type .quit to exit the SQLite prompt
  • If you see the SQLite prompt, you are good to go! 😊
  • In case you don’t have SQLite installed, you can download it from the SQLite website or using Homebrew (recommended)
  • To install Homebrew, paste the following command in your terminal
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  • Then install SQLite and verify the installation with the following commands
brew install sqlite3
sqlite3 --version
  • On Ubuntu (WSL), you can install SQLite with:
sudo apt update
sudo apt-get install sqlite3

Installing SQLite

Windows

  • This is not necessary if you use WSL, but in case you want to install SQLite natively on Windows…
  • … you can download the SQLite precompiled binaries from the SQLite website. Download the files that have the sqlite3 DLL and the command-line shell (sqlite-dll... and sqlite-tools...)
  • Then follow these steps:
    • Create a folder C:\>sqlite and unzip the files in this folder, which will give you sqlite3.def, sqlite3.dll and sqlite3.exe files
    • Double-click the sqlite3 file to open the software, and you will see the SQLite prompt
C:\>sqlite3
SQLite version 3.7.15.2 2013-01-09 11:53:05
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
  • If you want to use SQLite in the command line, you can add the folder to your PATH variable
  • Search and click on Edit the system environment variables
  • Click on Environment Variables...
  • Set C:\sqlite to Path in both or either User variables and/or System variables, then you can use SQLite on terminal (command prompt)
  • More about this here

Installing SQLite3 Editor in VS Code

  • To use SQLite in VS Code, you need to install the SQLite3 Editor extension (click on the link to go to the extension page)
  • Click on Install to install the extension
  • This extension allows you to connect to your SQLite database and run SQL commands directly in VS Code, and even edit files without having to write any SQL queries (!)
  • You can use it to create tables, export data to CSV and JSON formats, run queries in the command line, and compare versions using git diff
  • It is a very useful extension and I highly recommend it! 😊
  • We will use this extension today to interact with our SQLite database

Download it here or search for it in the extensions tab in VS Code

Any issues with the installation? 🤔

Let’s create our first SQL table! 🚀

Creating a database with SQLite3 Editor

  • It is very easy to create a SQLite database file using the extension
  • Simply create a file ending in .sqlite or .db and open it with the SQLite3 Editor
  • Here I will create a file called example.db and open it with the extension
touch example.db
  • Then just click on the file (or right-click and select Open to the Side – same thing)
  • You will see a new tab with the SQLite3 Editor

Creating a table with SQLite3 Editor

  • Now that you have opened the file with the extension, you will see the following screen (right)
  • You will see that there is no table in the database yet
  • Most of the time, the data come from other sources, such as APIs, JSON or CSV files
  • But in small projects, it is common to create a database to store the data
  • So let’s create one right now! 😉
  • We will create a table with some information about Formula 1 drivers
  • System commands are in upper case and user-defined names are in lower case

Creating a table with SQLite3 Editor

  • Now let’s create our table
  • We won’t write any code yet, just add the values to the editor
  • We will create four columns in the table: driver_id, driver_name, nationality, and victories:
    • driver_id is an integer, not null, and primary key (select these options in the editor)
    • driver_name is a text variable
    • nationality is a text variable too
    • victories is an integer
  • Then click on commit to save the changes

  • This is equivalent to the following SQL code (as you can see in the editor):
CREATE TABLE drivers (
    driver_id INTEGER NOT NULL PRIMARY KEY,
    driver_name TEXT,
    nationality TEXT,
    victories INTEGER
);

Creating a table with SQLite3 Editor

Database structure and data types in SQLite

Database structure

  • SQL works best with tabular data, so the main object in SQL is the table (rows and columns)
  • SQLite (and most SQL databases) are strongly typed, that is, each column has a specific data type
  • The five core data types in SQLite:
    • TEXT: Stores string values of any length using UTF-8 encoding
    • INTEGER: Holds whole numbers (positive or negative)
    • NUMERIC: Stores decimal numbers as floating-point values
    • BLOB: Contains binary data preserved as input
    • NULL: Represents missing or undefined data
  • SQLite also has other data types, such as REAL, BOOLEAN, DATE, TIME, and DATETIME

Primary and foreign keys

  • The PRIMARY KEY constraint is used to uniquely identify each row in a table
  • It cannot contain NULL values and is usually an integer
  • If you don’t want to enter the numbers manually, you can use the AUTOINCREMENT keyword
    • Example: driver_id INTEGER PRIMARY KEY AUTOINCREMENT
  • The FOREIGN KEY constraint is used to link two tables together
    • It is a field in one table that refers to the PRIMARY KEY in another table
  • We can also use ROWID to uniquely identify each row in a table, but it is not recommended
    • It is a hidden column that is automatically created, but deleting and reinserting a row can change the ROWID, and it is not compatible with other SQL databases

Example of tables with primary and foreign keys

  • Imagine that we have three tables, customers, products, and orders
  • The customers table will have a primary key customer_id
  • The products table will have a primary key product_id
  • And the orders table will have a primary key order_id
  • The orders table will also have two foreign keys: customer_id and product_id
  • We will see more about this in the next classes
customers
customer_id (PK) first_name last_name email
1 John Doe john.doe@email.com
2 Jane Smith jane.smith@email.com
3 Bob Johnson bob.j@email.com
products
product_id (PK) product_name price
101 Laptop 999.99
102 Smartphone 599.99
103 Wireless Mouse 29.99
orders
order_id (PK) customer_id (FK) product_id (FK) quantity total_price
1 1 101 1 999.99
2 2 102 2 1199.98
3 3 103 3 89.97

Add rows to table using the editor

Adding rows to the table using SQL commands

  • We can add rows to a table by using the INSERT INTO command
  • This command adds rows with VALUES to the bottom of the table
  • In SQL, different commands need to be separated by a semicolon (;)
  • You write system commands in capital letters to distinguish column names from variables and values
  • Let’s add some information about the drivers
  • Click on Query Editor and paste the following code:
INSERT INTO drivers VALUES (2, 'Michael Schumacher', 'German', 53);
INSERT INTO drivers VALUES (3, 'Sebastian Vettel', 'German', 91);
INSERT INTO drivers VALUES (4, 'Fernando Alonso', 'Spanish', 32);

Or more efficiently:

INSERT INTO drivers (driver_id, driver_name, nationality, victories) VALUES
(2, 'Michael Schumacher', 'German', 53),
(3, 'Sebastian Vettel', 'German', 91),
(4, 'Fernando Alonso', 'Spanish', 32);
  • Now click on ▶️️ Execute ️to run the code

Adding rows to the table

Querying the table

  • The editor already shows all the rows in the table, but we can also query the table using the SELECT command
SELECT * FROM drivers;
  • This command selects all columns (*) from the drivers table
  • SELECT is probably the most used command in SQL and is used to extract data from a database
  • If you are running SQL in the command line, you need to type SELECT to see the results

Exercise 01

  • Click on Query Editor and create a table called students with the following columns:

    • student_id (integer not null primary key)
    • student_name (text)
    • major (text)
    • gpa (numeric)
  • Add the following rows to the table:

  • student_id: 1, student_name: ‘John Doe’, major: ‘Computer Science’, gpa: 3.5

  • student_id: 2, student_name: ‘Jane Doe’, major: ‘Mathematics’, gpa: 3.8

  • Query the table to see the results

  • If the table already exists on your database, you can drop it with the following command:

DROP TABLE students;

Table Schemas 📄

What is a table schema?

  • The table schema is the structure of a table in a database
  • It defines the columns, data types, constraints, and relationships between tables
  • It is defined when the table is created but can be modified later
  • Usually, there is no need to change the schema frequently, but it is possible to do so if needed
  • You can view it with the PRAGMA table_info(table-name) command
  • The columns are:
    • cid: Column ID
    • name: Column name
    • type: Data type
    • notnull: 1 if the column cannot contain NULL values, 0 otherwise
    • dflt_value: Default value for the column
    • pk: 1 if the column is part of the primary key, 0 otherwise

Creating and modifying table schemas

  • To create a table, use the CREATE TABLE command, as we saw before
  • It is a good idea to use the IF NOT EXISTS clause to avoid errors if the table already exists
    • Example: CREATE TABLE IF NOT EXISTS table_name (column1_name data_type, ...);
  • Add a column: ALTER TABLE table_name ADD column_name data_type;
  • Drop a column: ALTER TABLE table_name DROP COLUMN column_name;
  • Rename a column: ALTER TABLE table_name RENAME COLUMN old_name TO new_name;
  • To delete a table, use the DROP TABLE command
    • Example: DROP TABLE table_name;
    • Be careful with this command, as it will delete all the data in the table!
  • To delete a row, use the DELETE FROM command
    • Example: DELETE FROM table_name WHERE condition;
    • This command deletes all rows that satisfy the condition
  • If you don’t specify a condition, it will delete all rows in the table (DELETE FROM table_name;)
  • To commit the changes, use the COMMIT; command

Some examples with the students table

  • Add a column: ALTER TABLE students ADD COLUMN age INTEGER;

  • Drop a column: ALTER TABLE students DROP COLUMN major;

  • Rename a column: ALTER TABLE students RENAME COLUMN gpa TO grade;

  • Delete a row with student_id = 2: DELETE FROM students WHERE student_id = 2;

Table Operations 📊

Extracting columns from table

SELECT extracts certain columns

  • The SELECT command is used to extract certain columns from a table
SELECT driver_id, nationality FROM drivers;

Subsetting rows

  • The WHERE command evaluates logical conditions to subset rows
SELECT * FROM drivers
WHERE nationality = 'German';

AND and OR conditions

OR conditions

  • Evaluate OR conditions
SELECT * FROM drivers
WHERE (nationality = 'Spanish') OR (nationality = 'German');

AND conditions

  • Evaluate AND conditions
SELECT * FROM drivers
WHERE (nationality = 'German') AND (driver_id = 3);

Data description

Aggregate

  • To compute aggregate statistics use a statistical function, such as SUM() or AVG()
  • The AS is an alias which assigns a name to the aggregate statistic
SELECT SUM(victories) AS sum_victories,
       COUNT(*) AS num_rows,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers;

Sort

  • Use ORDER BY to sort the data
  • Order in ascending order of victories with ASC (DESC for descending)
SELECT driver_name, victories FROM drivers
ORDER BY victories ASC;

Exercise 02

  • List drivers who are either German or British and have victories greater than 50
  • List all drivers with victories greater than the average victories
    • Hint: you will have to run another SELECT command to get the average victories (a subquery)
    • Aggregate functions are not allowed in the WHERE clause. See the reason why here. More about subqueries here.
    • We will see another way to do this soon (with HAVING)
  • Appendix 02

Groupby + Aggregate

Groupby

  • Use GROUP BY to group data by a column
SELECT nationality,
       SUM(victories) AS sum_victories,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Round

  • Use ROUND() to round the values
  • The first argument is the value to be rounded and the second is the number of decimal places
  • In this example, the results will be the same as the previous one
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Filtering after aggregating with HAVING

  • Use HAVING to filter variables after aggregating
  • This differs from WHERE, which is used to filter variables before aggregating
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 50;
  • HAVING must be used after GROUP BY and before ORDER BY
  • Why? Because HAVING is used to filter the results of the GROUP BY clause
  • So please remember: WHERE is used before aggregating and HAVING is used after aggregating!

Python vs. SQL

  • Some equivalences between Python and SQL commands
SQL Command Pandas Equivalent
SELECT df[['column1', 'column2']]
INSERT INTO df.loc[new_index] = new_row or df = df.append(new_row, ignore_index=True)
WHERE df[df['column'] == value]
GROUP BY df.groupby('column')
HAVING df.groupby('column').filter(lambda x: condition)
ROUND() df['column'].round(decimals)
Aggregate Functions df.groupby('column').agg({'col1': 'sum', 'col2': 'mean', 'col3': 'min', 'col4': 'max'})


-- SQL
SELECT nationality,
       SUM(victories) AS sum_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 10;
# Python
import pandas as pd

# Group by 'nationality' and calculate 'sum_victories' using agg()
result = (
    drivers.groupby('nationality')
    .agg(sum_victories=('victories', 'sum'))
    .reset_index()
)

# Apply 'HAVING' condition
result = result[result['sum_victories'] > 10]

Importing and exporting data

  • The SQLite3 Editor allows you to easily import and export data to many formats, including CSV and JSON
  • Just click on Other Tools and then Export / Import
  • If you want to export an XLSX file, you can choose between exporting one or all tables
  • Importing is just as easy: select the file and the table you want to import the data into
  • You can also do this with SQLite commands:

Do you want to learn more? 🤓

And that’s it for today! 🚀

See you next time! 😊

Appendix 01

  • Here is the code to create the students table and add the rows
  • You can paste this code into the SQL editor and execute the commands
DROP TABLE students;

CREATE TABLE students(
    student_id INTEGER NOT NULL PRIMARY KEY, 
    student_name TEXT,
    major TEXT,
    gpa NUMERIC
);

INSERT INTO students VALUES 
(1, 'John Doe', 'Computer Science', 3.5),
(2, 'Jane Doe', 'Mathematics', 3.8);

SELECT * FROM students;

Appendix 02

  • Here is the code to list drivers who are either German or British and have victories greater than 50
SELECT driver_name, nationality, victories
FROM drivers
WHERE (nationality = 'German' OR nationality = 'British') AND victories > 50;

SELECT driver_name, victories
FROM drivers
WHERE victories > (SELECT AVG(victories) FROM drivers);