You will create a GitHub repository with a report based on World Bank data
The report should be in Quarto, the data cleaning and descriptive statistics in SQL, and the data analysis and visualisation in Python
QTM 531 students will have a different project
Thanks to those who have already sent me their groups!
Please let me know if you have any questions!
Basic joins 📊
Primary and foreign keys
As with many languages, you can merge two tables in SQL either by columns or by rows
The most common method is the JOIN clause
JOIN is used to combine rows and columns from two or more tables based on a related column between them
As you know, there are two types of keys, primary and foreign keys
The primary key is a column that uniquely identifies each row in a table
A foreign key is a column that identifies a column in another table
One table can have multiple foreign keys, and they can be NULL
SQLite supports foreign keys, and the Python sqlite3 module handles them correctly by default
Load the libraries and connect to the database
Let’s load the libraries and connect to the SQLite database. We’ll use a file named lecture19.db.
import pandas as pd;import sqlite3# Connect to the SQLite database (this will create the file if it doesn't exist)connection = sqlite3.connect('lecture19.db'); cursor = connection.cursor()cursor.execute('DROP TABLE IF EXISTS players;')cursor.execute('''CREATE TABLE players ( player_id INTEGER PRIMARY KEY AUTOINCREMENT, player_name TEXT NOT NULL UNIQUE, goals INT NOT NULL, victories INT NOT NULL);''')cursor.execute('DROP TABLE IF EXISTS teams;')cursor.execute('''CREATE TABLE teams ( team_id INTEGER PRIMARY KEY AUTOINCREMENT, team_name TEXT NOT NULL);''')connection.commit() # Commit changes
Create the tables
Now let’s insert some data into the tables!
# Insert data into the tablescursor.execute('''INSERT INTO players (player_name, goals, victories) VALUES('Messi', 10, 5),('Vini Jr', 8, 4),('Neymar', 6, 3),('Mbappe', 5, 2),('Lewandowski', 4, 1),('Haaland', 5, 3);''')cursor.execute('''INSERT INTO teams (team_name) VALUES('Inter Miami'),('Real Madrid'),('Santos'),('Real Madrid'),('Bayern');''')connection.commit() # Commit changes
Visualise the tables
Let’s see our tables using pandas.
read_sql works fine with the sqlite3 connection object.
pd.read_sql('SELECT * FROM players', connection)
player_id
player_name
goals
victories
0
1
Messi
10
5
1
2
Vini Jr
8
4
2
3
Neymar
6
3
3
4
Mbappe
5
2
4
5
Lewandowski
4
1
5
6
Haaland
5
3
pd.read_sql('SELECT * FROM teams', connection)
team_id
team_name
0
1
Inter Miami
1
2
Real Madrid
2
3
Santos
3
4
Real Madrid
4
5
Bayern
Types of joins
Inner join
The INNER JOIN returns only the records where there is a match between both tables (intersection) based on the join condition.
If there’s no match for a record in either table, that record will be excluded from the results.
The matching condition is specified in the ON clause (e.g., ON table1.id = table2.id).
Note that Haaland (player_id 6) is not in the teams table (which only has 5 rows), so he will not appear in the result set.
A cross join does not use any comparisons (like ON) to match rows.
Instead, the result is constructed by pairing every row from the first table with every row from the second table (Cartesian product).
Useful for generating all possible combinations (e.g., pairing all sizes with all colors).
# Displaying cross join between players and teamspd.read_sql('''SELECT players.player_name, teams.team_nameFROM playersCROSS JOIN teamsORDER BY players.player_id, teams.team_id;''', connection)
player_name
team_name
0
Messi
Inter Miami
1
Messi
Real Madrid
2
Messi
Santos
3
Messi
Real Madrid
4
Messi
Bayern
5
Vini Jr
Inter Miami
6
Vini Jr
Real Madrid
7
Vini Jr
Santos
8
Vini Jr
Real Madrid
9
Vini Jr
Bayern
10
Neymar
Inter Miami
11
Neymar
Real Madrid
12
Neymar
Santos
13
Neymar
Real Madrid
14
Neymar
Bayern
15
Mbappe
Inter Miami
16
Mbappe
Real Madrid
17
Mbappe
Santos
18
Mbappe
Real Madrid
19
Mbappe
Bayern
20
Lewandowski
Inter Miami
21
Lewandowski
Real Madrid
22
Lewandowski
Santos
23
Lewandowski
Real Madrid
24
Lewandowski
Bayern
25
Haaland
Inter Miami
26
Haaland
Real Madrid
27
Haaland
Santos
28
Haaland
Real Madrid
29
Haaland
Bayern
Cross join
Here’s another example generating T-shirt combinations. SQLite uses || for string concatenation, not CONCAT().
# Drop and recreate tablescursor.execute('DROP TABLE IF EXISTS colours;')cursor.execute('DROP TABLE IF EXISTS sizes;')cursor.execute('CREATE TABLE colours (colour_name TEXT);')cursor.execute('CREATE TABLE sizes (size_code TEXT);')cursor.execute("INSERT INTO colours VALUES ('Black'), ('Red');")cursor.execute("INSERT INTO sizes VALUES ('S'), ('M');")connection.commit()# Perform cross join and concatenate strings using ||pd.read_sql('''SELECT colours.colour_name, sizes.size_code, colours.colour_name || ' - ' || sizes.size_code as t_shirtFROM coloursCROSS JOIN sizesORDER BY colours.colour_name, sizes.size_code DESC;''', connection)
colour_name
size_code
t_shirt
0
Black
S
Black - S
1
Black
M
Black - M
2
Red
S
Red - S
3
Red
M
Red - M
Self join
A self join is a regular join, but the table is joined with itself (!) 🤯
It may not be immediately apparent how this could be useful, but it actually has many applications.
Often, tables describe entities that can have relationships with other entities of the same type within that table.
For instance, if you have a table of employees, each row could contain a manager_id column that references the employee_id of another employee in the same table.
A self join allows you to connect these related rows, for example, to list each employee alongside their manager’s name.
Since self joins reference the same table twice, table aliases are required to distinguish between the two instances of the table.
You could join the employees table like this: employees AS e JOIN employees AS m ON e.manager_id = m.employee_id.
This way, you can clearly specify which instance (e for employee, m for manager) you are referring to in the SELECT list and the ON condition.
The general syntax is: SELECT columns FROM table1 AS alias1 JOIN table1 AS alias2 ON alias1.column = alias2.column
Self join
Let’s see an example with a family table where mother_id refers back to person_id.
cursor.execute('DROP TABLE IF EXISTS family;')cursor.execute('''CREATE TABLE family ( person_id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, mother_id INT );''')cursor.execute('''INSERT INTO family (name, mother_id) VALUES ('Emma', NULL), -- grandmother (id 1) ('Sarah', 1), -- Emma's daughter (id 2) ('Lisa', 1), -- Emma's daughter (id 3) ('Tom', 2), -- Sarah's son (id 4) ('Alice', 2); -- Sarah's daughter (id 5)''')connection.commit()# Self join to find child-mother pairspd.read_sql('''SELECT children.name as child, mothers.name as motherFROM family AS childrenJOIN family AS mothers ON children.mother_id = mothers.person_idORDER BY mothers.name, children.name;''', connection)
child
mother
0
Lisa
Emma
1
Sarah
Emma
2
Alice
Sarah
3
Tom
Sarah
Self join
Let’s see another example using the players table.
Here we want to compare the goals of every player against every other player. We use p1.player_id < p2.player_id to avoid duplicate pairs (e.g., Messi vs Vini Jr. and Vini Jr. vs Messi) and comparing a player to themselves.
pd.read_sql('''SELECT p1.player_name, p1.goals, p2.player_name as compared_to, p2.goals as their_goals, p1.goals - p2.goals as differenceFROM players AS p1JOIN players AS p2ON p1.player_id < p2.player_id -- Avoid duplicates and self-comparisonORDER BY difference DESC;''', connection)
player_name
goals
compared_to
their_goals
difference
0
Messi
10
Lewandowski
4
6
1
Messi
10
Mbappe
5
5
2
Messi
10
Haaland
5
5
3
Messi
10
Neymar
6
4
4
Vini Jr
8
Lewandowski
4
4
5
Vini Jr
8
Mbappe
5
3
6
Vini Jr
8
Haaland
5
3
7
Messi
10
Vini Jr
8
2
8
Vini Jr
8
Neymar
6
2
9
Neymar
6
Lewandowski
4
2
10
Neymar
6
Mbappe
5
1
11
Neymar
6
Haaland
5
1
12
Mbappe
5
Lewandowski
4
1
13
Mbappe
5
Haaland
5
0
14
Lewandowski
4
Haaland
5
-1
Merge tables by row 🧩
Union
The UNION operator combines the result sets of two or more SELECT statements vertically (stacking rows).
It automatically removes duplicate rows from the combined results. If you want to keep duplicates, use UNION ALL.
The columns in each SELECT statement must be compatible: the same number of columns, and corresponding columns must have compatible data types.
Let’s find all players who have either scored more than 7 goals OR achieved more than 3 victories. UNION is suitable here because we are combining two subsets of the same entity type (players) based on different criteria, and we want a unique list.
# Select players with more than 5 goals OR more than 3 victoriespd.read_sql('''SELECT player_name, goals, victoriesFROM playersWHERE goals > 7 UNION -- Combines results and removes duplicatesSELECT player_name, goals, victoriesFROM playersWHERE victories > 3ORDER BY player_name;''', connection)
player_name
goals
victories
0
Messi
10
5
1
Vini Jr
8
4
Union all and intersect
Similar to UNION, UNION ALL also merges tables by rows (stacks results vertically).
Unlike UNION, UNION ALLretains all duplicate rows. It simply appends the results. It’s generally faster than UNION as it doesn’t need to check for duplicates.
# Combine players with > 7 goals and players with > 3 victories# UNION ALL keeps all rows, including duplicates if a player meets both criteria.print(pd.read_sql('''SELECT player_name, goals, victories, 'High Scorer (>7)' AS categoryFROM playersWHERE goals > 7UNION ALLSELECT player_name, goals, victories, 'Many Victories (>3)' AS categoryFROM playersWHERE victories > 3ORDER BY player_name, category;''', connection))
player_name goals victories category
0 Messi 10 5 High Scorer (>7)
1 Messi 10 5 Many Victories (>3)
2 Vini Jr 8 4 High Scorer (>7)
3 Vini Jr 8 4 Many Victories (>3)
Intersect
The INTERSECT operator returns only the rows that are common to the result sets of both SELECT statements. It finds the intersection.
Like UNION, it removes duplicates within the final result.
Let’s find players who are both high scorers (more than 7 goals) and have achieved many victories (more than 3).
# Find players who are BOTH high scorers (goals > 7) AND have many victories (victories > 3)print("INTERSECT Example: Players with goals > 7 AND victories > 3")print(pd.read_sql('''SELECT player_nameFROM playersWHERE goals > 9INTERSECTSELECT player_nameFROM playersWHERE victories > 3ORDER BY player_name;''', connection))
INTERSECT Example: Players with goals > 7 AND victories > 3
player_name
0 Messi
Except
EXCEPT returns the rows from the first SELECT statement’s result set that are not present in the second SELECT statement’s result set. It finds the difference.
Like UNION, it removes duplicates before returning the final result.
# Example for EXCEPT: Find players who scored more than 5 goals# but did NOT achieve more than 3 victories.print("EXCEPT Example: Players with goals > 5 but victories <= 3")print(pd.read_sql('''SELECT player_name FROM playersWHERE goals > 5EXCEPT SELECT player_nameFROM playersWHERE victories > 3ORDER BY player_name;''', connection))
EXCEPT Example: Players with goals > 5 but victories <= 3
player_name
0 Neymar
UPSERT (INSERT ... ON CONFLICT)
SQLite provides “UPSERT” (Update or Insert) operations using the INSERT ... ON CONFLICT clause.
This allows you to attempt an INSERT, and if it violates a constraint (like UNIQUE or PRIMARY KEY), you can specify an alternative action, typically a DO UPDATE.
You can specify different actions for different types of conflicts, such as IGNORE, REPLACE, or UPDATE.
Let’s see a simplified example: If we try to insert a player with an existing player_name, we update their goals instead.
You will notice that excluded is a special table that refers to the row that would have been inserted if there was no conflict. SQLite always uses this name.
player_data = ('Messi', 2, 1)sql_upsert =""" INSERT INTO players (player_name, goals, victories) VALUES (?, ?, ?) ON CONFLICT(player_name) DO UPDATE SET goals = goals + excluded.goals, victories = victories + excluded.victories;"""cursor.execute(sql_upsert, player_data)pd.read_sql('SELECT * FROM players', connection)
player_id
player_name
goals
victories
0
1
Messi
12
6
1
2
Vini Jr
8
4
2
3
Neymar
6
3
3
4
Mbappe
5
2
4
5
Lewandowski
4
1
5
6
Haaland
5
3
Views 🔎
Views
A VIEW is a virtual table based on the result of a SELECT query.
It does not store data itself but provides a way to simplify complex queries or encapsulate frequently used queries.
You can create a view using the CREATE VIEW statement, followed by the view name and the SELECT query.
In SQLite, you can create a view using the CREATE VIEW statement, and you can also drop it using DROP VIEW.
# Drop the view if it existscursor.execute('DROP VIEW IF EXISTS player_stats;')# Create the viewcursor.execute('''CREATE VIEW player_stats ASSELECT player_name, SUM(goals) AS total_goals, SUM(victories) AS total_victoriesFROM playersGROUP BY player_name;''')connection.commit()pd.read_sql('SELECT * FROM player_stats LIMIT 4', connection)
player_name
total_goals
total_victories
0
Haaland
5
3
1
Lewandowski
4
1
2
Mbappe
5
2
3
Messi
12
6
Views
You can also use the view in a join with another table. For example, let’s join the player_stats view with the teams table to see which players are in which teams.
# Drop the view if it already exists to avoid errors on re-runcursor.execute('DROP VIEW IF EXISTS colour_size;')# Create the view using cursor.execute()cursor.execute('''CREATE VIEW colour_size ASSELECT c.colour_name, s.size_code, c.colour_name || ' - ' || s.size_code as t_shirt -- Use || for concatenationFROM colours AS cCROSS JOIN sizes AS sORDER BY c.colour_name, s.size_code DESC;''')connection.commit() # Commit the view creation# Now showcase the view by querying it with pandaspd.read_sql('SELECT * FROM colour_size;', connection)
colour_name
size_code
t_shirt
0
Black
S
Black - S
1
Black
M
Black - M
2
Red
S
Red - S
3
Red
M
Red - M
Conclusion 📖
Conclusion
Today we learned about different types of joins in SQL (INNER, LEFT, RIGHT, FULL OUTER), noting potential version requirements in SQLite for RIGHT and FULL OUTER.
We also learned about special joins: CROSS JOIN for combinations and SELF JOIN for relating rows within the same table.
We saw how to merge tables vertically (by row) with UNION, UNION ALL, INTERSECT, and EXCEPT.
We also saw the SQLite alternative to MERGE, the INSERT ... ON CONFLICT (UPSERT) clause 🚀.
We saw how to create and use views in SQLite, which are virtual tables based on SELECT queries.
And that’s all for today! 🎉
Thank you and have a great rest of your day! 🙏
Appendix 01
Here is the solution to the exercise comparing INNER JOIN and LEFT JOIN for products and reviews.
print("INNER JOIN Results (Only products with reviews):")# INNER JOIN only includes products that have at least one review.# Products like 'Microwave' and 'Air Fryer' are excluded.print(pd.read_sql(''' SELECT p.product_name, r.rating, r.comment FROM products p INNER JOIN reviews r ON p.product_id = r.product_id ORDER BY p.product_id, r.review_id; -- Added review_id for consistent ordering''', connection))
INNER JOIN Results (Only products with reviews):
product_name rating comment
0 Coffee Maker 5 Great coffee maker!
1 Coffee Maker 4 Good but expensive
2 Toaster 3 Average toaster
3 Blender 5 Best blender ever
print("\nLEFT JOIN Results (All products, reviews where available):")print(pd.read_sql(''' SELECT p.product_name, r.rating, r.comment FROM products p LEFT JOIN reviews r ON p.product_id = r.product_id ORDER BY p.product_id, r.review_id;''', connection))
LEFT JOIN Results (All products, reviews where available):
product_name rating comment
0 Coffee Maker 5.0 Great coffee maker!
1 Coffee Maker 4.0 Good but expensive
2 Toaster 3.0 Average toaster
3 Blender 5.0 Best blender ever
4 Microwave NaN None
5 Air Fryer NaN None
Here is the solution to the self-join exercise comparing player victories.
Note the use of CAST(... AS REAL) or multiplying by 1.0 to ensure floating-point division in SQLite.
print("Self Join on Players to Compare Victories:")print(pd.read_sql('''SELECT p1.player_name, p1.victories, p2.player_name AS compared_to, p2.victories AS their_victories, -- Ensure floating point division by casting one operand to REAL or NUMERIC ROUND(CAST(p1.victories AS REAL) / p2.victories, 2) AS victories_ratioFROM players p1JOIN players p2 ON p1.player_id < p2.player_id -- Avoid duplicates and self-comparisonWHERE p2.victories > 0 -- Avoid division by zeroORDER BY p1.player_id, p2.player_id; -- Consistent ordering''', connection))
Unlike server-based databases like PostgreSQL, SQLite databases are typically single files.
“Cleaning” the database often means simply deleting the database file and starting fresh.
If you encounter issues, the easiest way to reset is to close any active connections to the database file and then remove the file using your operating system or Python’s os module.
Here’s how you can close the connection and delete the database file (lecture19.db) we used in this lecture using Python:
# Ensure the connection is closed firsttry: connection.close()print("SQLite connection closed.")exceptExceptionas e:print(f"Error closing connection (might be already closed): {e}")