You will create a GitHub repository with a report based on World Bank data
The report should be in Quarto, the data cleaning and descriptive statistics in SQL, and the data analysis and visualisation in Python
QTM 531 students will also have to create a website with the report, as well as to write additional sections in the report
Please let me know if you have any questions
Basic joins 📊
Primary and foreign keys
As with many languages, you can merge two tables in SQL either by columns or by rows
Each uses different methods, but the most common is the JOIN clause
The JOIN clause is used to combine rows and columns from two or more tables based on a related column between them
There are two types of keys, primary and foreign keys
The primary key is a column that uniquely identifies each row in a table
It must contain a unique value for each row of the table and cannot contain NULL values. One table can have only one primary key
A foreign key is a column that identifies a column in another table
It references the key of another table. One table can have multiple foreign keys, and they can be NULL
Load the libraries and connect to the database
Let’s load the libraries and connect to the database
import pandas as pdimport psycopg2from sqlalchemy import create_engine, text# Connect to the databaseengine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')connection = engine.connect()
Then let’s create two tables that we will use to test the JOIN clauses
# Create the tablesconnection.execute(text('''DROP TABLE IF EXISTS players;CREATE TABLE players ( player_id SERIAL PRIMARY KEY, player_name VARCHAR(50) NOT NULL, goals INT NOT NULL, victories INT NOT NULL);'''))connection.execute(text('''DROP TABLE IF EXISTS teams;CREATE TABLE teams ( team_id SERIAL PRIMARY KEY, team_name VARCHAR(50) NOT NULL);'''))connection.commit()
A cross join does not use any comparisons to determine whether the rows in each table match one another
Instead, results are constructed by simply adding each of the rows from the first table to each of the rows of the second table.
Useful for generating all possible combinations (e.g., creating a multiplication table)
Can be resource-intensive with large tables since result grows exponentially
Sometimes written as simple comma-separated tables in FROM clause: SELECT * FROM table1, table2
pd.read_sql('''SELECT players.player_name, teams.team_nameFROM playersCROSS JOIN teamsORDER BY players.player_id;''', connection)
player_name
team_name
0
Messi
Inter Miami
1
Messi
Real Madrid
2
Messi
Al Hilal
3
Messi
Real Madrid
4
Messi
Bayern
5
Vini Jr
Inter Miami
6
Vini Jr
Real Madrid
7
Vini Jr
Al Hilal
8
Vini Jr
Real Madrid
9
Vini Jr
Bayern
10
Neymar
Inter Miami
11
Neymar
Real Madrid
12
Neymar
Al Hilal
13
Neymar
Real Madrid
14
Neymar
Bayern
15
Mbappe
Inter Miami
16
Mbappe
Real Madrid
17
Mbappe
Al Hilal
18
Mbappe
Real Madrid
19
Mbappe
Bayern
20
Lewandowski
Inter Miami
21
Lewandowski
Real Madrid
22
Lewandowski
Al Hilal
23
Lewandowski
Real Madrid
24
Lewandowski
Bayern
25
Haaland
Inter Miami
26
Haaland
Real Madrid
27
Haaland
Al Hilal
28
Haaland
Real Madrid
29
Haaland
Bayern
Cross join
Sometimes it is useful to generate all possible combinations of two tables
# Drop and recreate tablesconnection.execute(text('''DROP TABLE IF EXISTS colors CASCADE;CREATE TABLE colors (color_name VARCHAR(20));DROP TABLE IF EXISTS sizes CASCADE;CREATE TABLE sizes (size_code VARCHAR(5));INSERT INTO colors VALUES ('Black'), ('Red');INSERT INTO sizes VALUES ('S'), ('M');'''))pd.read_sql(text('''SELECT colors.color_name, sizes.size_code, CONCAT(colors.color_name, ' - ', sizes.size_code) as t_shirtFROM colorsCROSS JOIN sizesORDER BY colors.color_name, sizes.size_code DESC;'''), connection)
color_name
size_code
t_shirt
0
Black
S
Black - S
1
Black
M
Black - M
2
Red
S
Red - S
3
Red
M
Red - M
Self join
A self join is a regular join, but the table is joined with itself (!) 🤯
It may not be immediately apparent how this could be useful, but it actually has many applications
Often, tables describe entities that can fulfill multiple roles in relationship to one another
For instance, if you have a table of people, each row could contain a mother column that reference other people in the table with the same person_id
A self join would allow you to stitch these different rows together by joining a second instance of the table to the first where these values match
Since self joins reference the same table twice, table aliases are required to disambiguate the references
You could join the two instances of the people table using the aliases people AS children and people AS mothers
That way, you can specify which instance of the table you are referring to when defining join conditions
The syntax is as follows: SELECT columns FROM table1 AS alias1 JOIN table1 AS alias2 ON alias1.column = alias2.column
Self join
Let see an example
connection.execute(text('''DROP TABLE IF EXISTS family CASCADE;CREATE TABLE family ( person_id SERIAL PRIMARY KEY, name VARCHAR(50), mother_id INT);INSERT INTO family (name, mother_id) VALUES ('Emma', NULL), -- grandmother ('Sarah', 1), -- Emma's daughter ('Lisa', 1), -- Emma's daughter ('Tom', 2), -- Sarah's son ('Alice', 2); -- Sarah's daughter'''))pd.read_sql('''SELECT children.name as child, mothers.name as motherFROM family childrenJOIN family mothers ON children.mother_id = mothers.person_idORDER BY mothers.name;''', connection)
child
mother
0
Sarah
Emma
1
Lisa
Emma
2
Tom
Sarah
3
Alice
Sarah
Self join
Let see another example
Here we want to calculate the difference in goals between players in the players table
pd.read_sql('''SELECT p1.player_name, p1.goals, p2.player_name as compared_to, p2.goals as their_goals, p1.goals - p2.goals as differenceFROM players p1JOIN players p2 ON p1.player_id < p2.player_idORDER BY difference DESC;''', connection)
player_name
goals
compared_to
their_goals
difference
0
Messi
10
Lewandowski
4
6
1
Messi
10
Mbappe
5
5
2
Messi
10
Haaland
5
5
3
Vini Jr
8
Lewandowski
4
4
4
Messi
10
Neymar
6
4
5
Vini Jr
8
Haaland
5
3
6
Vini Jr
8
Mbappe
5
3
7
Messi
10
Vini Jr
8
2
8
Vini Jr
8
Neymar
6
2
9
Neymar
6
Lewandowski
4
2
10
Neymar
6
Haaland
5
1
11
Mbappe
5
Lewandowski
4
1
12
Neymar
6
Mbappe
5
1
13
Mbappe
5
Haaland
5
0
14
Lewandowski
4
Haaland
5
-1
Try it yourself! 🧠
Write a SQL query that:
Uses a self join on the players table to compare players’ victories
Shows for each pair of players:
First player’s name and victories
Second player’s name and victories
The ratio of their victories (rounded to 2 decimal places)
Requirements:
Use CAST AS NUMERIC to handle decimal division (ROUND(CAST(p1.victories AS NUMERIC) / CAST(p2.victories AS NUMERIC), 2) as victories_ratio)
Avoid duplicate comparisons using player_id < player_id
Prevent division by zero by adding WHERE p2.victories > 0
You can also use the USING clause to specify the columns that will be used to join the tables
USING is a bit more concise than ON, as it only requires the column names
Let’s create a new column player_id in the teams table to demonstrate how USING works
# First add team_id to players if not existsconnection.execute(text('''ALTER TABLE teams ADD COLUMN IF NOT EXISTS player_id SERIAL;'''))# Show joined data using USING clausepd.read_sql('''SELECT player_name, team_name, goalsFROM playersJOIN teams USING (player_id);''', connection)
player_name
team_name
goals
0
Messi
Inter Miami
10
1
Vini Jr
Real Madrid
8
2
Neymar
Al Hilal
6
3
Mbappe
Real Madrid
5
4
Lewandowski
Bayern
4
Natural join
This is a cool feature of PostgreSQL!
A NATURAL JOIN does not specify any columns to be matched
Instead, PostgreSQL will automatically join the tables based on all columns that have matching columns in each database
The UNION operator is used to combine the result sets of two or more SELECT statements
It removes duplicate rows between the various SELECT statements
The columns in each SELECT statement must be in the same order, with the same data types
pd.read_sql('''SELECT player_name, goals FROM playersUNIONSELECT team_name, NULL FROM teamsORDER BY player_name;''', connection)
player_name
goals
0
Al Hilal
NaN
1
Bayern
NaN
2
Haaland
5.0
3
Inter Miami
NaN
4
Lewandowski
4.0
5
Mbappe
5.0
6
Messi
10.0
7
Neymar
6.0
8
Real Madrid
NaN
9
Vini Jr
8.0
Union all and intersect
Similar to UNION, UNION ALL also merges tables by rows
Unlike UNION, UNION ALL retains the duplicate records of both the tables
pd.read_sql('''INSERT INTO players (player_name, goals, victories) VALUES ('Real Madrid', 0, 0); SELECT player_name FROM playersUNION ALLSELECT team_name FROM teams;''', connection)
player_name
0
Messi
1
Vini Jr
2
Neymar
3
Mbappe
4
Lewandowski
5
Haaland
6
Real Madrid
7
Inter Miami
8
Real Madrid
9
Al Hilal
10
Real Madrid
11
Bayern
The INTERSECT operator is used to return the common rows between two queries
pd.read_sql('''INSERT INTO players (player_name, goals, victories) VALUES ('Real Madrid', 0, 0); SELECT player_name FROM playersINTERSECTSELECT team_name FROM teams;''', connection)
player_name
0
Real Madrid
Except
EXCEPT returns the records from the first table (left table) which are not present in the second table (right table)
pd.read_sql('''INSERT INTO players (player_name, goals, victories) VALUES ('Real Madrid', 0, 0); SELECT player_name FROM playersEXCEPTSELECT team_name FROM teams;''', connection)
player_name
0
Mbappe
1
Neymar
2
Lewandowski
3
Messi
4
Vini Jr
5
Haaland
Merge operator
Merge
PostgreSQL 15 introduced the MERGE operator, which allows you to perform an INSERT, UPDATE, or DELETE operation in a single statement
It is extremely flexible, easy to read, and efficient. More info here
It is better to see it with an example
connection.execute(text('''ALTER TABLE teams ADD COLUMN IF NOT EXISTS player_id SERIAL;ALTER TABLE teams ADD COLUMN IF NOT EXISTS team_player VARCHAR(100);MERGE INTO teams AS tUSING players AS pON t.player_id = p.player_idWHEN MATCHED THEN UPDATE SET team_player = (p.player_name || ' - ' || t.team_name)WHEN NOT MATCHED THEN INSERT (team_name) VALUES (p.player_name);'''))pd.read_sql('SELECT * FROM teams', connection)
team_id
team_name
player_id
team_player
0
1
Inter Miami
1
Messi - Inter Miami
1
2
Real Madrid
2
Vini Jr - Real Madrid
2
3
Al Hilal
3
Neymar - Al Hilal
3
4
Real Madrid
4
Mbappe - Real Madrid
4
5
Bayern
5
Lewandowski - Bayern
5
6
Haaland
6
None
6
7
Real Madrid
7
None
7
8
Real Madrid
8
None
8
9
Real Madrid
9
None
Conclusion 📖
Conclusion
Today we learned about different types of joins in SQL
We also learned about special joins, such as CROSS JOIN and SELF JOIN
We saw how to merge tables by row with UNION, INTERSECT, and EXCEPT
We also learned about join conditions with ON, USING, and NATURAL JOIN
And we saw the new MERGE operator in PostgreSQL 15+ 🚀
And that’s all for today! 🎉
Thank you and have a great rest of your day! 🙏
Appendix 01
Here is the solution to the exercise
print("INNER JOIN Results:")print(pd.read_sql(''' SELECT p.product_name, r.rating, r.comment FROM products p INNER JOIN reviews r ON p.product_id = r.product_id ORDER BY p.product_id;''', connection))
INNER JOIN Results:
product_name rating comment
0 Coffee Maker 5 Great coffee maker!
1 Coffee Maker 4 Good but expensive
2 Toaster 3 Average toaster
3 Blender 5 Best blender ever
print("\nLEFT JOIN Results:")print(pd.read_sql(''' SELECT p.product_name, r.rating, r.comment FROM products p LEFT JOIN reviews r ON p.product_id = r.product_id ORDER BY p.product_id;''', connection))
LEFT JOIN Results:
product_name rating comment
0 Coffee Maker 5.0 Great coffee maker!
1 Coffee Maker 4.0 Good but expensive
2 Toaster 3.0 Average toaster
3 Blender 5.0 Best blender ever
4 Microwave NaN None
5 Air Fryer NaN None