QTM 350 - Data Science Computing

Lecture 19 - Tables and Dates in SQL

Danilo Freire

Emory University

06 October, 2024

Hello, my friends! 😊

Brief recap 📚

Recap of last class and today’s plan

Last time we learned how to:

  • Connect SQL with Python with psychopg2 and sqlalchemy
  • Use many SQL commands, such as CASE WHEN, window functions, and string functions
  • Use pandas to write SQL queries
  • Fill missing data with COALESCE
  • Pivot tables in PostgreSQL

Today we will learn how to:

  • See different types of join in SQL
  • Use special joins, such as CROSS JOIN and SELF JOIN
  • Merge tables by row with UNION, INTERSECT, and EXCEPT
  • Use join conditions with ON, USING, and NATURAL JOIN
  • Solve exercises to practice what we learned
  • Let’s get started! 🚀

An announcement 📢

Another announcement 📢

Final project instructions

  • The instructions for the final project is now available on GitHub
  • Please find it here: https://github.com/danilofreire/qtm350/blob/main/project/project-instructions.pdf
  • The project is due on December 9th, 2024
  • Groups of 3-4 students
  • You will create a GitHub repository with a report based on World Bank data
  • The report should be in Quarto, the data cleaning and descriptive statistics in SQL, and the data analysis and visualisation in Python
  • QTM 531 students will also have to create a website with the report, as well as to write additional sections in the report
  • Please let me know if you have any questions

Basic joins 📊

Primary and foreign keys

  • As with many languages, you can merge two tables in SQL either by columns or by rows
  • Each uses different methods, but the most common is the JOIN clause
  • The JOIN clause is used to combine rows and columns from two or more tables based on a related column between them
  • There are two types of keys, primary and foreign keys
  • The primary key is a column that uniquely identifies each row in a table
    • It must contain a unique value for each row of the table and cannot contain NULL values. One table can have only one primary key
  • A foreign key is a column that identifies a column in another table
    • It references the key of another table. One table can have multiple foreign keys, and they can be NULL

Load the libraries and connect to the database

  • Let’s load the libraries and connect to the database
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text

# Connect to the database
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')
connection = engine.connect()
  • Then let’s create two tables that we will use to test the JOIN clauses
# Create the tables
connection.execute(text('''
DROP TABLE IF EXISTS players;
CREATE TABLE players (
    player_id SERIAL PRIMARY KEY,
    player_name VARCHAR(50) NOT NULL,
    goals INT NOT NULL,
    victories INT NOT NULL
);
'''))

connection.execute(text('''
DROP TABLE IF EXISTS teams;
CREATE TABLE teams (
    team_id SERIAL PRIMARY KEY,
    team_name VARCHAR(50) NOT NULL
);
'''))
connection.commit()

Create the tables

# Insert data into the tables
connection.execute(text('''
INSERT INTO players (player_name, goals, victories) VALUES
('Messi', 10, 5),
('Vini Jr', 8, 4),
('Neymar', 6, 3),
('Mbappe', 5, 2),
('Lewandowski', 4, 1),
('Haaland', 5, 3);
'''))

connection.execute(text('''
INSERT INTO teams (team_name) VALUES
('Inter Miami'),
('Real Madrid'),
('Al Hilal'),
('Real Madrid'),
('Bayern');
'''))
connection.commit()

Visualise the tables

  • Let’s see our tables
pd.read_sql('SELECT * FROM players', connection)
player_id player_name goals victories
0 1 Messi 10 5
1 2 Vini Jr 8 4
2 3 Neymar 6 3
3 4 Mbappe 5 2
4 5 Lewandowski 4 1
5 6 Haaland 5 3
pd.read_sql('SELECT * FROM teams', connection)
team_id team_name
0 1 Inter Miami
1 2 Real Madrid
2 3 Al Hilal
3 4 Real Madrid
4 5 Bayern

Types of joins

Inner join

  • The INNER JOIN returns only the records where there is a match between both tables (intersection)
  • If there’s no match for a record in either table, that record will be excluded from the results
  • Each match creates a new row in the result set that combines columns from both tables
  • The matching condition is specified in the ON clause (e.g., ON table1.id = table2.id)
  • It is commonly used to combine related data, like the ones we have in the players and teams tables
  • The syntax is as follows:
  • SELECT columns FROM table1 INNER JOIN table2 ON table1.column = table2.column
  • Note that Haaland is not in the teams table, so he will not appear in the result set
pd.read_sql('''
SELECT players.player_name, teams.team_name, players.goals, players.victories
FROM players
INNER JOIN teams 
ON players.player_id = teams.team_id;
''', connection)
player_name team_name goals victories
0 Messi Inter Miami 10 5
1 Vini Jr Real Madrid 8 4
2 Neymar Al Hilal 6 3
3 Mbappe Real Madrid 5 2
4 Lewandowski Bayern 4 1

Left join

  • The LEFT JOIN returns all records from the left table (first table) and the matched records from the right table (second table)
  • The result is NULL from the right side if there is no match
  • This is perhaps the most common type of join, as it keeps the data from the table we are usually interested in
  • The syntax is as follows (pretty much the same as INNER JOIN):
  • SELECT columns FROM table1 LEFT JOIN table2 ON table1.column = table2.column
  • Note that Haaland is here because he is in the players table, even though he is not in the teams table
pd.read_sql('''
SELECT players.player_name, teams.team_name, players.goals
FROM players
LEFT JOIN teams
ON players.player_id = teams.team_id;
''', connection)
player_name team_name goals
0 Messi Inter Miami 10
1 Vini Jr Real Madrid 8
2 Neymar Al Hilal 6
3 Mbappe Real Madrid 5
4 Lewandowski Bayern 4
5 Haaland None 5

Right join

  • The RIGHT JOIN returns all records from the right table (second table) and the matched records from the left table (first table)
  • The result is NULL from the left side if there is no match
  • As you have probably guessed, this is the opposite of the LEFT JOIN (and less common)
  • The syntax is as follows (you have probably guessed it by now 😄):
  • SELECT columns FROM table1 RIGHT JOIN table2 ON table1.column = table2.column
pd.read_sql('''
SELECT players.player_name, teams.team_name, players.goals
FROM players
RIGHT JOIN teams
ON players.player_id = teams.team_id;
''', connection)
player_name team_name goals
0 Messi Inter Miami 10
1 Vini Jr Real Madrid 8
2 Neymar Al Hilal 6
3 Mbappe Real Madrid 5
4 Lewandowski Bayern 4

Full outer join

  • The FULL OUTER JOIN returns all records when there is a match in either left (first) or right (second) table
  • It returns NULL values when there is no match
  • In my experience, this is the least common type of join
  • Why? Because it returns all the data from both tables, which can be a lot and not very useful
  • It also overloads the database, as it has to compare all the records from both tables
pd.read_sql('''
SELECT players.player_name, teams.team_name, players.goals
FROM players
FULL OUTER JOIN teams
ON players.player_id = teams.team_id;
''', connection)
player_name team_name goals
0 Messi Inter Miami 10
1 Vini Jr Real Madrid 8
2 Neymar Al Hilal 6
3 Mbappe Real Madrid 5
4 Lewandowski Bayern 4
5 Haaland None 5

Try it yourself! 🧠

  • Let’s create two new tables and insert some data into them
# Create the tables and insert data
connection.execute(text('''
DROP TABLE IF EXISTS products CASCADE;
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL,
    price DECIMAL(10,2) -- 10 digits, 2 decimal places for money
);

-- Insert products (notice some won't have reviews)
INSERT INTO products (product_name, price) VALUES
    ('Coffee Maker', 99.99),
    ('Toaster', 29.99),
    ('Blender', 79.99),
    ('Microwave', 149.99),
    ('Air Fryer', 89.99);

DROP TABLE IF EXISTS reviews CASCADE;
CREATE TABLE reviews (
    review_id SERIAL PRIMARY KEY,
    product_id INT,
    rating INT CHECK (rating BETWEEN 1 AND 5),
    comment TEXT,
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

-- Insert reviews (only for some products)
INSERT INTO reviews (product_id, rating, comment) VALUES
    (1, 5, 'Great coffee maker!'),
    (1, 4, 'Good but expensive'),
    (2, 3, 'Average toaster'),
    (3, 5, 'Best blender ever');
'''))
connection.commit()

Try it yourself! 🧠

  • Now try to merge the directors and movies tables using INNER JOIN and LEFT JOIN
  • Explain the differences between the two results
  • Appendix 01

Special joins 🌟

Cross join

  • CROSS JOIN is also available in SQL
  • A cross join does not use any comparisons to determine whether the rows in each table match one another
  • Instead, results are constructed by simply adding each of the rows from the first table to each of the rows of the second table.
  • Useful for generating all possible combinations (e.g., creating a multiplication table)
  • Can be resource-intensive with large tables since result grows exponentially
  • Sometimes written as simple comma-separated tables in FROM clause: SELECT * FROM table1, table2
pd.read_sql('''
SELECT players.player_name, teams.team_name
FROM players
CROSS JOIN teams
ORDER BY players.player_id;
''', connection)
player_name team_name
0 Messi Inter Miami
1 Messi Real Madrid
2 Messi Al Hilal
3 Messi Real Madrid
4 Messi Bayern
5 Vini Jr Inter Miami
6 Vini Jr Real Madrid
7 Vini Jr Al Hilal
8 Vini Jr Real Madrid
9 Vini Jr Bayern
10 Neymar Inter Miami
11 Neymar Real Madrid
12 Neymar Al Hilal
13 Neymar Real Madrid
14 Neymar Bayern
15 Mbappe Inter Miami
16 Mbappe Real Madrid
17 Mbappe Al Hilal
18 Mbappe Real Madrid
19 Mbappe Bayern
20 Lewandowski Inter Miami
21 Lewandowski Real Madrid
22 Lewandowski Al Hilal
23 Lewandowski Real Madrid
24 Lewandowski Bayern
25 Haaland Inter Miami
26 Haaland Real Madrid
27 Haaland Al Hilal
28 Haaland Real Madrid
29 Haaland Bayern

Cross join

  • Sometimes it is useful to generate all possible combinations of two tables
# Drop and recreate tables
connection.execute(text('''
DROP TABLE IF EXISTS colors CASCADE;
CREATE TABLE colors (color_name VARCHAR(20));
DROP TABLE IF EXISTS sizes CASCADE;
CREATE TABLE sizes (size_code VARCHAR(5));
INSERT INTO colors VALUES ('Black'), ('Red');
INSERT INTO sizes VALUES ('S'), ('M');
'''))

pd.read_sql(text('''
SELECT 
    colors.color_name, 
    sizes.size_code,
    CONCAT(colors.color_name, ' - ', sizes.size_code) as t_shirt
FROM colors
CROSS JOIN sizes
ORDER BY colors.color_name, sizes.size_code DESC;
'''), connection)
color_name size_code t_shirt
0 Black S Black - S
1 Black M Black - M
2 Red S Red - S
3 Red M Red - M

Self join

  • A self join is a regular join, but the table is joined with itself (!) 🤯

  • It may not be immediately apparent how this could be useful, but it actually has many applications

  • Often, tables describe entities that can fulfill multiple roles in relationship to one another

  • For instance, if you have a table of people, each row could contain a mother column that reference other people in the table with the same person_id

  • A self join would allow you to stitch these different rows together by joining a second instance of the table to the first where these values match

  • Since self joins reference the same table twice, table aliases are required to disambiguate the references

  • You could join the two instances of the people table using the aliases people AS children and people AS mothers

  • That way, you can specify which instance of the table you are referring to when defining join conditions

  • The syntax is as follows: SELECT columns FROM table1 AS alias1 JOIN table1 AS alias2 ON alias1.column = alias2.column

Self join

  • Let see an example
connection.execute(text('''
DROP TABLE IF EXISTS family CASCADE;
CREATE TABLE family (
    person_id SERIAL PRIMARY KEY,
    name VARCHAR(50),
    mother_id INT
);

INSERT INTO family (name, mother_id) VALUES
    ('Emma', NULL), -- grandmother
    ('Sarah', 1),   -- Emma's daughter
    ('Lisa', 1),    -- Emma's daughter
    ('Tom', 2),     -- Sarah's son
    ('Alice', 2);   -- Sarah's daughter
'''))

pd.read_sql('''
SELECT 
    children.name as child,
    mothers.name as mother
FROM family children
JOIN family mothers ON children.mother_id = mothers.person_id
ORDER BY mothers.name;
''', connection)
child mother
0 Sarah Emma
1 Lisa Emma
2 Tom Sarah
3 Alice Sarah

Self join

  • Let see another example
  • Here we want to calculate the difference in goals between players in the players table
pd.read_sql('''
SELECT 
    p1.player_name,
    p1.goals,
    p2.player_name as compared_to,
    p2.goals as their_goals,
    p1.goals - p2.goals as difference
FROM players p1
JOIN players p2 
ON p1.player_id < p2.player_id
ORDER BY difference DESC;
''', connection)
player_name goals compared_to their_goals difference
0 Messi 10 Lewandowski 4 6
1 Messi 10 Mbappe 5 5
2 Messi 10 Haaland 5 5
3 Vini Jr 8 Lewandowski 4 4
4 Messi 10 Neymar 6 4
5 Vini Jr 8 Haaland 5 3
6 Vini Jr 8 Mbappe 5 3
7 Messi 10 Vini Jr 8 2
8 Vini Jr 8 Neymar 6 2
9 Neymar 6 Lewandowski 4 2
10 Neymar 6 Haaland 5 1
11 Mbappe 5 Lewandowski 4 1
12 Neymar 6 Mbappe 5 1
13 Mbappe 5 Haaland 5 0
14 Lewandowski 4 Haaland 5 -1

Try it yourself! 🧠

  • Write a SQL query that:
    • Uses a self join on the players table to compare players’ victories
  • Shows for each pair of players:
    • First player’s name and victories
    • Second player’s name and victories
    • The ratio of their victories (rounded to 2 decimal places)
  • Requirements:
  • Use CAST AS NUMERIC to handle decimal division (ROUND(CAST(p1.victories AS NUMERIC) / CAST(p2.victories AS NUMERIC), 2) as victories_ratio)
  • Avoid duplicate comparisons using player_id < player_id
  • Prevent division by zero by adding WHERE p2.victories > 0
  • Order results by first player’s ID
  • Appendix 02

Join conditions 🧩

Join conditions

  • The most standard way of defining the conditions for table joins is with the ON clause
  • ON uses an equals sign to specify the exact columns from each table that will be compared
  • PostgreSQL uses the provided columns to stitch together the rows from each table
  • ON is verbose, but also very flexible, as it allows you to specify any condition you want
pd.read_sql('''
SELECT players.player_name, teams.team_name, players.goals
FROM players
JOIN teams
ON players.player_id = teams.team_id;
''', connection)
player_name team_name goals
0 Messi Inter Miami 10
1 Vini Jr Real Madrid 8
2 Neymar Al Hilal 6
3 Mbappe Real Madrid 5
4 Lewandowski Bayern 4
  • You can also use the USING clause to specify the columns that will be used to join the tables
  • USING is a bit more concise than ON, as it only requires the column names
  • Let’s create a new column player_id in the teams table to demonstrate how USING works
# First add team_id to players if not exists
connection.execute(text('''
ALTER TABLE teams ADD COLUMN IF NOT EXISTS player_id SERIAL;
'''))

# Show joined data using USING clause
pd.read_sql('''
SELECT player_name, team_name, goals
FROM players
JOIN teams USING (player_id);
''', connection)
player_name team_name goals
0 Messi Inter Miami 10
1 Vini Jr Real Madrid 8
2 Neymar Al Hilal 6
3 Mbappe Real Madrid 5
4 Lewandowski Bayern 4

Natural join

  • This is a cool feature of PostgreSQL!
  • A NATURAL JOIN does not specify any columns to be matched
  • Instead, PostgreSQL will automatically join the tables based on all columns that have matching columns in each database
pd.read_sql('''
SELECT players.player_name, teams.team_name, players.goals
FROM players
NATURAL JOIN teams;
''', connection)
player_name team_name goals
0 Messi Inter Miami 10
1 Vini Jr Real Madrid 8
2 Neymar Al Hilal 6
3 Mbappe Real Madrid 5
4 Lewandowski Bayern 4

Merge tables by row 🧩

Union

  • The UNION operator is used to combine the result sets of two or more SELECT statements
  • It removes duplicate rows between the various SELECT statements
  • The columns in each SELECT statement must be in the same order, with the same data types
pd.read_sql('''
SELECT player_name, goals FROM players
UNION
SELECT team_name, NULL FROM teams
ORDER BY player_name;
''', connection)
player_name goals
0 Al Hilal NaN
1 Bayern NaN
2 Haaland 5.0
3 Inter Miami NaN
4 Lewandowski 4.0
5 Mbappe 5.0
6 Messi 10.0
7 Neymar 6.0
8 Real Madrid NaN
9 Vini Jr 8.0

Union all and intersect

  • Similar to UNION, UNION ALL also merges tables by rows
  • Unlike UNION, UNION ALL retains the duplicate records of both the tables
pd.read_sql('''
INSERT INTO players (player_name, goals, victories) VALUES ('Real Madrid', 0, 0); 
SELECT player_name FROM players
UNION ALL
SELECT team_name FROM teams;
''', connection)
player_name
0 Messi
1 Vini Jr
2 Neymar
3 Mbappe
4 Lewandowski
5 Haaland
6 Real Madrid
7 Inter Miami
8 Real Madrid
9 Al Hilal
10 Real Madrid
11 Bayern
  • The INTERSECT operator is used to return the common rows between two queries
pd.read_sql('''
INSERT INTO players (player_name, goals, victories) VALUES ('Real Madrid', 0, 0); 
SELECT player_name FROM players
INTERSECT
SELECT team_name FROM teams;
''', connection)
player_name
0 Real Madrid

Except

  • EXCEPT returns the records from the first table (left table) which are not present in the second table (right table)
pd.read_sql('''
INSERT INTO players (player_name, goals, victories) VALUES ('Real Madrid', 0, 0); 
SELECT player_name FROM players
EXCEPT
SELECT team_name FROM teams;
''', connection)
player_name
0 Mbappe
1 Neymar
2 Lewandowski
3 Messi
4 Vini Jr
5 Haaland

Merge operator

Merge

  • PostgreSQL 15 introduced the MERGE operator, which allows you to perform an INSERT, UPDATE, or DELETE operation in a single statement
  • It is extremely flexible, easy to read, and efficient. More info here
  • It is better to see it with an example
connection.execute(text('''
ALTER TABLE teams ADD COLUMN IF NOT EXISTS player_id SERIAL;
ALTER TABLE teams ADD COLUMN IF NOT EXISTS team_player VARCHAR(100);

MERGE INTO teams AS t
USING players AS p
ON t.player_id = p.player_id
WHEN MATCHED THEN
    UPDATE SET team_player = (p.player_name || ' - ' || t.team_name)
WHEN NOT MATCHED THEN
    INSERT (team_name) VALUES (p.player_name);
'''))

pd.read_sql('SELECT * FROM teams', connection)
team_id team_name player_id team_player
0 1 Inter Miami 1 Messi - Inter Miami
1 2 Real Madrid 2 Vini Jr - Real Madrid
2 3 Al Hilal 3 Neymar - Al Hilal
3 4 Real Madrid 4 Mbappe - Real Madrid
4 5 Bayern 5 Lewandowski - Bayern
5 6 Haaland 6 None
6 7 Real Madrid 7 None
7 8 Real Madrid 8 None
8 9 Real Madrid 9 None

Conclusion 📖

Conclusion

  • Today we learned about different types of joins in SQL
  • We also learned about special joins, such as CROSS JOIN and SELF JOIN
  • We saw how to merge tables by row with UNION, INTERSECT, and EXCEPT
  • We also learned about join conditions with ON, USING, and NATURAL JOIN
  • And we saw the new MERGE operator in PostgreSQL 15+ 🚀

And that’s all for today! 🎉

Thank you and have a great rest of your day! 🙏

Appendix 01

  • Here is the solution to the exercise
print("INNER JOIN Results:")
print(pd.read_sql('''
    SELECT p.product_name, r.rating, r.comment
    FROM products p
    INNER JOIN reviews r ON p.product_id = r.product_id
    ORDER BY p.product_id;
''', connection))
INNER JOIN Results:
   product_name  rating              comment
0  Coffee Maker       5  Great coffee maker!
1  Coffee Maker       4   Good but expensive
2       Toaster       3      Average toaster
3       Blender       5    Best blender ever
print("\nLEFT JOIN Results:")
print(pd.read_sql('''
    SELECT p.product_name, r.rating, r.comment
    FROM products p
    LEFT JOIN reviews r ON p.product_id = r.product_id
    ORDER BY p.product_id;
''', connection))

LEFT JOIN Results:
   product_name  rating              comment
0  Coffee Maker     5.0  Great coffee maker!
1  Coffee Maker     4.0   Good but expensive
2       Toaster     3.0      Average toaster
3       Blender     5.0    Best blender ever
4     Microwave     NaN                 None
5     Air Fryer     NaN                 None

Back to exercise

Appendix 02

pd.read_sql('''
SELECT p1.player_name, p1.victories, p2.player_name, p2.victories,
    ROUND(CAST(p1.victories AS NUMERIC) / CAST(p2.victories AS NUMERIC), 2) AS victories_ratio
FROM players p1
JOIN players p2
ON p1.player_id < p2.player_id
WHERE p2.victories > 0
ORDER BY p1.player_id;
''', connection)
player_name victories player_name victories victories_ratio
0 Messi 5 Vini Jr 4 1.25
1 Messi 5 Neymar 3 1.67
2 Messi 5 Mbappe 2 2.50
3 Messi 5 Lewandowski 1 5.00
4 Messi 5 Haaland 3 1.67
5 Vini Jr 4 Neymar 3 1.33
6 Vini Jr 4 Mbappe 2 2.00
7 Vini Jr 4 Lewandowski 1 4.00
8 Vini Jr 4 Haaland 3 1.33
9 Neymar 3 Mbappe 2 1.50
10 Neymar 3 Lewandowski 1 3.00
11 Neymar 3 Haaland 3 1.00
12 Mbappe 2 Lewandowski 1 2.00
13 Mbappe 2 Haaland 3 0.67
14 Lewandowski 1 Haaland 3 0.33

Back to exercise

Appendix 05

Cleaning all tables

  • I have encountered some issues with psycopg2 when trying to drop tables
  • So I created a function to clean all tables in the database
  • If you have a similar issue, you can use the function below
  • Replace the default values with your database name, user, and password
from sqlalchemy import create_engine, text, inspect
import time

engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()

def clean_database(engine):
    with engine.connect() as conn:
        try:
            # Get inspector to check existing tables
            inspector = inspect(engine)
            existing_tables = inspector.get_table_names()
            
            if not existing_tables:
                print("No tables found in database")
                return
                
            print(f"Found {len(existing_tables)} tables: {existing_tables}")
            
            # Kill other connections
            conn.execute(text("""
                SELECT pg_terminate_backend(pid) 
                FROM pg_stat_activity 
                WHERE pid <> pg_backend_pid()
                AND datname = current_database()
            """))
            
            conn.execute(text("ROLLBACK"))
            conn.execute(text("SET statement_timeout = '30s'"))
            
            # Only drop tables that exist
            for table in existing_tables:
                try:
                    conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
                    print(f"Dropped {table}")
                    conn.commit()
                    time.sleep(1)
                except Exception as e:
                    print(f"Error with {table}: {str(e)}")
                    conn.execute(text("ROLLBACK"))
                    
        except Exception as e:
            print(f"Fatal error: {str(e)}")
            conn.execute(text("ROLLBACK"))

# Execute
clean_database(engine)

Back to the lecture