QTM 151 - Introduction to Statistical Computing II

Lecture 20 - Joining Tables in SQL

Danilo Freire

danilo.freire@emory.edu

Emory University

Hi! 😊 Hope your week is going well

Brief recap of lecture 19 📚

Some updates on the final project

The list of groups is now available on our reppository. Check it out here
All groups have 3-4 members, as we discussed
Many thanks to all of you who have already formed your groups and sent them to me! 🙏🏻
Those not in a group have been assigned to one at random
So please check the list and make sure you are in a group, and that you know your group members
Any questions? 😊

Brief recap of last lecture

Last time, we learned how to:

Use psycopg2 to connect to a PostgreSQL database
Use sqlalchemy and pandas to run SQL queries in Python
Use pandas to convert files to SQL tables and vice-versa
And we also learned how to use Python functions with SQL queries

Today’s plan 📋

Merge tables in SQL

We will learn how to merge tables in SQL
We will use JOIN statements to merge tables
There are many types of joins, and we will learn the most common ones
- INNER JOIN and LEFT JOIN
The syntax is similar to pandas’ merge method
We will also learn how to merge tables using USING and ON clauses

Merging tables in SQL 📊

The `JOIN` statement

The JOIN statement is used to combine rows from two or more tables, based on a related column between them (usually a primary key)
JOIN is very flexible and can be used in many ways for pretty much all types of data merge operations
The most common types of joins are:
- INNER JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL JOIN
- CROSS JOIN
- SELF JOIN
LEFT JOIN is probably the most common type of join
You keep all rows from the left table and only the matching rows from the right table

The syntax is as follows:
SELECT * FROM table1 LEFT JOIN table2 ON table1.column = table2.column
The ON clause specifies the condition to join the tables, i.e., the column(s) that are common to both tables
The LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2)
The result is NULL from the right side if there is no match
Let’s see an example to make it clearer 😊

Set up the environment 🛠️

Let’s start by loading the necessary libraries

# Load the necessary libraries
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text

# Connect to the database
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')
conn = engine.connect()

Then let’s create two tables that we will use to test the JOIN clauses
Issues with the code? Please have a look at the appendix

# Rollback any existing transactions
conn.rollback()

# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS famous_people CASCADE'))

conn.execute(text('DROP TABLE IF EXISTS information CASCADE'))

# Create tables in the SQL database
conn.execute(text('''
CREATE TABLE IF NOT EXISTS famous_people (
    person_id INTEGER PRIMARY KEY,
    name TEXT
)
'''))

conn.execute(text('''
CREATE TABLE IF NOT EXISTS information (
    person_id INTEGER,
    country TEXT,
    profession TEXT
)
'''))

# Insert data into the tables
conn.execute(text('''
INSERT INTO famous_people (person_id, name) VALUES 
  (1, 'Albert Einstein'),
  (2, 'Isaac Newton'),
  (3, 'Marie Curie'),
  (4, 'Galileo Galilei'),
  (5, 'René Descartes'),
  (6, 'Blaise Pascal'),
  (7, 'Nikola Tesla')
'''))

conn.execute(text('''
INSERT INTO information (person_id, country, profession) VALUES 
  (1, 'Germany', 'Physicist'),
  (2, 'United Kingdom', 'Mathematician'),
  (3, 'Poland', 'Physicist'),
  (4, 'Italy', 'Astronomer'),
  (5, 'France', 'Philosopher'),
  (6, 'France', 'Mathematician')
'''))

conn.commit()
conn.close()

Let’s see the tables

Let’s see the tables we just created

# Connect to the database
conn = engine.connect()
pd.read_sql_query("SELECT * FROM famous_people", conn)

	person_id	name
0	1	Albert Einstein
1	2	Isaac Newton
2	3	Marie Curie
3	4	Galileo Galilei
4	5	René Descartes
5	6	Blaise Pascal
6	7	Nikola Tesla

pd.read_sql_query("SELECT * FROM information", conn)

	person_id	country	profession
0	1	Germany	Physicist
1	2	United Kingdom	Mathematician
2	3	Poland	Physicist
3	4	Italy	Astronomer
4	5	France	Philosopher
5	6	France	Mathematician

Error handling 🚨

sqlalchemy sometimes produces some unexpected behaviours when running SQL queries
In my experience, it sometimes fails to run queries that are too long or too complex
The worst thing is, it just keeps running forever without giving any error message! 🤬
So if you run a query and it takes too long, or if you suspect it is not working, just interrupt the kernel and try again
A good idea is to add conn.rollback() before running the query, to make sure the connection is clean
If you would like to save the data, it is also a good idea to run conn.commit() to make sure the data is saved
Then, close the connection to prevent memory leaks with conn.close()

The bad news is that you have to recreate the connection every time you encounter an error
If you typed conn.commit(), the table will indeed be in the database, so you won’t need to recreate it
But if you didn’t, you will need to recreate the table and insert the data again
So far I haven’t found a way to fix this error (it seems to happen quite often with others too), but I will keep looking for a solution 😊
Check this appendix for a workaround

Let’s join the tables with `LEFT JOIN` 🤝

Now that we have the tables, let’s join them using the LEFT JOIN clause

pd.read_sql("""
    SELECT * 
    FROM famous_people 
    LEFT JOIN information
    ON famous_people.person_id = information.person_id
""", conn)

	person_id	name	person_id	country	profession
0	1	Albert Einstein	1.0	Germany	Physicist
1	2	Isaac Newton	2.0	United Kingdom	Mathematician
2	3	Marie Curie	3.0	Poland	Physicist
3	4	Galileo Galilei	4.0	Italy	Astronomer
4	5	René Descartes	5.0	France	Philosopher
5	6	Blaise Pascal	6.0	France	Mathematician
6	7	Nikola Tesla	NaN	None	None

Let’s join the tables with `INNER JOIN` 🤝

Now let’s join the tables using the INNER JOIN clause

pd.read_sql("""
SELECT * 
FROM famous_people 
INNER JOIN information 
ON famous_people.person_id = information.person_id
""", conn)

	person_id	name	person_id	country	profession
0	1	Albert Einstein	1	Germany	Physicist
1	2	Isaac Newton	2	United Kingdom	Mathematician
2	3	Marie Curie	3	Poland	Physicist
3	4	Galileo Galilei	4	Italy	Astronomer
4	5	René Descartes	5	France	Philosopher
5	6	Blaise Pascal	6	France	Mathematician

Entity Relationship Diagram (ERD) 📊

What is an ERD?

An Entity Relationship Diagram (ERD) is a visual representation of the relationships between entities in a database
ERDs are used to model the logical structure of a database
Their main function is to indicate which entities (variables) are related to each other and how they are related
So they should always indicate the primary keys and foreign keys of the tables
A primary key is a unique identifier for each row in a table
A foreign key is a column that references the primary key of another table
Tables don’t need to have a foreign key, but they usually do as it helps to maintain data integrity

Click to enlarge Click on the image to enlarge

For instance, if we have a table with information about members of the US Congress, we could have the following ERD
Note the connections between the tables, which indicate the relationships between them

Let’s see another example with bigger tables 📊

This time, we will load two .csv files into the database
We will then join the tables using the LEFT JOIN clause

# Load the data
bills_actions       = pd.read_csv("data_raw/bills_actions.csv")
us_congress_members = pd.read_csv("data_raw/us_congress_members.csv")

# Recreate the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS bills_actions CASCADE'))
conn.execute(text('DROP TABLE IF EXISTS us_congress_members CASCADE'))
conn.commit()

# Create the tables in the database
bills_actions.to_sql("bills_actions", conn, if_exists="replace", index=False)
us_congress_members.to_sql("us_congress_members", conn, if_exists="replace", index=False)
conn.commit()

Let’s see the table 📊

# Recreate the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Check if the table exists
pd.read_sql(text('SELECT * FROM bills_actions'), conn).head()

	congress	bill_number	bill_type	action	main_action	object	member_id
0	116	1029	s	S.Amdt.1274 Amendment SA 1274 proposed by Sena...	senate amendment proposed (on the floor)	amendment	858
1	116	1031	s	S.Amdt.2698 Amendment SA 2698 proposed by Sena...	senate amendment proposed (on the floor)	amendment	675
2	116	1160	s	S.Amdt.2659 Amendment SA 2659 proposed by Sena...	senate amendment proposed (on the floor)	amendment	858
3	116	1199	s	Committee on Health, Education, Labor, and Pen...	senate committee/subcommittee actions	senate bill	1561
4	116	1208	s	Committee on the Judiciary. Reported by Senato...	senate committee/subcommittee actions	senate bill	1580

Try it yourself! 🤓

Download and display the us_congress_members dataset from SQL
Be sure to use conn.rollback() before running the query to clean the connection
If you encounter an error, check the appendix or ask me for help 😊
Click here to see the solution

Merging the tables 🤝

Click to enlarge Click on the image to enlarge

Here member_id is the PRIMARY KEY of the first dataset
We can check that with SQL commands

# Recreate the connection
engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Check if the table exists and count distinct member_id
results = pd.read_sql(text('''
  SELECT COUNT(DISTINCT member_id) AS num_distinct,
  COUNT(*) AS num_members 
  FROM us_congress_members
'''), conn)

print(results)

conn.close()

   num_distinct  num_members
0          1811         1811

Try it yourself! 🤓

Merge two datasets
Similar to Python
FROM is followed by the name of the primary dataset
LEFT is followed by the name of the secondary dataset
ON is the id variable used for merging
Click here to see the solution

Appendix 01 - Error handling 🚨

If you encounter an error when running the code, you can use the following workaround
First, you can rollback the transaction to clean the connection
Add this before running the query

conn.rollback()

If you want to save the data, you can commit the transaction
Add this after running the query

conn.commit()

Finally, close the connection to prevent memory leaks

conn.close()

Appendix 01 - Error handling 🚨

If everything else fails, you can drop all tables and recreate them
This code will clean your database

from sqlalchemy import create_engine, text, inspect
import time

engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()

def clean_database(engine):
    with engine.connect() as conn:
        try:
            # Get inspector to check existing tables
            inspector = inspect(engine)
            existing_tables = inspector.get_table_names()
            
            if not existing_tables:
                print("No tables found in database")
                return
                
            print(f"Found {len(existing_tables)} tables: {existing_tables}")
            
            # Kill other connections
            conn.execute(text("""
                SELECT pg_terminate_backend(pid) 
                FROM pg_stat_activity 
                WHERE pid <> pg_backend_pid()
                AND datname = current_database()
            """))
            
            conn.execute(text("ROLLBACK"))
            conn.execute(text("SET statement_timeout = '30s'"))
            
            # Only drop tables that exist
            for table in existing_tables:
                try:
                    conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
                    print(f"Dropped {table}")
                    conn.commit()
                    time.sleep(1)
                except Exception as e:
                    print(f"Error with {table}: {str(e)}")
                    conn.execute(text("ROLLBACK"))
                    
        except Exception as e:
            print(f"Fatal error: {str(e)}")
            conn.execute(text("ROLLBACK"))

# Execute
clean_database(engine)

Back to the lecture

Appendix 02

Download and display the us_congress_members dataset from SQL

# Recreate the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Drop the tables if they already exist
conn.execute(text('''
DROP TABLE IF EXISTS us_congress_members CASCADE
'''))
conn.commit()

# Import "races"
us_congress_members.to_sql('us_congress_members', 
             con = conn, 
             if_exists='replace',
             index=False)

conn.commit()

# Check if the table exists
pd.read_sql_query("SELECT * FROM us_congress_members", conn)

	member_id	full_name	last_name	member_title	state	party_name	chamber
0	0	A. Donald McEachin	McEachin	Representative	Virginia	Democratic	House
1	1	Aaron Schock	Schock	Representative	Illinois	Republican	House
2	2	Abby Finkenauer	Finkenauer	Representative	Iowa	Democratic	House
3	3	Abigail Davis Spanberger	Spanberger	Representative	Virginia	Democratic	House
4	4	Adam H. Putnam	Putnam	Representative	Florida	Republican	House
...	...	...	...	...	...	...	...
1806	1806	Wm. Lacy Clay	Clay	Representative	Missouri	Democratic	House
1807	1807	Wyche, Jr. Fowler	Fowler	Senator	Georgia	Democratic	House
1808	1808	Wyche, Jr. Fowler	Fowler	Senator	Georgia	Democratic	Senate
1809	1809	Yvette D. Clarke	Clarke	Representative	New York	Democratic	House
1810	1810	Zell Miller	Miller	Senator	Georgia	Democratic	Senate

1811 rows × 7 columns

Back to the lecture

Appendix 03

Merge two datasets
Similar to Python
FROM is followed by the name of the primary dataset
LEFT is followed by the name of the secondary dataset
ON is the id variable used for merging
Here we are merging the us_congress_members dataset with the bills_actions dataset

# Restart the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Merge the tables
results = pd.read_sql('''
SELECT *
FROM us_congress_members
LEFT JOIN bills_actions
ON us_congress_members.member_id = bills_actions.member_id
''', conn)

print(results.head())

conn.close()

   member_id                 full_name   last_name    member_title     state  \
0          0        A. Donald McEachin    McEachin  Representative  Virginia   
1          1              Aaron Schock      Schock  Representative  Illinois   
2          2           Abby Finkenauer  Finkenauer  Representative      Iowa   
3          2           Abby Finkenauer  Finkenauer  Representative      Iowa   
4          3  Abigail Davis Spanberger  Spanberger  Representative  Virginia   

   party_name chamber  congress  bill_number bill_type  \
0  Democratic   House       NaN          NaN      None   
1  Republican   House       NaN          NaN      None   
2  Democratic   House     116.0          3.0        hr   
3  Democratic   House     116.0       7617.0        hr   
4  Democratic   House     116.0       1644.0        hr   

                                              action              main_action  \
0                                               None                     None   
1                                               None                     None   
2  H.Amdt.719 Amendment (A009) offered by Ms. Fin...  house amendment offered   
3  H.Amdt.870 Amendment (A012) offered by Ms. Fin...  house amendment offered   
4  H.Amdt.167 Amendment (A011) offered by Ms. Spa...  house amendment offered   

      object  member_id  
0       None        NaN  
1       None        NaN  
2  amendment        2.0  
3  amendment        2.0  
4  amendment        3.0

Back to the lecture

QTM 151 - Introduction to Statistical Computing II

Hi! 😊 Hope your week is going well

Brief recap of lecture 19 📚

Some updates on the final project

Brief recap of last lecture

Last time, we learned how to:

Today’s plan 📋

Merge tables in SQL

Merging tables in SQL 📊

The JOIN statement

Set up the environment 🛠️

Let’s see the tables

Error handling 🚨

Let’s join the tables with LEFT JOIN 🤝

Let’s join the tables with INNER JOIN 🤝

Entity Relationship Diagram (ERD) 📊

What is an ERD?

Let’s see another example with bigger tables 📊

Let’s see the table 📊

Try it yourself! 🤓

Merging the tables 🤝

Try it yourself! 🤓

Appendix 01 - Error handling 🚨

Appendix 01 - Error handling 🚨

Appendix 02

Appendix 03

The `JOIN` statement

Let’s join the tables with `LEFT JOIN` 🤝

Let’s join the tables with `INNER JOIN` 🤝