QTM 151 - Introduction to Statistical Computing II

Lecture 20 - Joining Tables in SQL

Danilo Freire

Emory University

11 October, 2024

Hi! 😊 Hope your week is going well

Brief recap of lecture 19 📚

Some updates on the final project

  • The list of groups is now available on our reppository. Check it out here
  • All groups have 3-4 members, as we discussed
  • Many thanks to all of you who have already formed your groups and sent them to me! 🙏🏻
  • Those not in a group have been assigned to one at random
  • So please check the list and make sure you are in a group, and that you know your group members
  • Any questions? 😊

Brief recap of last lecture

Last time, we learned how to:

  • Use psycopg2 to connect to a PostgreSQL database
  • Use sqlalchemy and pandas to run SQL queries in Python
  • Use pandas to convert files to SQL tables and vice-versa
  • And we also learned how to use Python functions with SQL queries

Today’s plan 📋

Merge tables in SQL

  • We will learn how to merge tables in SQL
  • We will use JOIN statements to merge tables
  • There are many types of joins, and we will learn the most common ones
    • INNER JOIN and LEFT JOIN
  • The syntax is similar to pandasmerge method
  • We will also learn how to merge tables using USING and ON clauses

Merging tables in SQL 📊

The JOIN statement

  • The JOIN statement is used to combine rows from two or more tables, based on a related column between them (usually a primary key)
  • JOIN is very flexible and can be used in many ways for pretty much all types of data merge operations
  • The most common types of joins are:
    • INNER JOIN
    • LEFT JOIN
    • RIGHT JOIN
    • FULL JOIN
    • CROSS JOIN
    • SELF JOIN
  • LEFT JOIN is probably the most common type of join
  • You keep all rows from the left table and only the matching rows from the right table
  • The syntax is as follows:
  • SELECT * FROM table1 LEFT JOIN table2 ON table1.column = table2.column
  • The ON clause specifies the condition to join the tables, i.e., the column(s) that are common to both tables
  • The LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2)
  • The result is NULL from the right side if there is no match
  • Let’s see an example to make it clearer 😊

Set up the environment 🛠️

  • Let’s start by loading the necessary libraries
# Load the necessary libraries
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text

# Connect to the database
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()

# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS famous_people CASCADE'))

conn.execute(text('DROP TABLE IF EXISTS information CASCADE'))

# Create tables in the SQL database
conn.execute(text('''
CREATE TABLE IF NOT EXISTS famous_people (
    person_id INTEGER PRIMARY KEY,
    name TEXT
)
'''))

conn.execute(text('''
CREATE TABLE IF NOT EXISTS information (
    person_id INTEGER,
    country TEXT,
    profession TEXT
)
'''))

# Insert data into the tables
conn.execute(text('''
INSERT INTO famous_people (person_id, name) VALUES 
  (1, 'Albert Einstein'),
  (2, 'Isaac Newton'),
  (3, 'Marie Curie'),
  (4, 'Galileo Galilei'),
  (5, 'René Descartes'),
  (6, 'Blaise Pascal'),
  (7, 'Nikola Tesla')
'''))

conn.execute(text('''
INSERT INTO information (person_id, country, profession) VALUES 
  (1, 'Germany', 'Physicist'),
  (2, 'United Kingdom', 'Mathematician'),
  (3, 'Poland', 'Physicist'),
  (4, 'Italy', 'Astronomer'),
  (5, 'France', 'Philosopher'),
  (6, 'France', 'Mathematician')
'''))

conn.commit()
conn.close()

Let’s see the tables

  • Let’s see the tables we just created
# Connect to the database
conn = engine.connect()
pd.read_sql_query("SELECT * FROM famous_people", conn)
person_id name
0 1 Albert Einstein
1 2 Isaac Newton
2 3 Marie Curie
3 4 Galileo Galilei
4 5 René Descartes
5 6 Blaise Pascal
6 7 Nikola Tesla
pd.read_sql_query("SELECT * FROM information", conn)
person_id country profession
0 1 Germany Physicist
1 2 United Kingdom Mathematician
2 3 Poland Physicist
3 4 Italy Astronomer
4 5 France Philosopher
5 6 France Mathematician

Error handling 🚨

  • sqlalchemy sometimes produces some unexpected behaviours when running SQL queries
  • In my experience, it sometimes fails to run queries that are too long or too complex
  • The worst thing is, it just keeps running forever without giving any error message! 🤬
  • So if you run a query and it takes too long, or if you suspect it is not working, just interrupt the kernel and try again
  • A good idea is to add conn.rollback() before running the query, to make sure the connection is clean
  • If you would like to save the data, it is also a good idea to run conn.commit() to make sure the data is saved
  • Then, close the connection to prevent memory leaks with conn.close()
  • The bad news is that you have to recreate the connection every time you encounter an error
  • If you typed conn.commit(), the table will indeed be in the database, so you won’t need to recreate it
  • But if you didn’t, you will need to recreate the table and insert the data again
  • So far I haven’t found a way to fix this error (it seems to happen quite often with others too), but I will keep looking for a solution 😊
  • Check this appendix for a workaround

Let’s join the tables with LEFT JOIN 🤝

  • Now that we have the tables, let’s join them using the LEFT JOIN clause
pd.read_sql("""
    SELECT * 
    FROM famous_people 
    LEFT JOIN information
    ON famous_people.person_id = information.person_id
""", conn)
person_id name person_id country profession
0 1 Albert Einstein 1.0 Germany Physicist
1 2 Isaac Newton 2.0 United Kingdom Mathematician
2 3 Marie Curie 3.0 Poland Physicist
3 4 Galileo Galilei 4.0 Italy Astronomer
4 5 René Descartes 5.0 France Philosopher
5 6 Blaise Pascal 6.0 France Mathematician
6 7 Nikola Tesla NaN None None

Let’s join the tables with INNER JOIN 🤝

  • Now let’s join the tables using the INNER JOIN clause
pd.read_sql("""
SELECT * 
FROM famous_people 
INNER JOIN information 
ON famous_people.person_id = information.person_id
""", conn)
person_id name person_id country profession
0 1 Albert Einstein 1 Germany Physicist
1 2 Isaac Newton 2 United Kingdom Mathematician
2 3 Marie Curie 3 Poland Physicist
3 4 Galileo Galilei 4 Italy Astronomer
4 5 René Descartes 5 France Philosopher
5 6 Blaise Pascal 6 France Mathematician

Entity Relationship Diagram (ERD) 📊

What is an ERD?

  • An Entity Relationship Diagram (ERD) is a visual representation of the relationships between entities in a database
  • ERDs are used to model the logical structure of a database
  • Their main function is to indicate which entities (variables) are related to each other and how they are related
  • So they should always indicate the primary keys and foreign keys of the tables
  • A primary key is a unique identifier for each row in a table
  • A foreign key is a column that references the primary key of another table
  • Tables don’t need to have a foreign key, but they usually do as it helps to maintain data integrity

Click to enlarge Click on the image to enlarge

  • For instance, if we have a table with information about members of the US Congress, we could have the following ERD
  • Note the connections between the tables, which indicate the relationships between them

Let’s see another example with bigger tables 📊

  • This time, we will load two .csv files into the database
  • We will then join the tables using the LEFT JOIN clause
# Load the data
bills_actions       = pd.read_csv("data_raw/bills_actions.csv")
us_congress_members = pd.read_csv("data_raw/us_congress_members.csv")

# Recreate the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS bills_actions CASCADE'))
conn.execute(text('DROP TABLE IF EXISTS us_congress_members CASCADE'))
conn.commit()

# Create the tables in the database
bills_actions.to_sql("bills_actions", conn, if_exists="replace", index=False)
us_congress_members.to_sql("us_congress_members", conn, if_exists="replace", index=False)
conn.commit()

Let’s see the table 📊

# Recreate the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Check if the table exists
pd.read_sql(text('SELECT * FROM bills_actions'), conn).head()
congress bill_number bill_type action main_action object member_id
0 116 1029 s S.Amdt.1274 Amendment SA 1274 proposed by Sena... senate amendment proposed (on the floor) amendment 858
1 116 1031 s S.Amdt.2698 Amendment SA 2698 proposed by Sena... senate amendment proposed (on the floor) amendment 675
2 116 1160 s S.Amdt.2659 Amendment SA 2659 proposed by Sena... senate amendment proposed (on the floor) amendment 858
3 116 1199 s Committee on Health, Education, Labor, and Pen... senate committee/subcommittee actions senate bill 1561
4 116 1208 s Committee on the Judiciary. Reported by Senato... senate committee/subcommittee actions senate bill 1580

Try it yourself! 🤓

  • Download and display the us_congress_members dataset from SQL
  • Be sure to use conn.rollback() before running the query to clean the connection
  • If you encounter an error, check the appendix or ask me for help 😊
  • Click here to see the solution

Merging the tables 🤝

Click to enlarge Click on the image to enlarge

  • Here member_id is the PRIMARY KEY of the first dataset
  • We can check that with SQL commands
# Recreate the connection
engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Check if the table exists and count distinct member_id
results = pd.read_sql(text('''
  SELECT COUNT(DISTINCT member_id) AS num_distinct,
  COUNT(*) AS num_members 
  FROM us_congress_members
'''), conn)

print(results)

conn.close()
   num_distinct  num_members
0          1811         1811

Try it yourself! 🤓

  • Merge two datasets
  • Similar to Python
  • FROM is followed by the name of the primary dataset
  • LEFT is followed by the name of the secondary dataset
  • ON is the id variable used for merging
  • Click here to see the solution

Appendix 01 - Error handling 🚨

  • If you encounter an error when running the code, you can use the following workaround
  • First, you can rollback the transaction to clean the connection
  • Add this before running the query
conn.rollback()
  • If you want to save the data, you can commit the transaction
  • Add this after running the query
conn.commit()
  • Finally, close the connection to prevent memory leaks
conn.close()

Appendix 01 - Error handling 🚨

  • If everything else fails, you can drop all tables and recreate them
  • This code will clean your database
from sqlalchemy import create_engine, text, inspect
import time

engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()

def clean_database(engine):
    with engine.connect() as conn:
        try:
            # Get inspector to check existing tables
            inspector = inspect(engine)
            existing_tables = inspector.get_table_names()
            
            if not existing_tables:
                print("No tables found in database")
                return
                
            print(f"Found {len(existing_tables)} tables: {existing_tables}")
            
            # Kill other connections
            conn.execute(text("""
                SELECT pg_terminate_backend(pid) 
                FROM pg_stat_activity 
                WHERE pid <> pg_backend_pid()
                AND datname = current_database()
            """))
            
            conn.execute(text("ROLLBACK"))
            conn.execute(text("SET statement_timeout = '30s'"))
            
            # Only drop tables that exist
            for table in existing_tables:
                try:
                    conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
                    print(f"Dropped {table}")
                    conn.commit()
                    time.sleep(1)
                except Exception as e:
                    print(f"Error with {table}: {str(e)}")
                    conn.execute(text("ROLLBACK"))
                    
        except Exception as e:
            print(f"Fatal error: {str(e)}")
            conn.execute(text("ROLLBACK"))

# Execute
clean_database(engine)

Back to the lecture

Appendix 02

  • Download and display the us_congress_members dataset from SQL
# Recreate the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Drop the tables if they already exist
conn.execute(text('''
DROP TABLE IF EXISTS us_congress_members CASCADE
'''))
conn.commit()

# Import "races"
us_congress_members.to_sql('us_congress_members', 
             con = conn, 
             if_exists='replace',
             index=False)

conn.commit()

# Check if the table exists
pd.read_sql_query("SELECT * FROM us_congress_members", conn)
member_id full_name last_name member_title state party_name chamber
0 0 A. Donald McEachin McEachin Representative Virginia Democratic House
1 1 Aaron Schock Schock Representative Illinois Republican House
2 2 Abby Finkenauer Finkenauer Representative Iowa Democratic House
3 3 Abigail Davis Spanberger Spanberger Representative Virginia Democratic House
4 4 Adam H. Putnam Putnam Representative Florida Republican House
... ... ... ... ... ... ... ...
1806 1806 Wm. Lacy Clay Clay Representative Missouri Democratic House
1807 1807 Wyche, Jr. Fowler Fowler Senator Georgia Democratic House
1808 1808 Wyche, Jr. Fowler Fowler Senator Georgia Democratic Senate
1809 1809 Yvette D. Clarke Clarke Representative New York Democratic House
1810 1810 Zell Miller Miller Senator Georgia Democratic Senate

1811 rows × 7 columns

Back to the lecture

Appendix 03

  • Merge two datasets
  • Similar to Python
  • FROM is followed by the name of the primary dataset
  • LEFT is followed by the name of the secondary dataset
  • ON is the id variable used for merging
  • Here we are merging the us_congress_members dataset with the bills_actions dataset
# Restart the connection
conn = engine.connect()

# Rollback any existing transactions
conn.rollback()

# Merge the tables
results = pd.read_sql('''
SELECT *
FROM us_congress_members
LEFT JOIN bills_actions
ON us_congress_members.member_id = bills_actions.member_id
''', conn)

print(results.head())

conn.close()
   member_id                 full_name   last_name    member_title     state  \
0          0        A. Donald McEachin    McEachin  Representative  Virginia   
1          1              Aaron Schock      Schock  Representative  Illinois   
2          2           Abby Finkenauer  Finkenauer  Representative      Iowa   
3          2           Abby Finkenauer  Finkenauer  Representative      Iowa   
4          3  Abigail Davis Spanberger  Spanberger  Representative  Virginia   

   party_name chamber  congress  bill_number bill_type  \
0  Democratic   House       NaN          NaN      None   
1  Republican   House       NaN          NaN      None   
2  Democratic   House     116.0          3.0        hr   
3  Democratic   House     116.0       7617.0        hr   
4  Democratic   House     116.0       1644.0        hr   

                                              action              main_action  \
0                                               None                     None   
1                                               None                     None   
2  H.Amdt.719 Amendment (A009) offered by Ms. Fin...  house amendment offered   
3  H.Amdt.870 Amendment (A012) offered by Ms. Fin...  house amendment offered   
4  H.Amdt.167 Amendment (A011) offered by Ms. Spa...  house amendment offered   

      object  member_id  
0       None        NaN  
1       None        NaN  
2  amendment        2.0  
3  amendment        2.0  
4  amendment        3.0  

Back to the lecture