Lecture 20 - Joining Tables in SQL
11 October, 2024
psycopg2
to connect to a PostgreSQL databasesqlalchemy
and pandas
to run SQL queries in Pythonpandas
to convert files to SQL tables and vice-versaJOIN
statements to merge tablesINNER JOIN
and LEFT JOIN
pandas
’ merge
methodUSING
and ON
clausesJOIN
statementJOIN
statement is used to combine rows from two or more tables, based on a related column between them (usually a primary key)JOIN
is very flexible and can be used in many ways for pretty much all types of data merge operationsINNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
CROSS JOIN
SELF JOIN
LEFT JOIN
is probably the most common type of joinSELECT * FROM table1 LEFT JOIN table2 ON table1.column = table2.column
ON
clause specifies the condition to join the tables, i.e., the column(s) that are common to both tablesLEFT JOIN
keyword returns all records from the left table (table1), and the matched records from the right table (table2)NULL
from the right side if there is no matchJOIN
clauses# Rollback any existing transactions
conn.rollback()
# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS famous_people CASCADE'))
conn.execute(text('DROP TABLE IF EXISTS information CASCADE'))
# Create tables in the SQL database
conn.execute(text('''
CREATE TABLE IF NOT EXISTS famous_people (
person_id INTEGER PRIMARY KEY,
name TEXT
)
'''))
conn.execute(text('''
CREATE TABLE IF NOT EXISTS information (
person_id INTEGER,
country TEXT,
profession TEXT
)
'''))
# Insert data into the tables
conn.execute(text('''
INSERT INTO famous_people (person_id, name) VALUES
(1, 'Albert Einstein'),
(2, 'Isaac Newton'),
(3, 'Marie Curie'),
(4, 'Galileo Galilei'),
(5, 'René Descartes'),
(6, 'Blaise Pascal'),
(7, 'Nikola Tesla')
'''))
conn.execute(text('''
INSERT INTO information (person_id, country, profession) VALUES
(1, 'Germany', 'Physicist'),
(2, 'United Kingdom', 'Mathematician'),
(3, 'Poland', 'Physicist'),
(4, 'Italy', 'Astronomer'),
(5, 'France', 'Philosopher'),
(6, 'France', 'Mathematician')
'''))
conn.commit()
conn.close()
sqlalchemy
sometimes produces some unexpected behaviours when running SQL queriesconn.rollback()
before running the query, to make sure the connection is cleanconn.commit()
to make sure the data is savedconn.close()
conn.commit()
, the table will indeed be in the database, so you won’t need to recreate itLEFT JOIN
🤝LEFT JOIN
clausepd.read_sql("""
SELECT *
FROM famous_people
LEFT JOIN information
ON famous_people.person_id = information.person_id
""", conn)
person_id | name | person_id | country | profession | |
---|---|---|---|---|---|
0 | 1 | Albert Einstein | 1.0 | Germany | Physicist |
1 | 2 | Isaac Newton | 2.0 | United Kingdom | Mathematician |
2 | 3 | Marie Curie | 3.0 | Poland | Physicist |
3 | 4 | Galileo Galilei | 4.0 | Italy | Astronomer |
4 | 5 | René Descartes | 5.0 | France | Philosopher |
5 | 6 | Blaise Pascal | 6.0 | France | Mathematician |
6 | 7 | Nikola Tesla | NaN | None | None |
INNER JOIN
🤝INNER JOIN
clausepd.read_sql("""
SELECT *
FROM famous_people
INNER JOIN information
ON famous_people.person_id = information.person_id
""", conn)
person_id | name | person_id | country | profession | |
---|---|---|---|---|---|
0 | 1 | Albert Einstein | 1 | Germany | Physicist |
1 | 2 | Isaac Newton | 2 | United Kingdom | Mathematician |
2 | 3 | Marie Curie | 3 | Poland | Physicist |
3 | 4 | Galileo Galilei | 4 | Italy | Astronomer |
4 | 5 | René Descartes | 5 | France | Philosopher |
5 | 6 | Blaise Pascal | 6 | France | Mathematician |
Click on the image to enlarge
.csv
files into the databaseLEFT JOIN
clause# Load the data
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
us_congress_members = pd.read_csv("data_raw/us_congress_members.csv")
# Recreate the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS bills_actions CASCADE'))
conn.execute(text('DROP TABLE IF EXISTS us_congress_members CASCADE'))
conn.commit()
# Create the tables in the database
bills_actions.to_sql("bills_actions", conn, if_exists="replace", index=False)
us_congress_members.to_sql("us_congress_members", conn, if_exists="replace", index=False)
conn.commit()
# Recreate the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Check if the table exists
pd.read_sql(text('SELECT * FROM bills_actions'), conn).head()
congress | bill_number | bill_type | action | main_action | object | member_id | |
---|---|---|---|---|---|---|---|
0 | 116 | 1029 | s | S.Amdt.1274 Amendment SA 1274 proposed by Sena... | senate amendment proposed (on the floor) | amendment | 858 |
1 | 116 | 1031 | s | S.Amdt.2698 Amendment SA 2698 proposed by Sena... | senate amendment proposed (on the floor) | amendment | 675 |
2 | 116 | 1160 | s | S.Amdt.2659 Amendment SA 2659 proposed by Sena... | senate amendment proposed (on the floor) | amendment | 858 |
3 | 116 | 1199 | s | Committee on Health, Education, Labor, and Pen... | senate committee/subcommittee actions | senate bill | 1561 |
4 | 116 | 1208 | s | Committee on the Judiciary. Reported by Senato... | senate committee/subcommittee actions | senate bill | 1580 |
us_congress_members
dataset from SQLconn.rollback()
before running the query to clean the connectionClick on the image to enlarge
member_id
is the PRIMARY KEY
of the first dataset# Recreate the connection
engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Check if the table exists and count distinct member_id
results = pd.read_sql(text('''
SELECT COUNT(DISTINCT member_id) AS num_distinct,
COUNT(*) AS num_members
FROM us_congress_members
'''), conn)
print(results)
conn.close()
num_distinct num_members
0 1811 1811
FROM
is followed by the name of the primary datasetLEFT
is followed by the name of the secondary datasetON
is the id variable used for mergingfrom sqlalchemy import create_engine, text, inspect
import time
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()
def clean_database(engine):
with engine.connect() as conn:
try:
# Get inspector to check existing tables
inspector = inspect(engine)
existing_tables = inspector.get_table_names()
if not existing_tables:
print("No tables found in database")
return
print(f"Found {len(existing_tables)} tables: {existing_tables}")
# Kill other connections
conn.execute(text("""
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
AND datname = current_database()
"""))
conn.execute(text("ROLLBACK"))
conn.execute(text("SET statement_timeout = '30s'"))
# Only drop tables that exist
for table in existing_tables:
try:
conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
print(f"Dropped {table}")
conn.commit()
time.sleep(1)
except Exception as e:
print(f"Error with {table}: {str(e)}")
conn.execute(text("ROLLBACK"))
except Exception as e:
print(f"Fatal error: {str(e)}")
conn.execute(text("ROLLBACK"))
# Execute
clean_database(engine)
us_congress_members
dataset from SQL# Recreate the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Drop the tables if they already exist
conn.execute(text('''
DROP TABLE IF EXISTS us_congress_members CASCADE
'''))
conn.commit()
# Import "races"
us_congress_members.to_sql('us_congress_members',
con = conn,
if_exists='replace',
index=False)
conn.commit()
# Check if the table exists
pd.read_sql_query("SELECT * FROM us_congress_members", conn)
member_id | full_name | last_name | member_title | state | party_name | chamber | |
---|---|---|---|---|---|---|---|
0 | 0 | A. Donald McEachin | McEachin | Representative | Virginia | Democratic | House |
1 | 1 | Aaron Schock | Schock | Representative | Illinois | Republican | House |
2 | 2 | Abby Finkenauer | Finkenauer | Representative | Iowa | Democratic | House |
3 | 3 | Abigail Davis Spanberger | Spanberger | Representative | Virginia | Democratic | House |
4 | 4 | Adam H. Putnam | Putnam | Representative | Florida | Republican | House |
... | ... | ... | ... | ... | ... | ... | ... |
1806 | 1806 | Wm. Lacy Clay | Clay | Representative | Missouri | Democratic | House |
1807 | 1807 | Wyche, Jr. Fowler | Fowler | Senator | Georgia | Democratic | House |
1808 | 1808 | Wyche, Jr. Fowler | Fowler | Senator | Georgia | Democratic | Senate |
1809 | 1809 | Yvette D. Clarke | Clarke | Representative | New York | Democratic | House |
1810 | 1810 | Zell Miller | Miller | Senator | Georgia | Democratic | Senate |
1811 rows × 7 columns
FROM
is followed by the name of the primary datasetLEFT
is followed by the name of the secondary datasetON
is the id variable used for mergingus_congress_members
dataset with the bills_actions
dataset# Restart the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Merge the tables
results = pd.read_sql('''
SELECT *
FROM us_congress_members
LEFT JOIN bills_actions
ON us_congress_members.member_id = bills_actions.member_id
''', conn)
print(results.head())
conn.close()
member_id full_name last_name member_title state \
0 0 A. Donald McEachin McEachin Representative Virginia
1 1 Aaron Schock Schock Representative Illinois
2 2 Abby Finkenauer Finkenauer Representative Iowa
3 2 Abby Finkenauer Finkenauer Representative Iowa
4 3 Abigail Davis Spanberger Spanberger Representative Virginia
party_name chamber congress bill_number bill_type \
0 Democratic House NaN NaN None
1 Republican House NaN NaN None
2 Democratic House 116.0 3.0 hr
3 Democratic House 116.0 7617.0 hr
4 Democratic House 116.0 1644.0 hr
action main_action \
0 None None
1 None None
2 H.Amdt.719 Amendment (A009) offered by Ms. Fin... house amendment offered
3 H.Amdt.870 Amendment (A012) offered by Ms. Fin... house amendment offered
4 H.Amdt.167 Amendment (A011) offered by Ms. Spa... house amendment offered
object member_id
0 None NaN
1 None NaN
2 amendment 2.0
3 amendment 2.0
4 amendment 3.0