Lecture 20 - Joining Tables in SQL
psycopg2 to connect to a PostgreSQL databasesqlalchemy and pandas to run SQL queries in Pythonpandas to convert files to SQL tables and vice-versaJOIN statements to merge tablesINNER JOIN and LEFT JOINpandas’ merge methodUSING and ON clausesJOIN statementJOIN statement is used to combine rows from two or more tables, based on a related column between them (usually a primary key)JOIN is very flexible and can be used in many ways for pretty much all types of data merge operationsINNER JOINLEFT JOINRIGHT JOINFULL JOINCROSS JOINSELF JOINLEFT JOIN is probably the most common type of joinSELECT * FROM table1 LEFT JOIN table2 ON table1.column = table2.columnON clause specifies the condition to join the tables, i.e., the column(s) that are common to both tablesLEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2)NULL from the right side if there is no matchJOIN clauses# Rollback any existing transactions
conn.rollback()
# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS famous_people CASCADE'))
conn.execute(text('DROP TABLE IF EXISTS information CASCADE'))
# Create tables in the SQL database
conn.execute(text('''
CREATE TABLE IF NOT EXISTS famous_people (
person_id INTEGER PRIMARY KEY,
name TEXT
)
'''))
conn.execute(text('''
CREATE TABLE IF NOT EXISTS information (
person_id INTEGER,
country TEXT,
profession TEXT
)
'''))
# Insert data into the tables
conn.execute(text('''
INSERT INTO famous_people (person_id, name) VALUES
(1, 'Albert Einstein'),
(2, 'Isaac Newton'),
(3, 'Marie Curie'),
(4, 'Galileo Galilei'),
(5, 'René Descartes'),
(6, 'Blaise Pascal'),
(7, 'Nikola Tesla')
'''))
conn.execute(text('''
INSERT INTO information (person_id, country, profession) VALUES
(1, 'Germany', 'Physicist'),
(2, 'United Kingdom', 'Mathematician'),
(3, 'Poland', 'Physicist'),
(4, 'Italy', 'Astronomer'),
(5, 'France', 'Philosopher'),
(6, 'France', 'Mathematician')
'''))
conn.commit()
conn.close()sqlalchemy sometimes produces some unexpected behaviours when running SQL queriesconn.rollback() before running the query, to make sure the connection is cleanconn.commit() to make sure the data is savedconn.close()conn.commit(), the table will indeed be in the database, so you won’t need to recreate itLEFT JOIN 🤝LEFT JOIN clausepd.read_sql("""
SELECT *
FROM famous_people
LEFT JOIN information
ON famous_people.person_id = information.person_id
""", conn)| person_id | name | person_id | country | profession | |
|---|---|---|---|---|---|
| 0 | 1 | Albert Einstein | 1.0 | Germany | Physicist |
| 1 | 2 | Isaac Newton | 2.0 | United Kingdom | Mathematician |
| 2 | 3 | Marie Curie | 3.0 | Poland | Physicist |
| 3 | 4 | Galileo Galilei | 4.0 | Italy | Astronomer |
| 4 | 5 | René Descartes | 5.0 | France | Philosopher |
| 5 | 6 | Blaise Pascal | 6.0 | France | Mathematician |
| 6 | 7 | Nikola Tesla | NaN | None | None |
INNER JOIN 🤝INNER JOIN clausepd.read_sql("""
SELECT *
FROM famous_people
INNER JOIN information
ON famous_people.person_id = information.person_id
""", conn)| person_id | name | person_id | country | profession | |
|---|---|---|---|---|---|
| 0 | 1 | Albert Einstein | 1 | Germany | Physicist |
| 1 | 2 | Isaac Newton | 2 | United Kingdom | Mathematician |
| 2 | 3 | Marie Curie | 3 | Poland | Physicist |
| 3 | 4 | Galileo Galilei | 4 | Italy | Astronomer |
| 4 | 5 | René Descartes | 5 | France | Philosopher |
| 5 | 6 | Blaise Pascal | 6 | France | Mathematician |
Click on the image to enlarge
.csv files into the databaseLEFT JOIN clause# Load the data
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
us_congress_members = pd.read_csv("data_raw/us_congress_members.csv")
# Recreate the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Drop the tables if they already exist
conn.execute(text('DROP TABLE IF EXISTS bills_actions CASCADE'))
conn.execute(text('DROP TABLE IF EXISTS us_congress_members CASCADE'))
conn.commit()
# Create the tables in the database
bills_actions.to_sql("bills_actions", conn, if_exists="replace", index=False)
us_congress_members.to_sql("us_congress_members", conn, if_exists="replace", index=False)
conn.commit()# Recreate the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Check if the table exists
pd.read_sql(text('SELECT * FROM bills_actions'), conn).head()| congress | bill_number | bill_type | action | main_action | object | member_id | |
|---|---|---|---|---|---|---|---|
| 0 | 116 | 1029 | s | S.Amdt.1274 Amendment SA 1274 proposed by Sena... | senate amendment proposed (on the floor) | amendment | 858 |
| 1 | 116 | 1031 | s | S.Amdt.2698 Amendment SA 2698 proposed by Sena... | senate amendment proposed (on the floor) | amendment | 675 |
| 2 | 116 | 1160 | s | S.Amdt.2659 Amendment SA 2659 proposed by Sena... | senate amendment proposed (on the floor) | amendment | 858 |
| 3 | 116 | 1199 | s | Committee on Health, Education, Labor, and Pen... | senate committee/subcommittee actions | senate bill | 1561 |
| 4 | 116 | 1208 | s | Committee on the Judiciary. Reported by Senato... | senate committee/subcommittee actions | senate bill | 1580 |
us_congress_members dataset from SQLconn.rollback() before running the query to clean the connection Click on the image to enlarge
member_id is the PRIMARY KEY of the first dataset# Recreate the connection
engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Check if the table exists and count distinct member_id
results = pd.read_sql(text('''
SELECT COUNT(DISTINCT member_id) AS num_distinct,
COUNT(*) AS num_members
FROM us_congress_members
'''), conn)
print(results)
conn.close() num_distinct num_members
0 1811 1811
FROM is followed by the name of the primary datasetLEFT is followed by the name of the secondary datasetON is the id variable used for mergingfrom sqlalchemy import create_engine, text, inspect
import time
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()
def clean_database(engine):
with engine.connect() as conn:
try:
# Get inspector to check existing tables
inspector = inspect(engine)
existing_tables = inspector.get_table_names()
if not existing_tables:
print("No tables found in database")
return
print(f"Found {len(existing_tables)} tables: {existing_tables}")
# Kill other connections
conn.execute(text("""
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
AND datname = current_database()
"""))
conn.execute(text("ROLLBACK"))
conn.execute(text("SET statement_timeout = '30s'"))
# Only drop tables that exist
for table in existing_tables:
try:
conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
print(f"Dropped {table}")
conn.commit()
time.sleep(1)
except Exception as e:
print(f"Error with {table}: {str(e)}")
conn.execute(text("ROLLBACK"))
except Exception as e:
print(f"Fatal error: {str(e)}")
conn.execute(text("ROLLBACK"))
# Execute
clean_database(engine)us_congress_members dataset from SQL# Recreate the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Drop the tables if they already exist
conn.execute(text('''
DROP TABLE IF EXISTS us_congress_members CASCADE
'''))
conn.commit()
# Import "races"
us_congress_members.to_sql('us_congress_members',
con = conn,
if_exists='replace',
index=False)
conn.commit()
# Check if the table exists
pd.read_sql_query("SELECT * FROM us_congress_members", conn)| member_id | full_name | last_name | member_title | state | party_name | chamber | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | A. Donald McEachin | McEachin | Representative | Virginia | Democratic | House |
| 1 | 1 | Aaron Schock | Schock | Representative | Illinois | Republican | House |
| 2 | 2 | Abby Finkenauer | Finkenauer | Representative | Iowa | Democratic | House |
| 3 | 3 | Abigail Davis Spanberger | Spanberger | Representative | Virginia | Democratic | House |
| 4 | 4 | Adam H. Putnam | Putnam | Representative | Florida | Republican | House |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1806 | 1806 | Wm. Lacy Clay | Clay | Representative | Missouri | Democratic | House |
| 1807 | 1807 | Wyche, Jr. Fowler | Fowler | Senator | Georgia | Democratic | House |
| 1808 | 1808 | Wyche, Jr. Fowler | Fowler | Senator | Georgia | Democratic | Senate |
| 1809 | 1809 | Yvette D. Clarke | Clarke | Representative | New York | Democratic | House |
| 1810 | 1810 | Zell Miller | Miller | Senator | Georgia | Democratic | Senate |
1811 rows × 7 columns
FROM is followed by the name of the primary datasetLEFT is followed by the name of the secondary datasetON is the id variable used for mergingus_congress_members dataset with the bills_actions dataset# Restart the connection
conn = engine.connect()
# Rollback any existing transactions
conn.rollback()
# Merge the tables
results = pd.read_sql('''
SELECT *
FROM us_congress_members
LEFT JOIN bills_actions
ON us_congress_members.member_id = bills_actions.member_id
''', conn)
print(results.head())
conn.close() member_id full_name last_name member_title state \
0 0 A. Donald McEachin McEachin Representative Virginia
1 1 Aaron Schock Schock Representative Illinois
2 2 Abby Finkenauer Finkenauer Representative Iowa
3 2 Abby Finkenauer Finkenauer Representative Iowa
4 3 Abigail Davis Spanberger Spanberger Representative Virginia
party_name chamber congress bill_number bill_type \
0 Democratic House NaN NaN None
1 Republican House NaN NaN None
2 Democratic House 116.0 3.0 hr
3 Democratic House 116.0 7617.0 hr
4 Democratic House 116.0 1644.0 hr
action main_action \
0 None None
1 None None
2 H.Amdt.719 Amendment (A009) offered by Ms. Fin... house amendment offered
3 H.Amdt.870 Amendment (A012) offered by Ms. Fin... house amendment offered
4 H.Amdt.167 Amendment (A011) offered by Ms. Spa... house amendment offered
object member_id
0 None NaN
1 None NaN
2 amendment 2.0
3 amendment 2.0
4 amendment 3.0