QTM 151 - Introduction to Statistical Computing II

Lecture 19 - SQL and Python

Danilo Freire

Emory University

06 October, 2024

Hello, everyone! 😊
How are you doing today?

An announcement 📢

Another announcement 📢

Final project

  • The project is due on December 9th, as we talked about before
  • Feel free to create your own groups of up to 4 people and send them to me via email
  • I will also send an announcement on Canvas about that
  • Students who are not in a group by the end of this week will be assigned to a group at random
  • So please, make sure you have a group soon 🤓
  • What do you think? Any questions? 😊

Brief recap of lecture 17 📚

Do you still remember what SQL is? 😂

  • SQL is a language used to interact with databases
  • We learned how to connect to a PostgreSQL database, use pgAdmin, and run SQL queries both with pgAdmin and the SQL Notebook extension in VS Code
  • We also saw some of SQL’s basic commands:
    • SELECT
    • FROM
    • WHERE
    • ORDER BY
    • GROUP BY
    • INSERT INTO
    • CREATE TABLE
    • DROP TABLE
  • And today we will see how to use SQL with Python! 🐍

SQL in Python 🐍

Necessary libraries 📦

  • psycopg2 is a PostgreSQL adapter for Python. Info here
    • It allows us to connect to a PostgreSQL database and run SQL queries
  • sqlalchemy is a SQL toolkit. More info here
    • It provides a set of high-level APIs to interact with databases
  • They are a bit cumbersome to install and run, but they are very powerful tools
  • You won’t need to use pgAdmin or SQL Notebook anymore, as you can run SQL queries directly in Python
  • SQL is a great language to interact with databases, but it does not have any support for data visualisation or analysis
  • But we can use Python to do all the other things that SQL cannot do 🐍

Installing psycopg2 and sqlalchemy 🐘

  • You can install psycopg2 and sqlalchemy using condas or pip
conda install psycopg2 sqlalchemy

# or

pip install psycopg2 sqlalchemy
  • The command line method is actually more reliable (and easier) than using the Anaconda Navigator
  • So you can try to do it yourself! Open a terminal in VS Code and run the command above 🤓

Connecting to a PostgreSQL database

  • So let’s load the libraries! 🤓
import psycopg2
from sqlalchemy import create_engine, text
  • Now let’s connect to the database
  • We need the database name, user, password, and host
# Connection parameters
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')
connection = engine.connect()
  • Let’s see what every part of the connection string means
  • create_engine() creates a connection to the database
  • postgresql+psycopg2 is the database driver
  • postgres:postgres is the username and password
  • localhost:5432 is the host and port
  • postgres is the database name

Load data and import them into a table

  • Let’s load some data into a table
import pandas as pd

results  = pd.read_csv("data_raw/results.csv")
races    = pd.read_csv("data_raw/races.csv")
circuits = pd.read_csv("data_raw/circuits.csv")
  • We will use the to_sql() method from pandas to import the data into PostgreSQL
  • We need to specify the table name and the connection
results.to_sql('results', connection, if_exists='replace', index=False)
races.to_sql('results', connection, if_exists='replace', index=False)
102
  • The if_exists='replace' parameter will replace the table if it already exists
    • Like DROP TABLE and CREATE TABLE in SQL
  • The index=False parameter will not import the index column

Try it yourself! 🤓

  • Import the circuits table into PostgreSQL
  • Appendix 01

Escape the SQL queries

  • When we work with strings (text), we need to ensure that the SQL query is correctly formatted
  • SQL can get confused with single and double quotes (I get confused too 😂)
  • Use a backslash (\) to define strins over multiple lines
  • Do not include a space after \, otherwise it won’t work
example_string = "This is a string \
                  defined over multiple lines"

print(example_string)
This is a string                   defined over multiple lines
  • Double quotes inside a string
example_double = "This will \"supposedly\" put double quotes inside a string"
print(example_double)
This will "supposedly" put double quotes inside a string
  • Triple quotes
example_triple = '''This is a string with 'single' and "double" quotes'''
print(example_triple)
This is a string with 'single' and "double" quotes

Try it yourself! 🤓

  • Print a string SELECT "driverId" FROM results; using backslash
  • Appendix 02

Data operations 📊

Selecting data

  • We can use the pd.read_sql_query() method to run SQL queries
  • A query is a string that contains the SQL command
  • We can use the text() method from sqlalchemy to format the query
  • Why do we use it? Because it is safer and more reliable
  • So if you’re going to run a query, you’re always safe using text()
pd.read_sql(text("SELECT * FROM results;"), connection).head(3)
raceId year round circuitId name date time url fp1_date fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date sprint_time
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N \N \N \N \N \N \N \N \N \N
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N \N \N \N \N \N \N \N \N \N
2 3 2009 3 17 Chinese Grand Prix 2009-04-19 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N \N \N \N \N \N \N \N \N \N
  • Here are other examples of similar queries:
# Extract all data from a column
example1 = pd.read_sql(text("SELECT * FROM results;"), connection)

# Extract a subset of columns
example2 = pd.read_sql(text("SELECT points \
                             FROM results;"), connection)

# Subset based on a string condition
example3 = pd.read_sql(text("SELECT * \
                             FROM races \
                             WHERE name = 'Abu Dhabi Grand Prix';"), connection)

Try it yourself! 🤓

  • Select raceId and constructorId from results
  • Please convert the results.csv file to a table in PostgreSQL again, replacing the table if it already exists
  • Appendix 03

Appendix 01

  • Import the circuits table into PostgreSQL
circuits.to_sql('circuits', connection, if_exists='replace', index=False)
77
  • Check if the table was imported correctly
pd.read_sql(text('''
SELECT * FROM circuits
LIMIT 3 -- Show only the first 3 rows
'''), connection)
circuitId circuitRef name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.9680 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.7380 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 3 bahrain Bahrain International Circuit Sakhir Bahrain 26.03250 50.5106 7 http://en.wikipedia.org/wiki/Bahrain_Internati...

Back to exercise

Appendix 02

  • Print a string SELECT "driverId" FROM results; using backslash
example_query = "SELECT \"driverId\" FROM results;"
print(example_query)
SELECT "driverId" FROM results;

Back to exercise

Appendix 03

  • Select “raceId” and “constructorId” from results
  • Use the pd.read_sql_query() method to run the query
results.to_sql('results', connection, if_exists='replace', index=False)
pd.read_sql(text("SELECT \"raceId\", \"constructorId\" FROM results;"), connection).head()
raceId constructorId
0 18 1
1 18 2
2 18 3
3 18 4
4 18 1

Back to exercise