QTM 151 - Introduction to Statistical Computing II

Lecture 19 - SQL and Python

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 😊
How are you doing today?

An announcement 📢

Another announcement 📢

Final project

The project is due on December 9th, as we talked about before
Feel free to create your own groups of up to 4 people and send them to me via email
I will also send an announcement on Canvas about that
Students who are not in a group by the end of this week will be assigned to a group at random
So please, make sure you have a group soon 🤓
What do you think? Any questions? 😊

Brief recap of lecture 17 📚

Do you still remember what SQL is? 😂

SQL is a language used to interact with databases
We learned how to connect to a PostgreSQL database, use pgAdmin, and run SQL queries both with pgAdmin and the SQL Notebook extension in VS Code
We also saw some of SQL’s basic commands:
- SELECT
- FROM
- WHERE
- ORDER BY
- GROUP BY
- INSERT INTO
- CREATE TABLE
- DROP TABLE
And today we will see how to use SQL with Python! 🐍

SQL in Python 🐍

Necessary libraries 📦

psycopg2 is a PostgreSQL adapter for Python. Info here
- It allows us to connect to a PostgreSQL database and run SQL queries
sqlalchemy is a SQL toolkit. More info here
- It provides a set of high-level APIs to interact with databases
They are a bit cumbersome to install and run, but they are very powerful tools
You won’t need to use pgAdmin or SQL Notebook anymore, as you can run SQL queries directly in Python
SQL is a great language to interact with databases, but it does not have any support for data visualisation or analysis
But we can use Python to do all the other things that SQL cannot do 🐍

Installing `psycopg2` and `sqlalchemy` 🐘

You can install psycopg2 and sqlalchemy using condas or pip

conda install psycopg2 sqlalchemy

# or

pip install psycopg2 sqlalchemy

The command line method is actually more reliable (and easier) than using the Anaconda Navigator
So you can try to do it yourself! Open a terminal in VS Code and run the command above 🤓

Connecting to a PostgreSQL database

So let’s load the libraries! 🤓

import psycopg2
from sqlalchemy import create_engine, text

Now let’s connect to the database
We need the database name, user, password, and host

# Connection parameters
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')
connection = engine.connect()

Let’s see what every part of the connection string means
create_engine() creates a connection to the database
postgresql+psycopg2 is the database driver
postgres:postgres is the username and password
localhost:5432 is the host and port
postgres is the database name

Load data and import them into a table

Let’s load some data into a table

import pandas as pd

results  = pd.read_csv("data_raw/results.csv")
races    = pd.read_csv("data_raw/races.csv")
circuits = pd.read_csv("data_raw/circuits.csv")

We will use the to_sql() method from pandas to import the data into PostgreSQL
We need to specify the table name and the connection

results.to_sql('results', connection, if_exists='replace', index=False)
races.to_sql('results', connection, if_exists='replace', index=False)

The if_exists='replace' parameter will replace the table if it already exists
- Like DROP TABLE and CREATE TABLE in SQL
The index=False parameter will not import the index column

Try it yourself! 🤓

Import the circuits table into PostgreSQL
Appendix 01

Escape the SQL queries

When we work with strings (text), we need to ensure that the SQL query is correctly formatted
SQL can get confused with single and double quotes (I get confused too 😂)
Use a backslash (\) to define strins over multiple lines
Do not include a space after \, otherwise it won’t work

example_string = "This is a string \
                  defined over multiple lines"

print(example_string)

This is a string                   defined over multiple lines

Double quotes inside a string

example_double = "This will \"supposedly\" put double quotes inside a string"
print(example_double)

This will "supposedly" put double quotes inside a string

Triple quotes

example_triple = '''This is a string with 'single' and "double" quotes'''
print(example_triple)

This is a string with 'single' and "double" quotes

Try it yourself! 🤓

Print a string SELECT "driverId" FROM results; using backslash
Appendix 02

Data operations 📊

Selecting data

We can use the pd.read_sql_query() method to run SQL queries
A query is a string that contains the SQL command
We can use the text() method from sqlalchemy to format the query
Why do we use it? Because it is safer and more reliable
So if you’re going to run a query, you’re always safe using text()

pd.read_sql(text("SELECT * FROM results;"), connection).head(3)

	raceId	year	round	circuitId	name	date	time	url	fp1_date	fp1_time	fp2_date	fp2_time	fp3_date	fp3_time	quali_date	quali_time	sprint_date	sprint_time
0	1	2009	1	1	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/2009_Australian_G...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
1	2	2009	2	2	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/2009_Malaysian_Gr...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N
2	3	2009	3	17	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/2009_Chinese_Gran...	\N	\N	\N	\N	\N	\N	\N	\N	\N	\N

Here are other examples of similar queries:

# Extract all data from a column
example1 = pd.read_sql(text("SELECT * FROM results;"), connection)

# Extract a subset of columns
example2 = pd.read_sql(text("SELECT points \
                             FROM results;"), connection)

# Subset based on a string condition
example3 = pd.read_sql(text("SELECT * \
                             FROM races \
                             WHERE name = 'Abu Dhabi Grand Prix';"), connection)

Try it yourself! 🤓

Select raceId and constructorId from results
Please convert the results.csv file to a table in PostgreSQL again, replacing the table if it already exists
Appendix 03

Appendix 01

Import the circuits table into PostgreSQL

circuits.to_sql('circuits', connection, if_exists='replace', index=False)

Check if the table was imported correctly

pd.read_sql(text('''
SELECT * FROM circuits
LIMIT 3 -- Show only the first 3 rows
'''), connection)

	circuitId	circuitRef	name	location	country	lat	lng	alt	url
0	1	albert_park	Albert Park Grand Prix Circuit	Melbourne	Australia	-37.84970	144.9680	10	http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1	2	sepang	Sepang International Circuit	Kuala Lumpur	Malaysia	2.76083	101.7380	18	http://en.wikipedia.org/wiki/Sepang_Internatio...
2	3	bahrain	Bahrain International Circuit	Sakhir	Bahrain	26.03250	50.5106	7	http://en.wikipedia.org/wiki/Bahrain_Internati...

Back to exercise

Appendix 02

Print a string SELECT "driverId" FROM results; using backslash

example_query = "SELECT \"driverId\" FROM results;"
print(example_query)

SELECT "driverId" FROM results;

Back to exercise

Appendix 03

Select “raceId” and “constructorId” from results
Use the pd.read_sql_query() method to run the query

results.to_sql('results', connection, if_exists='replace', index=False)
pd.read_sql(text("SELECT \"raceId\", \"constructorId\" FROM results;"), connection).head()

	raceId	constructorId
0	18	1
1	18	2
2	18	3
3	18	4
4	18	1

Back to exercise

QTM 151 - Introduction to Statistical Computing II

Hello, everyone! 😊 How are you doing today?

An announcement 📢

Another announcement 📢

Final project

Brief recap of lecture 17 📚

Do you still remember what SQL is? 😂

SQL in Python 🐍

Necessary libraries 📦

Installing psycopg2 and sqlalchemy 🐘

Connecting to a PostgreSQL database

Load data and import them into a table

Try it yourself! 🤓

Escape the SQL queries

Try it yourself! 🤓

Data operations 📊

Selecting data

Try it yourself! 🤓

Appendix 01

Appendix 02

Appendix 03

Hello, everyone! 😊
How are you doing today?

Installing `psycopg2` and `sqlalchemy` 🐘