QTM 350 - Data Science Computing

Lecture 18 - SQL and Python

Danilo Freire

Emory University

06 October, 2024

Hello! Great to see you again! 😊

Lecture overview 📚

Lecture overview

Last time we learned:

  • What are relational databases and why they are still dominant in data science
  • How to install and run PostgreSQL and pgAdmin
  • How to create databases, tables, and insert data
  • How to perform queries with SELECT, WHERE and ORDER BY
  • Group data with GROUP BY and filter groups with HAVING
  • Compute statistics with COUNT, SUM, AVG, MIN, and MAX
  • That’s quite a lot! 🤓
  • By the way, any issues with the installation? 🤔

Today we will learn:

  • How to connect SQL with Python, because we love Python! 🐍 😂
  • We will use the psycopg2 and sqlalchemy libraries (and pandas too! 🐼)
  • We will also learn about other SQL commands, such as LIKE, IN, BETWEEN, CASE, window functions, and string functions like LEFT, RIGHT, and LENGTH
  • How to fill missing data with COALESCE
  • If time allows, we will also see how to pivot tables in PostgreSQL
  • Let’s get started! 🚀

Connecting SQL with Python 🐍

Why connect SQL with Python?

  • SQL is great for data manipulation and querying, but it is not a programming language
  • So it is useful to connect SQL with a programming language that allows for more complex operations, data visualisation, and machine learning
  • Do you know any programming language that does that? Of course you do! 😂
  • So let’s see how to connect SQL with Python! 🐍
  • First, let’s install the psycopg2 and sqlalchemy libraries
pip install psycopg2 sqlalchemy

# or
# conda install psycopg2 sqlalchemy

  • psycopg2 is a PostgreSQL adapter for Python (more info here)
  • It allows you to connect to a PostgreSQL database and execute SQL commands
  • sqlalchemy is a SQL for Python, and it is database-agnostic
  • More information about it here
  • Both of them are very popular and work well together (and with other Python libraries too)

Connecting to a PostgreSQL database

  • sqlalchemy generates SQL statements and psycopg2 sends SQL statements to the database
  • So let’s load the libraries! 🤓
import psycopg2
from sqlalchemy import create_engine, text
  • Now let’s connect to the database
  • We need the database name, user, password, and host
# Connection parameters
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres')
connection = engine.connect()
  • Let’s see what every part of the connection string means
  • create_engine() creates a connection to the database
  • postgresql+psycopg2 is the database driver
  • postgres:postgres is the username and password
  • localhost:5432 is the host and port
  • postgres is the database name

SQL commands in Python 🐍

Executing SQL commands in Python

  • Now that we have connected to the database, we can execute SQL commands
  • We can use the execute() method of the connection object
  • Let’s see an example
# Create drivers table
connection.execute(text('''
DROP TABLE IF EXISTS drivers;
CREATE TABLE drivers (
    driver_id SERIAL PRIMARY KEY,
    driver_name VARCHAR(50),
    team VARCHAR(50),
    nationality VARCHAR(50),
    victories INT
);
'''))
connection.commit()
  • The execute() method receives a string with the SQL command
  • We can use triple quotes to write multi-line strings
  • Let’s add some data to the table
# Add data to the drivers table
connection.execute(text('''
INSERT INTO drivers (driver_name, team, nationality, victories)
VALUES 
    ('Lewis Hamilton', 'Mercedes', 'British', 103),
    ('Max Verstappen', 'Red Bull Racing', 'Dutch', 55),
    ('Charles Leclerc', 'Ferrari', 'Monégasque', 5),
    ('Fernando Alonso', 'Aston Martin', NULL, NULL)
'''))
connection.commit()
  • Now let’s check if the data were inserted correctly
  • We can use the SELECT command for that, and fetchall() to retrieve the data
connection.execute(text('SELECT * FROM drivers')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103),
 (2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55),
 (3, 'Charles Leclerc', 'Ferrari', 'Monégasque', 5),
 (4, 'Fernando Alonso', 'Aston Martin', None, None)]

IN and BETWEEN operators

  • We use IN to filter data based on a list of values
connection.execute(text('''
SELECT * FROM drivers WHERE team IN ('Ferrari', 'Mercedes');
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103),
 (3, 'Charles Leclerc', 'Ferrari', 'Monégasque', 5)]
  • BETWEEN is used to filter data within a range
connection.execute(text('''
SELECT * FROM drivers WHERE victories BETWEEN 50 AND 100;
''')).fetchall()
[(2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55)]
  • We can also use NOT IN and NOT BETWEEN to exclude values
connection.execute(text('''
SELECT * FROM drivers WHERE team NOT IN ('Ferrari', 'Mercedes');
''')).fetchall()
[(2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55),
 (4, 'Fernando Alonso', 'Aston Martin', None, None)]

LIKE, ILIKE, and SIMILAR TO operators

  • We can use the LIKE operator to filter data based on patterns
  • % is a wildcard that matches any sequence of characters
  • _ is a wildcard that matches any single character
connection.execute(text('''
SELECT * FROM drivers WHERE driver_name LIKE 'L%';
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103)]
connection.execute(text('''
SELECT * FROM drivers WHERE team LIKE 'M______s';
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103)]
  • We can also use NOT LIKE to exclude patterns
connection.execute(text('''
SELECT * FROM drivers WHERE driver_name NOT LIKE 'L%';
''')).fetchall()
[(2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55),
 (3, 'Charles Leclerc', 'Ferrari', 'Monégasque', 5),
 (4, 'Fernando Alonso', 'Aston Martin', None, None)]
  • ILIKE is the case-insensitive version of LIKE
connection.execute(text('''
SELECT * FROM drivers WHERE driver_name ILIKE 'l%';
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103)]
  • SIMILAR TO is another operator that allows for more complex patterns
connection.execute(text('''
SELECT * FROM drivers WHERE driver_name SIMILAR TO 'L[a-z]%';
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103)]
connection.execute(text('''
SELECT * FROM drivers
WHERE team SIMILAR TO '%(Racing|Martin)';
''')).fetchall()
[(2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55),
 (4, 'Fernando Alonso', 'Aston Martin', None, None)]

Try it yourself! 🤓

  • Now it’s your turn to practice!
  • Find all drivers whose names start with ‘M’
  • Find drivers whose nationality contain exactly 7 characters
  • List drivers whose names start with either ‘L’ or ‘M’
  • Find drivers who have won between 1 and 10 races
  • Appendix 01

Missing data

IS NULL and COALESCE operators

  • We can also use IS NULL to check for missing values
connection.execute(text('''
SELECT * FROM drivers WHERE victories IS NULL;
''')).fetchall()
[(4, 'Fernando Alonso', 'Aston Martin', None, None)]
  • We can fill missing values with COALESCE
  • It is a simple impuation method that replaces missing values with a constant
  • The syntax is COALESCE(column_name, value)
connection.execute(text('''
SELECT driver_name, COALESCE(victories, 30) FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 103),
 ('Max Verstappen', 55),
 ('Charles Leclerc', 5),
 ('Fernando Alonso', 30)]

COALESCE with subqueries

  • We can also use COALESCE with subqueries
  • Subqueries are used when we need to compute a value based on another query, usually for aggregation purposes
  • In this example, we compute the average number of victories and use it to fill missing values (rounding to 2 decimal places)
  • Note the use of ::int to convert the result to an integer
connection.execute(text('''
SELECT 
  driver_name,
  COALESCE(victories, 
    (SELECT AVG(victories)::int
     FROM drivers 
     WHERE victories IS NOT NULL)) 
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 103),
 ('Max Verstappen', 55),
 ('Charles Leclerc', 5),
 ('Fernando Alonso', 54)]

Window functions

Window functions

  • Window functions are used to compute values based on a group of rows
  • They are similar to aggregation functions, but they do not reduce the number of rows
  • The syntax is function() OVER (PARTITION BY column ORDER BY column)
  • We can use ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), and many others
connection.execute(text('''
INSERT INTO drivers (driver_name, team, nationality, victories)
VALUES 
    ('Valtteri Bottas', 'Mercedes', 'Finnish', 10),
    ('Sergio Perez', 'Red Bull Racing', 'Mexican', 5),
    ('Lando Norris', 'McLaren', 'British', 2),
    ('Esteban Ocon', 'Ferrari', 'French', 1)
''')) 

connection.execute(text('''
SELECT driver_name, team, victories,
    ROUND(AVG(victories) OVER (), 2) AS avg_victories,
    ROUND(AVG(victories) OVER (PARTITION BY team), 2) AS avg_victories_team,
    RANK() OVER (ORDER BY victories DESC) AS rank
FROM drivers
ORDER BY victories DESC;
''')).fetchall()
[('Fernando Alonso', 'Aston Martin', None, Decimal('25.86'), None, 1),
 ('Lewis Hamilton', 'Mercedes', 103, Decimal('25.86'), Decimal('56.50'), 2),
 ('Max Verstappen', 'Red Bull Racing', 55, Decimal('25.86'), Decimal('30.00'), 3),
 ('Valtteri Bottas', 'Mercedes', 10, Decimal('25.86'), Decimal('56.50'), 4),
 ('Charles Leclerc', 'Ferrari', 5, Decimal('25.86'), Decimal('3.00'), 5),
 ('Sergio Perez', 'Red Bull Racing', 5, Decimal('25.86'), Decimal('30.00'), 5),
 ('Lando Norris', 'McLaren', 2, Decimal('25.86'), Decimal('2.00'), 7),
 ('Esteban Ocon', 'Ferrari', 1, Decimal('25.86'), Decimal('3.00'), 8)]

Window functions versus GROUP BY

GROUP BY

  • GROUP BY reduces the number of rows in the result set
  • It is used to aggregate data based on a column
  • It is useful to compute statistics for groups of data, but it is not as flexible as window functions
connection.execute(text('''
SELECT team, ROUND(AVG(victories), 2) AS avg_victories
FROM drivers
GROUP BY team;
''')).fetchall()
[('McLaren', Decimal('2.00')),
 ('Red Bull Racing', Decimal('30.00')),
 ('Ferrari', Decimal('3.00')),
 ('Mercedes', Decimal('56.50')),
 ('Aston Martin', None)]

Window functions

  • Window functions can have multiple partitions and orderings, and they are particularly useful for time series data, when we need to compute values based on a time window
  • For instance, they can be used to calculate moving averages, cumulative sums, and other statistics
connection.execute(text('''
SELECT 
    driver_name, 
    team, 
    victories,
    ROUND(AVG(victories) OVER (PARTITION BY team), 2) AS avg_victories_team,
    RANK() OVER (ORDER BY victories DESC) AS rank,
    RANK() OVER (PARTITION BY team ORDER BY victories DESC) AS rank_team
FROM drivers
WHERE victories IS NOT NULL
ORDER BY rank ASC;
''')).fetchall()
[('Lewis Hamilton', 'Mercedes', 103, Decimal('56.50'), 1, 1),
 ('Max Verstappen', 'Red Bull Racing', 55, Decimal('30.00'), 2, 1),
 ('Valtteri Bottas', 'Mercedes', 10, Decimal('56.50'), 3, 2),
 ('Charles Leclerc', 'Ferrari', 5, Decimal('3.00'), 4, 1),
 ('Sergio Perez', 'Red Bull Racing', 5, Decimal('30.00'), 4, 2),
 ('Lando Norris', 'McLaren', 2, Decimal('2.00'), 6, 1),
 ('Esteban Ocon', 'Ferrari', 1, Decimal('3.00'), 7, 2)]

Try it yourself! 🤓

  • Select the variables driver_name, nationality, and victories from the drivers table
  • Create a column called rank_nationality that ranks drivers based on the number of victories and their nationality
  • Drop missing values in the victories column
  • Order the results by rank_nationality
  • Appendix 02

String functions

String functions

  • SQL has several string functions that can be used to manipulate text data
  • We can use LEFT(), RIGHT(), LENGTH(), UPPER(), LOWER(), INITCAP(), TRIM(), REPLACE(), and many others
  • They are usually very similar to their Python counterparts (and straightforward to use!)
connection.execute(text('''
SELECT driver_name, 
    LEFT(driver_name, 5) AS first_name,
    RIGHT(driver_name, 8) AS last_name,
    LENGTH(driver_name) AS name_length,
    UPPER(driver_name) AS upper_name,
    LOWER(driver_name) AS lower_name,
    INITCAP(driver_name) AS initcap_name
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 'Lewis', 'Hamilton', 14, 'LEWIS HAMILTON', 'lewis hamilton', 'Lewis Hamilton'),
 ('Max Verstappen', 'Max V', 'rstappen', 14, 'MAX VERSTAPPEN', 'max verstappen', 'Max Verstappen'),
 ('Charles Leclerc', 'Charl', ' Leclerc', 15, 'CHARLES LECLERC', 'charles leclerc', 'Charles Leclerc'),
 ('Fernando Alonso', 'Ferna', 'o Alonso', 15, 'FERNANDO ALONSO', 'fernando alonso', 'Fernando Alonso'),
 ('Valtteri Bottas', 'Valtt', 'i Bottas', 15, 'VALTTERI BOTTAS', 'valtteri bottas', 'Valtteri Bottas'),
 ('Sergio Perez', 'Sergi', 'io Perez', 12, 'SERGIO PEREZ', 'sergio perez', 'Sergio Perez'),
 ('Lando Norris', 'Lando', 'o Norris', 12, 'LANDO NORRIS', 'lando norris', 'Lando Norris'),
 ('Esteban Ocon', 'Esteb', 'ban Ocon', 12, 'ESTEBAN OCON', 'esteban ocon', 'Esteban Ocon')]

String functions

  • We can also use TRIM() to remove leading and trailing spaces
  • STRPOS() returns the position of a substring in a string
  • REPLACE() is used to replace a substring with another substring
  • CONCAT() concatenates strings
connection.execute(text('''
SELECT driver_name, 
    TRIM(driver_name) AS trimmed_name,
    STRPOS(driver_name, 'a') AS a_position,
    REPLACE(driver_name, ' ', '_') AS replaced_name,
    CONCAT(driver_name, ' is a great driver!') AS message
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 'Lewis Hamilton', 8, 'Lewis_Hamilton', 'Lewis Hamilton is a great driver!'),
 ('Max Verstappen', 'Max Verstappen', 2, 'Max_Verstappen', 'Max Verstappen is a great driver!'),
 ('Charles Leclerc', 'Charles Leclerc', 3, 'Charles_Leclerc', 'Charles Leclerc is a great driver!'),
 ('Fernando Alonso', 'Fernando Alonso', 5, 'Fernando_Alonso', 'Fernando Alonso is a great driver!'),
 ('Valtteri Bottas', 'Valtteri Bottas', 2, 'Valtteri_Bottas', 'Valtteri Bottas is a great driver!'),
 ('Sergio Perez', 'Sergio Perez', 0, 'Sergio_Perez', 'Sergio Perez is a great driver!'),
 ('Lando Norris', 'Lando Norris', 2, 'Lando_Norris', 'Lando Norris is a great driver!'),
 ('Esteban Ocon', 'Esteban Ocon', 6, 'Esteban_Ocon', 'Esteban Ocon is a great driver!')]

CASE statement
(If-Then-Else)

CASE statement

  • SQL does not have an IF statement, but we can use the CASE statement
  • It works exactly like if statements in other programming languages
  • The syntax is CASE WHEN condition THEN value ELSE value END
connection.execute(text('''
SELECT driver_name, 
    CASE 
        WHEN victories > 50 THEN 'Great driver'
        ELSE 'Good driver'
    END
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 'Great driver'),
 ('Max Verstappen', 'Great driver'),
 ('Charles Leclerc', 'Good driver'),
 ('Fernando Alonso', 'Good driver'),
 ('Valtteri Bottas', 'Good driver'),
 ('Sergio Perez', 'Good driver'),
 ('Lando Norris', 'Good driver'),
 ('Esteban Ocon', 'Good driver')]
  • We can also use CASE with multiple conditions
  • The AS keyword is used to rename the column
connection.execute(text('''
SELECT driver_name, 
    CASE 
        WHEN victories > 50 THEN 'Great driver'
        WHEN victories < 10 THEN 'Average driver'
        ELSE 'No data'
    END AS driver_status
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 'Great driver'),
 ('Max Verstappen', 'Great driver'),
 ('Charles Leclerc', 'Average driver'),
 ('Fernando Alonso', 'No data'),
 ('Valtteri Bottas', 'No data'),
 ('Sergio Perez', 'Average driver'),
 ('Lando Norris', 'Average driver'),
 ('Esteban Ocon', 'Average driver')]
  • CASE can be combined with IN and BETWEEN operators
  • It is a very versatile statement! 🤓
connection.execute(text('''
SELECT driver_name, 
    CASE 
        WHEN victories IN (5, 54) THEN 'Good driver'
        WHEN victories BETWEEN 55 AND 110 THEN 'Great driver'
        ELSE 'No data'
    END AS driver_status
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 'Great driver'),
 ('Max Verstappen', 'Great driver'),
 ('Charles Leclerc', 'Good driver'),
 ('Fernando Alonso', 'No data'),
 ('Valtteri Bottas', 'No data'),
 ('Sergio Perez', 'Good driver'),
 ('Lando Norris', 'No data'),
 ('Esteban Ocon', 'No data')]

CASE for missing values

  • CASE can also be useful to fill missing values with conditions
  • Let’s add some missing values to the table
connection.execute(text('''
SELECT * FROM drivers WHERE victories IS NULL;
''')).fetchall()
[(4, 'Fernando Alonso', 'Aston Martin', None, None)]
connection.execute(text('''
SELECT driver_name, 
    CASE 
        WHEN nationality IS NULL THEN 'Spanish'
        ELSE nationality
    END,
    CASE 
        WHEN victories IS NULL THEN 30
        ELSE victories
    END
FROM drivers
''')).fetchall()
[('Lewis Hamilton', 'British', 103),
 ('Max Verstappen', 'Dutch', 55),
 ('Charles Leclerc', 'Monégasque', 5),
 ('Fernando Alonso', 'Spanish', 30),
 ('Valtteri Bottas', 'Finnish', 10),
 ('Sergio Perez', 'Mexican', 5),
 ('Lando Norris', 'British', 2),
 ('Esteban Ocon', 'French', 1)]
  • Note the use of commas to separate the columns

Try it yourself! 🤓

  • Let’s see if you can use the CASE statement
  • Create a column called driver_level that classifies drivers as ‘Beginner’, ‘Intermediate’, or ‘Expert’ based on the number of victories
    • ‘Beginner’ if victories < 10
    • ‘Intermediate’ if victories between 10 and 100
    • ‘Expert’ if victories > 100
  • Appendix 03

Using SQL with pandas 🐼

Using SQL with pandas

  • We all know that pandas is great! 😎
  • And it can be even better when combined with SQL queries
  • We can use the read_sql() method to read data from a SQL query
import pandas as pd

# Read data from SQL query
df = pd.read_sql('SELECT * FROM drivers', connection)
df
driver_id driver_name team nationality victories
0 1 Lewis Hamilton Mercedes British 103.0
1 2 Max Verstappen Red Bull Racing Dutch 55.0
2 3 Charles Leclerc Ferrari Monégasque 5.0
3 4 Fernando Alonso Aston Martin None NaN
4 5 Valtteri Bottas Mercedes Finnish 10.0
5 6 Sergio Perez Red Bull Racing Mexican 5.0
6 7 Lando Norris McLaren British 2.0
7 8 Esteban Ocon Ferrari French 1.0
  • You can pass any SQL query to the read_sql() method
pd.read_sql(
  "SELECT * FROM drivers WHERE victories > 50 AND nationality = 'British'", 
  connection
)
driver_id driver_name team nationality victories
0 1 Lewis Hamilton Mercedes British 103


  • Yes, it is that simple! 🤓
  • And you all now want to kill me for not showing this before! 😂
  • Just connect to the database using sqlalchemy and use read_sql() to read the data if that’s what you want
  • Note the use of double quotes for the SQL query and single quotes for the strings
  • SQL only accepts single quotes, so be careful with that 😉

Using SQL with pandas

  • We can then use any pandas method to manipulate the data
df.groupby('team').agg(avg_victories=('victories', 'mean')).dropna()
avg_victories
team
Ferrari 3.0
McLaren 2.0
Mercedes 56.5
Red Bull Racing 30.0
  • Using .query() to filter data
df.query('victories > 50')
driver_id driver_name team nationality victories
0 1 Lewis Hamilton Mercedes British 103.0
1 2 Max Verstappen Red Bull Racing Dutch 55.0
  • We can also write data to a SQL table with the to_sql() method
df.to_sql('drivers_copy', connection, if_exists='replace', index=False)

# Check if the data were written correctly
pd.read_sql('SELECT * FROM drivers_copy', connection)
driver_id driver_name team nationality victories
0 1 Lewis Hamilton Mercedes British 103.0
1 2 Max Verstappen Red Bull Racing Dutch 55.0
2 3 Charles Leclerc Ferrari Monégasque 5.0
3 4 Fernando Alonso Aston Martin None NaN
4 5 Valtteri Bottas Mercedes Finnish 10.0
5 6 Sergio Perez Red Bull Racing Mexican 5.0
6 7 Lando Norris McLaren British 2.0
7 8 Esteban Ocon Ferrari French 1.0

Try it yourself! 🤓

  • Create a new SQL table called employees with the following columns:
    • employee_id (serial, primary key)
    • employee_name (varchar(50))
    • department (varchar(50))
    • salary (int)
  • Remember to drop the table if it already exists
  • Insert some data into the table and read it with pandas (you can see the example below)
  • Commit the changes
  • Use pandas to compute the average salary by department
# Create employees table first, then insert data
'Alice', 'HR', 5000
'Bob', 'IT', 6000
'Charlie', 'HR', 7000
'David', 'IT', 8000

Pivot tables in SQL 📄

Pivot tables in SQL

  • We have already seen how to pivot tables in pandas
  • But we can also pivot tables in SQL too
  • We can use the CASE function to pivot tables, which is similar to the pivot_table() method in pandas
  • Let’s create another table with student scores
connection.execute(text('''
DROP TABLE IF EXISTS student_scores;
CREATE TABLE student_scores (
    student_name VARCHAR(50),
    subject VARCHAR(50),
    score INTEGER,
    term VARCHAR(10)
);

INSERT INTO student_scores (student_name, subject, score, term)
VALUES 
    ('Alice', 'Maths', 90, 'Q1'),
    ('Alice', 'Maths', 80, 'Q2'),
    ('Alice', 'Science', 80, 'Q1'),
    ('Alice', 'Science', 75, 'Q2'),
    ('Bob', 'Maths', 80, 'Q1'),
    ('Bob', 'Maths', 100, 'Q2'),
    ('Bob', 'Science', 80, 'Q1'),
    ('Bob', 'Science', 70, 'Q2'),
    ('Charles', 'Maths', 70, 'Q1'),
    ('Charles', 'Maths', 75, 'Q2'),
    ('Charles', 'Science', 90, 'Q1'),
    ('Charles', 'Science', 85, 'Q2')
'''))

connection.commit()

connection.execute(text('SELECT * FROM student_scores')).fetchall()
[('Alice', 'Maths', 90, 'Q1'),
 ('Alice', 'Maths', 80, 'Q2'),
 ('Alice', 'Science', 80, 'Q1'),
 ('Alice', 'Science', 75, 'Q2'),
 ('Bob', 'Maths', 80, 'Q1'),
 ('Bob', 'Maths', 100, 'Q2'),
 ('Bob', 'Science', 80, 'Q1'),
 ('Bob', 'Science', 70, 'Q2'),
 ('Charles', 'Maths', 70, 'Q1'),
 ('Charles', 'Maths', 75, 'Q2'),
 ('Charles', 'Science', 90, 'Q1'),
 ('Charles', 'Science', 85, 'Q2')]

Pivot tables in SQL

  • Now let’s pivot the table using the CASE function
  • We will focus on the students’ scores in Math and Science
  • Let’s calculate their average scores for each subject
df2 = pd.read_sql('''
SELECT 
    student_name,
    AVG(CASE WHEN subject = 'Maths' THEN score END) as Maths,
    AVG(CASE WHEN subject = 'Science' THEN score END) as Science,
    AVG(score) as total
FROM student_scores
GROUP BY student_name
ORDER BY total DESC;
''', connection)

df2
student_name maths science total
0 Bob 90.0 75.0 82.50
1 Alice 85.0 77.5 81.25
2 Charles 72.5 87.5 80.00

Pivot tables in SQL

  • PostgreSQL also has a crosstab function that can be used to pivot tables
  • It is a bit more complex than the CASE function, but it works fine
  • The syntax is crosstab(text, text) and it requires the tablefunc extension
  • We can install the extension with the command below
#} eval: true
connection.execute(text('CREATE EXTENSION IF NOT EXISTS tablefunc;'))
connection.commit()

df3 = pd.read_sql('''
DROP TABLE IF EXISTS ct;
SELECT student_name, Maths, Science, (Maths + Science)/2 as Total
FROM crosstab(
    'SELECT student_name, subject, AVG(score)
     FROM student_scores
     GROUP BY student_name, subject
     ORDER BY student_name, subject',
    'SELECT DISTINCT subject FROM student_scores ORDER BY subject'
) AS ct (
    student_name VARCHAR, Maths NUMERIC, Science NUMERIC
);
''', connection)

df3
student_name maths science total
0 Alice 85.0 77.5 81.25
1 Bob 90.0 75.0 82.50
2 Charles 72.5 87.5 80.00

Pivot tables in SQL

  • Let’s break the code down
  • We first dropped the table ct if it already exists, just to avoid conflicts
    • DROP TABLE IF EXISTS ct;
  • We then used the crosstab function to pivot the table
    • SELECT student_name, subject, AVG(score) FROM student_scores GROUP BY student_name, subject ORDER BY student_name, subject
    • SELECT DISTINCT subject FROM student_scores ORDER BY subject
  • We calculated the total score as the average of the scores in Maths and Science
  • The crosstab function requires two arguments
    • The first argument is the query to retrieve the data
    • The second argument is the query to retrieve the column names
  • We then used the AS keyword to rename the columns and assigned the result to a new table called ct
    • AS ct (student_name VARCHAR, Maths NUMERIC, Science NUMERIC)
  • And that’s it! 🤓

Conclusion 📚

Conclusion

  • SQL is cool, SQL + Python is even cooler! 😎
  • We can use psycopg2 and sqlalchemy to connect to a PostgreSQL database
  • We can execute SQL commands with execute() and fetchall()
  • Functions like IN, BETWEEN, LIKE, CASE, and window functions are very useful
  • There are also many string functions that can be used to manipulate text data
  • We can fill missing values with COALESCE and pivot tables with the CASE function or the crosstab function
  • And we can use pandas to read data from SQL queries and manipulate it

Source: Susan Ibach

And that’s all for today! 🚀

Have a great day! 😊

Appendix 01

  • Find all drivers whose names start with ‘M’
connection.execute(text('''
SELECT * FROM drivers WHERE driver_name LIKE 'M%';
''')).fetchall()
[(2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55)]
  • Find drivers whose nationality contains exactly 7 characters
connection.execute(text('''
SELECT * FROM drivers WHERE nationality LIKE '_______';
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103),
 (5, 'Valtteri Bottas', 'Mercedes', 'Finnish', 10),
 (6, 'Sergio Perez', 'Red Bull Racing', 'Mexican', 5),
 (7, 'Lando Norris', 'McLaren', 'British', 2)]
  • Or we can use the LENGTH function
connection.execute(text('''
SELECT * FROM drivers WHERE LENGTH(nationality) = 7;
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103),
 (5, 'Valtteri Bottas', 'Mercedes', 'Finnish', 10),
 (6, 'Sergio Perez', 'Red Bull Racing', 'Mexican', 5),
 (7, 'Lando Norris', 'McLaren', 'British', 2)]

Back to exercise

Appendix 01

  • List drivers whose names start with either ‘L’ or ‘M’
connection.execute(text('''
SELECT * FROM drivers WHERE driver_name LIKE 'L%' OR driver_name LIKE 'M%';
''')).fetchall()
[(1, 'Lewis Hamilton', 'Mercedes', 'British', 103),
 (2, 'Max Verstappen', 'Red Bull Racing', 'Dutch', 55),
 (7, 'Lando Norris', 'McLaren', 'British', 2)]
  • Find drivers who have won between 10 and 50 races
connection.execute(text('''
SELECT * FROM drivers WHERE victories BETWEEN 1 AND 10;
''')).fetchall()
[(3, 'Charles Leclerc', 'Ferrari', 'Monégasque', 5),
 (5, 'Valtteri Bottas', 'Mercedes', 'Finnish', 10),
 (6, 'Sergio Perez', 'Red Bull Racing', 'Mexican', 5),
 (7, 'Lando Norris', 'McLaren', 'British', 2),
 (8, 'Esteban Ocon', 'Ferrari', 'French', 1)]

Back to exercise

Appendix 02

  • Create a column called rank_nationality that ranks drivers based on the number of victories and their nationality
connection.execute(text('''
SELECT driver_name, nationality, victories,
    RANK() OVER (PARTITION BY nationality ORDER BY victories DESC) AS rank_nationality
FROM drivers
WHERE victories IS NOT NULL
ORDER BY rank_nationality
''')).fetchall()
[('Valtteri Bottas', 'Finnish', 10, 1),
 ('Charles Leclerc', 'Monégasque', 5, 1),
 ('Max Verstappen', 'Dutch', 55, 1),
 ('Lewis Hamilton', 'British', 103, 1),
 ('Esteban Ocon', 'French', 1, 1),
 ('Sergio Perez', 'Mexican', 5, 1),
 ('Lando Norris', 'British', 2, 2)]

Back to exercise

Appendix 03

  • Create a column called driver_level that classifies drivers as ‘Beginner’, ‘Intermediate’, or ‘Expert’ based on the number of victories
    • Beginner if victories < 10
    • Intermediate if victories between 10 and 50
    • Expert if victories > 50
connection.execute(text('''
SELECT driver_name, 
    CASE 
        WHEN victories < 10 THEN 'Beginner'
        WHEN victories BETWEEN 10 AND 100 THEN 'Intermediate'
        ELSE 'Expert'
    END AS driver_level
FROM drivers;
''')).fetchall()
[('Lewis Hamilton', 'Expert'),
 ('Max Verstappen', 'Intermediate'),
 ('Charles Leclerc', 'Beginner'),
 ('Fernando Alonso', 'Expert'),
 ('Valtteri Bottas', 'Intermediate'),
 ('Sergio Perez', 'Beginner'),
 ('Lando Norris', 'Beginner'),
 ('Esteban Ocon', 'Beginner')]

Back to exercise

Appendix 04

  • Create a new SQL table called employees with the following columns:
    • employee_id (serial, primary key)
    • employee_name (varchar(50))
    • department (varchar(50))
    • salary (int)
  • Commit the changes
  • Insert some data into the table and read it with pandas (you can see the example below)
  • Use pandas to compute the average salary by department
connection.execute(text('''
DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    employee_name VARCHAR(50),
    department VARCHAR(50),
    salary INT
);
'''))

connection.execute(text('''
INSERT INTO employees (employee_name, department, salary)
VALUES 
    ('Alice', 'HR', 5000),
    ('Bob', 'IT', 6000),
    ('Charlie', 'HR', 7000),
    ('David', 'IT', 8000)
'''))

connection.commit()

# Read data with pandas
pd.read_sql('SELECT * FROM employees', connection)

# Compute average salary by department
pd.read_sql('SELECT department, AVG(salary) FROM employees GROUP BY department', connection)
department avg
0 IT 7000.0
1 HR 6000.0

Back to exercise

Appendix 05

Cleaning all tables

  • I have encountered some issues with psycopg2 when trying to drop tables
  • So I created a function to clean all tables in the database
  • If you have a similar issue, you can use the function below
  • Replace the default values with your database name, user, and password
from sqlalchemy import create_engine, text, inspect
import time

engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/postgres', future=True)
connection = engine.connect()

def clean_database(engine):
    with engine.connect() as conn:
        try:
            # Get inspector to check existing tables
            inspector = inspect(engine)
            existing_tables = inspector.get_table_names()
            
            if not existing_tables:
                print("No tables found in database")
                return
                
            print(f"Found {len(existing_tables)} tables: {existing_tables}")
            
            # Kill other connections
            conn.execute(text("""
                SELECT pg_terminate_backend(pid) 
                FROM pg_stat_activity 
                WHERE pid <> pg_backend_pid()
                AND datname = current_database()
            """))
            
            conn.execute(text("ROLLBACK"))
            conn.execute(text("SET statement_timeout = '30s'"))
            
            # Only drop tables that exist
            for table in existing_tables:
                try:
                    conn.execute(text(f"DROP TABLE IF EXISTS {table} CASCADE"))
                    print(f"Dropped {table}")
                    conn.commit()
                    time.sleep(1)
                except Exception as e:
                    print(f"Error with {table}: {str(e)}")
                    conn.execute(text("ROLLBACK"))
                    
        except Exception as e:
            print(f"Fatal error: {str(e)}")
            conn.execute(text("ROLLBACK"))

# Execute
clean_database(engine)

Back to the lecture