QTM 350 - Data Science Computing

Lecture 17 - Introduction to SQL

Danilo Freire

Emory University

30 October, 2024

Hey, there! 😊
I hope you’re all doing well!

Today’s lecture 📚

Today’s agenda

SQL essentials with PostgreSQL

  • Introduction to SQL: We will start by learning the fundamentals of SQL, focusing on PostgreSQL
  • Setup: You will learn how to:
  • Querying: We will cover basic queries to interact with your database
  • Connecting with VSCode: You will also learn how to connect to your database using VS Code

Structured Query Language (SQL)

SQL: A relational database management system (RDBMS)

  • SQL stands for Structured Query Language and it is used to interact with relational databases
  • What is a relational database?
    • A database that stores data in tables and allows you to define relationships between tables
    • The most common type of database
    • Examples: PostgreSQL, MySQL, SQLite, Oracle, SQL Server
  • Here we will focus on PostgreSQL
  • Why? It’s open-source, powerful, and widely used
  • According to the 2024 Stack Overflow survey, PostgreSQL is the world’s most popular database amongst developers
  • It has several advanced features too, such as JSON support, various backup options, and robust security

Let’s get started! 🤓

Installing PostgreSQL

  • Go to the PostgreSQL website
  • Download the installer for your operating system
  • Run the installer
  • Follow the installation instructions and accept the default settings
  • Make sure to remember the password you set for the postgres user (you can use postgres as the password)
  • No need to install the Stack Builder, pgAdmin will be enough for our purposes
  • pgAdmin is a graphical tool to manage your PostgreSQL databases and it comes installed with the interactive installer
  • If not, you can download it from the pgAdmin website

Click on the image to go to the website - There’s also a tutorial on our website

Default settings

  • The default settings for the PostgreSQL installation are:
    • Server: localhost
    • Port: 5432
    • User: postgres
    • Password: the one you set during the installation
    • Database: postgres
  • Why do we need to install PostgreSQL and pgAdmin?
    • PostgreSQL is the database management system
    • pgAdmin is a graphical tool to manage your databases
    • We will also see how to use the command line to interact with the database
    • But pgAdmin is more user-friendly

  • After the installation, you can just click on the pgAdmin icon on your desktop and you are good to go! 🚀
  • If you have any issues with this version of pgAdmin for MacOS, you can install it again directly from the official website

Any issues with the installation? 🤔

Let’s create our first SQL table! 🚀

Creating a table with pgAdmin

  • Open pgAdmin and click on postgres with the right mouse button, then click on Connect Server
  • You will be prompted to enter the password you set during the installation

Creating a table with pgAdmin

  • You will see a few icons below the postgres database
  • We’re on the right track! 😉

Creating a table with pgAdmin

  • Now click on Databases, then postgres below it. After this, click with the right mouse button and go to Query Tool

Creating a table with pgAdmin

  • You will see a new tab with a SQL editor
  • This is where we will write our SQL commands for now
  • Soon we will learn how to write SQL commands in VSCode and in Python!

Creating a database

  • We usually do not create databases using SQL queries
  • Most of the time, the data come from other sources, such as APIs, JSON or CSV files
  • But in small projects, it is common to create a database to store the data
  • So let’s create one right now! 😉
  • We will create a table with some information about Formula 1 drivers
  • System commands are in upper case and user-defined names are in lower case
DROP TABLE IF EXISTS drivers;

CREATE TABLE drivers(
    driver_id int,
    driver_name varchar(30),
    nationality varchar(15),
    victories int
);
  • What are we doing here?
  • Create empty TABLE called drivers
  • This example has five arguments separated by a comma (,) that contain the column_name and column_type, respectively
  • The driver_id is an int, the driver_name is a varchar with a maximum of 30 characters, the nationality is a varchar with a maximum of 15 characters, and the victories is an int
  • The driver_id is the primary key, which means it is unique for each driver
  • Paste that into the SQL editor and click on the Execute (▶️) button to run the code
NOTICE:  table "drivers" does not exist, skipping
CREATE TABLE

Query returned successfully in 88 msec.

Add rows to table

  • Now that we have created the table, let’s add some rows to it!
  • We can do this by using the INSERT INTO command
  • This command adds rows with VALUES to the bottom of the table
  • In SQL, different lines need to be separated by a semicolon (;)
  • As mentioned above, you write system commands in capital letters to easily distinguish column names from system commands
  • Let’s add some information about the drivers
INSERT INTO drivers VALUES (1, 'Lewis Hamilton','British', 103);

INSERT INTO drivers VALUES (4, 'Fernando Alonso', 'Spanish', 32);

INSERT INTO drivers VALUES (3, 'Sebastian Vettel', 'German', 91);

INSERT INTO drivers VALUES (2, 'Michael Schumacher', 'German', 53);
  • Now click on the Execute (▶️) button to run the code

Add rows to table

  • Click on the Data Output tab (bottom left) to see the results… but it’s empty! 🧐

:::

Querying the table

  • To see the data we just added, we need to query the table
SELECT * FROM drivers;
  • This command selects all columns (*) from the drivers table
  • Click on the Execute (▶️) button to run the code (you can also press F5)
  • Now you should see the data you just added to the table

Exercise 01

  • Create a table called students with the following columns:

    • student_id (int)
    • student_name (varchar, 30 characters)
    • major (varchar, 30 characters)
    • gpa (float)
  • Add the following rows to the table:

  • student_id: 1, student_name: ‘John Doe’, major: ‘Computer Science’, gpa: 3.5

  • student_id: 2, student_name: ‘Jane Doe’, major: ‘Mathematics’, gpa: 3.8

  • Query the table to see the results

  • Appendix 01

Connecting to the database in VSCode

SQL Notebook extension

  • We can also connect to the database using VSCode
  • There are many good extensions for SQL in VSCode, but we will use the SQL Notebook extension
  • Why this one? Because it not only works with PostgreSQL but also with other SQL databases
  • And it has the same look and feel as Jupyter Notebooks, which we are already familiar with 😊

SQL Notebook extension

Connecting to the database

  • Keep your pgAdmin open and let’s connected to the database
  • Click on the SQL Notebook icon on the left side of the screen
  • Then on New SQL Connection
  • Display name: localhost (or any name you want)
  • Database Driver: postgres
  • Database Host: localhost
  • Database Port: 5432
  • Database User: postgres
  • Database Password: the password you set during the installation
  • Database Name: postgres
  • Then click on Create

SQL Notebook extension

Connecting to the database

  • Now we only need to two two things:
    • Click on the icon to establish the connection
    • Open a .sql with SQL Notebook

  • Download the example.sql file from our website
  • Find the file in VSCode’s explorer and open it with the Open With menu option (right button)
  • Then, select the SQL Notebook format

SQL Notebook extension

SQL Notebook extension

Dataset Operations in SQL 📊

Extracting columns from table

SELECT extracts certain columns

  • The SELECT command is used to extract certain columns from a table
SELECT driver_id, nationality FROM drivers;

Subsetting rows

  • The WHERE command evaluates logical conditions to subset rows
SELECT * FROM drivers
WHERE nationality = 'German';

AND and OR conditions

OR conditions

  • Evaluate OR conditions
SELECT * FROM drivers
WHERE (nationality = 'Brazilian') OR (nationality = 'German');

AND conditions

  • Evaluate AND conditions
SELECT * FROM drivers
WHERE (nationality = 'German') AND (driver_id = 3);

Data description

Aggregate

  • To compute aggregate statistics use a statistic function, such as SUM() or AVG()
  • The AS is an alias which assigns a name to the aggregate statistic
SELECT SUM(victories) AS sum_victories,
       COUNT(*) AS num_rows,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers;

Sort

  • Use ORDER BY to sort the data
  • Order in ascending order of victories with ASC (DESC for descending)
SELECT driver_name, victories FROM drivers
ORDER BY victories ASC;

Exercise 02

  • List drivers who are either ‘German’ or ‘British’ and have victories greater than 50
  • List all drivers with victories greater than the average victories
    • Hint: you will have to run another SELECT command to get the average victories (a subquery)
    • Aggregate functions are not allowed in the WHERE clause. See the reason why here. More about subqueries here.
    • We will see another way to do this soon (with HAVING)
  • Appendix 02

Groupby + Aggregate

Groupby

  • Use GROUP BY to group data by a column
SELECT nationality,
       SUM(victories) AS sum_victories,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Round

  • Use ROUND() to round the values
  • The first argument is the value to be rounded and the second is the number of decimal places
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Filtering after aggregating with HAVING

  • Use HAVING to filter variables after aggregating
  • This differs from WHERE, which is used to filter variables before aggregating
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 50;
  • The command above computes the same table as the previous one
  • Then filters with HAVING

Python vs. SQL

  • Here are some equivalences between Python and SQL commands
SQL Command Pandas Equivalent
SELECT df[['column1', 'column2']]
INSERT INTO df.loc[new_index] = new_row or df = df.append(new_row, ignore_index=True)
WHERE df[df['column'] == value]
GROUP BY df.groupby('column')
HAVING df.groupby('column').filter(lambda x: condition)
ROUND() df['column'].round(decimals)
Aggregate Functions df.groupby('column').agg({'col1': 'sum', 'col2': 'mean', 'col3': 'min', 'col4': 'max'})


-- SQL
SELECT nationality,
       SUM(victories) AS sum_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 10;
# Python
import pandas as pd

# Group by 'nationality' and calculate 'sum_victories' using agg()
result = (
    drivers.groupby('nationality')
    .agg(sum_victories=('victories', 'sum'))
    .reset_index()
)

# Apply 'HAVING' condition
result = result[result['sum_victories'] > 10]

Do you want to learn more? 🤓

And that’s it for today! 🚀

See you next time! 😊

Appendix 01

  • Here is the code to create the students table and add the rows
  • You can paste this code into the SQL editor and click on the Execute (▶️) button to run the code
DROP TABLE IF EXISTS students;

CREATE TABLE students(
    student_id int,
    student_name varchar(30),
    major varchar(30),
    gpa float
);

INSERT INTO students VALUES (1, 'John Doe', 'Computer Science', 3.5);

INSERT INTO students VALUES (2, 'Jane Doe', 'Mathematics', 3.8);

SELECT * FROM students;

Appendix 02

  • Here is the code to list drivers who are either ‘German’ or ‘British’ and have victories greater than 50
SELECT driver_name, nationality, victories
FROM drivers
WHERE (nationality = 'German' OR nationality = 'British') AND victories > 50;

SELECT driver_name, victories
FROM drivers
WHERE victories > (SELECT AVG(victories) FROM drivers);