QTM 151 - Introduction to Statistical Computing II

Lecture 17 - Introduction to SQL

Danilo Freire

Emory University

30 October, 2024

Hello, everyone! 😊
Nice to see you again!

Some reminders 🤓

Final Project 📖

  • Task: Write a report on a data analysis project based on the Formula 1 datasets
  • Groups: 3-4 students
  • Due date: December 9th
  • Submission: HTML file from Jupyter Notebook
  • Resources: GitHub repository
  • More information: Final Project Instructions

Other reminders 📝

Assignments

  • Assignment 08 is due this Friday, not today
  • Why? My mistake! I posted it in the wrong folder on Canvas 😅
  • So you guys have a few more days to submit it 😉
  • Assignment 09 will be posted later today

Quizzes

  • We will have three quizzes until the end of the semester
  • Why more quizzes now? Because you are more familiar with the content and are doing great! 🚀
  • The next quiz will be on November 4th (lecture 18) and it will cover Python, not SQL
  • As usual, I will post a repository link right before the quiz starts
  • But so far you guys are doing very well, congratulations! 🎉
  • So let’s move on to today’s topic: Introduction to SQL 📊🧑🏻‍💻👩🏼‍💻

What is SQL? 🤔

What is SQL?

  • SQL stands for Structured Query Language
  • It is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS)
  • SQL is widely used in industry and academia
  • Its main task is to extract data from a server
  • Many concepts are similar to Python
  • SQL is actually quite easy to learn and very powerful
    • It is a great skill to have in your toolbox! 🧰

PostgreSQL and pgAdmin 🐘

Getting started

  • Please check the PostgreSQL tutorial on our website
  • You can install PostgreSQL from the official website
  • If you install the Interactive installer by EDB, pgAdmin will come with it
  • During the installation, you may have to set a password for the default user, postgres. Please remember it, as you will need it to connect to the database
  • The default port for PostgreSQL is 5432 and the host is localhost (no need to change them)
  • The default username is postgres and the default database is postgres
  • No need to install the Stack Builder, pgAdmin will be enough for our purposes
  • If you do (there’s no harm in doing so), you don’t need to run it after the installation
  • Again, please check the tutorial as it includes step-by-step instructions and screenshots

Click on the image to go to the website

Installation

Installation

Installation

Installation

Installation

  • No need to launch Stack Builder after the installation

  • Then you can just click on the pgAdmin icon on your desktop and you are good to go! 🚀
  • If you have any issues with this version of pgAdmin for MacOS, you can install it again directly from the official website

Questions?
Did the installation work? 🤓

Let’s see how SQL works! 🧑🏻‍💻

Creating a table with pgAdmin

  • Open pgAdmin and click on postgres with the right mouse button, then click on Connect Server
  • You will be prompted to enter the password you set during the installation

Creating a table with pgAdmin

  • You will see a few icons below the postgres database
  • We’re on the right track! 😉

Creating a table with pgAdmin

  • Now click on Databases, then postgres below it. After this, click with the right mouse button and go to Query Tool

Creating a table with pgAdmin

  • You will see a new tab with a SQL editor
  • This is where we will write our SQL commands for now
  • Soon we will learn how to write SQL commands in VSCode and in Python!

Are you ready? 🚀

Let’s write some SQL code!

  • We will start by creating a table from scratch
  • But first, it is good practice to check if the table already exists and drop it if it does
  • This way we avoid errors and can start fresh 🤓
  • Since we’ve been talking about Formula 1, let’s create a table with some information about drivers
  • System commands are in upper case and user-defined names are in lower case
DROP TABLE IF EXISTS drivers;

CREATE TABLE drivers(
    driver_id int,
    driver_name varchar(30),
    nationality varchar(15),
    victories int
);
  • What are we doing here?
  • Create empty TABLE called drivers
  • This example has five arguments separated by a comma (,) that contain the column_name and column_type, respectively
  • The driver_id is an int, the driver_name is a varchar with a maximum of 30 characters, the nationality is a varchar with a maximum of 15 characters, and the victories is an int
  • The driver_id is the primary key, which means it is unique for each driver
  • Paste that into the SQL editor and click on the Execute (▶️) button to run the code
NOTICE:  table "drivers" does not exist, skipping
CREATE TABLE

Query returned successfully in 88 msec.

Add rows to table

  • Now that we have created the table, let’s add some rows to it!
  • We can do this by using the INSERT INTO command
  • This command adds rows with VALUES to the bottom of the table
  • In SQL, different lines need to be separated by a semicolon (;)
  • As mentioned above, you write system commands in capital letters to easily distinguish column names from system commands
  • Let’s add some information about the drivers
INSERT INTO drivers VALUES (1, 'Lewis Hamilton','British', 103);

INSERT INTO drivers VALUES (4, 'Fernando Alonso', 'Spanish', 32);

INSERT INTO drivers VALUES (3, 'Sebastian Vettel', 'German', 91);

INSERT INTO drivers VALUES (2, 'Michael Schumacher', 'German', 53);
  • Now click on the Execute (▶️) button to run the code

Add rows to table

  • Click on the Data Output tab (bottom left) to see the results… but there’s nothing there! 😱

:::

Nah, don’t worry 😅

  • The Data Output tab is empty because we need to run a SELECT command to see the data 😉
  • We will learn more about the SELECT command soon, but for now let’s just run the following command:
SELECT * FROM drivers;
  • This command selects all columns (*) from the drivers table
  • Click on the Execute (▶️) button to run the code (or press F5)
  • Now you should see the data you just added to the table! 🎉

Try it yourself! 🧠

  • Add two or three more rows to the drivers table on your own
  • Make sure to use a unique driver_id for each new driver
  • Remember to separate commands with a semicolon (;)
  • Here’s some code to help you get started:
INSERT INTO drivers VALUES (5, 'Max Verstappen', 'Dutch', 51); 

INSERT INTO drivers VALUES (6, 'Juan Pablo Montoya', 'Colombian', 6);

INSERT INTO drivers VALUES (7, 'Danilo Freire','Brazilian', 10);

Connecting to the database in VSCode

SQL Notebook extension

  • We can also connect to the database using VSCode
  • There are many good extensions for SQL in VSCode, but we will use the SQL Notebook extension
  • Why this one? Because it not only works with PostgreSQL but also with other SQL databases
  • And it has the same look and feel as Jupyter Notebooks, which we are already familiar with 😊

SQL Notebook extension

Connecting to the database

  • Keep your pgAdmin open and let’s connected to the database
  • Click on the SQL Notebook icon on the left side of the screen
  • Then on New SQL Connection
  • Display name: localhost (or any name you want)
  • Database Driver: postgres
  • Database Host: localhost
  • Database Port: 5432
  • Database User: postgres
  • Database Password: the password you set during the installation
  • Database Name: postgres
  • Then click on Create

SQL Notebook extension

Connecting to the database

  • Now we only need to two two things:
    • Click on the icon to establish the connection
    • Open a .sql with SQL Notebook

  • There is a 17-introduction-sql.sql file in the repository for this lecture
  • Open with the Open With menu option (right button). Then, select the SQL Notebook format

SQL Notebook extension

SQL Notebook extension

So far so good? 😉

Dataset Operations in SQL 📊

Extracting columns from table

SELECT extracts certain columns

  • The SELECT command is used to extract certain columns from a table
SELECT driver_id, nationality FROM drivers;

Subsetting rows

  • The WHERE command evaluates logical conditions to subset rows
SELECT * FROM drivers
WHERE nationality = 'German';

AND and OR conditions

OR conditions

  • Evaluate OR conditions
SELECT * FROM drivers
WHERE (nationality = 'Brazilian') OR (nationality = 'German');

AND conditions

  • Evaluate AND conditions
SELECT * FROM drivers
WHERE (nationality = 'German') AND (driver_id = 3);

Data description

Aggregate

  • To compute aggregate statistics use a statistic function, such as SUM() or AVG()
  • The AS is an alias which assigns a name to the aggregate statistic
SELECT SUM(victories) AS sum_victories,
       COUNT(*) AS num_rows,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers;

Sort

  • Use ORDER BY to sort the data
  • Order in ascending order of victories with ASC (DESC for descending)
SELECT driver_name, victories FROM drivers
ORDER BY victories ASC;

Try it yourself! 🧠

  • Select nationality and driver_id for drivers with more than 60 victories
  • Write your own code! 🧠
  • Appendix 02

Groupby + Aggregate

Groupby

  • Use GROUP BY to group data by a column
SELECT nationality,
       SUM(victories) AS sum_victories,
       AVG(victories) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Round

  • Use ROUND() to round the values
  • The first argument is the value to be rounded and the second is the number of decimal places
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality;

Filtering after aggregating with HAVING

  • Use HAVING to filter variables after aggregating
  • This differs from WHERE, which is used to filter variables before aggregating
SELECT nationality,
       SUM(victories) AS sum_victories,
       ROUND(AVG(victories), 1) AS mean_victories,
       MIN(victories) AS min_victories,
       MAX(victories) AS max_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 50;
  • The command above computes the same table as the previous one
  • Then filters with HAVING

Python vs. SQL

  • Here are some equivalences between Python and SQL commands
SQL Command Pandas Equivalent
SELECT df[['column1', 'column2']]
INSERT INTO df.loc[new_index] = new_row or df = df.append(new_row, ignore_index=True)
WHERE df[df['column'] == value]
GROUP BY df.groupby('column')
HAVING df.groupby('column').filter(lambda x: condition)
ROUND() df['column'].round(decimals)
Aggregate Functions df.groupby('column').agg({'col1': 'sum', 'col2': 'mean', 'col3': 'min', 'col4': 'max'})


-- SQL
SELECT nationality,
       SUM(victories) AS sum_victories
FROM drivers
GROUP BY nationality
HAVING SUM(victories) > 10;
# Python
import pandas as pd

# Group by 'nationality' and calculate 'sum_victories' using agg()
result = (
    drivers.groupby('nationality')
    .agg(sum_victories=('victories', 'sum'))
    .reset_index()
)

# Apply 'HAVING' condition
result = result[result['sum_victories'] > 10]

Do you want to learn more? 🤓

And that’s it for today! 🚀

See you next time! 😊

Appendix 01

INSERT INTO drivers VALUES (5, 'Max Verstappen', 'Dutch', 51);

INSERT INTO drivers VALUES (6, 'Juan Pablo Montoya', 'Colombian', 6);

INSERT INTO drivers VALUES (7, 'Danilo Freire','Brazilian', 10);

Back to the exercise

Appendix 02

SELECT nationality, driver_id FROM drivers
WHERE victories > 60;

Back to the exercise