QTM 151 - Introduction to Statistical Computing II

Lecture 17 - Introduction to SQL with SQLite

Danilo Freire

Emory University

Hello, everyone! 😊
Nice to see you again!

Some reminders 🤓

Final Project 📖

  • Task: Write a report on a data analysis project based on the Formula 1 datasets
  • Groups: 3-4 students
  • Submission: HTML file from Jupyter Notebook
  • Resources: GitHub repository
  • More information: Final Project Instructions

Other reminders 📝

Quizzes

  • We will have three quizzes until the end of the semester
  • Why more quizzes now? Because you are more familiar with the content and are doing great! 🚀
  • The next quiz will cover Python, not SQL
  • As usual, I will post a repository link right before the quiz starts
  • But so far you guys are doing very well, congratulations! 🎉
  • Now let’s move on to today’s topic: Introduction to SQL with SQLite 📊🧑🏻‍💻👩🏼‍💻

What is SQL? 🤔

What is SQL?

  • SQL stands for Structured Query Language
  • It is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS)
  • SQL is widely used in industry and academia
  • Its main task is to extract data from a database
  • Many concepts are similar to Python
  • SQL is actually quite easy to learn and very powerful
    • It is a great skill to have in your toolbox! 🧰

SQLite 🪶

  • We will use SQLite as our database management system for this lecture.
  • SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.
  • It’s the most used database engine in the world! 🌍
  • Key features:
    • Serverless: SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.
    • Self-contained: Requires no external dependencies.
    • Transactional: ACID compliant, ensuring reliable transactions.
    • Zero-configuration: No setup or administration needed.
  • It’s excellent for learning SQL, local development, testing, and applications that need a simple, embedded database.

Getting started with SQLite

  • The best part about SQLite is that there’s usually no installation required! 🎉
  • It’s likely already included with your operating system or Python installation.
  • We will primarily interact with SQLite databases using:
    1. Python: Using the built-in sqlite3 module (we’ll see this later).
    2. VSCode Extensions: For browsing and querying .sqlite or .db files directly within VSCode.

VSCode Extension for SQLite

  • To easily view and manage SQLite database files, we recommend installing a VSCode extension.
  • A popular choice is SQLite by alexcvzz.
  • Go to the Extensions view (Ctrl+Shift+X or Cmd+Shift+X) and search for “SQLite”. Install the one by alexcvzz.
  • This extension allows you to:
    • Open .sqlite or .db files (it will create one if it doesn’t exist).
    • Explore the database by clicking the SQLite icon in the activity bar.
    • See the database schema (tables, columns).
    • Run SQL queries directly against the database file using .sql files or the query editor.

Searching for the SQLite extension in VSCode.

Ready to use SQLite?
Let’s create a database! 🧑🏻‍💻

Creating a database file and SQL script

  • Ensure you have the SQLite by alexcvzz extension installed.
  • Using the VSCode File Explorer, create a new, empty file named lecture17.sqlite in the current directory (lectures/lecture-17/). This file will hold our database.
  • Create another new file named 17-introduction-sql.sql (or use the existing one if you prefer, just clear its contents). This will be our SQL script file.
  • Open the 17-introduction-sql.sql file.
  • At the top of the editor, you should see a toolbar from the SQLite extension. Click the database icon (Choose Database) and select Attach existing database. Choose the lecture17.sqlite file you just created.
  • Now, any SQL command you run in this .sql file will be executed against the lecture17.sqlite database file.

Connecting the .sql file to the SQLite database file.

Let’s write some SQL code!

  • Now that our .sql file is connected to the lecture17.sqlite database, we can write and run SQL commands.
  • We will start by creating a table from scratch.
  • It is good practice to check if the table already exists and drop it if it does.
  • This way we avoid errors and can start fresh 🤓.
  • Since we’ve been talking about Formula 1, let’s create a table with some information about drivers.
  • System commands are often in upper case and user-defined names are often in lower case by convention.
-- Drop the table if it already exists
DROP TABLE IF EXISTS drivers;

-- Create the drivers table
CREATE TABLE drivers(
    driver_id INTEGER PRIMARY KEY, -- Use INTEGER PRIMARY KEY for auto-increment in SQLite
    driver_name TEXT,             -- TEXT is common for strings in SQLite
    nationality TEXT,
    victories INTEGER             -- INTEGER is the standard type for integers
);
  • What are we doing here?
  • DROP TABLE IF EXISTS drivers;: Removes the drivers table if it exists.
  • CREATE TABLE drivers(...): Creates an empty TABLE called drivers.
  • Inside the parentheses (...), we define the columns:
    • driver_id INTEGER PRIMARY KEY: An integer column that uniquely identifies each driver. In SQLite, this often acts as an auto-incrementing key.
    • driver_name TEXT: A column to store the driver’s name as text.
    • nationality TEXT: A column for the driver’s nationality.
    • victories INTEGER: A column for the number of victories.
  • Type the code above into the 17-introduction-sql.sql editor.
  • Select the code you want to run (or place the cursor within a command).
  • Right-click and choose Run Query or use the shortcut (Ctrl+Shift+Q or Cmd+Shift+Q).
  • Check the SQLite output panel for success messages.

Add rows to table

  • Now that we have created the table, let’s add some rows to it!
  • We can do this by using the INSERT INTO command.
  • This command adds rows with VALUES to the table.
  • In SQL, statements are typically separated by a semicolon (;).
  • Let’s add some information about the drivers into our .sql file:
-- Insert data into the drivers table
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Lewis Hamilton','British', 103);
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Fernando Alonso', 'Spanish', 32);
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Sebastian Vettel', 'German', 91); -- Note: Original had 91, corrected from 53
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Michael Schumacher', 'German', 53); -- Note: Original had 53, corrected from 91

Note: We don’t specify driver_id here; SQLite will auto-generate it because it’s an INTEGER PRIMARY KEY.

  • Select these INSERT statements in your .sql file and run them using Run Query.

Viewing the data

  • How do we see the data we just inserted? We use the SELECT command!
  • We will learn more about SELECT soon, but for now let’s run the following command in our .sql file:
-- Select all data from the drivers table
SELECT * FROM drivers;
  • This command selects all columns (*) from the drivers table.
  • Run this query (Ctrl+Shift+Q or Cmd+Shift+Q).
  • The results should appear in a new tab! 🎉

Result of SELECT FROM drivers.*

Try it yourself! 🧠

  • Add two or three more rows to the drivers table in your 17-introduction-sql.sql file.
  • Remember to only specify driver_name, nationality, and victories.
  • Run the INSERT statements.
  • Run SELECT * FROM drivers; again to see the updated table.
  • Here’s some code to help you get started:
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Max Verstappen', 'Dutch', 51);
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Juan Pablo Montoya', 'Colombian', 6);
INSERT INTO drivers (driver_name, nationality, victories) VALUES ('Danilo Freire','Brazilian', 10); -- Let's be optimistic! 😉

Interacting with the database in VSCode

Using the SQLite VSCode Extension

  • We’ve already seen how to connect a .sql file to our lecture17.sqlite database and run queries.
  • You write SQL commands directly in the .sql file editor.
  • To run a command (or selected commands), right-click in the editor and choose Run Query or use the shortcut (Ctrl+Shift+Q or Cmd+Shift+Q).
  • The results appear in a separate tab.

Running a query from the .sql file.

Exploring the Database

  • The SQLite extension also provides a database explorer view.
  • Click the SQLite icon in the VSCode Activity Bar (usually on the far left).
  • You should see your lecture17.sqlite database listed under the “SQLite Explorer” panel.
  • Expand it to see the tables (like drivers), their columns, and data types.
  • You can right-click on tables to perform actions like viewing the data (Show Table) or generating SELECT statements.

Using the SQLite Explorer in VSCode.

So far so good? 😉

Dataset Operations in SQL 📊

Extracting columns from table

SELECT extracts certain columns

  • The SELECT command is used to extract specific columns from a table.
SELECT driver_id, nationality FROM drivers;

Subsetting rows

  • The WHERE command evaluates logical conditions to filter rows.
SELECT * FROM drivers
WHERE nationality = 'German';

AND and OR conditions

OR conditions

  • Evaluate OR conditions to select rows matching either condition.
SELECT * FROM drivers
WHERE (nationality = 'Brazilian') OR (nationality = 'German');

AND conditions

  • Evaluate AND conditions to select rows matching both conditions.
SELECT * FROM drivers
WHERE (nationality = 'German') AND (victories > 60);

Other useful commands

ORDER BY

  • Sort results using ORDER BY. Default is ascending (ASC). Use DESC for descending.
SELECT * FROM drivers
ORDER BY victories DESC;

LIMIT

  • Limit the number of rows returned using LIMIT.
SELECT * FROM drivers
ORDER BY victories DESC
LIMIT 3; -- Get the top 3 drivers by victories

DISTINCT values

  • The DISTINCT keyword returns only unique values for a column.
SELECT DISTINCT nationality FROM drivers;

Basic calculations

  • You can perform calculations directly in SELECT. Use AS to name the result column.
SELECT
    driver_name,
    victories,
    victories * 1.1 AS victories_plus_10_percent
FROM drivers;

Aggregate functions

  • SQL has functions to summarize data: COUNT, SUM, AVG, MIN, MAX.
SELECT COUNT(*) AS total_drivers FROM drivers;

SELECT SUM(victories) AS total_victories FROM drivers;

SELECT AVG(victories) AS average_victories FROM drivers;

SELECT MAX(victories) AS max_victories FROM drivers;

GROUP BY

  • GROUP BY groups rows with the same values in specified columns, allowing aggregate functions per group.
SELECT
    nationality,
    COUNT(*) AS number_of_drivers,
    AVG(victories) AS avg_victories_per_nationality
FROM drivers
GROUP BY nationality
ORDER BY number_of_drivers DESC;

Questions? 🤔

Appendix

Appendix 01

  • Here’s the full table after adding the extra drivers from the exercise:
SELECT * FROM drivers ORDER BY driver_id;
driver_id driver_name nationality victories
1 Lewis Hamilton British 103
2 Fernando Alonso Spanish 32
3 Sebastian Vettel German 91
4 Michael Schumacher German 53
5 Max Verstappen Dutch 51
6 Juan Pablo Montoya Colombian 6
7 Danilo Freire Brazilian 10

Note: The exact driver_id values might differ if you dropped and recreated the table multiple times, but the relative order should be similar.