class: center, middle, inverse, title-slide .title[ # MKT615: Data Storytelling for Marketers ] .subtitle[ ## Lecture 7: MySQL ] .author[ ### Davide Proserpio ] .institute[ ### Marshall School of Business ] --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } .large4 { font-size: 400% } .large2 { font-size: 200% } .small90 { font-size: 90% } .small75 { font-size: 75% } </style> # Table of contents 1. [Checklist](#check) 2. [Databases](#db) 3. [MySQL](#mysql) 4. [MySQL Commands](#commands) 5. [MySQL Operators](#operators) 6. [Connect to a DB using R](#R) 7. [Import CSV files](#csv) --- name: check # Checklist - Database management application for MySQL: [Sequel Pro](https://www.sequelpro.com/) (Mac) or [MySQL Workbench](https://www.mysql.com/products/workbench/) (Windows) - R libraries: **RMySQL**, **ggplot2**, **dplyr**, **data.table**, **ggthemes**, **tidyverse** --- class: inverse, center, middle name:db # Database <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # What is a database? - A database is an application that stores a collection of data in the form of tables - Each database has one or more distinct APIs for creating, accessing, managing, searching and replicating the data it holds. - Nowadays, we use Relational Database Management Systems (RDBMS) to store and manage huge volume of data. They are called relational database because all the data is stored into different tables and relations are established using keys (Primary or Foreign). - RDBMS makes it possible to analyze data in different ways while simultaneously keeping it organized in a systematic fashion, with minimal redundancy. --- name:mysql # MySQL - MySQL is a fast, easy-to-use RDBMS being used for many small and big businesses. MySQL is presently developed, distributed, and supported by Oracle Corporation. - MySQL is becoming so popular because of many good reasons: - MySQL is released under an open-source license. So you have nothing to pay to use it. - MySQL used for both small and large applications. - MySQL is fast, reliable, and flexible and easy to use. - MySQL supports standard SQL (Structured Query Language). - MySQL handles large databases. --- # Some important notions - **Database connection** - A database typically resides on a remote PC (host) - Before being able to do any operations on a DB, the client needs to connect to the PC hosting MySQL server using username and password. - Let's see how to do it with a database management application such as Sequel Pro or MySQL Workbench --- # Some important notions (cont.) - **Primary Key** - Uniquely identify a row in a Table. - Every table should have a primary key - Useful to avoid duplicates when updating the table --- # Some important notions (cont.) - **Indexes (extremely important!)** - Are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. - If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially. - Indexes can be create with one or more columns - So it is important to set indexes for the variables on which we are going to make some operation (join, filter, etc.) - Note that Primary Keys are indexes so there is no need to create an index for the column that is the Primary Key --- # Data types MySQL uses many different data types broken into three categories: 1. Numeric (int, float, decimal, etc) 2. Date and Time (date, datetime, timestamp, etc.) 3. String Types (varchar, text, etc.) [Here](https://www.tutorialspoint.com/mysql/mysql-data-types.htm) you can find a description of each of them --- # Database for today's practice - List of Yelp restaurant available in LA county - Three tables: 1. yelp_lacounty_businesses: list of restaurants, PK=business_id 2. yelp_lacounty_reviews: list of all restaurant reviews, PK= review_id 3. yelp_lacounty_users: list of all Yelp users that wrote the reviews in yelp_lacounty_reviews, PK=user_id - Note that I already created a bunch of indexes that are going to be useful today. - Let's get familiar with the application and the database before moving forward... --- class: inverse, center, middle name:commands # MySQL Commands <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # Basic syntax - MySQL is case insensitive but usually MySQL commands/operators/functions are upper case: SELECT, DROP, etc. - Comments are preceded by `#` or `--` . - `NULL` means that there is not value for a specif row or rows (PKs cannot be `NULL`). - Equality is `=` (not double equal). - You can create tables and columns aliases: `column1 AS c1`; `table1 as t1`. - Queries end with `;`. --- # Data Definition Language - CREATE - DROP - TRUNCATE - ALTER --- # Data Definition Language (cont.) **CREATE** - to create a database and its objects like table, index, primary key, etc. Examples - CREATE DATABASE: creates a new database ```sql CREATE DATABASE databasename;` ``` - CREATE TABLE: creates a new table ```sql CREATE TABLE table_name ( column_1 datatype, column_2 datatype, column_3 datatype ); ``` - CREATE INDEX: creates an index ```sql CREATE INDEX index_name ON table_name (column_name1, column_name2…); ``` --- # Data Definition Language (cont.) **DROP** - deletes objects from the database Examples - DROP DATABASE - delete a database ```sql DROP DATABASE database_name; ``` - DROP TABLE — deletes a table ```sql DROP TABLE table_name; ``` --- # Data Definition Language (cont.) **TRUNCATE** - remove all records from a table ```sql TRUNCATE table_name; ``` --- # Data Definition Language (cont.) - ALTER TABLE — modifies a table (add column, drop index, etc.) ```sql # Add a column ALTER TABLE table_name ADD column_name datatype; # Drop a column ALTER TABLE table_name DROP COLUMN column_name; # Drop an index ALTER TABLE table_name DROP INDEX index_name; ``` --- # Data Manipulation Language - SELECT - INSERT - UPDATE - DELETE --- # Data Manipulation Language (cont.) - SELECT - Extracts data from a database ```sql SELECT colum_name(s) FROM table_name ``` P.S.: `SELECT *` selects all the rows --- # Data Manipulation Language (cont.) - INSERT - Insert data into a table ```sql INSERT INTO table_name (column_1, column_2, column_3) VALUES (value_1, ‘value_2’, value_3); ``` --- # Data Manipulation Language (cont.) - UPDATE - updates existing data within a table ```sql UPDATE table_name SET some_column = some_value WHERE some_column = some_value; ``` --- # Data Manipulation Language (cont.) - DELETE - deletes data from a database ```sql DELETE FROM table_name WHERE some_column = some_value; ``` --- class: inverse, center, middle name:operators # MySQL Operators <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # Operators and functions Most of the data manipulation function can be combined with a MySQL operators and functions. We already saw `WHERE`, but there are many more. Let's see some of them. --- # Sorting the data - ORDER BY - order the data in ascending (ASC) or descending order (DESC) --- # Filtering the data - WHERE - filter rows based on specified conditions - DISTINCT - return distinct row values (eliminate duplicated records) - AND, OR - Boolean operators - IN, NOT IN - check if rows are (NOT) in a set of values. - BETWEEN - query data based on a range. - LIKE - query data based on a pattern. - LIMIT - limit the number of rows returned by a query. - IS NULL - test whether a value is NULL. --- # Filtering the data (cont.) Let's practice with what we have learned so far. 1. Create a query that returns all restaurant in the city of Los Angeles 2. Create a query that returns 1-star and 5-stars reviews. 3. Create a query that returns a list of unique restaurant cities. 4. Create a query that returns all Italian restaurants --- # Grouping the data - GROUP BY - group data by some variables. ```sql SELECT col1, max(col2), min(col3) FROM table_name GROUP BY col1; ``` P.S.: You can use variables "position" in the group by operator, e.g.: ```sql SELECT col1, max(col2), min(col3) FROM table_name GROUP BY 1; # means group by the first colum ``` - HAVING - filter the groups by a specific condition. ```sql SELECT col1, max(col2) AS a, min(col3) AS b # Note the cols aliases FROM table_name GROUP BY col1 HAVING a > 30 ``` --- # Grouping the data (count.) Let's practice these two operators. 1. Create a query that returns the number of restaurants by city in descending order (Hint.: use `count(*)` to count records in a group) 2. As above, but exclude cities with less that 1000 restaurants --- # Set operators - UNION - combine two or more results set into one. By default it returns only distinct values. To allow duplicates use UNION ALL. ```sql SELECT col1, col2 FROM table1 UNION SELECT col1, col2 from table2 ``` - INTERSECT - return the intersection of two or more results set. ```sql SELECT col1, col2 FROM table1 INTERSECT SELECT col1, col2 from table2 ``` Note that columns types (not names) need to match --- # Joining tables - JOIN - returns common records. ```sql SELECT f.col1, f.col2, s.col1 FROM table1 f # note the table alias JOIN table2 s on f.col2 = s.col2 ``` Note that if the joining column name is the same you can replace `ON` with `USING(colname)`. - LEFT JOIN - returns all records of the first table. - RIGHT JOIN - return all records of the second table. --- # Conditional statements ```sql IF(condition, what to do if condition is TRUE, what to do if condition is FALSE) ``` Example: ```sql SELECT col1, IF(col2>5,1,0) from table1 ``` --- # Mathematical operators - count() - sum() - avg() - min() - max() --- # String functions There are a ton of string manipulation functions, some of them are: - CHAR_LENGTH (LENGTH) - return length of a string in characters (bytes) - SUBSTRING_INDEX - split a string using a delimiter and return a subset of it - SUBSTR - return a subset of a string - CONCAT - concatenates two strings - REPLACE - replace a string pattern with another one For more, see https://dev.mysql.com/doc/refman/8.0/en/string-functions.html --- # More practice exercises 1. Create a query that returns count the average length of a review and average number of photos by star-rating 2. Count the number of reviews containing the word "sorry" by star-rating --- class: inverse, center, middle name:R # Use R to connect to my MySQL <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # RMySQL: Connect to DB ```sql library(data.table) library(RMySQL) mydb = dbConnect(MySQL(), user='username', password= rstudioapi::askForPassword("Database password"), dbname='database-name', host='mshresearch.marshall.usc.edu') ``` --- # RMySQL: Query DB Execute a query: ```sql rs = dbSendQuery(mydb, query_string) data = data.table(fetch(rs, n=-1)) ``` --- # RMySQL: Write to DB ```sql dbWriteTable(db_connection, value = data.table or data.fram, name = "MyTable", append = TRUE, row.names=FALSE) ``` --- Let's practice. Open Rstudio, connect to the practice DB and: 1. Create a query that returns the average ratings by city. Create a barplot of the top-10 cites by average rating in descending order. 2. Create a query that for each review, returns the restaurant ID, review ID, star-rating, review length, number of photos, review_date. Regress star-rating against review length and photos. Plot review length and photos by star-rating using a boxplot. 3. Plot average number of reviews by (1) year-month and (2) by month --- name:csv # Import CSV files to DB - Extremely easy using a database management application - Also possible using a query Step 1: Create Table: ```sql CREATE TABLE discounts ( id INT NOT NULL, title VARCHAR(255) NOT NULL, PRIMARY KEY (id) ); ``` Step 2: Import file: ```sql LOAD DATA INFILE 'c:/tmp/discounts.csv' INTO TABLE discounts FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS; # skip headers ```