class: center, middle, inverse, title-slide .title[ # Topic 17: Databases and Big Data ] .author[ ### Nick Hagerty
ECNS 460/560 Fall 2023
Montana State University ] .date[ ### .small[
*These slides are adapted from
“Data Science for Economists”
by Grant R. McDermott, used under the
MIT License
.] ] --- exclude: true --- # Table of contents 1. [Tools for big data](#bigdata) 1. [Databases in R](#databases) 1. [Writing SQL queries](#sql) 1. [Getting started with BigQuery](#bigquery) --- class: inverse, middle name: bigdata # Tools for big data --- # Tools for big data **What if your datasets are too big to fit in your computer's memory?** * You probably don't have more than 32 or 64 GB of RAM. * Many types of data (e.g., raster data or videos) can easily exceed that. * Companies like Amazon and Google work with petabytes of data. -- **What if your models are taking too long to run on your computer?** * OLS has a closed-form solution -- `\((X'X)^{-1}X'Y\)` -- but complex specifications can still take a while. * Many other statistical models take much longer, since they require numerical optimization. * Machine learning models can take orders of magnitude longer to fit and tune. -- **"Big data is when your workflow breaks."** -Randall Prium --- # Tools for big data Solutions that are suboptimal but might be good enough: **1. Use a random sample of your data to develop your code.** * I.e., use only 10%, 1%, 0.001%, etc. * And then run it on the full dataset only once at the end. * Or don't -- maybe a sample is enough for your purposes! **2. Talk someone into letting you use a more powerful computer.** * Not always feasible, easy, sustainable, or sufficient. * Check out Montana State's [high performance computing resources](https://www.montana.edu/uit/rci/hpc/). --- # Tools for big data Better solutions: 1. **Databases** * An architecture for storing and retrieving data that avoids loading the entire dataset into memory. 1. **Parallelization ("embarassingly parallel")** * Speed up iterations by parallelizing them across multiple cores on one computer. 1. **Distributed computing** * Parallelize tasks by breaking up data into chunks and handing it to multiple computers, then combining results. 1. **Cloud computing** * Get more computing power, memory, or processors by working on virtual machines on remote servers. --- # Cloud computing **Set up a virtual machine (VM) on a cloud computing platform.** * Easy and cheap. (Much more so than setting up your own server!) * Get billed for exactly as much power, memory, and time as you need. * For most purposes, you won't use more than a few dollars per month. **Platforms:** * [Google Cloud Compute Engine](https://cloud.google.com/compute/) * [Amazon Web Services](https://aws.amazon.com/) * [Posit Cloud](https://posit.cloud/) (affiliated with RStudio). **Tutorials** by Grant McDermott: [Part 1](https://raw.githack.com/uo-ec510-2020-spring/lectures/master/13-gce-i/13-gce-i.html), [Part 2](https://raw.githack.com/uo-ec510-2020-spring/lectures/master/14-gce-ii/14-gce-ii.html). --- # Distributed computing: Spark **Spark is a "unified analytics engine for large-scale data processing."** * Spark Core: enables distributed computing * Spark SQL: a SQL implementation * MLlib: An extensive machine learning library. Spark can scale in ways that other systems can't. * Move from a local test dataset to a massive dataset on a remote cluster, with minimal changes to your code. Key advantage: Ability to process data both **in memory** and **on disk.** * Deploys the optimal combination of resources for where it's deployed. **Tutorial** by Grant McDermott: [Here](https://raw.githack.com/uo-ec607/lectures/master/17-spark/17-spark.html) --- class: inverse, middle name: databases # Databases in R --- # Databases **Databases can help when:** 1. Your data is too large to fit into memory at once - And you don't actually need all of it, only parts or a summary. 1. Your data is continuously updated by other people/processes. **Databases store data:** * On disk, not in memory. * On remote servers (usually, but may also be local). * Relationally: they contain multiple tables linked by keys. To retrieve data from a databases, we submit a **query**. * We only get back the results of the query, not the entire raw data. * Databases are extremely efficient at executing queries over data stored on disk. --- # Databases and the tidyverse Basically all databases use **SQL** (Structured Query Language). * SQL is powerful but old and less intuitive than the tidyverse. Fortunately, `dbplyr` makes it super easy to interact with databases from R. * It can automatically translate tidyverse pipelines to SQL. * You don't even *need* to learn any SQL to use databases in R. But we'll still cover some SQL syntax. * Knowing SQL gives you a lot more flexibility. * And it's necessary for many data science jobs. --- # SQLite We'll use **SQLite** as our specific database backend. * Not very scalable, but lightweight and open-source. * Otherwise we would need to separately install a whole database system. Some other (more powerful) implementations of SQL: * **MySQL:** Open-source. Used by Google, Facebook, LinkedIn, Twitter. * **PostgreSQL:** Open-source. Used by Instagram, Reddit. * **Oracle** and **Microsoft SQL Server:** widespread in corporate environments. * **Google BigQuery:** Cloud solution that you can use with `bigrquery`. --- # R packages Run these preliminaries: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyverse, DBI, dbplyr, RSQLite, bigrquery, nycflights13, tictoc) ``` **The R packages we'll need:** 1. `dbplyr` translates tidyverse code to SQL. 2. `DBI` helps R connect to the database. 3. A backend interface to the specific type of database we want to work with. * `RSQLite` provides this interface * And also fully embeds a SQLite database. --- # Connecting to a database Open an (empty) database connection via `DBI::dbConnect`: * The first argument is always the database backend. * Other arguments vary across database implementations. ```r lite_con = dbConnect(RSQLite::SQLite(), path = ":memory:") ``` The path `":memory:"` creates a temporary in-memory database that we'll play around with *as if* it is a remote on-disk database. --- # Populate some tables Our database is empty so far. Let's copy in our old friend, the `flights` dataframe from `nycflights13`. ```r copy_to( dest = lite_con, df = nycflights13::flights, name = "flights", temporary = FALSE, indexes = list(c("year", "month", "day"), "carrier", "tailnum", "dest") ) ``` --- # Reference a database table Now we can use `dplyr::tbl` to use this database table as if it were a dataframe in R: ```r flights_db = tbl(lite_con, "flights") class(flights_db) ``` ``` #> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" #> [4] "tbl_lazy" "tbl" ``` ```r flights_db ``` ``` #> # Source: table<flights> [?? x 19] #> # Database: sqlite 3.41.2 [] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # ℹ more rows #> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dbl> ``` --- # Generating queries `dbplr` will automatically translate tidyverse code to SQL. ```r # Select columns flights_db |> select(year:day, dep_delay, arr_delay) ``` ``` #> # Source: SQL [?? x 5] #> # Database: sqlite 3.41.2 [] #> year month day dep_delay arr_delay #> <int> <int> <int> <dbl> <dbl> #> 1 2013 1 1 2 11 #> 2 2013 1 1 4 20 #> 3 2013 1 1 2 33 #> 4 2013 1 1 -1 -18 #> 5 2013 1 1 -6 -25 #> 6 2013 1 1 -4 12 #> 7 2013 1 1 -5 19 #> 8 2013 1 1 -3 -14 #> 9 2013 1 1 -3 -8 #> 10 2013 1 1 -2 8 #> # ℹ more rows ``` --- # Generating queries `dbplr` will automatically translate tidyverse code to SQL. ```r # Filter rows flights_db |> filter(dep_delay > 240) ``` ``` #> # Source: SQL [?? x 19] #> # Database: sqlite 3.41.2 [] #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 848 1835 853 1001 1950 #> 2 2013 1 1 1815 1325 290 2120 1542 #> 3 2013 1 1 1842 1422 260 1958 1535 #> 4 2013 1 1 2115 1700 255 2330 1920 #> 5 2013 1 1 2205 1720 285 46 2040 #> 6 2013 1 1 2343 1724 379 314 1938 #> 7 2013 1 2 1332 904 268 1616 1128 #> 8 2013 1 2 1412 838 334 1710 1147 #> 9 2013 1 2 1607 1030 337 2003 1355 #> 10 2013 1 2 2131 1512 379 2340 1741 #> # ℹ more rows #> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, #> # hour <dbl>, minute <dbl>, time_hour <dbl> ``` --- # Generating queries `dbplr` will automatically translate tidyverse code to SQL. ```r # Calculate the mean delay by destination flights_db |> group_by(dest) |> summarize(delay = mean(dep_delay)) ``` ``` #> # Source: SQL [?? x 2] #> # Database: sqlite 3.41.2 [] #> dest delay #> <chr> <dbl> #> 1 ABQ 13.7 #> 2 ACK 6.46 #> 3 ALB 23.6 #> 4 ANC 12.9 #> 5 ATL 12.5 #> 6 AUS 13.0 #> 7 AVL 8.19 #> 8 BDL 17.7 #> 9 BGR 19.5 #> 10 BHM 29.7 #> # ℹ more rows ``` --- # Laziness as a virtue Why does it say `# Source: SQL [?? x 2]`? **Evaluation is lazy.** * `dbplyr` does as little in R as possible. * Instead it passes everything to the database. * It doesn't pull any data into R until you ask for it. * It delays the analysis until the last possible moment. **Demonstration:** How long does it take to calculate mean departure and arrival delays for each plane (i.e., unique tail number)? --- # Laziness as a virtue Using the `flights` data frame: ```r tic() tailnum_delay_tv = flights |> group_by(tailnum) |> summarize( mean_dep_delay = mean(dep_delay), mean_arr_delay = mean(arr_delay), n = n() ) |> filter(n > 100) |> arrange(desc(mean_arr_delay)) toc() ``` ``` #> 0.17 sec elapsed ``` --- # Laziness as a virtue Using the `flights_db` database table: ```r tic() tailnum_delay_db = flights_db |> group_by(tailnum) |> summarize( mean_dep_delay = mean(dep_delay), mean_arr_delay = mean(arr_delay), n = n() ) |> filter(n > 100) |> arrange(desc(mean_arr_delay)) toc() ``` ``` #> 0.17 sec elapsed ``` -- This code does not actually do any calculations! Or even communicate with the database! --- # Laziness as a virtue Only when you print the object does it generate the SQL and query the database: ```r tic() print(tailnum_delay_db, n = 2) ``` ``` #> # Source: SQL [?? x 4] #> # Database: sqlite 3.41.2 [] #> # Ordered by: desc(mean_arr_delay) #> tailnum mean_dep_delay mean_arr_delay n #> <chr> <dbl> <dbl> <int> #> 1 N11119 32.6 30.3 148 #> 2 N16919 32.4 29.9 251 #> # ℹ more rows ``` ```r toc() ``` ``` #> 1.51 sec elapsed ``` Even then it only pulls down exactly as many rows as you request. --- # Collect data into your R environment Once you figure out what data you need, use `collect` to pull the data into a local data frame. ```r tailnum_delay = tailnum_delay_db |> collect() class(tailnum_delay) ``` ``` #> [1] "tbl_df" "tbl" "data.frame" ``` ```r tailnum_delay ``` ``` #> # A tibble: 1,201 × 4 #> tailnum mean_dep_delay mean_arr_delay n #> <chr> <dbl> <dbl> <int> #> 1 N11119 32.6 30.3 148 #> 2 N16919 32.4 29.9 251 #> 3 N14998 29.4 27.9 230 #> 4 N15910 29.3 27.6 280 #> 5 N13123 29.6 26.0 121 #> 6 N11192 27.5 25.9 154 #> 7 N14950 26.2 25.3 219 #> 8 N21130 27.0 25.0 126 #> 9 N24128 24.8 24.9 129 #> 10 N22971 26.5 24.7 230 #> # ℹ 1,191 more rows ``` --- # Collect data into your R environment Now this is a real data frame and we can use it like any other. Is there a relationship between departure and arrival delays? ```r tailnum_delay |> ggplot(aes(x = mean_dep_delay, y = mean_arr_delay)) + geom_point(alpha = 0.3) + geom_abline(intercept = 0, slope = 1, col="orange") ``` <img src="17-Databases_files/figure-html/tailnum_delay_ggplot-1.svg" width="70%" style="display: block; margin: auto;" /> --- class: inverse, middle name: sql # Writing SQL queries --- # SQL queries Let's see what SQL query was generated by `dbplyr`: .pull-left[ ```r tailnum_delay_db = flights_db |> group_by(tailnum) |> summarize( mean_dep_delay = mean(dep_delay), mean_arr_delay = mean(arr_delay), n = n() ) |> filter(n > 100) |> arrange(desc(mean_arr_delay)) ``` ] .pull-right[ ```r tailnum_delay_db |> show_query() ``` ``` #> <SQL> #> SELECT #> `tailnum`, #> AVG(`dep_delay`) AS `mean_dep_delay`, #> AVG(`arr_delay`) AS `mean_arr_delay`, #> COUNT(*) AS `n` #> FROM `flights` #> GROUP BY `tailnum` #> HAVING (COUNT(*) > 100.0) #> ORDER BY `mean_arr_delay` DESC ``` ] The SQL code is arguably less elegant. It also does not follow the logical order of operations. --- layout: false class: clear <img src="https://wizardzines.com/zines/sql/samples/from.png" width="47%" style="display: block; margin: auto;" /> .smaller[Source: Julia Evans, [*Become a Select Star*](https://wizardzines.com/zines/sql/)] --- # Using SQL directly in R Let's first generate a simple SQL query we can use to build on: ```r flights_db |> filter(dep_delay > 240) |> head(5) |> show_query() ``` ``` #> <SQL> #> SELECT * #> FROM `flights` #> WHERE (`dep_delay` > 240.0) #> LIMIT 5 ``` Note: The backticks around objects (`flights` and `dep_delay`) are not strictly necessary. --- # SQL chunks in R Markdown When you're using R Markdown, you can directly embed SQL queries: ```sql SELECT * FROM flights WHERE dep_delay > 240 LIMIT 5 ``` <table> <caption>5 records</caption> <thead> <tr> <th style="text-align:right;"> year </th> <th style="text-align:right;"> month </th> <th style="text-align:right;"> day </th> <th style="text-align:right;"> dep_time </th> <th style="text-align:right;"> sched_dep_time </th> <th style="text-align:right;"> dep_delay </th> <th style="text-align:right;"> arr_time </th> <th style="text-align:right;"> sched_arr_time </th> <th style="text-align:right;"> arr_delay </th> <th style="text-align:left;"> carrier </th> <th style="text-align:right;"> flight </th> <th style="text-align:left;"> tailnum </th> <th style="text-align:left;"> origin </th> <th style="text-align:left;"> dest </th> <th style="text-align:right;"> air_time </th> <th style="text-align:right;"> distance </th> <th style="text-align:right;"> hour </th> <th style="text-align:right;"> minute </th> <th style="text-align:right;"> time_hour </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 848 </td> <td style="text-align:right;"> 1835 </td> <td style="text-align:right;"> 853 </td> <td style="text-align:right;"> 1001 </td> <td style="text-align:right;"> 1950 </td> <td style="text-align:right;"> 851 </td> <td style="text-align:left;"> MQ </td> <td style="text-align:right;"> 3944 </td> <td style="text-align:left;"> N942MQ </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> BWI </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 184 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 1357081200 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1815 </td> <td style="text-align:right;"> 1325 </td> <td style="text-align:right;"> 290 </td> <td style="text-align:right;"> 2120 </td> <td style="text-align:right;"> 1542 </td> <td style="text-align:right;"> 338 </td> <td style="text-align:left;"> EV </td> <td style="text-align:right;"> 4417 </td> <td style="text-align:left;"> N17185 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> OMA </td> <td style="text-align:right;"> 213 </td> <td style="text-align:right;"> 1134 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 1357063200 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1842 </td> <td style="text-align:right;"> 1422 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 1958 </td> <td style="text-align:right;"> 1535 </td> <td style="text-align:right;"> 263 </td> <td style="text-align:left;"> EV </td> <td style="text-align:right;"> 4633 </td> <td style="text-align:left;"> N18120 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> BTV </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> 266 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 1357066800 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2115 </td> <td style="text-align:right;"> 1700 </td> <td style="text-align:right;"> 255 </td> <td style="text-align:right;"> 2330 </td> <td style="text-align:right;"> 1920 </td> <td style="text-align:right;"> 250 </td> <td style="text-align:left;"> 9E </td> <td style="text-align:right;"> 3347 </td> <td style="text-align:left;"> N924XJ </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> CVG </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 589 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1357077600 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2205 </td> <td style="text-align:right;"> 1720 </td> <td style="text-align:right;"> 285 </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> 2040 </td> <td style="text-align:right;"> 246 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> N5DNAA </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> MIA </td> <td style="text-align:right;"> 146 </td> <td style="text-align:right;"> 1085 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1357077600 </td> </tr> </tbody> </table> --- # SQL queries in regular R scripts Using `DBI::dbGetQuery()`: ```r dbGetQuery(lite_con, "SELECT * FROM flights WHERE dep_delay > 240.0 LIMIT 5") ``` ``` #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> 1 2013 1 1 848 1835 853 1001 1950 #> 2 2013 1 1 1815 1325 290 2120 1542 #> 3 2013 1 1 1842 1422 260 1958 1535 #> 4 2013 1 1 2115 1700 255 2330 1920 #> 5 2013 1 1 2205 1720 285 46 2040 #> arr_delay carrier flight tailnum origin dest air_time distance hour minute #> 1 851 MQ 3944 N942MQ JFK BWI 41 184 18 35 #> 2 338 EV 4417 N17185 EWR OMA 213 1134 13 25 #> 3 263 EV 4633 N18120 EWR BTV 46 266 14 22 #> 4 250 9E 3347 N924XJ JFK CVG 115 589 17 0 #> 5 246 AA 1999 N5DNAA EWR MIA 146 1085 17 20 #> time_hour #> 1 1357081200 #> 2 1357063200 #> 3 1357066800 #> 4 1357077600 #> 5 1357077600 ``` --- # SQL queries in regular R scripts What tables do we have in our database? ```r dbListTables(lite_con) ``` ``` #> [1] "flights" "sqlite_stat1" "sqlite_stat4" ``` Show the columns/fields of a table: ```r dbListFields(lite_con, "flights") ``` ``` #> [1] "year" "month" "day" "dep_time" #> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time" #> [9] "arr_delay" "carrier" "flight" "tailnum" #> [13] "origin" "dest" "air_time" "distance" #> [17] "hour" "minute" "time_hour" ``` --- # SQL syntax **From before:** Return the first 5 rows that meet a specified filter: ```sql SELECT * FROM flights WHERE dep_delay > 240 LIMIT 5 ``` <table> <caption>5 records</caption> <thead> <tr> <th style="text-align:right;"> year </th> <th style="text-align:right;"> month </th> <th style="text-align:right;"> day </th> <th style="text-align:right;"> dep_time </th> <th style="text-align:right;"> sched_dep_time </th> <th style="text-align:right;"> dep_delay </th> <th style="text-align:right;"> arr_time </th> <th style="text-align:right;"> sched_arr_time </th> <th style="text-align:right;"> arr_delay </th> <th style="text-align:left;"> carrier </th> <th style="text-align:right;"> flight </th> <th style="text-align:left;"> tailnum </th> <th style="text-align:left;"> origin </th> <th style="text-align:left;"> dest </th> <th style="text-align:right;"> air_time </th> <th style="text-align:right;"> distance </th> <th style="text-align:right;"> hour </th> <th style="text-align:right;"> minute </th> <th style="text-align:right;"> time_hour </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 848 </td> <td style="text-align:right;"> 1835 </td> <td style="text-align:right;"> 853 </td> <td style="text-align:right;"> 1001 </td> <td style="text-align:right;"> 1950 </td> <td style="text-align:right;"> 851 </td> <td style="text-align:left;"> MQ </td> <td style="text-align:right;"> 3944 </td> <td style="text-align:left;"> N942MQ </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> BWI </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 184 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 35 </td> <td style="text-align:right;"> 1357081200 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1815 </td> <td style="text-align:right;"> 1325 </td> <td style="text-align:right;"> 290 </td> <td style="text-align:right;"> 2120 </td> <td style="text-align:right;"> 1542 </td> <td style="text-align:right;"> 338 </td> <td style="text-align:left;"> EV </td> <td style="text-align:right;"> 4417 </td> <td style="text-align:left;"> N17185 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> OMA </td> <td style="text-align:right;"> 213 </td> <td style="text-align:right;"> 1134 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 1357063200 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1842 </td> <td style="text-align:right;"> 1422 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 1958 </td> <td style="text-align:right;"> 1535 </td> <td style="text-align:right;"> 263 </td> <td style="text-align:left;"> EV </td> <td style="text-align:right;"> 4633 </td> <td style="text-align:left;"> N18120 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> BTV </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> 266 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 1357066800 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2115 </td> <td style="text-align:right;"> 1700 </td> <td style="text-align:right;"> 255 </td> <td style="text-align:right;"> 2330 </td> <td style="text-align:right;"> 1920 </td> <td style="text-align:right;"> 250 </td> <td style="text-align:left;"> 9E </td> <td style="text-align:right;"> 3347 </td> <td style="text-align:left;"> N924XJ </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> CVG </td> <td style="text-align:right;"> 115 </td> <td style="text-align:right;"> 589 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1357077600 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2205 </td> <td style="text-align:right;"> 1720 </td> <td style="text-align:right;"> 285 </td> <td style="text-align:right;"> 46 </td> <td style="text-align:right;"> 2040 </td> <td style="text-align:right;"> 246 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:left;"> N5DNAA </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> MIA </td> <td style="text-align:right;"> 146 </td> <td style="text-align:right;"> 1085 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 1357077600 </td> </tr> </tbody> </table> --- # SQL syntax Return only selected columns: ```sql SELECT dep_delay, arr_delay FROM flights LIMIT 5 ``` <table> <caption>5 records</caption> <thead> <tr> <th style="text-align:right;"> dep_delay </th> <th style="text-align:right;"> arr_delay </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 11 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -18 </td> </tr> <tr> <td style="text-align:right;"> -6 </td> <td style="text-align:right;"> -25 </td> </tr> </tbody> </table> --- # SQL syntax Count rows: ```sql SELECT COUNT(*) FROM flights ``` <table> <caption>1 records</caption> <thead> <tr> <th style="text-align:right;"> COUNT(*) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 336776 </td> </tr> </tbody> </table> Count distinct/unique values: ```sql SELECT COUNT(DISTINCT dep_delay) FROM flights ``` <table> <caption>1 records</caption> <thead> <tr> <th style="text-align:right;"> COUNT(DISTINCT dep_delay) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 527 </td> </tr> </tbody> </table> --- # Challenge `mutate(new_var = var 1 * var2)` translates to `SELECT var1 * var2 AS near_var`. Can you translate this R code to a SQL query? ```r flights_db |> select(distance, air_time) |> mutate(speed = distance / (air_time / 60)) ``` -- ```sql SELECT distance, air_time, distance / (air_time / 60) AS speed FROM flights ``` <table> <caption>Displaying records 1 - 10</caption> <thead> <tr> <th style="text-align:right;"> distance </th> <th style="text-align:right;"> air_time </th> <th style="text-align:right;"> speed </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1400 </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 370.0441 </td> </tr> <tr> <td style="text-align:right;"> 1416 </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 374.2731 </td> </tr> <tr> <td style="text-align:right;"> 1089 </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 408.3750 </td> </tr> <tr> <td style="text-align:right;"> 1576 </td> <td style="text-align:right;"> 183 </td> <td style="text-align:right;"> 516.7213 </td> </tr> <tr> <td style="text-align:right;"> 762 </td> <td style="text-align:right;"> 116 </td> <td style="text-align:right;"> 394.1379 </td> </tr> <tr> <td style="text-align:right;"> 719 </td> <td style="text-align:right;"> 150 </td> <td style="text-align:right;"> 287.6000 </td> </tr> <tr> <td style="text-align:right;"> 1065 </td> <td style="text-align:right;"> 158 </td> <td style="text-align:right;"> 404.4304 </td> </tr> <tr> <td style="text-align:right;"> 229 </td> <td style="text-align:right;"> 53 </td> <td style="text-align:right;"> 259.2453 </td> </tr> <tr> <td style="text-align:right;"> 944 </td> <td style="text-align:right;"> 140 </td> <td style="text-align:right;"> 404.5714 </td> </tr> <tr> <td style="text-align:right;"> 733 </td> <td style="text-align:right;"> 138 </td> <td style="text-align:right;"> 318.6957 </td> </tr> </tbody> </table> --- # Joins Joins are more flexible in SQL than R. Let's put the `weather` data frame in the database as well: ```r copy_to( dest = lite_con, df = nycflights13::weather, name = "weather", temporary = FALSE, indexes = list(c("year", "month", "day", "hour")) ) ``` --- # Joins Here's an example: ```sql SELECT * FROM flights AS f LEFT JOIN weather AS w ON f.year = w.year AND f.month = w.month AND f.day = w.day AND f.hour = w.hour ``` <table> <caption>Displaying records 1 - 10</caption> <thead> <tr> <th style="text-align:right;"> year </th> <th style="text-align:right;"> month </th> <th style="text-align:right;"> day </th> <th style="text-align:right;"> dep_time </th> <th style="text-align:right;"> sched_dep_time </th> <th style="text-align:right;"> dep_delay </th> <th style="text-align:right;"> arr_time </th> <th style="text-align:right;"> sched_arr_time </th> <th style="text-align:right;"> arr_delay </th> <th style="text-align:left;"> carrier </th> <th style="text-align:right;"> flight </th> <th style="text-align:left;"> tailnum </th> <th style="text-align:left;"> origin </th> <th style="text-align:left;"> dest </th> <th style="text-align:right;"> air_time </th> <th style="text-align:right;"> distance </th> <th style="text-align:right;"> hour </th> <th style="text-align:right;"> minute </th> <th style="text-align:right;"> time_hour </th> <th style="text-align:left;"> origin </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> month </th> <th style="text-align:right;"> day </th> <th style="text-align:right;"> hour </th> <th style="text-align:right;"> temp </th> <th style="text-align:right;"> dewp </th> <th style="text-align:right;"> humid </th> <th style="text-align:right;"> wind_dir </th> <th style="text-align:right;"> wind_speed </th> <th style="text-align:right;"> wind_gust </th> <th style="text-align:right;"> precip </th> <th style="text-align:right;"> pressure </th> <th style="text-align:right;"> visib </th> <th style="text-align:right;"> time_hour </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 517 </td> <td style="text-align:right;"> 515 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 830 </td> <td style="text-align:right;"> 819 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> UA </td> <td style="text-align:right;"> 1545 </td> <td style="text-align:left;"> N14228 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> IAH </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 1400 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 28.04 </td> <td style="text-align:right;"> 64.43 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 12.65858 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.9 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 517 </td> <td style="text-align:right;"> 515 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 830 </td> <td style="text-align:right;"> 819 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> UA </td> <td style="text-align:right;"> 1545 </td> <td style="text-align:left;"> N14228 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> IAH </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 1400 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> JFK </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 26.96 </td> <td style="text-align:right;"> 61.63 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 14.96014 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1012.1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 517 </td> <td style="text-align:right;"> 515 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 830 </td> <td style="text-align:right;"> 819 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> UA </td> <td style="text-align:right;"> 1545 </td> <td style="text-align:left;"> N14228 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:left;"> IAH </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 1400 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> LGA </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.92 </td> <td style="text-align:right;"> 24.98 </td> <td style="text-align:right;"> 54.81 </td> <td style="text-align:right;"> 250 </td> <td style="text-align:right;"> 14.96014 </td> <td style="text-align:right;"> 21.86482 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.4 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 533 </td> <td style="text-align:right;"> 529 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 850 </td> <td style="text-align:right;"> 830 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:left;"> UA </td> <td style="text-align:right;"> 1714 </td> <td style="text-align:left;"> N24211 </td> <td style="text-align:left;"> LGA </td> <td style="text-align:left;"> IAH </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 1416 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 28.04 </td> <td style="text-align:right;"> 64.43 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 12.65858 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.9 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 533 </td> <td style="text-align:right;"> 529 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 850 </td> <td style="text-align:right;"> 830 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:left;"> UA </td> <td style="text-align:right;"> 1714 </td> <td style="text-align:left;"> N24211 </td> <td style="text-align:left;"> LGA </td> <td style="text-align:left;"> IAH </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 1416 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> JFK </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 26.96 </td> <td style="text-align:right;"> 61.63 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 14.96014 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1012.1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 533 </td> <td style="text-align:right;"> 529 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 850 </td> <td style="text-align:right;"> 830 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:left;"> UA </td> <td style="text-align:right;"> 1714 </td> <td style="text-align:left;"> N24211 </td> <td style="text-align:left;"> LGA </td> <td style="text-align:left;"> IAH </td> <td style="text-align:right;"> 227 </td> <td style="text-align:right;"> 1416 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> LGA </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.92 </td> <td style="text-align:right;"> 24.98 </td> <td style="text-align:right;"> 54.81 </td> <td style="text-align:right;"> 250 </td> <td style="text-align:right;"> 14.96014 </td> <td style="text-align:right;"> 21.86482 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.4 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 542 </td> <td style="text-align:right;"> 540 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 923 </td> <td style="text-align:right;"> 850 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 1141 </td> <td style="text-align:left;"> N619AA </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> MIA </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 1089 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 28.04 </td> <td style="text-align:right;"> 64.43 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 12.65858 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.9 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 542 </td> <td style="text-align:right;"> 540 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 923 </td> <td style="text-align:right;"> 850 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 1141 </td> <td style="text-align:left;"> N619AA </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> MIA </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 1089 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> JFK </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 26.96 </td> <td style="text-align:right;"> 61.63 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 14.96014 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1012.1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 542 </td> <td style="text-align:right;"> 540 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 923 </td> <td style="text-align:right;"> 850 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 1141 </td> <td style="text-align:left;"> N619AA </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> MIA </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 1089 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> LGA </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.92 </td> <td style="text-align:right;"> 24.98 </td> <td style="text-align:right;"> 54.81 </td> <td style="text-align:right;"> 250 </td> <td style="text-align:right;"> 14.96014 </td> <td style="text-align:right;"> 21.86482 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.4 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> <tr> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 544 </td> <td style="text-align:right;"> 545 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 1004 </td> <td style="text-align:right;"> 1022 </td> <td style="text-align:right;"> -18 </td> <td style="text-align:left;"> B6 </td> <td style="text-align:right;"> 725 </td> <td style="text-align:left;"> N804JB </td> <td style="text-align:left;"> JFK </td> <td style="text-align:left;"> BQN </td> <td style="text-align:right;"> 183 </td> <td style="text-align:right;"> 1576 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 1357034400 </td> <td style="text-align:left;"> EWR </td> <td style="text-align:right;"> 2013 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 39.02 </td> <td style="text-align:right;"> 28.04 </td> <td style="text-align:right;"> 64.43 </td> <td style="text-align:right;"> 260 </td> <td style="text-align:right;"> 12.65858 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1011.9 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1357034400 </td> </tr> </tbody> </table> --- # Tidyverse-to-SQL summary table |tidyverse | SQL | |--|--| | <code>dataset |></code> | `FROM dataset` | | `select()` | `SELECT` | | `filter()` | `WHERE` | | `mutate()` | `SELECT ... AS` | | `arrange()` | `ORDER BY` | | <code>group_by() |> summarize()</code> | `SELECT ... AS ... GROUP BY` | | `left_join()` | `LEFT JOIN ... ON` | | `head()` | `LIMIT` | --- # More challenges **Use SQL queries to answer these questions for year 2013:** 1. How many nonstop flights went from New York airports to Bozeman (BZN)? Return all matching rows. 2. List the number of flights to Bozeman per month, naming this column `flights_count`. Were they year-round or in specific seasons? 3. What were the top 10 longest flights out of NYC airports? 4. Ignore duplicate routes. Which 10 distinct *routes* had the longest flights? 5. **Advanced:** What percentage of flights took off in freezing temperatures? Make sure to confirm the join gives you what you want. *(Hint: First just count the number of flights that took off in temperatures below/above freezing. Calculating the percentage within SQL requires a subquery.)* --- # Disconnect When you're done using a database, make sure to **disconnect** from the connection: ```r dbDisconnect(lite_con) ``` --- class: inverse, middle name: bigquery # Getting started with BigQuery --- # Google Cloud BigQuery If you want to learn more SQL, I suggest you play around with **Google BigQuery** using the [web interface](https://console.cloud.google.com/bigquery). - Sandbox (Google Cloud Free Tier): Free to analyze 1 TB per month. - Free trial of Google Cloud: Gives you a $300 credit to use for 90 days. - Cheap after that: Half a cent per GB. Incredible [public datasets](https://www.reddit.com/r/bigquery/wiki/datasets/). E.g.: * All open-source code on GitHub (1.7+ TB) * 1.9 billion comments on Reddit (546+ GB) * 1 billion taxi trips in NYC (130+ GB) You can get started with Google's tutorial on [how to query a public dataset](https://cloud.google.com/bigquery/docs/quickstarts/query-public-dataset-console). --- # Try it out **Start here:** [https://console.cloud.google.com/bigquery](https://console.cloud.google.com/bigquery) 1. Log in with your personal Google account 2. Agree to the terms to use BigQuery 3. **DISMISS** the banner asking you to start your free trial (do not activate it unless you want to) 4. Create a project (Select a Project --> New Project) 5. Copy your Project Number (click on the Google Cloud logo) Store your project number as an **environment variable,** like with API keys. * Run this line **IN YOUR R CONSOLE,** replacing the fake string of digits with your project number: ```r Sys.setenv(GCE_DEFAULT_PROJECT_ID = "1234567890") ``` --- # Connect to BigQuery In particular, using public data from Global Fishing Watch: ```r gfw_con = dbConnect( bigrquery::bigquery(), project = "global-fishing-watch", dataset = "global_footprint_of_fisheries", billing = Sys.getenv("GCE_DEFAULT_PROJECT_ID") ) dbListTables(gfw_con) ``` ``` #> [1] "fishing_effort" "fishing_effort_byvessel" #> [3] "fishing_vessels" "vessels" ``` --- # Which countries fish the most? ```sql SELECT flag, SUM(fishing_hours) AS total_fishing_hours FROM fishing_effort GROUP BY flag ORDER BY total_fishing_hours DESC ``` What this looks like in the R Markdown script: ````markdown ```{sql connection = gfw_con, output.var = "effort_by_country"} SELECT flag, SUM(fishing_hours) AS total_fishing_hours FROM fishing_effort GROUP BY flag ORDER BY total_fishing_hours DESC ``` Running job '1234567890.job_HiDalrETlYSSTb6XvKty8tMX4j_f.US' [|] 2s Complete Billed: 2.35 GB Downloading first chunk of data. Downloading the remaining 10,412 rows in 2 chunks of (up to) 10,000 rows. ```` --- # Which countries fish the most? ```r effort_by_country ``` ``` #> # A tibble: 126 × 2 #> flag total_fishing_hours #> <chr> <dbl> #> 1 CHN 57711389. #> 2 ESP 8806223. #> 3 ITA 6790417. #> 4 FRA 6122613. #> 5 RUS 5660001. #> 6 KOR 5585248. #> 7 TWN 5337054. #> 8 GBR 4383738. #> 9 JPN 4347252. #> 10 NOR 4128516. #> # ℹ 116 more rows ``` --- # Extract a 1° grid of fishing hours We need a nested query: ```sql SELECT lat_bin_center, lon_bin_center, SUM(fishing_hours) AS fishing_hours FROM ( SELECT *, FLOOR(lat_bin / 100.0) * 1.0 + 0.5 * 1.0 AS lat_bin_center, FLOOR(lon_bin / 100.0) * 1.0 + 0.5 * 1.0 AS lon_bin_center FROM fishing_effort WHERE _PARTITIONTIME >= '2016-01-01 00:00:00' AND _PARTITIONTIME <= '2016-12-31 00:00:00' AND fishing_hours > 0.0 ) GROUP BY lat_bin_center, lon_bin_center ``` --- # Extract a 1° grid of fishing hours Here are the results: ```r globe ``` ``` #> # A tibble: 20,412 × 3 #> lat_bin_center lon_bin_center fishing_hours #> <dbl> <dbl> <dbl> #> 1 52.5 -12.5 15109. #> 2 75.5 40.5 14164. #> 3 70.5 47.5 1251. #> 4 52.5 154. 4258. #> 5 -55.5 -72.5 349. #> 6 -49.5 -61.5 1726. #> 7 51.5 -7.5 33791. #> 8 63.5 -16.5 7498. #> 9 61.5 176. 4180. #> 10 69.5 38.5 1748. #> # ℹ 20,402 more rows ``` --- # Extract a 1° grid of fishing hours <img src="17-Databases_files/figure-html/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" /> --- # Always disconnect ```r dbDisconnect(gfw_con) ``` --- # Summary 1. [Tools for big data](#bigdata) 1. [Databases in R](#databases) 1. [Writing SQL queries](#sql) 1. [Getting started with BigQuery](#bigquery)