class: center, middle, inverse, title-slide .title[ # Big Data and Economics ] .subtitle[ ## Lecture 1: Introduction ] .author[ ### Kyle Coombs (he/him/his) ] .date[ ### Bates College |
EC/DCS 368
] --- <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } </style> # Table of contents 1. [Prologue](#prologue) 2. [Veggies: Syllabus highlights](#syllabus) 2. [Dessert: What is data science?](#what_is_data_science) 4. [Appetizer: Getting started](#started) 5. [Entree: R for data science](#r4ds) 6. [Second Dessert: Data visualization with ggplot2](#ggplot2) --- class: inverse, center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Introductions ### Course
https://github.com/big-data-and-economics You'll soon receive access to this GitHub organization, where we submit assignments, upload presentations, etc. -- ### Me
[Kyle Coombs](https://kylecoombs.com/)
[kcoombs@bates.edu](mailto:kcoombs@bates.edu)
Assistant Professor (economics)
From Scotia, New York
Live in Maine and Massachusetts
Research fields: Public and Labor, interested in applied econometrics and data science --- class: inverse, center, middle name: syllabus # Syllabus highlights <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> (Read the full document [here](https://github.com/big-data-and-economics/syllabus/blob/master/syllabus.pdf).) --- # Why this course? Fill in the gaps left by traditional econometrics and methods classes. - Practical skills that tools that will benefit your thesis and future career. - Neglected skills like how to actually find datasets in the wild and clean them. - Apply skills to analyze empirical questions on economic and social problems. Data science skills are largely distinct from (and complementary to) the core 'metrics familiar to economists. - Acquiring data; scraping; maintaining databases; etc. - Data viz, cleaning and wrangling; programming; cloud computation; relational databases; machine learning; etc. -- > *"In short, we will cover things that I wish someone had taught me when I was starting out in college."* --- # Caveat - This course will be **_hard._** You will need to: - Teach yourself new skills I cannot cover in 12 weeks - Be entrepreneurial: If you find a better way to do something, do (and share) it! - Be patient: You will encounter bugs and errors, and you will need to learn how to fix them - This course will also be **_rewarding_** - You can avoid the mistakes you make here on your thesis and in your career - You will learn skills that employers, pre-doc programs, and grad schools want - You will learn how to be a better researcher and citizen - Seriously, a little data science goes a long way in helping you see through BS --- # Tips for coding fruitfully You're gonna write a lot of code for this class, which means you're gonna troubleshoot a lot of bugs. - Some of these will be bugs of your own making, some will be bugs of open source R tools 1. Try to describe in plain words/simple pictures what you want code to do before you write it - Read in a CSV file with variables for annual population at the county level and calculate the change in population from the previous year 2. Break this description into smaller steps (1: Read in data, 2: Drop rows with NA County, etc.) 3. Write code "modularly" to do each step 4. Then you can troubleshoot modularly too 5. The `help` documentation for R functions is the best place to start for troubleshooting 6. Google is your dictionary, ChatGPT is your weird friend who knows a lot of words but sometimes uses them wrong<sup>1</sup> - Be precise in your Google searches and ChatGPT instructions 7. If you're stuck, ask for help from me and classmates on GitHub Discussions <sup>1</sup> GitHub CoPilot (ChatGPT's cousin) wrote that sentence fragment! --- # You, at the end of this course <div align="center"> <img src="pics/awesome.jpg"> </div> --- # Syllabus readthrough - Read over the class [ReadMe](https://github.com/big-data-and-economics/big-data-class-materials) for 3 minutes - Identify at least one word you're unfamiliar with - Anyone have guesses for how to define them? - [Mentimeter](https://www.mentimeter.com/app/presentation/bldvy2hwjfb483btveenvon9k7mbtkdm/xqvbqbhy8hzo) -- #### Highlights 1. Final project: Paper replication 2. 6 problem sets, drop the two lowest, grade improvement 3. 5-minute presentations 4. GitHub Participation 5. _Optional_ Lewiston Hack-a-thon: 1-day data science competition at end of semester - Details TBD - I will soon be asking your availability for 24-hour periods in late March/April - Does 24 hours lead to immediate red flags for anyone's schedule? If so please let me know now. --- # Class outline .pull-left[ ### Data science basics - Version control with Git and GitHub - R language basics - Data cleaning and wrangling - Webscraping - Data visualization ### Analysis and Programming - Regression analysis in R - Spatial analysis in R - Functions in R - Parallel programming ] .pull-right[ ### Causal inference - Regression discontinuity design - Panel data and fixed effects - Field experiments ### Scaling up: Big data, ML, and cloud computation - High performance computing (Leavitt cluster) - Databases: SQL(ite) and BigQuery - Machine Learning techniques - Text analysis ] --- class: inverse, center, middle name: what_is_data_science # What is Data Science? --- # What is Data Science? - .hi[Data science (DS):] The scientific discipline that deals with transforming data into useful information ("insights") using a variety of stats/ML techniques - Facebook: Collects data on search history, friendship links, site clicks, occupation, etc. - Chetty et al. used FB data to estimate users' SES and social network ([Social Capital Atlas](https://www.socialcapital.org/)) - The rise of data science has come because of the so-called Big Data revolution - The rise of the internet in the late-1990s and 2000s `\(\Rightarrow \,\uparrow\)` opportunities for companies and governments to collect data on consumers & citizens - Spread of mobile devices & social media from late 2000s until now generated even more data #### Pillars of data science - Programming (automation of data collection, manipulation, cleaning, visualization, and modeling) - Visualization & exploration - Causal inference (to be able to make a policy prescription) - Machine learning (to select models, compress data, predict outcomes) ...Assuming one has the appropriate foundation of basic calculus and statistics --- # The data science workflow .center[ <img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="85%" style="display: block; margin: auto;" /> Source: [R for Data Science](http://r4ds.had.co.nz/introduction.html) ] --- # Big Data .center[ <img src="pics/frisch.jpg" width="90%" style="display: block; margin: auto;" /> ] --- # Big Data .center[ <img src="pics/frisch.jpg" width="90%" style="display: block; margin: auto;" /> ] Source: Frisch, Ragnar. 1933. "Editor's Note" _Econometrica_ 1(1): 1-4 --- # What is Big Data? It depends on who you ask. It could mean: 1. "Wild" data (unstructured; happenstance; collected without a particular intention; e.g. twitter, contrast with Census surveys) 2. "Wide" data (a.k.a. "Big-K" data because `\(K>N\)`, customer data sets where each click is a variable) 3. "Long" data (a.k.a. "Big-N" data because `\(N\)` very, very large [and may not all fit onto a single hard drive!], government tax records, Medicare claims data, etc.) 4. Any data set that cannot be analyzed with classical methods like OLS (e.g. all combinations of the above three types) "Big Data" not so much about size of data, but about whether or not "small data" (read: classical) methods can be used --- # Long data <img src="pics/longdata.png" width="90%" style="display: block; margin: auto;" /> - Main application: *identifying causal effects* - Example: effects of improving schools on income --- # Wide data <img src="pics/widedata.png" width="90%" style="display: block; margin: auto;" /> - Main application: *prediction* - Example: predicting income to target ads from tons of information like location, links clicked, etc. --- # Why does data shape matter? -- 1. Data shape determines how much memory is required to store information 2. Data shape determines what method you can use to analyze the data -- - A regression model with year fixed effects requires that year changes between observations, so you need long data - You cannot have a wide data set with one row per unit and one column per year --- # What can we measure with data? - Data can measure a lot of information about the world, but not everything - Yes, that sounds like buzzword soup, but it is true - Any dataset, no matter how big, has simplified the world in some way - You want the simplification to match the question - How do you record where a person is? - County? Lots of people have same location. - IP address? Changes frequently - GPS coordinates? Too precise, and changes every second! - The solution? Decide why you need location - Are you curious about the effect of local government policies or firms on people? - Are you looking to measure the effect of air pollution on health? - Do you want to see how people change their commute patterns over time? When there is a road closure? - Different questions require different levels of precision --- # Big data & machine learning - You'll often hear the phrase "big data and machine learning" - This is because many machine learning algorithms are helpful for big data problems: - Selecting which `\(k<K\)` covariates should enter your model - Streamlined techniques for processing "wild" data - New modeling approaches that can leverage the greater amount of information that Big Data has - ML also gave us generative AI ([ChatGPT](https://chat.openai.com/) and [GitHub CoPilot](https://github.com/features/copilot)), which I encourage you to use heavily to help you learn R and Git - Think of it as a more interactive version of Googling for the solution to a bug - It is not (yet) a replacement for thinking through the best way to code something - It will actually perform better if you think through the basic coding tasks first, then ask it to fill in the blanks --- # What is machine learning? What is AI? - .hi[Machine learning (ML):] Allowing computers to learn for themselves without explicitly being programmed - USPS: Computer to read handwriting on envelopes - Google: AlphaGo, computer that defeated world champion Go player - Apple/Amazon/Microsoft: Siri, Alexa, Cortana, Talon voice assistants - .hi[Artificial intelligence (AI):] Constructing machines (robots, computers) to think and act like human beings - ML is a subset of AI --- class: inverse, center, middle name: started # Getting started <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Software installation and registration 1. Download [R](https://www.r-project.org/). 2. Download [RStudio](https://www.rstudio.com/products/rstudio/download/preview/). 3. Download [Git](https://git-scm.com/downloads). 4. Create an account on [GitHub](https://github.com/) and register for a student/educator [discount](https://education.github.com/discount_requests/new). - We will use GitHub to disseminate and submit assignments, receive feedback and grading, etc. 5. Make a folder on your computer for this class. Any and all repositories for this class should be cloned into this folder. -- If you had trouble completing any of these steps, please raise your hand. - My go-to place for installation guidance and troubleshooting is Jenny Bryan's http://happygitwithr.com. --- # Some OS-specific extras I'll detail further software requirements as and when the need arises. However, to help smooth some software installation issues further down the road, please also do the following (depending on your OS): - **Windows:** Install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). I also recommend that you install [Chocolately](https://chocolatey.org/) and [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install). - **Mac:** Install [Homebrew](https://brew.sh/). I also recommend that you configure/open your C++ toolchain (see [here](https://github.com/rmacoslib/r-macos-rtools#installer-package-for-macos-r-toolchain-).) - **Linux:** None (you should be good to go). --- # Checklist ☑ Do you have the most recent version of R? ```r version$version.string ``` ``` ## [1] "R version 4.3.1 (2023-06-16 ucrt)" ``` ☑ Do you have the most recent version of RStudio? (The [preview version](https://www.rstudio.com/products/rstudio/download/preview/) is fine.) ```r RStudio.Version()$version ## Requires an interactive session but should return something like "[1] ‘1.4.1100’" ``` ☑ Have you updated all of your R packages? ```r update.packages(ask = FALSE, checkBuilt = TRUE) ``` --- # Checklist (cont.) Open up the [shell](http://happygitwithr.com/shell.html#shell). - Windows users, make sure that you installed a Bash-compatible version of the shell. If you installed [Git for Windows](https://gitforwindows.org), then you should be good to go. ☑ Which version of Git have you installed? ```bash git --version ``` ``` ## git version 2.34.1 ``` ☑ Did you introduce yourself to Git? (Substitute in your details.) ```bash git config --global user.name 'kgcsport' git config --global user.email 'kcoombs@bates.edu' git config --global --list ``` ☑ Did you register an account in GitHub? --- # Checklist (cont.) - Navigate to the [class materials repository](https://github.com/big-data-and-economics/big-data-class-materials) - Click the green "Code" button and copy the HTTPS link. - Under Codespaces, click `Create codespace on main`. This will create a cloud-based server for you to work on. - It may take a few minutes to get up and running. - Once inside, navigate to `PORTS` at the bottom, and click the link under "Local Address" for the RStudio Port. The username and password are rstudio/rstudio. - You should now be in RStudio! - Open up test.Rmd and click `Knit` at the top to test that it works! - Navigate back to the main GitHub repository page and click `Code`, the three dots next to the codespace name, and `Delete codespace`. - I'll encourage you to use Codespaces on some problem sets when I want to hit the ground running instead of troubleshooting package installation issues. --- # Checklist (cont.) We will make sure that everything is working properly with your R and GitHub setup next lecture. For the rest of today's lecture, I want to go over some very basic ChatGPT and R concepts. -- </br> PS — Just so you know where we're headed: We'll return to these R concepts (and delve much deeper) next week after a brief, but important detour to the lands of coding best practices and Git(Hub). --- class: inverse, center, middle name: r4ds # R and Generative AI for data science <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- name: whyr # Why R and RStudio? ### Data science positivism - Alongside Python, R has become the *de facto* language for data science. - See: [*The Impressive Growth of R*](https://stackoverflow.blog/2017/10/10/impressive-growth-r/), [*The Popularity of Data Science Software*](http://r4stats.com/articles/popularity/) - Open-source (free!) with a global user-base spanning academia and industry. - "Do you want to be a profit source or a cost center?" ### Bridge to applied economics and other tools - Already has all of the statistics and econometrics support, and is amazingly adaptable as a “glue” language to other programming languages and APIs. - The RStudio IDE and ecosystem allow for further, seemless integration. ### Path dependency - It's also the language that I know (second) best. - (Learning multiple languages is a good idea, though.) --- # R code example (linear regression) ```r fit = lm(mpg ~ wt, data = mtcars) summary(fit) ``` ``` ## ## Call: ## lm(formula = mpg ~ wt, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.5432 -2.3647 -0.1252 1.4096 6.8727 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** ## wt -5.3445 0.5591 -9.559 1.29e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.046 on 30 degrees of freedom ## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 ## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 ``` --- # Base R plot ```r par(mar = c(4, 4, 1, .1)) ## Just for nice plot margins on this slide deck plot(mtcars$wt, mtcars$mpg) abline(fit, col = "red") ``` <img src="01-Intro_files/figure-html/mtcars_baseplot-1.svg" style="display: block; margin: auto;" /> --- # ggplot2 ```r library(ggplot2) ggplot(data = mtcars, aes(x = wt, y = mpg)) + geom_smooth(method = "lm", col = "red") + geom_point() ``` <img src="01-Intro_files/figure-html/mtcars_ggplot-1.svg" style="display: block; margin: auto;" /> --- # Do `\(\uparrow\)` in GDP cause life expectancy to `\(\uparrow\)`? - Let's use our new-found R knowledge to try to separate correlation from causation for a critical question in economics: - Does increasing the economic pie (GDP) lead to longer lives (life expectancy)? - We can use the gapminder dataset to explore this question - The [gapminder](https://github.com/jennybc/gapminder) dataset contains panel data on life expectancy, population size, and GDP per capita for 142 countries since the 1950s - Any predictions about what we'll learn? - Also, we can tap into generative AI to help us climb the steep learning curve of coding in a new language --- class: inverse, center, middle name: ggplot2 # More ggplot2 <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Install and load Open up your laptops. For the remainder of this first lecture, we're going to play around with [ggplot2](https://ggplot2.tidyverse.org/) (i.e. livecoding) to answer our question about GDP and life expectancy. If you don't have them already, install the `ggplot2` and `gapminder` packages via either: - **Console:** Enter `install.packages(c("ggplot2", "gapminder"), dependencies=T)`. - **RStudio:** Click the "Packages" tab in the bottom-right window pane. Then click "Install" and search for these two packages. ![](pics/install.gif) --- # Install and load (cont.) Once the packages are installed, load them into your R session with the `library()` function. ```r library(ggplot2) library(gapminder) ## We're just using this package for the gapminder data ``` Notice too that you don't need quotes around the package names any more. Reason: R now recognises these packages as defined objects with given names. -- PS — A convenient way to combine the package installation and loading steps is with the [pacman package's](https://github.com/trinker/pacman) `p_load()` function. If you run `pacman::p_load(ggplot, gapminder)` it will first look to see whether it needs to install either package before loading them. Clever. - We'll get to this later, but if you want to run a function from an (installed) package without loading it, you can use the `PACKAGE::package_function()` syntax. --- name: ai # Leverage ChatGPT and GitHub CoPilot - I'm going to ask you to use a number of new technical tools in this class - The documentation will be concise and oftentimes you won't know where to start - For example, how in the world are you supposed to make a ggplot plot of the gapminder data if you've never used ggplot before? - What if you've never even opened R? How do you even look at the gapminder data? - I strongly encourage you to use ChatGPT and GitHub CoPilot in all aspects of this class - ChatGPT is a large language model (LLM) developed by OpenAI - Create an account at http://chat.openai.com/ - ChatGPT is excellent at producing basic code and helping resolve bugs - It serves as a bridge between the ideas in your head and crisp instructions for a computer (code) --- # ChatGPT example First, submit a [prompt on Mentimeter](https://www.mentimeter.com/app/presentation/bldvy2hwjfb483btveenvon9k7mbtkdm/emjehaixxc7c) that loads the gapminder data, displays some key checks, then makes a scatterplot of GDP per capita and life expectancy -- **Human:** Write an R script that does the following: (a) Reads in the gapminder dataset using the library gapminder. (b) Prints out the first 10 rows of the gapminder dataset. (c) Lists which countries are in the sample (i.e. list unique values of the county variable) (d) Plots a scatterplot of the gdpPercap by lifeExp using the gapminder dataset using ggplot. --- # ChatGPT example (continued) **Assistant:** Certainly! Here's an R script that accomplishes these tasks using the gapminder dataset: .scroll-box-8[ ```r # Load necessary libraries library(gapminder) library(ggplot2) # Step (a) - Read in the gapminder dataset data("gapminder") gapminder_df <- gapminder # Step (b) - Print first 10 rows print(head(gapminder_df, 10)) # Step (c) - List unique countries unique_countries <- unique(gapminder_df$country) print(unique_countries) # Step (d) - Plot scatterplot of gdpPercap by lifeExp ggplot(gapminder_df, aes(x = gdpPercap, y = lifeExp)) + geom_point() + labs(x = "GDP per Capita", y = "Life Expectancy") + ggtitle("Relationship between GDP per Capita and Life Expectancy") ``` ] This script assumes that the gapminder package is installed and loaded. It reads in the gapminder dataset, prints the first 10 rows, lists unique countries present in the dataset, and then creates a scatterplot of gdpPercap (GDP per Capita) against lifeExp (Life Expectancy) using ggplot2. Adjust the plot aesthetics as needed to suit your preferences. *** Exported on 1/8/2024. --- # How's the code run? ``` ## # A tibble: 10 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ``` ``` ## [1] Afghanistan Albania Algeria ## [4] Angola Argentina Australia ## [7] Austria Bahrain Bangladesh ## [10] Belgium Benin Bolivia ## [13] Bosnia and Herzegovina Botswana Brazil ## [16] Bulgaria Burkina Faso Burundi ## [19] Cambodia Cameroon Canada ## [22] Central African Republic Chad Chile ## [25] China Colombia Comoros ## [28] Congo, Dem. Rep. Congo, Rep. Costa Rica ## [31] Cote d'Ivoire Croatia Cuba ## [34] Czech Republic Denmark Djibouti ## [37] Dominican Republic Ecuador Egypt ## [40] El Salvador Equatorial Guinea Eritrea ## [43] Ethiopia Finland France ## [46] Gabon Gambia Germany ## [49] Ghana Greece Guatemala ## [52] Guinea Guinea-Bissau Haiti ## [55] Honduras Hong Kong, China Hungary ## [58] Iceland India Indonesia ## [61] Iran Iraq Ireland ## [64] Israel Italy Jamaica ## [67] Japan Jordan Kenya ## [70] Korea, Dem. Rep. Korea, Rep. Kuwait ## [73] Lebanon Lesotho Liberia ## [76] Libya Madagascar Malawi ## [79] Malaysia Mali Mauritania ## [82] Mauritius Mexico Mongolia ## [85] Montenegro Morocco Mozambique ## [88] Myanmar Namibia Nepal ## [91] Netherlands New Zealand Nicaragua ## [94] Niger Nigeria Norway ## [97] Oman Pakistan Panama ## [100] Paraguay Peru Philippines ## [103] Poland Portugal Puerto Rico ## [106] Reunion Romania Rwanda ## [109] Sao Tome and Principe Saudi Arabia Senegal ## [112] Serbia Sierra Leone Singapore ## [115] Slovak Republic Slovenia Somalia ## [118] South Africa Spain Sri Lanka ## [121] Sudan Swaziland Sweden ## [124] Switzerland Syria Taiwan ## [127] Tanzania Thailand Togo ## [130] Trinidad and Tobago Tunisia Turkey ## [133] Uganda United Kingdom United States ## [136] Uruguay Venezuela Vietnam ## [139] West Bank and Gaza Yemen, Rep. Zambia ## [142] Zimbabwe ## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe ``` --- # How's the code run? (cont.) <img src="01-Intro_files/figure-html/gapminder-chatgpt-eval2-1.png" style="display: block; margin: auto;" /> --- # Tips for using ChatGPT - Be as specific as possible in your instructions - If you know the name of the variables in your dataset, use them - Think of it as a more interactive version of Googling for the solution to a bug - Try things iteratively and in small steps - If you're not sure how to do something, try to break it down into smaller steps - This is a good tip for coding in general - Your brain is still the most powerful tool you have - ChatGPT is a tool to help you, not replace you - You will not get much mileage if you say, "Read in the gapminder dataset and do something interesting with it" - Often it only "skeleton code" for you, so you'll need to fill in the blanks --- # Elements of ggplot2 [Hadley Wickham's](http://hadley.nz/) ggplot2 is one of the most popular packages in the entire R canon. - It also happens to be built upon some deep visualization theory: i.e. Leland Wilkinson's [*The Grammar of Graphics*](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448). There's a lot to say about ggplot2's implementation of this "grammar of graphics" approach, but the three key elements are: 1. Your plot ("the visualization") is linked to your variables ("the data") through various **aesthetic mappings**. 2. Once the aesthetic mappings are defined, you can represent your data in different ways by choosing different **geoms** (i.e. "geometric objects" like points, lines or bars). 3. You build your plot in **layers**. -- </br> That's kind of abstract. Let's break down the elements of ggplot2 in turn with some actual plots. - As a shortcut, we'll use AI to write the basic code for us then we'll fill in the blanks. --- # 1. Aesthetic mappings ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` <img src="01-Intro_files/figure-html/aesthetics0-1.png" style="display: block; margin: auto;" /> --- # 1. Aesthetic mappings (cont.) ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` Focus on the top line, which contains the initialising `ggplot()` function call. This function accepts various arguments, including: - Where the data come from (i.e. `data = gapminder`). - What the aesthetic mappings are (i.e. `mapping = aes(x = gdpPercap, y = lifeExp)`). -- The aesthetic mappings here are pretty simple: They just define an x-axis (GDP per capita) and a y-axis (life expecancy). - To get a sense of the power and flexibility that comes with this approach, however, consider what happens if we add more aesthetics to the plot call... --- # 1. Aesthetic mappings (cont.) - Can you ask ChatGPT to add color the dots by continent? ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` - Submit your guesses on [Mentimeter](https://www.mentimeter.com/app/presentation/bldvy2hwjfb483btveenvon9k7mbtkdm/cc861vpassv5). --- # 1. Aesthetic mappings (cont.) ```r ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point(alpha = 0.3) ## "alpha" controls transparency. Takes a value between 0 and 1. ``` <img src="01-Intro_files/figure-html/aesthetics3-1.svg" style="display: block; margin: auto;" /> -- - I've dropped the "mapping =" part of the ggplot call. `ggplot2` knows the order of the arguments. --- # 1. Aesthetic mappings (cont.) The aesthetics expect "long" data. This is distinct from the "wide" format, where each series has its own column. -- .pull-left[ #### Long will work ``` ## # A tibble: 1,704 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # ℹ 1,694 more rows ``` ] .pull-right[ #### Wide would not work ``` ## # A tibble: 142 × 26 ## country continent lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan Asia 28.8 30.3 32.0 34.0 ## 2 Albania Europe 55.2 59.3 64.8 66.2 ## 3 Algeria Africa 43.1 45.7 48.3 51.4 ## 4 Angola Africa 30.0 32.0 34 36.0 ## 5 Argentina Americas 62.5 64.4 65.1 65.6 ## 6 Australia Oceania 69.1 70.3 70.9 71.1 ## 7 Austria Europe 66.8 67.5 69.5 70.1 ## 8 Bahrain Asia 50.9 53.8 56.9 59.9 ## 9 Bangladesh Asia 37.5 39.3 41.2 43.5 ## 10 Belgium Europe 68 69.2 70.2 70.9 ## # ℹ 132 more rows ## # ℹ 20 more variables: lifeExp_1972 <dbl>, lifeExp_1977 <dbl>, ## # lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>, ## # lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, ## # gdpPercap_1952 <dbl>, gdpPercap_1957 <dbl>, gdpPercap_1962 <dbl>, ## # gdpPercap_1967 <dbl>, gdpPercap_1972 <dbl>, gdpPercap_1977 <dbl>, ## # gdpPercap_1982 <dbl>, gdpPercap_1987 <dbl>, gdpPercap_1992 <dbl>, … ``` ] --- # 1. Aesthetic mappings (cont.) We can specify aesthetic mappings in the geom layer too. ```r ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + ## Applicable to all geoms geom_point(aes(size = pop, col = continent), alpha = 0.3) ## Applicable to this geom only ``` <img src="01-Intro_files/figure-html/aesthetics4-1.svg" style="display: block; margin: auto;" /> --- # 1. Aesthetic mappings (cont.) Oops. What went wrong here? ```r ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point(aes(size = "big", col="black"), alpha = 0.3) ``` <img src="01-Intro_files/figure-html/aesthetics_mistake-1.svg" style="display: block; margin: auto;" /> -- **Answer:** Aesthetics must be mapped to variables, not descriptions! --- # 1. Aesthetic mappings (cont.) At this point, instead of repeating the same ggplot2 call every time, it will prove convenient to define an intermediate plot object that we can re-use. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p ``` <img src="01-Intro_files/figure-html/p-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms Once your variable relationships have been defined by the aesthetic mappings, you can invoke and combine different geoms to generate different visualizations. ```r p + geom_point(alpha = 0.3) + geom_smooth(method = "loess") # A smoothed "locally estimated scatterplot smoothing" line ``` <img src="01-Intro_files/figure-html/geoms1-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms (cont.) Aesthetics can be applied differentially across geoms. ```r p + geom_point(aes(size = pop, col = continent), alpha = 0.3) + geom_smooth(method = "loess") ``` <img src="01-Intro_files/figure-html/geoms2-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms (cont.) The previous plot provides a good illustration of the power (or effect) that comes from assigning aesthetic mappings "globally" vs in the individual geom layers. - Compare: What happens if you run the below code chunk? ```r ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point(alpha = 0.3) + geom_smooth(method = "loess") ``` --- # 2. Geoms (cont.) The previous plot provides a good illustration of the power (or effect) that comes from assigning aesthetic mappings "globally" vs in the individual geom layers. - Compare: What happens if you run the below code chunk? ```r ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) + geom_point(alpha = 0.3) + geom_smooth(method = "loess") ``` <img src="01-Intro_files/figure-html/geoms3-eval-1.png" style="display: block; margin: auto;" /> --- # 3. Build your plot in layers We've already seen how we can chain (or "layer") consecutive plot elements using the `+` connector. - The fact that we can create and then re-use an intermediate plot object (e.g. "p") is testament to this. But it bears repeating: You can build out some truly impressive complexity and transformation of your visualization through this simple layering process. - You don't have to transform your original data; ggplot2 takes care of all of that. - For example (see next slide for figure). - **Bonus:** Maybe this will help make sense of the non-linear relationship between GDP per capita and life expectancy? --- # Build your plot in layers (cont.) ```r p2 = p + geom_point(aes(size = pop, col = continent), alpha = 0.3) + scale_color_brewer(name = "Continent", palette = "Set1") + ## Different colour scale scale_size(name = "Population", labels = scales::comma) + ## Different point (i.e. legend) scale scale_x_log10(labels = scales::dollar) + ## Switch to logarithmic scale on x-axis. Use dollar units. labs(x = "Log (GDP per capita)", y = "Life Expectancy") + ## Better axis titles theme_minimal() ## Try a minimal (b&w) plot theme ``` - Before executing, what will this do? - The comments help, as will Google and ChatGPT, but the function names are somewhat intuitive too. --- # 3. Build your plot in layers (cont.) <img src="01-Intro_files/figure-html/layers2-1.svg" style="display: block; margin: auto;" /> --- # What else? We have barely scratched the surface of ggplot2's or ChatGPT's functionality... let alone talked about the entire ecosystem of packages that has been built around it. - Here's are an additional example to whet your appetite -- Note that you will need to install and load some additional packages if you want to recreate the next two figures on your own machine. A quick way to do this: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load(hrbrthemes, gganimate) ``` #### Animation! (See the next slide for the resulting GIF.) ```r # library(gganimate) ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) + geom_point(alpha = 0.7, show.legend = FALSE) + scale_colour_manual(values = country_colors) + scale_size(range = c(2, 12)) + scale_x_log10(labels = scales::dollar) + facet_wrap(~continent) + # Here comes the gganimate specific bits labs(title = 'Year: {frame_time}', x = 'Log (GDP per capita)', y = 'Life expectancy') + transition_time(year) + ease_aes('linear') ``` - What is different about this code? --- # What else? (cont.) <img src="01-Intro_files/figure-html/gganim2-1.gif" style="display: block; margin: auto;" /> -- Note that this animated plot provides a much more intuitive understanding of the underlying data. Just as [Hans Rosling](https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen) intended. --- # But do GDP per cap increases cause life expectancy to increase? - We can't answer this question with a simple plot. - We also can't answer this question with a very fancy plot. - What do we need? > - A model > - A causal identification strategy > - More granular (bigger) data --- # What else? (cont.) We also haven't touched on ggplot2's relationship to "tidy" data. - It actually forms part of a suite of packages collectively known as the [tidyverse](https://www.tidyverse.org/). - We will get back to this. Rest assured, you will be using ggplot2 throughout the rest of this course and developing your skills along the way. - Your second assignment (coming up) is a chance specifically to hone some of those skills. You can do some reading and practice on your own. Pick any of the following (or choose among the litany of online resources) and work through their examples: - [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) of *R for Data Science* by Hadley Wickham and Garett Grolemund. - [*Data Visualization: A Practical Guide*](https://socviz.co/makeplot.html) by Kieran Healy. - [Designing ggplots](https://designing-ggplots.netlify.com) by Malcom Barrett. For next class on GitHub, please complete the following: - Problem Set 0 - Work through Chapters 1-14 of https://happygitwithr.com/ - Read through the [Git fundamentals](https://happygitwithr.com/git-intro) unit --- class: inverse, center, middle # Next lecture: Deep dive into Git(Hub). <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- name: appendix class: inverse, center, middle # Appendix --- # Some R basics 1. Everything is an object. 2. Everything has a name. 3. You do things using functions. 4. Functions come pre-written in packages (i.e. "libraries"), although you can — and should — write your own functions too. -- </br> Points 1. and 2. can be summarised as an [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) (OOP) approach. - This may sound super abstract now, but we'll see *lots* of examples over the coming weeks that will make things clear. --- # R vs Stata If you're coming from Stata, some additional things worth emphasizing: - Multiple objects (e.g. data frames) can exist happily in the same workspace. - No more `keep`, `preserve`, `restore` hackery. (Though, props to [Stata 16](https://www.stata.com/new-in-stata/multiple-datasets-in-memory/).) - This is a direct consequence of the OOP approach. - You will load packages at the start of every new R session. Make peace with this. - "Base" R comes with tons of useful in-built functions. It also provides all the tools necessary for you to write your own functions. - However, many of R's best data science functions and tools come from external packages written by other users. - R easily and infinitely parallelizes. For free. - Compare the cost of a [Stata/MP](https://www.stata.com/statamp/) license, nevermind the fact that you effectively pay per core... - You don't need to `tsset or xtset` your data. (Although you can too.)