class: center, middle, inverse, title-slide .title[ # Applied Data Analysis for Public Policy Studies ] .subtitle[ ## Introduction ] .author[ ### Michele Fioretti ] .date[ ### Sciences Po Paris 2023-08-31 ] --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Welcome to Applied Data Analysis for Public Policy Studies -- - This course will teach you the core tools of ***econometrics***. The goal of the course is to provide you with *essential data analysis tools*. - We will meet on ***Tuesdays at 10.15 AM*** . - You will meet with the TA, Eléonore Richard, to discuss the material and review problem sets. Her email address is ***eleonore.richard@sciencespo.fr*** -- # Welcome to ScPoEconometrics! -- - We will learn these tools through example based on ***the programming language R***. - This course builds on the (amazing) [openSource lectures](https://github.com/ScPoEcon/ScPoEconometrics) by the ScPo Econometrics team (Florian Oswald, Gustave Kenedi and Pierre Villedieu). - A [book overview](https://michelefioretti.github.io/ScPoEconometrics/index.html) of the material is also available. --- # What is *econometrics*? - A set of techniques and methods to answer questions with data. - Econometrics shares many things with *applied statistics* and *machine learning*. - Some examples! --- # Answering Important Questions with Econometrics [<ru-blockquote> Does immigration *cause* lower wages and higher unemployment for locals? </ru-blockquote>](http://davidcard.berkeley.edu/papers/mariel-impact.pdf) -- [<ru-blockquote> Does increasing the minimum wage *cause* greater unemployment? </ru-blockquote>](http://davidcard.berkeley.edu/papers/njmin-aer.pdf) -- [<ru-blockquote> Does more education *cause* higher wages? </ru-blockquote>](http://davidcard.berkeley.edu/papers/causal_educ_earnings.pdf) -- [<ru-blockquote> Does higher public debt levels *cause* lower economic growth? </ru-blockquote>](https://www.imf.org/external/pubs/ft/wp/2014/wp1434.pdf) -- [<ru-blockquote> Does birth order *cause* differing education trajectories? </ru-blockquote>](https://academic.oup.com/qje/article/120/2/669/1933962) --- background-image: url("https://media.giphy.com/media/3o7TKMMlq8TItpr7Co/giphy.gif") background-position: 80% 80% background-size: 300pt # Causality .pull-left[ * Notice the keyword **cause** in all of the above. * Notice also that *many other factors could have caused* each of those outcomes. * Econometrics is often about spelling out conditions under which we can claim to measure causal relationships. * We will encounter the most basic of those conditions, and talk about some potential pitfalls. ] .pull-right[ As in the acclaimed [Book of Why](https://www.amazon.fr/Book-Why-Science-Cause-Effect/dp/046509760X) we often ask *why* did something happen? ] --- # This Course - Teach you the basics of ***linear regression***, ***statistical inference*** and ***impact evaluation***. - Equip you with a framework to think more deeply about causality. - Introduce you to the `R` software environment. - ⚠️ This is *not* a course about `R`. -- ## Grading 1. There will be ***periodic quizzes*** on Moodle roughly every two weeks => 20% -- 1. There will be ***a problem set / case study*** => 30% -- 1. There will be ***a take home exam*** => 50% --- # Course Materials 1. The [Book](https://michelefioretti.github.io/ScPoEconometrics/) 1. The [Slides](https://michelefioretti.github.io/ScPoEconometrics-Slides/) 1. The code repository for the [R package](https://github.com/ScPoEcon/ScPoEconometrics) 1. Quizzes on [Moodle](https://moodle.sciences-po.fr) --- # Syllabus 🤞 .pull-left[ <br> Lecture 1: **Introduction** Lecture 2/3: **Summarising Data** *Quiz 1 (after lecture 2)* Lecture 4: **Simple Linear Regression** *Quiz 2* Lecture 5: **Introduction to Causality** *Midterm Project* Lecture 6: **Multiple Linear Regression** ] .pull-right[ Lecture 7: **Sampling** Lecture 8/9: **Statistical Inference** Lecture 10: **Differences-in-Differences** Lecture 11: **Regression Discontinuity** *Quiz 3* Lecture 12: ***Recap*** *Final Project* ] --- # Useful Resources (Other Than our [_Book_](https://scpoecon.github.io/ScPoEconometrics/)) ## Econometrics - [*Mastering Metrics*](https://catalogue-bibliotheque.sciencespo.fr/discovery/fulldisplay?docid=alma991003947949705808&context=L&vid=33USPC_SPO:SPO&lang=fr&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=title,contains,Mastering%20Metrics,AND&query=creator,contains,Angrist%20and%20Pischke,AND&mode=advanced&offset=0) by Angrist and Pischke - [*Introductory Econometrics: Modern Approach*](https://catalogue-bibliotheque.sciencespo.fr/discovery/fulldisplay?docid=alma991003442689705808&context=L&vid=33USPC_SPO:SPO&lang=fr&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&isFrbr=true&tab=Everything&query=title,contains,Econometrics,AND&query=creator,contains,Wooldridge,AND&mode=advanced&offset=0) by Wooldridge - [*Introduction to Econometrics*](https://catalogue-bibliotheque.sciencespo.fr/discovery/fulldisplay?docid=alma991003436429705808&context=L&vid=33USPC_SPO:SPO&lang=fr&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&isFrbr=true&tab=Everything&query=title,contains,Econometrics,AND&query=creator,contains,Stock%20and%20Watson,AND&mode=advanced&offset=0) by Stock and Watson - [*Causal Inference: The Mixtape*](https://scunning.com/cunningham_mixtape.pdf) by Cunningham - [Ben Lambert's youtube channel](https://www.youtube.com/user/SpartacanUsuals) ## Metrics and `R` - [ModernDive](https://moderndive.com/) - [Introduction to Econometrics with R](https://www.econometrics-with-r.org/) - [R for Data Science](https://r4ds.had.co.nz/) --- layout: false class: title-slide-section-red, middle # R --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- ## What is `R`? `R` is a __programming language__ with powerful statistical and graphic capabilities. ## Why are we using `R`?<sup>1</sup> 1. `R` is __free__ and __open source__—saving both you and the university 💰💵💰. 1. `R` is very __flexible and powerful__—adaptable to nearly any task, _e.g._, data cleaning, data visualization, econometrics, spatial data analysis, machine learning, web scraping, ... 1. `R` has a vibrant, thriving online community that will (almost) always have a solution to your problem. ([stack overflow](https://stackoverflow.com/questions/tagged/r)) 1. If you put in the work,<sup>2</sup> you will come away with a __very valuable and marketable__ tool. .footnote[ [1]: This list has been inspired by [Ed Rubin's](https://github.com/edrubin/EC421S19). [2]: Learning `R` definitely requires time and effort but it's worth it, trust me! 💪 ] --- # Why can't we just use Excel? Many reasons but here are just a few: -- - Not reproducible. -- - Not straightforward to merge datasets together. -- - Very fastidious to clean data. -- - Limited to small datasets -- - Not designed for proper econometric analyses, maps, complex visualisations, etc. --- layout: false class: title-slide-section-red, middle # R SHOWCASE --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Showcase #1: Spatial Data <!-- The shp fil is missing--> .pull-left[ * `R` is very strong with spatial data. In particular via the [`sf`](https://cran.r-project.org/web/packages/sf/index.html) package. * We can represent _any_ shape or geometry. * Maps are the most obvious example: ```r library(sf) library(tmap) iris_shfl <- read_sf("chapter1_files/figure-html/contours_iris_2015/CONTOURS-IRIS.shp") %>% mutate(dep = substr(INSEE_COM,1,2)) %>% select(CODE_IRIS, dep, geometry) %>% filter(dep == "75") iris_income <- readRDS("../rds/iris_inc.rds") %>% mutate(CODE_IRIS = IRIS) %>% select(CODE_IRIS, DISP_MED15) iris_map <- left_join(iris_shfl, iris_income, by = "CODE_IRIS") tmap_mode("plot") tm_shape(iris_map) + tm_borders() + tm_fill(col = "DISP_MED15", title = "Median household income (euros)") ``` ] -- .pull-right[ * Can be improved but you get this with only __14 lines of code!__ <img src="chapter1_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] --- # Showcase #2: Interactive web applications -- - Santé publique France has created a [simple web application](https://cpelat.shinyapps.io/mass/) to track the epidemiological situation in French regions per week .center[![Santé publique France](chapter1_files/figure-html/sante_publique_france.png)] --- # In Practice: Data Wrangling * You will spend a lot of time preparing data for further analysis. -- * The `gapminder` dataset contains data on life expectancy, GDP per capita and population by country between 1952 and 2007. * Suppose we want to know the average life expectancy and average GDP per capita for each continent in each year. * We need to group the data by continent *and* year, then compute the average life expectancy and average GDP per capita .left-column[ ```r # load gapminder package library(gapminder) # load the dataset in object gapminder gapminder = gapminder::gapminder # display variables in the dataset names(gapminder) # show first 4 lines of this dataframe head(gapminder,n = 4) ``` ] .right-column[ ``` ## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap" ``` ``` ## # A tibble: 4 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ``` ] --- # In Practice: Data Wrangling * There are always several ways to achieve a goal. (As in life 😄) * Here we will only focus on the `dplyr` way: ```r # You may want to install the dplyr package first: install.packages("dplyr") library(dplyr) # compute the required statistics gapminder_dplyr <- gapminder %>% group_by(continent, year) %>% summarise( count = n(), mean_lifeexp = mean(lifeExp, na.rm = TRUE), mean_gdppercap = mean(gdpPercap, na.rm = TRUE) ) ``` ```r # show first 4 lines of this dataframe head(gapminder_dplyr, n = 4) ``` ``` ## # A tibble: 4 × 5 ## # Groups: continent [1] ## continent year count mean_lifeexp mean_gdppercap ## <fct> <int> <int> <dbl> <dbl> ## 1 Africa 1952 52 39.1 1253. ## 2 Africa 1957 52 41.3 1385. ## 3 Africa 1962 52 43.3 1598. ## 4 Africa 1967 52 45.3 2050. ``` --- # Visualisation .pull-left[ * Now we could *look* at the result in `gapminder_dplyr`, or compute some statistics from it. * Nothing beats a picture, though: ```r # to install the dplyr package: install.packages("ggplot2") library(ggplot2) # create a plot ggplot(data = gapminder_dplyr, mapping = aes(x = mean_lifeexp, y = mean_gdppercap, color = continent, size = count)) + geom_point(alpha = 1/3) + labs(x = "Average life expectancy", y = "Average GDP per capita", color = "Continent", size = "Nb of countries") + theme_bw() # white background ``` ] .pull-right[ <img src="chapter1_files/figure-html/fig.h-1.svg" style="display: block; margin: auto;" /> ] --- # Animated Plotting 👌 <sup>1</sup> .center[![Gapminder](chapter1_files/figure-html/ex_gganimate.gif)] .footnote[ [1]: This animation is taken from [Ed Rubin](https://raw.githack.com/edrubin/EC421S19/master/LectureNotes/01Intro/01_intro.html#40). ] --- layout: false class: title-slide-section-red, middle # R 101: Here Is Where You Start --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- background-image: url("https://media.giphy.com/media/3oKIPqsXYcdjcBcXL2/giphy-downsized.gif") background-position: 90% 50% background-size: 300pt # Tool Time! .pull-left[ ## Getting R and Rstudio * Download `R` from [CRAN](https://cran.r-project.org) for your OS. * Download `RStudio` from [here](https://www.rstudio.com/products/rstudio/download/#download) for your OS. ] --- # Start your `RStudio`! ## First Glossary of Terms * `R`: a programming language. * `RStudio`: an integrated development environment (IDE) to work with `R`. -- * *command*: user input (text or numbers) that `R` *understands*. * *script*: a list of commands collected in a text file, each separated by a new line, to be run one after the other. -- * To run a script, you need to highlight the relevant code lines and hit `Ctrl`+`Enter` (Windows) or `Cmd`+`Enter` (Mac). --- # `RStudio` Layout <img src="chapter1_files/figure-html/rstudio.png" width="600px" style="display: block; margin: auto;" /> --- # R as a Calculator .pull-left[ * You can use the `R` console like a calculator * Just type an arithmetic operation after `>` and hit `Enter`! ] -- .pull-right[ * Some basic arithmetic first: ```r 4 + 1 ``` ``` ## [1] 5 ``` ```r 8 / 2 ``` ``` ## [1] 4 ``` * Great! What about this? ```r log(exp(1)) ``` ``` ## [1] 1 ``` ```r # by the way: this is a comment! (R disregards it) ``` ] --- class: inverse # Task 1 (5 minutes) * Create a new R script (File `\(\rightarrow\)` New File `\(\rightarrow\)` R Script). Save it somewhere as `lecture_intro.R`. * Write in your script and run the following code: (`Ctrl` or `Cmd` + `Enter`) ```r 4 * 8 ``` * Write in your script and run the following code. What happens if you only run the first line of the code? ```r x = 5 # equivalently x <- 5 x ``` * Congratulations, you have created your first `R` "object"! Everything is an object in R! Objects are assigned using `=` or `<-`. * Lastly, find the cube of `x` and assign that value to `x_3`. --- # Where to get Help? * `R` built-in `help`: ```r ?log ?sin ?paste ?lm help(lm) # help() is equivalent ??plot # get all help on keyword "plot" help(ggplot,package="ggplot2") # show help from a certain package ``` * Help from Humans! * Google is your best friend! * [stackoverflow.com](https://stackoverflow.com) * Your classroom channel on Slack. If you haven't joined click [here](https://join.slack.com/t/applieddata/shared_invite/zt-h8gsmw9y-xdzV46sY5T2ldgK74u~0Mw) to join * [rstudio forum](https://community.rstudio.com) --- # HOW to get Help? Follow this for Slack questions! 1. Describe what you want to do. 1. Describe what you *expect* your code to do. 1. Describe what your code *does instead*. * Provide the entire error message. 1. Provide enough code to *reproduce* your error. * You can post post code snippets on Slack and Stack Overflow --- # R Packages * `R` users contribute add-on data and functions as *packages* * Installing packages is easy! ```r install.packages("ggplot2") ``` * To *use* the contents of a packge, we must load it from our library: ```r library(ggplot2) ``` --- # `ScPoApps` package * The ScPoEconometrics team wrote an `R` package for you, which is hosted on [GitHub](https://github.com/ScPoEcon/ScPoApps) * You can install (and frequently update!) from here: ```r if (!require(devtools)) {install.packages("devtools"); library(devtools)} install_github(repo = "ScPoEcon/ScPoApps") # Let me know if you encounter an error ``` -- * Did it work? ```r library(ScPoApps) packageVersion("ScPoApps") # latest release is 0.2 ``` ``` ## [1] '0.2' ``` * Check also if the package finds the tutorials ```r ScPoApps::runTutorial() ``` ``` ## Error: Please run `runTutorial()` with a valid tutorial as an argument. ## Valid tutorial are: 'airline-safety', 'chapter2', 'chapter2-script', 'correlation', 'lm-example', 'non-normal', 'rescaling', 'trade' ``` --- # Vectors * What *is* a **vector**? * The `c` function creates vectors. ```r c(1, 3, 5, 7, 8, 9) ``` ``` ## [1] 1 3 5 7 8 9 ``` * Coercion to unique types: ```r c(42, "Statistics", TRUE) ``` ``` ## [1] "42" "Statistics" "TRUE" ``` * Creating a *range* ```r c(y = 1:6) ``` ``` ## y1 y2 y3 y4 y5 y6 ## 1 2 3 4 5 6 ``` --- # `data.frame`'s `data.frame`s are like spreadsheets. ```r example_data = data.frame(x = c(1, 3, 5, 7), y = c(rep("Hello", 3), "Goodbye"), z = sample(c(TRUE,FALSE),size=4,replace=TRUE)) example_data ``` ``` ## x y z ## 1 1 Hello TRUE ## 2 3 Hello TRUE ## 3 5 Hello TRUE ## 4 7 Goodbye TRUE ``` In practice, you will be importing files that contain the data into `R` rather than creating `data.frame`s by hand. --- class: inverse # Task 2 (10 minutes) * Find out (using google) how to import a .csv file. * Import [gun_murders.csv](https://www.dropbox.com/s/zuk0qcfm3kyzs4e/gun_murders.csv?dl=1)<sup>1</sup> in a new object `murders`. This file contains data on gun murders by US state in 2010. (Hint: objects are created using `=`) * Ensure that `murders` is a data.frame by running: ```r # Check class class(murders) ``` * Find out what variables are contained in `murders` by running: ```r # Obain variable names names(murders) ``` * View the contents of `murders` by clicking on `murders` in your workspace - What does the `total` variable correspond to? .footnote[ [1]: This dataset is taken from the `dslabs` package. ] --- # `data.frame`s * Useful methods for a dataframe: ```r str(murders) # describes the data.frame ``` ``` ## 'data.frame': 51 obs. of 5 variables: ## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ abb : chr "AL" "AK" "AZ" "AR" ... ## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ... ## $ population: num 4779736 710231 6392017 2915918 37253956 ... ## $ total : num 135 19 232 93 1257 ... ``` ```r names(murders) # column names ``` ``` ## [1] "state" "abb" "region" "population" "total" ``` ```r nrow(murders) # number of rows ``` ``` ## [1] 51 ``` ```r ncol(murders) # number of columns ``` ``` ## [1] 5 ``` --- # Data on Gun Murders in the US * Let's ***View*** the data in a table format ```r View(murders) # Open a data.frame/vector in an Excel style table (up to 50 columns) ``` * Let's look at the first rows of `murders`. ```r head(murders, n = 3) # show first 3 rows ``` ``` ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ``` * To access one of the variables **as a vector**, we use the `$` operator as in `murders$state`. We can check the type of `murders$state` with ```r class(murders$state) # type of the state variable in the murders data.frame ``` ``` ## [1] "character" ``` * Or we use the column name or index: `murders[,"state"]` or `murders[,1]` --- # Subsetting `data.frames` * Subsetting a data.frame: `murders[row condition, column number]` or `murders[row condition, "column name"]` ```r # Only keep states with over 500 gun murders and keep only the "state" and "total" variables murders[murders$total > 500, c("state", "total")] ``` ``` ## state total ## 5 California 1257 ## 10 Florida 669 ## 33 New York 517 ## 44 Texas 805 ``` * There is also a special function for subsetting data: ```r subset(murders, subset = total > 500, select = c("state", "total")) ``` ``` ## state total ## 5 California 1257 ## 10 Florida 669 ## 33 New York 517 ## 44 Texas 805 ``` --- class: inverse # Task 3 (10 minutes) 1. How many observations are there in `murders`? 1. How many variables? What are the data types of each variable? 1. Notice that the colon operator `a:b` is just short for *construct a sequence from `a` to `b`*. Create a new object `murders_2` containing the rows 10 to 25 of `murders`. 1. Create a new object `murders_3` which only contains the columns `state` and `total`. (Recall that `c` creates vectors.) 1. What is the average value of `total`? 1. What is the average value of `total` for state's in the "South", i.e. with `region == "South"`? 1. Create a total_percap variable: ```r murders$total_percap = (murders$total / murders$population) * 10000 ``` Congratulations, you've created your first variable! --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU NEXT WEEK! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>] | michele.fioretti@sciencespo.fr | | <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://michelefioretti.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon |