Applied Data Analysis for Public Policy Studies

.title[
# Applied Data Analysis for Public Policy Studies
]
.subtitle[
## Introduction
]
.author[
### Michele Fioretti
]
.date[
### Sciences Po Paris </br> 2023-08-31
]

---

---

# Welcome to Applied Data Analysis for Public Policy Studies

- This course will teach you the core tools of ***econometrics***. The goal of the course is to provide you with *essential data analysis tools*.

- We will meet on ***Tuesdays at 10.15 AM*** .

- You will meet with the TA, Eléonore Richard, to discuss the material and review problem sets. Her email address is ***eleonore.richard@sciencespo.fr***

# Welcome to ScPoEconometrics!

- We will learn these tools through example based on ***the programming language R***.

- This course builds on the (amazing) [openSource lectures](https://github.com/ScPoEcon/ScPoEconometrics) by the ScPo Econometrics team (Florian Oswald, Gustave Kenedi and Pierre Villedieu).

- A [book overview](https://michelefioretti.github.io/ScPoEconometrics/index.html) of the material is also available.

---

# What is *econometrics*?

- A set of techniques and methods to answer questions with data.

- Econometrics shares many things with *applied statistics* and *machine learning*.

- Some examples!

---

# Answering Important Questions with Econometrics

[<ru-blockquote>
Does immigration *cause* lower wages and higher unemployment for locals?
</ru-blockquote>](http://davidcard.berkeley.edu/papers/mariel-impact.pdf)

[<ru-blockquote>
Does increasing the minimum wage *cause* greater unemployment?
</ru-blockquote>](http://davidcard.berkeley.edu/papers/njmin-aer.pdf)

[<ru-blockquote>
Does more education *cause* higher wages?
</ru-blockquote>](http://davidcard.berkeley.edu/papers/causal_educ_earnings.pdf)

[<ru-blockquote>
Does higher public debt levels *cause* lower economic growth?
</ru-blockquote>](https://www.imf.org/external/pubs/ft/wp/2014/wp1434.pdf)

[<ru-blockquote>
Does birth order *cause* differing education trajectories?
</ru-blockquote>](https://academic.oup.com/qje/article/120/2/669/1933962)

---

background-image: url("https://media.giphy.com/media/3o7TKMMlq8TItpr7Co/giphy.gif")
background-position: 80% 80%
background-size: 300pt

# Causality

* Notice the keyword **cause** in all of the above.

* Notice also that *many other factors could have caused* each of those outcomes.

* Econometrics is often about spelling out conditions under which we can claim to measure causal relationships.

* We will encounter the most basic of those conditions, and talk about some potential pitfalls.
]

.pull-right[
As in the acclaimed [Book of Why](https://www.amazon.fr/Book-Why-Science-Cause-Effect/dp/046509760X) we often ask *why* did something happen?
]

---

# This Course

- Teach you the basics of ***linear regression***, ***statistical inference*** and ***impact evaluation***.

- Equip you with a framework to think more deeply about causality.

- Introduce you to the `R` software environment.

- ⚠️ This is *not* a course about `R`.

## Grading

1. There will be ***periodic quizzes*** on Moodle roughly every two weeks => 20%

1. There will be ***a problem set / case study*** => 30%

1. There will be ***a take home exam*** => 50%

---

# Course Materials

1. The [Book](https://michelefioretti.github.io/ScPoEconometrics/)

1. The [Slides](https://michelefioretti.github.io/ScPoEconometrics-Slides/)

1. The code repository for the [R package](https://github.com/ScPoEcon/ScPoEconometrics)

1. Quizzes on [Moodle](https://moodle.sciences-po.fr)

---

# Syllabus 🤞

Lecture 2/3: **Summarising Data**

*Quiz 1 (after lecture 2)*

Lecture 4: **Simple Linear Regression**

*Quiz 2*

Lecture 5: **Introduction to Causality**

*Midterm Project*

Lecture 6: **Multiple Linear Regression**
]

Lecture 8/9: **Statistical Inference**

Lecture 10: **Differences-in-Differences**

Lecture 11: **Regression Discontinuity**

*Quiz 3*

Lecture 12: ***Recap***

*Final Project*
]

---

# Useful Resources (Other Than our [_Book_](https://scpoecon.github.io/ScPoEconometrics/))

## Econometrics
- [*Mastering Metrics*](https://catalogue-bibliotheque.sciencespo.fr/discovery/fulldisplay?docid=alma991003947949705808&context=L&vid=33USPC_SPO:SPO&lang=fr&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=title,contains,Mastering%20Metrics,AND&query=creator,contains,Angrist%20and%20Pischke,AND&mode=advanced&offset=0) by Angrist and Pischke
- [*Introductory Econometrics: Modern Approach*](https://catalogue-bibliotheque.sciencespo.fr/discovery/fulldisplay?docid=alma991003442689705808&context=L&vid=33USPC_SPO:SPO&lang=fr&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&isFrbr=true&tab=Everything&query=title,contains,Econometrics,AND&query=creator,contains,Wooldridge,AND&mode=advanced&offset=0) by Wooldridge
- [*Introduction to Econometrics*](https://catalogue-bibliotheque.sciencespo.fr/discovery/fulldisplay?docid=alma991003436429705808&context=L&vid=33USPC_SPO:SPO&lang=fr&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&isFrbr=true&tab=Everything&query=title,contains,Econometrics,AND&query=creator,contains,Stock%20and%20Watson,AND&mode=advanced&offset=0) by Stock and Watson
- [*Causal Inference: The Mixtape*](https://scunning.com/cunningham_mixtape.pdf) by Cunningham
- [Ben Lambert's youtube channel](https://www.youtube.com/user/SpartacanUsuals)

## Metrics and `R`
- [ModernDive](https://moderndive.com/)
- [Introduction to Econometrics with R](https://www.econometrics-with-r.org/)
- [R for Data Science](https://r4ds.had.co.nz/)

---

# R

---

---

## What is `R`?

`R` is a __programming language__ with powerful statistical and graphic capabilities.

## Why are we using `R`?<sup>1</sup>

1. `R` is __free__ and __open source__—saving both you and the university 💰💵💰.

1. `R` is very __flexible and powerful__—adaptable to nearly any task, _e.g._, data cleaning, data visualization, econometrics, spatial data analysis, machine learning, web scraping, ...

1. `R` has a vibrant, thriving online community that will (almost) always have a solution to your problem. ([stack overflow](https://stackoverflow.com/questions/tagged/r))

1. If you put in the work,<sup>2</sup> you will come away with a __very valuable and marketable__ tool.

.footnote[
[1]: This list has been inspired by [Ed Rubin's](https://github.com/edrubin/EC421S19).  
[2]: Learning `R` definitely requires time and effort but it's worth it, trust me! 💪
]

---

# Why can't we just use Excel?

Many reasons but here are just a few:

- Not reproducible.

- Not straightforward to merge datasets together.

- Very fastidious to clean data.

- Limited to small datasets

- Not designed for proper econometric analyses, maps, complex visualisations, etc.

---

# R SHOWCASE

---

---

# Showcase #1: Spatial Data

* `R` is very strong with spatial data. In particular via the [`sf`](https://cran.r-project.org/web/packages/sf/index.html) package.

* We can represent _any_ shape or geometry.

* Maps are the most obvious example:

```r
library(sf)
library(tmap)
iris_shfl <- read_sf("chapter1_files/figure-html/contours_iris_2015/CONTOURS-IRIS.shp") %>%
  mutate(dep = substr(INSEE_COM,1,2)) %>%
  select(CODE_IRIS, dep, geometry) %>%
  filter(dep == "75")
iris_income <- readRDS("../rds/iris_inc.rds") %>%
  mutate(CODE_IRIS = IRIS) %>%
  select(CODE_IRIS, DISP_MED15)
iris_map <- left_join(iris_shfl, iris_income, by = "CODE_IRIS")
tmap_mode("plot")
tm_shape(iris_map) +
  tm_borders() +
  tm_fill(col = "DISP_MED15", title = "Median household income (euros)")
```
]

* Can be improved but you get this with only __14 lines of code!__

<img src="chapter1_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" />
]

---

# Showcase #2: Interactive web applications

- Santé publique France has created a [simple web application](https://cpelat.shinyapps.io/mass/) to track the epidemiological situation in French regions per week

---

# In Practice: Data Wrangling

* You will spend a lot of time preparing data for further analysis.

* The `gapminder` dataset contains data on life expectancy, GDP per capita and population by country between 1952 and 2007.

* Suppose we want to know the average life expectancy and average GDP per capita for each continent in each year.

* We need to group the data by continent *and* year, then compute the average life expectancy and average GDP per capita

```r
  # load gapminder package
library(gapminder)
  # load the dataset in object gapminder
gapminder = gapminder::gapminder
  # display variables in the dataset
names(gapminder)
  # show first 4 lines of this dataframe
head(gapminder,n = 4)
```
]

```
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
```

```
## # A tibble: 4 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
```
]

---

# In Practice: Data Wrangling

* There are always several ways to achieve a goal. (As in life 😄)

* Here we will only focus on the `dplyr` way:

```r
# You may want to install the dplyr package first: install.packages("dplyr")
library(dplyr) 
# compute the required statistics
gapminder_dplyr <- gapminder %>% 
  group_by(continent, year) %>% 
  summarise(
    count = n(),
    mean_lifeexp = mean(lifeExp, na.rm = TRUE),
    mean_gdppercap = mean(gdpPercap, na.rm = TRUE)
  )
```

```r
  # show first 4 lines of this dataframe
head(gapminder_dplyr, n = 4)
```

```
## # A tibble: 4 × 5
## # Groups:   continent [1]
##   continent  year count mean_lifeexp mean_gdppercap
##   <fct>     <int> <int>        <dbl>          <dbl>
## 1 Africa     1952    52         39.1          1253.
## 2 Africa     1957    52         41.3          1385.
## 3 Africa     1962    52         43.3          1598.
## 4 Africa     1967    52         45.3          2050.
```

---

# Visualisation

.pull-left[
* Now we could *look* at the result in `gapminder_dplyr`, or compute some statistics from it.

* Nothing beats a picture, though:

```r
  # to install the dplyr package: install.packages("ggplot2")
library(ggplot2)
  # create a plot
ggplot(data = gapminder_dplyr, 
       mapping = aes(x = mean_lifeexp,
                     y = mean_gdppercap,
                     color = continent,
                     size = count)) +
  geom_point(alpha = 1/3) +
  labs(x = "Average life expectancy",
       y = "Average GDP per capita",
       color = "Continent",
       size = "Nb of countries") +
  theme_bw() # white background
```
]

.pull-right[
<img src="chapter1_files/figure-html/fig.h-1.svg" style="display: block; margin: auto;" />
]

---

# Animated Plotting 👌 <sup>1</sup>

.footnote[
[1]: This animation is taken from [Ed Rubin](https://raw.githack.com/edrubin/EC421S19/master/LectureNotes/01Intro/01_intro.html#40).
]

---

# R 101: Here Is Where You Start

---

---

background-image: url("https://media.giphy.com/media/3oKIPqsXYcdjcBcXL2/giphy-downsized.gif")
background-position: 90% 50%
background-size: 300pt

# Tool Time!

* Download `R` from [CRAN](https://cran.r-project.org) for your OS.

* Download `RStudio` from [here](https://www.rstudio.com/products/rstudio/download/#download) for your OS.
]

---

# Start your `RStudio`!

## First Glossary of Terms

* `R`: a programming language.

* `RStudio`: an integrated development environment (IDE) to work with `R`.

* *command*: user input (text or numbers) that `R` *understands*.

* *script*: a list of commands collected in a text file, each separated by a new line, to be run one after the other.

* To run a script, you need to highlight the relevant code lines and hit `Ctrl`+`Enter` (Windows) or `Cmd`+`Enter` (Mac).

---

# `RStudio` Layout

---

# R as a Calculator

* You can use the `R` console like a calculator

* Just type an arithmetic operation after `>` and hit `Enter`!

]

.pull-right[
* Some basic arithmetic first:
    
    ```r
    4 + 1
    ```
    
    ```
    ## [1] 5
    ```
    
    ```r
    8 / 2
    ```
    
    ```
    ## [1] 4
    ```

* Great! What about this?
    
    ```r
    log(exp(1))
    ```
    
    ```
    ## [1] 1
    ```
    
    ```r
    # by the way: this is a comment! (R disregards it)
    ```
]
---

# Task 1 (5 minutes)

* Create a new R script (File `$\rightarrow$` New File `$\rightarrow$` R Script). Save it somewhere as `lecture_intro.R`.

* Write in your script and run the following code: (`Ctrl` or `Cmd` + `Enter`)
    
    ```r
    4 * 8
    ```

* Write in your script and run the following code. What happens if you only run the first line of the code?
    
    ```r
    x = 5 # equivalently x <- 5
    x
    ```

* Congratulations, you have created your first `R` "object"! Everything is an object in R! Objects are assigned using `=` or `<-`.

* Lastly, find the cube of `x` and assign that value to `x_3`.

---

# Where to get Help?

* `R` built-in `help`:
    
    ```r
    ?log
    ?sin
    ?paste
    ?lm
    help(lm)   # help() is equivalent
    ??plot  # get all help on keyword "plot"
    help(ggplot,package="ggplot2")  # show help from a certain package
    ```

* Help from Humans!
  * Google is your best friend!
  * [stackoverflow.com](https://stackoverflow.com)
  * Your classroom channel on Slack. If you haven't joined click [here](https://join.slack.com/t/applieddata/shared_invite/zt-h8gsmw9y-xdzV46sY5T2ldgK74u~0Mw) to join
  * [rstudio forum](https://community.rstudio.com)
  
---

# HOW to get Help? Follow this for Slack questions!

1. Describe what you want to do.

1. Describe what you *expect* your code to do.

1. Describe what your code *does instead*.
  * Provide the entire error message.
  
1. Provide enough code to *reproduce* your error.
  * You can post post code snippets on Slack and Stack Overflow

---

# R Packages

* `R` users contribute add-on data and functions as *packages*

* Installing packages is easy!
    
    ```r
    install.packages("ggplot2")
    ```

* To *use* the contents of a packge, we must load it from our library:
    
    ```r
    library(ggplot2)
    ```

---

# `ScPoApps` package

* The ScPoEconometrics team wrote an `R` package for you, which is hosted on [GitHub](https://github.com/ScPoEcon/ScPoApps)

* You can install (and frequently update!) from here:
    
    ```r
    if (!require(devtools)) {install.packages("devtools"); library(devtools)}
    install_github(repo = "ScPoEcon/ScPoApps") # Let me know if you encounter an error
    ```

* Did it work?
    
    ```r
    library(ScPoApps)
    packageVersion("ScPoApps") # latest release is 0.2
    ```
    
    ```
    ## [1] '0.2'
    ```

* Check also if the package finds the tutorials
    
    ```r
    ScPoApps::runTutorial() 
    ```
    
    ```
    ## Error: Please run `runTutorial()` with a valid tutorial as an argument.
    ## Valid tutorial are: 'airline-safety', 'chapter2', 'chapter2-script', 'correlation', 'lm-example', 'non-normal', 'rescaling', 'trade'
    ```

---

# Vectors

* What *is* a **vector**?

* The `c` function creates vectors.
    
    ```r
    c(1, 3, 5, 7, 8, 9)
    ```
    
    ```
    ## [1] 1 3 5 7 8 9
    ```

* Coercion to unique types:
    
    ```r
    c(42, "Statistics", TRUE)
    ```
    
    ```
    ## [1] "42"         "Statistics" "TRUE"
    ```

* Creating a *range*
    
    ```r
    c(y = 1:6)
    ```
    
    ```
    ## y1 y2 y3 y4 y5 y6 
    ##  1  2  3  4  5  6
    ```

---

# `data.frame`'s

`data.frame`s are like spreadsheets.

```r
example_data = data.frame(x = c(1, 3, 5, 7),
                          y = c(rep("Hello", 3), "Goodbye"),
                          z = sample(c(TRUE,FALSE),size=4,replace=TRUE))
example_data
```

```
##   x       y    z
## 1 1   Hello TRUE
## 2 3   Hello TRUE
## 3 5   Hello TRUE
## 4 7 Goodbye TRUE
```

In practice, you will be importing files that contain the data into `R` rather than creating `data.frame`s by hand.

---

# Task 2 (10 minutes)

* Find out (using google) how to import a .csv file.

* Import [gun_murders.csv](https://www.dropbox.com/s/zuk0qcfm3kyzs4e/gun_murders.csv?dl=1)<sup>1</sup> in a new object `murders`. This file contains data on gun murders by US state in 2010. (Hint: objects are created using `=`)

* Ensure that `murders` is a data.frame by running:
    
    ```r
    # Check class
    class(murders)
    ```

* Find out what variables are contained in `murders` by running:
    
    ```r
    # Obain variable names
    names(murders)
    ```

* View the contents of `murders` by clicking on `murders` in your workspace
  - What does the `total` variable correspond to?

---

# `data.frame`s

* Useful methods for a dataframe:

```r
str(murders) # describes the data.frame
```

```
## 'data.frame':	51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...
```

```r
names(murders) # column names
```

```
## [1] "state"      "abb"        "region"     "population" "total"
```

```r
nrow(murders) # number of rows
```

```
## [1] 51
```

```r
ncol(murders) # number of columns
```

```
## [1] 5
```

---

# Data on Gun Murders in the US

* Let's ***View*** the data in a table format
    
    ```r
    View(murders) # Open a data.frame/vector in an Excel style table (up to 50 columns)
    ```

* Let's look at the first rows of `murders`.
    
    ```r
    head(murders, n = 3)  # show first 3 rows
    ```
    
    ```
    ##     state abb region population total
    ## 1 Alabama  AL  South    4779736   135
    ## 2  Alaska  AK   West     710231    19
    ## 3 Arizona  AZ   West    6392017   232
    ```
    
* To access one of the variables **as a vector**, we use the `$` operator as in `murders$state`. We can check the type of `murders$state` with
    
    ```r
    class(murders$state)  # type of the state variable in the murders data.frame
    ```
    
    ```
    ## [1] "character"
    ```

* Or we use the column name or index: `murders[,"state"]` or `murders[,1]`

---

# Subsetting `data.frames`

* Subsetting a data.frame: `murders[row condition, column number]` or `murders[row condition, "column name"]`
    
    ```r
    # Only keep states with over 500 gun murders and keep only the "state" and "total" variables
    murders[murders$total > 500, c("state", "total")]
    ```
    
    ```
    ##         state total
    ## 5  California  1257
    ## 10    Florida   669
    ## 33   New York   517
    ## 44      Texas   805
    ```
    
* There is also a special function for subsetting data:
    
    ```r
    subset(murders, subset = total > 500, select = c("state", "total"))
    ```
    
    ```
    ##         state total
    ## 5  California  1257
    ## 10    Florida   669
    ## 33   New York   517
    ## 44      Texas   805
    ```

---

# Task 3 (10 minutes)

1. How many observations are there in `murders`?

1. How many variables? What are the data types of each variable?

1. Notice that the colon operator `a:b` is just short for *construct a sequence from `a` to `b`*. Create a new object `murders_2` containing the rows 10 to 25 of `murders`.

1. Create a new object `murders_3` which only contains the columns `state` and `total`. (Recall that `c` creates vectors.)

1. What is the average value of `total`?

1. What is the average value of `total` for state's in the "South", i.e. with `region == "South"`?

1. Create a total_percap variable:
    
    ```r
    murders$total_percap = (murders$total / murders$population) * 10000
    ```

Congratulations, you've created your first variable!

---

class: title-slide-final, middle
background-image: url(../img/logo/ScPo-econ.png)
background-size: 250px
background-position: 9% 19%

# SEE YOU NEXT WEEK!

|                                                                                                            |                                   |
| :--------------------------------------------------------------------------------------------------------- | :-------------------------------- |
| <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>]               | michele.fioretti@sciencespo.fr       |
| <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides |
| <a href="https://michelefioretti.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book |
| <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>]                          | @ScPoEcon                         |
| <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>]                          | @ScPoEcon                       |