Applied Data Analysis for Public Policy Studies

# Applied Data Analysis for Public Policy Studies
## Summarising, Visualizing and Tidying Data
### Michele Fioretti
### SciencesPo Paris </br> 2022-08-29

---

---

# Recap from last week

* ***Causality*** plays a central role in modern econometrics!

* Using R is ***very*** valuable

* Basic data wrangling:
  - `View`, `str`, `names`, `nrow`, `ncol`
  - *subsetting:* `murders[row condition, "column name"]`
  - *variable creation:* `murders$total_percap = (murders$total / murders$population) * 10000`

## Today

* Deeper dive into data wrangling with `R`:
  - __summarizing__ data,
  - __visualisation__ data,
  - __tidying__ data

---
  
# Working With Data

* Econometrics is about `data`.

* According a to [2014 NYTimes article](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html), "data scientists [...] spend from ***50 percent to 80 percent of their time*** mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."

* In the next two lectures you will learn the __basics__ of summarizing, visualising and tidying data

---

# The `gapminder` dataset: Overview

* Let's first load a dataset with these commands:
    
    ```r
    library(dslabs)
    gapminder <- dslabs::gapminder
    ```

* Here are the first 3 rows
    
    ```r
    head(gapminder, n = 3)
    ```
    
    ```
    ##   country year infant_mortality life_expectancy fertility population
    ## 1 Albania 1960            115.4           62.87      6.19    1636054
    ## 2 Algeria 1960            148.2           47.50      7.65   11124892
    ## 3  Angola 1960            208.0           35.98      7.32    5270844
    ##           gdp continent          region
    ## 1          NA    Europe Southern Europe
    ## 2 13828152297    Africa Northern Africa
    ## 3          NA    Africa   Middle Africa
    ```

---
# The `gapminder` dataset: Overview

* What variables does this dataset contain?
    
    ```r
    names(gapminder)
    ```
    
    ```
    ## [1] "country"          "year"             "infant_mortality" "life_expectancy" 
    ## [5] "fertility"        "population"       "gdp"              "continent"       
    ## [9] "region"
    ```

* `tail` gives you the last (6) rows.
    
    ```r
    tail(gapminder)
    ```

---

# The `gapminder` dataset: Datatypes

* It's important to know how the data is stored.

* We can use `str` for that:

```r
str(gapminder)
```

```
## 'data.frame':	10545 obs. of  9 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
##  $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
##  $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
##  $ population      : num  1636054 11124892 5270844 54681 20619075 ...
##  $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
##  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
##  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
```

---

# Task 1 (7 minutes)

* Create a new variable called `gdppercap` corresponding to `gdp` divided by `population`

* Which countries had a 2011 GDP per capita greater than 30.000?

* Filter the dataset to only keep the year 2015: `gapminder_2015`

* How many countries have an infant mortality in 2015 greater than 90 (per 1000)?

* What is the average life expectancy in Africa in 2015?

---

# Summarizing

---

---

# Summarizing Data

* One can learn only a limited amount from **looking** at a `data.frame`. 🔍

* Even if you *could* see all rows of the dataset, you would not know very much **about it**.

* We need to **summarize** the data for us to learn from it.

* In general, we can compute summary statistics, or visualize the data with plots.

* Let's start with some statistics first!

* Let's look at two features: *central tendency* and *spread*.

---

# Central Tendency

.pull-left[
1. `mean(x)`: the average of all values in `x`.
`$$\bar{x} = \frac{1}{N}\sum_{i=1}^N x_i$$`

```r
x <- c(1,2,2,2,2,100)
mean(x)
```

```
## [1] 18.16667
```

```r
mean(x) == sum(x) / length(x)
```

```
## [1] TRUE
```

***Your turn:*** What's the mean of `infant_mortality` in 1960? Read the help for `mean` to remove `NA`s.
]

.pull-right[
1. `median`: the value `$x_j$` below and above which 50% of the values in `x` lie. `$m$` is the median if
    `$$\Pr(X \leq m) \geq 0.5 \text{ and } \Pr(X \geq m) \geq 0.5$$`
    
1. The median is robust against *outliers*. 🤔? (later).

```r
median(x)
```

```
## [1] 2
```

***Your turn:*** What's the median of `infant_mortality` in 1960?
]

---

# Missing Values: `NA`

.pull-left[
* Whenever a value is *missing*, we code it as `NA`.
    
    ```r
    x <- NA
    ```
* `R` propagates `NA` through operations:
    
    ```r
    NA > 5
    ```
    
    ```
    ## [1] NA
    ```
    
    ```r
    NA + 10
    ```
    
    ```
    ## [1] NA
    ```
* the function `is.na(x)` returns `TRUE` if `x` is an `NA`.
    
    ```r
    is.na(x)
    ```
    
    ```
    ## [1] TRUE
    ```
]

.pull-right[
* What is confusing is that 
    
    ```r
    NA == NA
    ```
    
    ```
    ## [1] NA
    ```

* It's easy to illustrate like that:
    
    ```r
    # Let x be Mary's age. We don't know how old she is.
    x <- NA
    
    # Let y be John's age. We don't know how old he is.
    y <- NA
    
    # Are John and Mary the same age?
    x == y
    ```
    
    ```
    ## [1] NA
    ```
    
    ```r
    #> [1] NA
    # We don't know!
    ```
]

---

# Spread

.pull-left[
* Another interesting feature is how much a variable is *spread out* about it's center (the mean in this case).

* The *variance* is such a measure.
    `$$Var(X) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})^2$$`
    
* Consider two `normal distributions` with equal mean at `0`:
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" />

* Compute with:

```r
    var(x)
    range(x)   # range: max value - min value
```
]
---

# Example: the Weight of Women and Men<sup>1</sup>

.pull-left[
* Men weight 65 kg and women 55 kg on average. The variance is 25.
    
    ```r
    # Generate a random dataset
    set.seed(1234) # `set.seed` allows replicating random data
    df <- data.frame(
        sex=factor(rep(c("F", "M"), each=200)),
        weight=round(c(rnorm(200, mean=55, sd=5), # ?rnorm | Note: sd = sqrt(variance)
                       rnorm(200, mean=65, sd=5))))
    ```

* Plot the overall density
    
    ```r
    ggplot(df, aes(x=weight)) + geom_density() + theme_bw() #white background
    ```
    
* Plot separated densities
    
    ```r
    # Change density plot line colors by groups
    ggplot(df, aes(x=weight, color=sex)) + geom_density() + theme_bw() # White background
    ```
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" />

.footnote[
[1]: This example is taken from [sthda.com](http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization).  
]

]

---

# The `table` function

* `table(x)` is a useful function that counts the occurence of each unique value in `x`:
    
    ```r
    table(gapminder$continent)
    ```
    
    ```
    ## 
    ##   Africa Americas     Asia   Europe  Oceania 
    ##     2907     2052     2679     2223      684
    ```
    
    ```r
    table(gapminder$region)
    ```
    
    ```
    ## 
    ## Australia and New Zealand                 Caribbean           Central America 
    ##                       114                       741                       456 
    ##              Central Asia            Eastern Africa              Eastern Asia 
    ##                       285                       912                       342 
    ##            Eastern Europe                 Melanesia                Micronesia 
    ##                       570                       285                       114 
    ##             Middle Africa           Northern Africa          Northern America 
    ##                       456                       342                       171 
    ##           Northern Europe                 Polynesia             South America 
    ##                       570                       171                       684 
    ##        South-Eastern Asia           Southern Africa             Southern Asia 
    ##                       570                       285                       456 
    ##           Southern Europe            Western Africa              Western Asia 
    ##                       684                       912                      1026 
    ##            Western Europe 
    ##                       399
    ```

---

# Crosstables

* Given two vectors, `table` produces a contingency table:
    
    ```r
    gapminder_2015 <- subset(gapminder, year == 2015)
    gapminder_2015$fertility_above_2 = (gapminder_2015$fertility > 2.1) # dummy variable for fertility rate above replacement level fertility
    table(gapminder_2015$fertility_above_2,gapminder_2015$continent)
    ```
    
    ```
    ##        
    ##         Africa Americas Asia Europe Oceania
    ##   FALSE      2       15   20     39       4
    ##   TRUE      49       20   27      0       8
    ```

* With `prop.table`, we can get proportions:
    
    ```r
    # proportions by row
    prop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 1)
    # proportions by column
    prop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 2) 
    ```

* ⚠️ To obtain `table`s with `NA`s, use the `useNA = "always"` or `useNA = "ifany"`
---

# Plotting

---
layout: true

---

# Plotting

* There is an extremely powerful alternative in package `ggplot2`. We'll see both.

* First example: *histograms*. A histogram counts how many obserations fall within a certain bin.
]

```r
gapminder_2015 <- gapminder[gapminder$year == 2015,]
hist(gapminder_2015$life_expectancy)
```

]

---

# A Nicer Histogram

* Look at `?hist` for more.

```r
hist(gapminder_2015$life_expectancy, 
     xlab   = "Life Expectancy", 
     main   = "Histogram of life expectancy in 2015", 
     breaks = seq(from = 40, to = 90, by = 5),
     las = 1, # label orientation
     col  = "#d90502",
     border = "white")
```

]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-29-1.svg" style="display: block; margin: auto;" />
]

---

# Looking for Outliers: Boxplots

* An *outlier* is a datapoint far removed from the center of a distribution.

* Boxplots are an effective way to visualise the distribution of a variable.

* The *box* typically denotes the __interquartile range__ (observations between 25th pctile and 75th pctile).

* The *thick line* corresponds to the __median__.

* The *dots* are __outliers__ (⚠️ no universally accepted definition).

---

# Looking for Outliers: Boxplots

---

# Looking for Outliers: Boxplots

```r
boxplot(life_expectancy ~ continent,
    data = gapminder_2015,
    xlab   = "Continent",
    ylab   = "Life expectancy in 2015",
    main   = "Life expectancy by continent in 2015",
    pch = 20, cex = 2, # colour and size of outliers
    col ="#d90502",border = "black", las = 1)
```

* see `?boxplot` for more options

***Your turn:*** How does the same plot look like for `fertility`? For `infant_mortality`?
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" />
]

---

# Scatter Plots

* Two variables `$x$` and `$y$`

* Natural to ask: How often do certain pairs of `$(x_i,y_i)$` occur?
    
    ```r
    head(gapminder_2015[,c("fertility","infant_mortality")])
    ```
    
    ```
    ##       fertility infant_mortality
    ## 10176      1.78             12.5
    ## 10177      2.71             21.9
    ## 10178      5.65             96.0
    ## 10179      2.06              5.8
    ## 10180      2.15             11.1
    ## 10181      1.41             12.6
    ```
* That's what a scatter plots shows.

---

# Scatter Plots

```r
plot(fertility ~ infant_mortality,
    data = gapminder_2015,
    xlab = "Infant mortality",
    ylab = "Fertility rate",
    xlim = c(0,200), # range of x axis
    ylim = c(0,8), # range of y axis
    main = "Relationship between fertility and infant mortality in 2015",
    col = "#d90502",
    las = 1)
```

<img src="chapter2_files/figure-html/unnamed-chunk-34-1.svg" style="display: block; margin: auto;" />
]

* We often call it one *observation*.

* Corresponding to one *row* of the `data.frame`.

* Why do some dots appear *darker* than others here?

***Your turn:*** How does the same scatter plot looks like for 1960?
]

---

background-image: url("../img/logo/ggplot2.svg")
background-position: 90% 5%
background-size: 180px

# Quick `ggplot2` Intro

* Great intro to `ggplot2` [here](https://pkg.garrickadenbuie.com/gentle-ggplot2).

* Based on *The __G__rammar of __G__raphics* (hence __gg__plot).

* More powerful than base `R` plotting

* Let's reproduce the previous graphs in ggplot
]

.pull-right[
<br>
<br>
<img src="chapter2_files/figure-html/ggplot_grammar_graphics.png" width="500px" style="display: block; margin: auto;" />

###### source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html)
]

---

# `ggplot2`: Basic Histogram

```r
library(ggplot2)

ggplot(gapminder_2015,
       aes(x = life_expectancy)) +
  geom_histogram()
```
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-37-1.svg" style="display: block; margin: auto;" />
]

---

# `ggplot2`: Fancy Histogram

```r
library(ggplot2)

ggplot(gapminder_2015,
       aes(x = life_expectancy)) +
  geom_histogram(binwidth = 5,
                 boundary = 0, # try with 1: what does it do?,
                 colour = "white",
                 fill = "#d90502") +
  labs(x = "Life Expectancy",
       y = "Frequency",
       title = "Histogram of life expectancy in 2015") +
  theme_bw(base_size = 16) # white background + base font size adjusted to 16 pts
```
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-39-1.svg" style="display: block; margin: auto;" />
]

---

# `ggplot2`: Fancy Histogram with `facet_grid()`

```r
library(ggplot2)

ggplot(gapminder_2015,
       aes(x = life_expectancy)) +
  geom_histogram(binwidth = 5,
                 boundary = 0, 
                 colour = "white",
                 fill = "#d90502") +
  labs(x = "Life Expectancy",
       y = "Frequency",
       title = "Histogram of life expectancy in 2015") +
  theme_bw(base_size = 16) +
  facet_grid(rows = vars(continent))
```
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-41-1.svg" style="display: block; margin: auto;" />
]

---

# `ggplot2`: Boxplots

```r
ggplot(gapminder_2015,
       aes(x = continent, y = life_expectancy)) +
  geom_boxplot(colour = "black",
               fill = "#d90502") +
  labs(x = "Continent",
       y = "Life expectancy in 2015", 
       title = "Life expectancy by continent in 2015") +
  theme_bw(base_size = 20)
```
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-43-1.svg" style="display: block; margin: auto;" />
]

---

# `ggplot2`: Scatter Plots

```r
ggplot(gapminder_2015,
       aes(x = infant_mortality,
           y = fertility)) +
  geom_point(size = 3,
             alpha = 0.5,
             colour = "#d90502") +
  expand_limits(x = 0, y = 0) +
  labs(x = "Infant mortality",
       y = "Fertility rate",
       title = "Relationship between fertility and infant mortality in 2015") +
  theme_bw(base_size = 16)
```
]

.pull-right[
<img src="chapter2_files/figure-html/unnamed-chunk-45-1.svg" style="display: block; margin: auto;" />
]

---

# It's Tutorial Time!

---

# Tutorial 1 (10 minutes)

Time for our first tutorial!!

Type this into your `RStudio` console:

```r
library(ScPoApps)
runTutorial('chapter2')
```

If you have trouble with the interactive doc, try this version (no interactive content):

```r
runTutorial('chapter2-script')
```

---

# How are x and y related? Covariance and Correlation
* [This](https://michelefioretti.github.io/ScPoEconometrics/sum.html#summarize-two) is the relevant section in the book about Covariance.

.pull-left[
<img src="chapter2_files/figure-html/x-y-corr-1.svg" style="display: block; margin: auto;" />
]

.pull-right[
* The covariance is a measure of __joint variability__ of two variables.
    `$$Cov(x,y) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})$$`

```
## [1] 24.21146
```

* The correlation is a measure of the strenght of the __linear association__ between two variables.
    `$$Cor(x,y) = \frac{Cov(x,y)}{\sqrt(Var(x))\sqrt(Var(y))}$$`

```
## [1] 0.8286402
```
]

---

# Correlation App

```r
library(ScPoApps)
runTutorial('correlation')
```

---

# Wrangling

---
layout: true

---

# Intro to `dplyr`

.pull-left[
<br>
<br>
<br>
* [`dplyr`](https://dplyr.tidyverse.org) is part of the [tidyverse](https://www.tidyverse.org) package family.

* [`data.table`](https://github.com/Rdatatable/data.table/wiki) is an alternative. Very fast but a bit more difficult.

* Both have pros and cons. We'll start you off with `dplyr`. 
]

![:scale 35%](../img/logo/r-datatable.svg)
]

---

# `dplyr` Overview

.pull-left[
<br>
<br>
* You *must* read through [Hadley Wickham's chapter](https://r4ds.had.co.nz/transform.html). It's concise.

* The package is organized around a set of **verbs**, i.e. *actions* to be taken.

* We operate on `data.frames` or `tibbles` (*nicer looking* data.frames.)

* All *verbs*: First argument is a data.frame, subsequent arguments describe what to do, returns another data.frame.

]

## Verbs

1. `filter()`: Choose observations based on a certain value (i.e. subset)

1. `arrange()`: Reorder rows

1. `select()`: Select variables by name

1. `mutate()`: Create new variables out of existing ones

1. `group_by()` and `summarise()`: Summarise variables
]

---

# Data on 2016 US election polls from the `dslabs` package

* This dataset contains __real__ data on polls made during the 2016 US Presidential elections and compiled by [fivethirtyeight](fivethirtyeight.com)

```r
library(dslabs, dplyr)
data(polls_us_election_2016) # this data is from fivethirtyeight.com
polls_us_election_2016 <- dplyr::as_tibble(polls_us_election_2016)
head(polls_us_election_2016[,1:6], n = 6) # show first 6 lines of first 6 variables
```

```
## # A tibble: 6 x 6
##   state startdate  enddate    pollster                          grade samplesize
##   <fct> <date>     <date>     <fct>                             <fct>      <int>
## 1 U.S.  2016-11-03 2016-11-06 ABC News/Washington Post          A+          2220
## 2 U.S.  2016-11-01 2016-11-07 Google Consumer Surveys           B          26574
## 3 U.S.  2016-11-02 2016-11-06 Ipsos                             A-          2195
## 4 U.S.  2016-11-04 2016-11-07 YouGov                            B           3677
## 5 U.S.  2016-11-03 2016-11-06 Gravis Marketing                  B-         16639
## 6 U.S.  2016-11-03 2016-11-06 Fox News/Anderson Robbins Resear… A           1295
```

🚨 This is a `tibble` (more informative than a  `data.frame`)

What variables does this dataset contain?

1. `?polls_us_election_2016`
1. [FiveThirtyEight’s Pollster Ratings website](https://projects.fivethirtyeight.com/pollster-ratings/)

---

# `filter()`: subset a data.frame

* `filter` has the same purpose as `subset`

* Example: Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote?
    
    ```r
    filter(polls_us_election_2016,
           grade == "A" & samplesize > 2000 & rawpoll_trump > 45)
    ```
--
    
    ```
    ## # A tibble: 1 x 15
    ##   state startdate  enddate    pollster grade samplesize population
    ##   <fct> <date>     <date>     <fct>    <fct>      <int> <chr>     
    ## 1 Indi… 2016-04-26 2016-04-28 Marist … A           2149 rv        
    ## # … with 8 more variables: rawpoll_clinton <dbl>, rawpoll_trump <dbl>,
    ## #   rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>,
    ## #   adjpoll_trump <dbl>, adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>
    ```
   
---

# Create a Filter: Comparisons and Logical Operators

.pull-left[
* We have a standard suite of comparison operators:
  - `>`: greater than,
  - `<`: smaller than,
  - `>=`: greater than or equal to,
  - `<=`: smaller than or equal to,
  - `!=`: not equal to,
  - `==`: equal to.

* Construct more complex filters with logical operators
    1. `x & y`: `x` **and** `y`
    1. `x | y`: `x` **or** `y`
    1. `!y`: **not** `y`
]

.pull-right[
* `R` has the convenient `x %in% y` operator (conversely `!(x %in% y)`), `TRUE` if `x` is *a member of* `y`.
    
    ```r
    3 %in% 1:3
    ```
    
    ```
    ## [1] TRUE
    ```
    
    ```r
    c(2,5) %in% 2:10  # also vectorized
    ```
    
    ```
    ## [1] TRUE TRUE
    ```
    
    ```r
    c("S","Po") %in% c("Sciences","Po")  # also strings
    ```
    
    ```
    ## [1] FALSE  TRUE
    ```
]

---

# `mutate()`: create new variables

* *Example*: What was
    1. the combined vote share of Trump and Clinton for each poll?
    2. the difference between Trump's raw poll vote share and 538's adjusted vote share?

```r
mutate(polls_us_election_2016,
       trump_clinton_tot = rawpoll_trump + rawpoll_clinton,
       trump_raw_adj_diff = rawpoll_trump - adjpoll_trump)
```

# `select()`: only keep some variables

* *Example*: Only keep the variables `state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump`

```r
select(polls_us_election_2016,
       state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump)
```

---

# Task 2 (10 minutes)

1. Which polls had more vote intentions for Trump than for Clinton.

1. How many polls have a missing `grade`?

1. Which polls were (i) polled by American Strategies, GfK Group or Merrill Poll, _and_ (ii) had a sample size greater than 1,000, _and_ (iii) started on October 20th, 2016?

*For the following questions you should use `filter` and `mutate`.*

1. Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% _and_ (iii) had a sample size greater than 1,000.?

1. Which polls (i) did not poll for vote intentions for Johnson, (ii) had a difference in raw poll vote shares between Trump and Clinton greater than 5, and (iii) were done in the state of Iowa?

---

# Split-Apply-Combine

.pull-left[
* Often we do *some* operation **by** some group in our dataset:
    * Mean vote share for Clinton by pollster grade.
    * Maximum vote share for Trump by poll month, etc

* For this, we need to 
    1. Split the data **by** group
    2. Apply to each group the operation
    3. Recombine all groups into one table
    
* In `dplyr`, this is achieved with `group_by()` and `summarise`.
]

.pull-right[
1. `group_by(polls_us_election_2016, grade)` groups/splits `polls_us_election_2016` by pollster `grade`:
    
    ```r
    polls_grade = group_by(polls_us_election_2016, grade)
    ```

1. `summarise` each chunk and re-combine
    
    ```r
    summarise(polls_grade, mean_vote_clinton = mean(rawpoll_clinton))
    ```
    
    ```
    ## # A tibble: 11 x 2
    ##    grade mean_vote_clinton
    ##    <fct>             <dbl>
    ##  1 D                  46.7
    ##  2 C-                 43.2
    ##  3 C                  41.8
    ##  4 C+                 44.2
    ##  5 B-                 43.9
    ##  6 B                  37.3
    ##  7 B+                 44.1
    ##  8 A-                 43.0
    ##  9 A                  45.3
    ## 10 A+                 45.8
    ## 11 NA                 43.2
    ```
]

---

background-image: url("../img/logo/magrittr.svg")
background-position: 90% 5%
background-size: 180px

# Chaining 🔗 Commands Together: The Pipe

* The `magrittr` package gives us the *pipe* `%>%`.

* `x %>% f(y)` becomes `f(x,y)`.

* With the *pipe* you construct data *pipelines*.

```r
polls_us_election_2016 %>%
  group_by(grade) %>% 
  summarise(
    mean_vote_clinton = mean(rawpoll_clinton)
    )
```
which is equivalent to, but nicer than:

```r
summarise(
  group_by(polls_us_election_2016, grade),
  mean_vote_clinton = mean(rawpoll_clinton))
```
]

```r
polls_us_election_2016 %>%
  mutate(trump_clinton_diff = rawpoll_trump-rawpoll_clinton) %>%
  filter(trump_clinton_diff>5 &
        state == "Iowa" &
        is.na(rawpoll_johnson)) %>%
  select(pollster)
```

```
## # A tibble: 3 x 1
##   pollster
##   <fct>   
## 1 Ipsos   
## 2 Ipsos   
## 3 Ipsos
```
]

---
class: title-slide-final, middle
background-image: url(../img/logo/ScPo-econ.png)
background-size: 250px
background-position: 9% 19%

# SEE YOU IN TWO WEEKS!

|                                                                                                            |                                   |
| :--------------------------------------------------------------------------------------------------------- | :-------------------------------- |
| <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>]               | michele.fioretti@sciencespo.fr       |
| <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides |
| <a href="https://michelefioretti.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book |
| <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>]                          | @ScPoEcon                         |
| <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>]                          | @ScPoEcon                       |