class: center, middle, inverse, title-slide # Applied Data Analysis for Public Policy Studies ## Summarising, Visualizing and Tidying Data ### Michele Fioretti ### SciencesPo Paris 2022-08-29 --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Recap from last week * ***Causality*** plays a central role in modern econometrics! * Using R is ***very*** valuable * Basic data wrangling: - `View`, `str`, `names`, `nrow`, `ncol` - *subsetting:* `murders[row condition, "column name"]` - *variable creation:* `murders$total_percap = (murders$total / murders$population) * 10000` -- ## Today * Deeper dive into data wrangling with `R`: - __summarizing__ data, - __visualisation__ data, - __tidying__ data --- # Working With Data * Econometrics is about `data`. <img src="chapter2_files/figure-html/data_science_pipeline.png" width="400px" style="display: block; margin: auto;" /> -- * According a to [2014 NYTimes article](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html), "data scientists [...] spend from ***50 percent to 80 percent of their time*** mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets." * In the next two lectures you will learn the __basics__ of summarizing, visualising and tidying data --- # The `gapminder` dataset: Overview * Let's first load a dataset with these commands: ```r library(dslabs) gapminder <- dslabs::gapminder ``` * Here are the first 3 rows ```r head(gapminder, n = 3) ``` ``` ## country year infant_mortality life_expectancy fertility population ## 1 Albania 1960 115.4 62.87 6.19 1636054 ## 2 Algeria 1960 148.2 47.50 7.65 11124892 ## 3 Angola 1960 208.0 35.98 7.32 5270844 ## gdp continent region ## 1 NA Europe Southern Europe ## 2 13828152297 Africa Northern Africa ## 3 NA Africa Middle Africa ``` --- # The `gapminder` dataset: Overview * What variables does this dataset contain? ```r names(gapminder) ``` ``` ## [1] "country" "year" "infant_mortality" "life_expectancy" ## [5] "fertility" "population" "gdp" "continent" ## [9] "region" ``` * `tail` gives you the last (6) rows. ```r tail(gapminder) ``` --- # The `gapminder` dataset: Datatypes * It's important to know how the data is stored. * We can use `str` for that: ```r str(gapminder) ``` ``` ## 'data.frame': 10545 obs. of 9 variables: ## $ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ... ## $ infant_mortality: num 115.4 148.2 208 NA 59.9 ... ## $ life_expectancy : num 62.9 47.5 36 63 65.4 ... ## $ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ... ## $ population : num 1636054 11124892 5270844 54681 20619075 ... ## $ gdp : num NA 1.38e+10 NA NA 1.08e+11 ... ## $ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ... ## $ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ... ``` --- class: inverse # Task 1 (7 minutes) * Create a new variable called `gdppercap` corresponding to `gdp` divided by `population` * Which countries had a 2011 GDP per capita greater than 30.000? * Filter the dataset to only keep the year 2015: `gapminder_2015` * How many countries have an infant mortality in 2015 greater than 90 (per 1000)? * What is the average life expectancy in Africa in 2015? --- layout: false class: title-slide-section-red, middle # Summarizing --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Summarizing Data * One can learn only a limited amount from **looking** at a `data.frame`. 🔍 * Even if you *could* see all rows of the dataset, you would not know very much **about it**. -- * We need to **summarize** the data for us to learn from it. * In general, we can compute summary statistics, or visualize the data with plots. -- * Let's start with some statistics first! * Let's look at two features: *central tendency* and *spread*. --- # Central Tendency .pull-left[ 1. `mean(x)`: the average of all values in `x`. `$$\bar{x} = \frac{1}{N}\sum_{i=1}^N x_i$$` ```r x <- c(1,2,2,2,2,100) mean(x) ``` ``` ## [1] 18.16667 ``` ```r mean(x) == sum(x) / length(x) ``` ``` ## [1] TRUE ``` ***Your turn:*** What's the mean of `infant_mortality` in 1960? Read the help for `mean` to remove `NA`s. ] -- .pull-right[ 1. `median`: the value `\(x_j\)` below and above which 50% of the values in `x` lie. `\(m\)` is the median if `$$\Pr(X \leq m) \geq 0.5 \text{ and } \Pr(X \geq m) \geq 0.5$$` 1. The median is robust against *outliers*. 🤔? (later). ```r median(x) ``` ``` ## [1] 2 ``` ***Your turn:*** What's the median of `infant_mortality` in 1960? ] --- # Missing Values: `NA` .pull-left[ * Whenever a value is *missing*, we code it as `NA`. ```r x <- NA ``` * `R` propagates `NA` through operations: ```r NA > 5 ``` ``` ## [1] NA ``` ```r NA + 10 ``` ``` ## [1] NA ``` * the function `is.na(x)` returns `TRUE` if `x` is an `NA`. ```r is.na(x) ``` ``` ## [1] TRUE ``` ] -- .pull-right[ * What is confusing is that ```r NA == NA ``` ``` ## [1] NA ``` * It's easy to illustrate like that: ```r # Let x be Mary's age. We don't know how old she is. x <- NA # Let y be John's age. We don't know how old he is. y <- NA # Are John and Mary the same age? x == y ``` ``` ## [1] NA ``` ```r #> [1] NA # We don't know! ``` ] --- # Spread .pull-left[ * Another interesting feature is how much a variable is *spread out* about it's center (the mean in this case). * The *variance* is such a measure. `$$Var(X) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})^2$$` * Consider two `normal distributions` with equal mean at `0`: ] -- .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> * Compute with: ```r var(x) range(x) # range: max value - min value ``` ] --- # Example: the Weight of Women and Men<sup>1</sup> .pull-left[ * Men weight 65 kg and women 55 kg on average. The variance is 25. ```r # Generate a random dataset set.seed(1234) # `set.seed` allows replicating random data df <- data.frame( sex=factor(rep(c("F", "M"), each=200)), weight=round(c(rnorm(200, mean=55, sd=5), # ?rnorm | Note: sd = sqrt(variance) rnorm(200, mean=65, sd=5)))) ``` * Plot the overall density ```r ggplot(df, aes(x=weight)) + geom_density() + theme_bw() #white background ``` * Plot separated densities ```r # Change density plot line colors by groups ggplot(df, aes(x=weight, color=sex)) + geom_density() + theme_bw() # White background ``` ] -- .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> <img src="chapter2_files/figure-html/unnamed-chunk-22-1.svg" style="display: block; margin: auto;" /> .footnote[ [1]: This example is taken from [sthda.com](http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization). ] ] --- # The `table` function * `table(x)` is a useful function that counts the occurence of each unique value in `x`: ```r table(gapminder$continent) ``` ``` ## ## Africa Americas Asia Europe Oceania ## 2907 2052 2679 2223 684 ``` ```r table(gapminder$region) ``` ``` ## ## Australia and New Zealand Caribbean Central America ## 114 741 456 ## Central Asia Eastern Africa Eastern Asia ## 285 912 342 ## Eastern Europe Melanesia Micronesia ## 570 285 114 ## Middle Africa Northern Africa Northern America ## 456 342 171 ## Northern Europe Polynesia South America ## 570 171 684 ## South-Eastern Asia Southern Africa Southern Asia ## 570 285 456 ## Southern Europe Western Africa Western Asia ## 684 912 1026 ## Western Europe ## 399 ``` --- # Crosstables * Given two vectors, `table` produces a contingency table: ```r gapminder_2015 <- subset(gapminder, year == 2015) gapminder_2015$fertility_above_2 = (gapminder_2015$fertility > 2.1) # dummy variable for fertility rate above replacement level fertility table(gapminder_2015$fertility_above_2,gapminder_2015$continent) ``` ``` ## ## Africa Americas Asia Europe Oceania ## FALSE 2 15 20 39 4 ## TRUE 49 20 27 0 8 ``` -- * With `prop.table`, we can get proportions: ```r # proportions by row prop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 1) # proportions by column prop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 2) ``` * ⚠️ To obtain `table`s with `NA`s, use the `useNA = "always"` or `useNA = "ifany"` --- layout: false class: title-slide-section-red, middle # Plotting --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Plotting .pull-left[ * `R` base plotting is fairly good. * There is an extremely powerful alternative in package `ggplot2`. We'll see both. * First example: *histograms*. A histogram counts how many obserations fall within a certain bin. ] -- .pull-right[ ```r gapminder_2015 <- gapminder[gapminder$year == 2015,] hist(gapminder_2015$life_expectancy) ``` <img src="chapter2_files/figure-html/unnamed-chunk-26-1.svg" style="display: block; margin: auto;" /> ] --- # A Nicer Histogram .pull-left[ * We can give additional arguments to `hist`. * Look at `?hist` for more. ```r hist(gapminder_2015$life_expectancy, xlab = "Life Expectancy", main = "Histogram of life expectancy in 2015", breaks = seq(from = 40, to = 90, by = 5), las = 1, # label orientation col = "#d90502", border = "white") ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-29-1.svg" style="display: block; margin: auto;" /> ] --- # Looking for Outliers: Boxplots * An *outlier* is a datapoint far removed from the center of a distribution. * Boxplots are an effective way to visualise the distribution of a variable. * The *box* typically denotes the __interquartile range__ (observations between 25th pctile and 75th pctile). * The *thick line* corresponds to the __median__. * The *dots* are __outliers__ (⚠️ no universally accepted definition). --- # Looking for Outliers: Boxplots <img src="chapter2_files/figure-html/boxplot_explanation.png" width="850px" style="display: block; margin: auto;" /> --- # Looking for Outliers: Boxplots .pull-left[ ```r boxplot(life_expectancy ~ continent, data = gapminder_2015, xlab = "Continent", ylab = "Life expectancy in 2015", main = "Life expectancy by continent in 2015", pch = 20, cex = 2, # colour and size of outliers col ="#d90502",border = "black", las = 1) ``` * see `?boxplot` for more options ***Your turn:*** How does the same plot look like for `fertility`? For `infant_mortality`? ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" /> ] --- # Scatter Plots * Two variables `\(x\)` and `\(y\)` -- * Natural to ask: How often do certain pairs of `\((x_i,y_i)\)` occur? ```r head(gapminder_2015[,c("fertility","infant_mortality")]) ``` ``` ## fertility infant_mortality ## 10176 1.78 12.5 ## 10177 2.71 21.9 ## 10178 5.65 96.0 ## 10179 2.06 5.8 ## 10180 2.15 11.1 ## 10181 1.41 12.6 ``` * That's what a scatter plots shows. --- # Scatter Plots .pull-left[ ```r plot(fertility ~ infant_mortality, data = gapminder_2015, xlab = "Infant mortality", ylab = "Fertility rate", xlim = c(0,200), # range of x axis ylim = c(0,8), # range of y axis main = "Relationship between fertility and infant mortality in 2015", col = "#d90502", las = 1) ``` <img src="chapter2_files/figure-html/unnamed-chunk-34-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <br> * Each dot is one pair `\((x_i,y_i)\)`. * We often call it one *observation*. * Corresponding to one *row* of the `data.frame`. * Why do some dots appear *darker* than others here? ***Your turn:*** How does the same scatter plot looks like for 1960? ] --- background-image: url("../img/logo/ggplot2.svg") background-position: 90% 5% background-size: 180px # Quick `ggplot2` Intro .pull-left[ * Excellent cheatsheet on [project website](https://ggplot2.tidyverse.org). * Great intro to `ggplot2` [here](https://pkg.garrickadenbuie.com/gentle-ggplot2). * Based on *The __G__rammar of __G__raphics* (hence __gg__plot). * More powerful than base `R` plotting * Let's reproduce the previous graphs in ggplot ] .pull-right[ <br> <br> <img src="chapter2_files/figure-html/ggplot_grammar_graphics.png" width="500px" style="display: block; margin: auto;" /> ###### source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) ] --- # `ggplot2`: Basic Histogram .pull-left[ ```r library(ggplot2) ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram() ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-37-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Fancy Histogram .pull-left[ ```r library(ggplot2) ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram(binwidth = 5, boundary = 0, # try with 1: what does it do?, colour = "white", fill = "#d90502") + labs(x = "Life Expectancy", y = "Frequency", title = "Histogram of life expectancy in 2015") + theme_bw(base_size = 16) # white background + base font size adjusted to 16 pts ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-39-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Fancy Histogram with `facet_grid()` .pull-left[ ```r library(ggplot2) ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram(binwidth = 5, boundary = 0, colour = "white", fill = "#d90502") + labs(x = "Life Expectancy", y = "Frequency", title = "Histogram of life expectancy in 2015") + theme_bw(base_size = 16) + facet_grid(rows = vars(continent)) ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-41-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Boxplots .pull-left[ ```r ggplot(gapminder_2015, aes(x = continent, y = life_expectancy)) + geom_boxplot(colour = "black", fill = "#d90502") + labs(x = "Continent", y = "Life expectancy in 2015", title = "Life expectancy by continent in 2015") + theme_bw(base_size = 20) ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-43-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Scatter Plots .pull-left[ ```r ggplot(gapminder_2015, aes(x = infant_mortality, y = fertility)) + geom_point(size = 3, alpha = 0.5, colour = "#d90502") + expand_limits(x = 0, y = 0) + labs(x = "Infant mortality", y = "Fertility rate", title = "Relationship between fertility and infant mortality in 2015") + theme_bw(base_size = 16) ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-45-1.svg" style="display: block; margin: auto;" /> ] --- class: inverse, middle # It's Tutorial Time! --- # Tutorial 1 (10 minutes) Time for our first tutorial!! Type this into your `RStudio` console: ```r library(ScPoApps) runTutorial('chapter2') ``` If you have trouble with the interactive doc, try this version (no interactive content): ```r runTutorial('chapter2-script') ``` --- # How are x and y related? Covariance and Correlation * [This](https://michelefioretti.github.io/ScPoEconometrics/sum.html#summarize-two) is the relevant section in the book about Covariance. .pull-left[ <img src="chapter2_files/figure-html/x-y-corr-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ * The covariance is a measure of __joint variability__ of two variables. `$$Cov(x,y) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})$$` ``` ## [1] 24.21146 ``` * The correlation is a measure of the strenght of the __linear association__ between two variables. `$$Cor(x,y) = \frac{Cov(x,y)}{\sqrt(Var(x))\sqrt(Var(y))}$$` ``` ## [1] 0.8286402 ``` ] --- class: inverse # Correlation App ```r library(ScPoApps) runTutorial('correlation') ``` --- layout: false class: title-slide-section-red, middle # Wrangling --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Intro to `dplyr` .pull-left[ <br> <br> <br> * [`dplyr`](https://dplyr.tidyverse.org) is part of the [tidyverse](https://www.tidyverse.org) package family. * [`data.table`](https://github.com/Rdatatable/data.table/wiki) is an alternative. Very fast but a bit more difficult. * Both have pros and cons. We'll start you off with `dplyr`. ] .pull-right[   ] --- # `dplyr` Overview .pull-left[ <br> <br> * You *must* read through [Hadley Wickham's chapter](https://r4ds.had.co.nz/transform.html). It's concise. * The package is organized around a set of **verbs**, i.e. *actions* to be taken. * We operate on `data.frames` or `tibbles` (*nicer looking* data.frames.) * All *verbs*: First argument is a data.frame, subsequent arguments describe what to do, returns another data.frame. ] -- .pull-right[ ## Verbs 1. `filter()`: Choose observations based on a certain value (i.e. subset) 1. `arrange()`: Reorder rows 1. `select()`: Select variables by name 1. `mutate()`: Create new variables out of existing ones 1. `group_by()` and `summarise()`: Summarise variables ] --- # Data on 2016 US election polls from the `dslabs` package * This dataset contains __real__ data on polls made during the 2016 US Presidential elections and compiled by [fivethirtyeight](fivethirtyeight.com) ```r library(dslabs, dplyr) data(polls_us_election_2016) # this data is from fivethirtyeight.com polls_us_election_2016 <- dplyr::as_tibble(polls_us_election_2016) head(polls_us_election_2016[,1:6], n = 6) # show first 6 lines of first 6 variables ``` ``` ## # A tibble: 6 x 6 ## state startdate enddate pollster grade samplesize ## <fct> <date> <date> <fct> <fct> <int> ## 1 U.S. 2016-11-03 2016-11-06 ABC News/Washington Post A+ 2220 ## 2 U.S. 2016-11-01 2016-11-07 Google Consumer Surveys B 26574 ## 3 U.S. 2016-11-02 2016-11-06 Ipsos A- 2195 ## 4 U.S. 2016-11-04 2016-11-07 YouGov B 3677 ## 5 U.S. 2016-11-03 2016-11-06 Gravis Marketing B- 16639 ## 6 U.S. 2016-11-03 2016-11-06 Fox News/Anderson Robbins Resear… A 1295 ``` 🚨 This is a `tibble` (more informative than a `data.frame`) What variables does this dataset contain? -- 1. `?polls_us_election_2016` 1. [FiveThirtyEight’s Pollster Ratings website](https://projects.fivethirtyeight.com/pollster-ratings/) --- # `filter()`: subset a data.frame * `filter` has the same purpose as `subset` * Example: Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote? ```r filter(polls_us_election_2016, grade == "A" & samplesize > 2000 & rawpoll_trump > 45) ``` -- ``` ## # A tibble: 1 x 15 ## state startdate enddate pollster grade samplesize population ## <fct> <date> <date> <fct> <fct> <int> <chr> ## 1 Indi… 2016-04-26 2016-04-28 Marist … A 2149 rv ## # … with 8 more variables: rawpoll_clinton <dbl>, rawpoll_trump <dbl>, ## # rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, ## # adjpoll_trump <dbl>, adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl> ``` --- # Create a Filter: Comparisons and Logical Operators .pull-left[ * We have a standard suite of comparison operators: - `>`: greater than, - `<`: smaller than, - `>=`: greater than or equal to, - `<=`: smaller than or equal to, - `!=`: not equal to, - `==`: equal to. * Construct more complex filters with logical operators 1. `x & y`: `x` **and** `y` 1. `x | y`: `x` **or** `y` 1. `!y`: **not** `y` ] .pull-right[ * `R` has the convenient `x %in% y` operator (conversely `!(x %in% y)`), `TRUE` if `x` is *a member of* `y`. ```r 3 %in% 1:3 ``` ``` ## [1] TRUE ``` ```r c(2,5) %in% 2:10 # also vectorized ``` ``` ## [1] TRUE TRUE ``` ```r c("S","Po") %in% c("Sciences","Po") # also strings ``` ``` ## [1] FALSE TRUE ``` ] --- # `mutate()`: create new variables * *Example*: What was 1. the combined vote share of Trump and Clinton for each poll? 2. the difference between Trump's raw poll vote share and 538's adjusted vote share? ```r mutate(polls_us_election_2016, trump_clinton_tot = rawpoll_trump + rawpoll_clinton, trump_raw_adj_diff = rawpoll_trump - adjpoll_trump) ``` # `select()`: only keep some variables * *Example*: Only keep the variables `state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump` ```r select(polls_us_election_2016, state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump) ``` --- class: inverse # Task 2 (10 minutes) 1. Which polls had more vote intentions for Trump than for Clinton. 1. How many polls have a missing `grade`? 1. Which polls were (i) polled by American Strategies, GfK Group or Merrill Poll, _and_ (ii) had a sample size greater than 1,000, _and_ (iii) started on October 20th, 2016? *For the following questions you should use `filter` and `mutate`.* 1. Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% _and_ (iii) had a sample size greater than 1,000.? 1. Which polls (i) did not poll for vote intentions for Johnson, (ii) had a difference in raw poll vote shares between Trump and Clinton greater than 5, and (iii) were done in the state of Iowa? --- # Split-Apply-Combine .pull-left[ * Often we do *some* operation **by** some group in our dataset: * Mean vote share for Clinton by pollster grade. * Maximum vote share for Trump by poll month, etc * For this, we need to 1. Split the data **by** group 2. Apply to each group the operation 3. Recombine all groups into one table * In `dplyr`, this is achieved with `group_by()` and `summarise`. ] -- .pull-right[ 1. `group_by(polls_us_election_2016, grade)` groups/splits `polls_us_election_2016` by pollster `grade`: ```r polls_grade = group_by(polls_us_election_2016, grade) ``` 1. `summarise` each chunk and re-combine ```r summarise(polls_grade, mean_vote_clinton = mean(rawpoll_clinton)) ``` ``` ## # A tibble: 11 x 2 ## grade mean_vote_clinton ## <fct> <dbl> ## 1 D 46.7 ## 2 C- 43.2 ## 3 C 41.8 ## 4 C+ 44.2 ## 5 B- 43.9 ## 6 B 37.3 ## 7 B+ 44.1 ## 8 A- 43.0 ## 9 A 45.3 ## 10 A+ 45.8 ## 11 NA 43.2 ``` ] --- background-image: url("../img/logo/magrittr.svg") background-position: 90% 5% background-size: 180px # Chaining 🔗 Commands Together: The Pipe * The `magrittr` package gives us the *pipe* `%>%`. * `x %>% f(y)` becomes `f(x,y)`. * With the *pipe* you construct data *pipelines*. -- .pull-left[ Our above example would become: ```r polls_us_election_2016 %>% group_by(grade) %>% summarise( mean_vote_clinton = mean(rawpoll_clinton) ) ``` which is equivalent to, but nicer than: ```r summarise( group_by(polls_us_election_2016, grade), mean_vote_clinton = mean(rawpoll_clinton)) ``` ] -- .pull-right[ Works for all `dplyr` verbs***!!*** ```r polls_us_election_2016 %>% mutate(trump_clinton_diff = rawpoll_trump-rawpoll_clinton) %>% filter(trump_clinton_diff>5 & state == "Iowa" & is.na(rawpoll_johnson)) %>% select(pollster) ``` ``` ## # A tibble: 3 x 1 ## pollster ## <fct> ## 1 Ipsos ## 2 Ipsos ## 3 Ipsos ``` ] --- class: title-slide-final, middle background-image: url(../img/logo/ScPo-econ.png) background-size: 250px background-position: 9% 19% # SEE YOU IN TWO WEEKS! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="mailto:michele.fioretti@sciencespo.fr">.ScPored[<i class="fa fa-paper-plane fa-fw"></i>] | michele.fioretti@sciencespo.fr | | <a href="https://michelefioretti.github.io/ScPoEconometrics-Slides/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://michelefioretti.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-link fa-fw"></i>] | Book | | <a href="http://twitter.com/ScPoEcon">.ScPored[<i class="fa fa-twitter fa-fw"></i>] | @ScPoEcon | | <a href="http://github.com/ScPoEcon">.ScPored[<i class="fa fa-github fa-fw"></i>] | @ScPoEcon |