Read the help for `mean` to remove `NA`s. ] -- .pull-right[ 1. `median`: the value `\(x_j\)` below and above which 50% of the values in `x` lie. `\(m\)` is the median if `$$\Pr(X \leq m) \geq 0.5 \text{ and } \Pr(X \geq m) \geq 0.5$$` 1. The median is robust against *outliers*. 🤔? (later). ```r median(x) ``` ``` ## [1] 2 ``` ***Your turn:*** What's the median of `infant_mortality` in 1960? ] --- # Missing Values: `NA` .pull-left[ * Whenever a value is *missing*, we code it as `NA`. ```r x <- NA ``` * `R` propagates `NA` through operations: ```r NA > 5 ``` ``` ## [1] NA ``` ```r NA + 10 ``` ``` ## [1] NA ``` * the function `` returns `TRUE` if `x` is an `NA`. ```r ``` ``` ## [1] TRUE ``` ] -- .pull-right[ * What is confusing is that ```r NA == NA ``` ``` ## [1] NA ``` * It's easy to illustrate like that: ```r # Let x be Mary's age. We don't know how old she is. x <- NA # Let y be John's age. We don't know how old he is. y <- NA # Are John and Mary the same age? x == y ``` ``` ## [1] NA ``` ```r #> [1] NA # We don't know! ``` ] --- # Spread .pull-left[ * Another interesting feature is how much a variable is *spread out* about it's center (the mean in this case). * The *variance* is such a measure. `$$Var(X) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})^2$$` * Consider two `normal distributions` with equal mean at `0`: ] -- .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> * Compute with: ```r var(x) range(x) # range: max value - min value ``` ] --- # Example: the Weight of Women and Men<sup>1</sup> .pull-left[ * Men weight 65 kg and women 55 kg on average. The variance is 25. ```r # Generate a random dataset set.seed(1234) # `set.seed` allows replicating random data df <- data.frame( sex=factor(rep(c("F", "M"), each=200)), weight=round(c(rnorm(200, mean=55, sd=5), # ?rnorm | Note: sd = sqrt(variance) rnorm(200, mean=65, sd=5)))) ``` * Plot the overall density ```r ggplot(df, aes(x=weight)) + geom_density() + theme_bw() #white background ``` * Plot separated densities ```r # Change density plot line colors by groups ggplot(df, aes(x=weight, color=sex)) + geom_density() + theme_bw() # White background ``` ] -- .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-21-1.svg" style="display: block; margin: auto;" /> <img src="chapter2_files/figure-html/unnamed-chunk-22-1.svg" style="display: block; margin: auto;" /> .footnote[ [1]: This example is taken from []( ] ] --- # The `table` function * `table(x)` is a useful function that counts the occurence of each unique value in `x`: ```r table(gapminder$continent) ``` ``` ## ## Africa Americas Asia Europe Oceania ## 2907 2052 2679 2223 684 ``` ```r table(gapminder$region) ``` ``` ## ## Australia and New Zealand Caribbean Central America ## 114 741 456 ## Central Asia Eastern Africa Eastern Asia ## 285 912 342 ## Eastern Europe Melanesia Micronesia ## 570 285 114 ## Middle Africa Northern Africa Northern America ## 456 342 171 ## Northern Europe Polynesia South America ## 570 171 684 ## South-Eastern Asia Southern Africa Southern Asia ## 570 285 456 ## Southern Europe Western Africa Western Asia ## 684 912 1026 ## Western Europe ## 399 ``` --- # Crosstables * Given two vectors, `table` produces a contingency table: ```r gapminder_2015 <- subset(gapminder, year == 2015) gapminder_2015$fertility_above_2 = (gapminder_2015$fertility > 2.1) # dummy variable for fertility rate above replacement level fertility table(gapminder_2015$fertility_above_2,gapminder_2015$continent) ``` ``` ## ## Africa Americas Asia Europe Oceania ## FALSE 2 15 20 39 4 ## TRUE 49 20 27 0 8 ``` -- * With `prop.table`, we can get proportions: ```r # proportions by row prop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 1) # proportions by column prop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 2) ``` * ⚠️ To obtain `table`s with `NA`s, use the `useNA = "always"` or `useNA = "ifany"` --- layout: false class: title-slide-section-red, middle # Plotting --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Plotting .pull-left[ * `R` base plotting is fairly good. * There is an extremely powerful alternative in package `ggplot2`. We'll see both. * First example: *histograms*. A histogram counts how many obserations fall within a certain bin. ] -- .pull-right[ ```r gapminder_2015 <- gapminder[gapminder$year == 2015,] hist(gapminder_2015$life_expectancy) ``` <img src="chapter2_files/figure-html/unnamed-chunk-26-1.svg" style="display: block; margin: auto;" /> ] --- # A Nicer Histogram .pull-left[ * We can give additional arguments to `hist`. * Look at `?hist` for more. ```r hist(gapminder_2015$life_expectancy, xlab = "Life Expectancy", main = "Histogram of life expectancy in 2015", breaks = seq(from = 40, to = 90, by = 5), las = 1, # label orientation col = "#d90502", border = "white") ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-29-1.svg" style="display: block; margin: auto;" /> ] --- # Looking for Outliers: Boxplots * An *outlier* is a datapoint far removed from the center of a distribution. * Boxplots are an effective way to visualise the distribution of a variable. * The *box* typically denotes the __interquartile range__ (observations between 25th pctile and 75th pctile). * The *thick line* corresponds to the __median__. * The *dots* are __outliers__ (⚠️ no universally accepted definition). --- # Looking for Outliers: Boxplots <img src="chapter2_files/figure-html/boxplot_explanation.png" width="850px" style="display: block; margin: auto;" /> --- # Looking for Outliers: Boxplots .pull-left[ ```r boxplot(life_expectancy ~ continent, data = gapminder_2015, xlab = "Continent", ylab = "Life expectancy in 2015", main = "Life expectancy by continent in 2015", pch = 20, cex = 2, # colour and size of outliers col ="#d90502",border = "black", las = 1) ``` * see `?boxplot` for more options ***Your turn:*** How does the same plot look like for `fertility`? For `infant_mortality`? ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" /> ] --- # Scatter Plots * Two variables `\(x\)` and `\(y\)` -- * Natural to ask: How often do certain pairs of `\((x_i,y_i)\)` occur? ```r head(gapminder_2015[,c("fertility","infant_mortality")]) ``` ``` ## fertility infant_mortality ## 10176 1.78 12.5 ## 10177 2.71 21.9 ## 10178 5.65 96.0 ## 10179 2.06 5.8 ## 10180 2.15 11.1 ## 10181 1.41 12.6 ``` * That's what a scatter plots shows. --- # Scatter Plots .pull-left[ ```r plot(fertility ~ infant_mortality, data = gapminder_2015, xlab = "Infant mortality", ylab = "Fertility rate", xlim = c(0,200), # range of x axis ylim = c(0,8), # range of y axis main = "Relationship between fertility and infant mortality in 2015", col = "#d90502", las = 1) ``` <img src="chapter2_files/figure-html/unnamed-chunk-34-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <br> * Each dot is one pair `\((x_i,y_i)\)`. * We often call it one *observation*. * Corresponding to one *row* of the `data.frame`. * Why do some dots appear *darker* than others here? ***Your turn:*** How does the same scatter plot looks like for 1960? ] --- background-image: url("../img/logo/ggplot2.svg") background-position: 90% 5% background-size: 180px # Quick `ggplot2` Intro .pull-left[ * Excellent cheatsheet on [project website]( * Great intro to `ggplot2` [here]( * Based on *The __G__rammar of __G__raphics* (hence __gg__plot). * More powerful than base `R` plotting * Let's reproduce the previous graphs in ggplot ] .pull-right[ <br> <br> <img src="chapter2_files/figure-html/ggplot_grammar_graphics.png" width="500px" style="display: block; margin: auto;" /> ###### source: [BloggoType]( ] --- # `ggplot2`: Basic Histogram .pull-left[ ```r library(ggplot2) ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram() ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-37-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Fancy Histogram .pull-left[ ```r library(ggplot2) ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram(binwidth = 5, boundary = 0, # try with 1: what does it do?, colour = "white", fill = "#d90502") + labs(x = "Life Expectancy", y = "Frequency", title = "Histogram of life expectancy in 2015") + theme_bw(base_size = 16) # white background + base font size adjusted to 16 pts ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-39-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Fancy Histogram with `facet_grid()` .pull-left[ ```r library(ggplot2) ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram(binwidth = 5, boundary = 0, colour = "white", fill = "#d90502") + labs(x = "Life Expectancy", y = "Frequency", title = "Histogram of life expectancy in 2015") + theme_bw(base_size = 16) + facet_grid(rows = vars(continent)) ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-41-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Boxplots .pull-left[ ```r ggplot(gapminder_2015, aes(x = continent, y = life_expectancy)) + geom_boxplot(colour = "black", fill = "#d90502") + labs(x = "Continent", y = "Life expectancy in 2015", title = "Life expectancy by continent in 2015") + theme_bw(base_size = 20) ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-43-1.svg" style="display: block; margin: auto;" /> ] --- # `ggplot2`: Scatter Plots .pull-left[ ```r ggplot(gapminder_2015, aes(x = infant_mortality, y = fertility)) + geom_point(size = 3, alpha = 0.5, colour = "#d90502") + expand_limits(x = 0, y = 0) + labs(x = "Infant mortality", y = "Fertility rate", title = "Relationship between fertility and infant mortality in 2015") + theme_bw(base_size = 16) ``` ] .pull-right[ <img src="chapter2_files/figure-html/unnamed-chunk-45-1.svg" style="display: block; margin: auto;" /> ] --- class: inverse, middle # It's Tutorial Time! --- # Tutorial 1 (10 minutes) Time for our first tutorial!! Type this into your `RStudio` console: ```r library(ScPoApps) runTutorial('chapter2') ``` If you have trouble with the interactive doc, try this version (no interactive content): ```r runTutorial('chapter2-script') ``` --- # How are x and y related? Covariance and Correlation * [This]( is the relevant section in the book about Covariance. .pull-left[ <img src="chapter2_files/figure-html/x-y-corr-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ * The covariance is a measure of __joint variability__ of two variables. `$$Cov(x,y) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})$$` ``` ## [1] 24.21146 ``` * The correlation is a measure of the strenght of the __linear association__ between two variables. `$$Cor(x,y) = \frac{Cov(x,y)}{\sqrt(Var(x))\sqrt(Var(y))}$$` ``` ## [1] 0.8286402 ``` ] --- class: inverse # Correlation App ```r library(ScPoApps) runTutorial('correlation') ``` --- layout: false class: title-slide-section-red, middle # Wrangling --- layout: true <div class="my-footer"><img src="../img/logo/ScPo-shield.png" style="height: 60px;"/></div> --- # Intro to `dplyr` .pull-left[ <br> <br> <br> * [`dplyr`]( is part of the [tidyverse]( package family. * [`data.table`]( is an alternative. Very fast but a bit more difficult. * Both have pros and cons. We'll start you off with `dplyr`. ] .pull-right[   ] --- # `dplyr` Overview .pull-left[ <br> <br> * You *must* read through [Hadley Wickham's chapter]( It's concise. * The package is organized around a set of **verbs**, i.e. *actions* to be taken. * We operate on `data.frames` or `tibbles` (*nicer looking* data.frames.) * All *verbs*: First argument is a data.frame, subsequent arguments describe what to do, returns another data.frame. ] -- .pull-right[ ## Verbs 1. `filter()`: Choose observations based on a certain value (i.e. subset) 1. `arrange()`: Reorder rows 1. `select()`: Select variables by name 1. `mutate()`: Create new variables out of existing ones 1. `group_by()` and `summarise()`: Summarise variables ] --- # Data on 2016 US election polls from the `dslabs` package * This dataset contains __real__ data on polls made during the 2016 US Presidential elections and compiled by [fivethirtyeight]( ```r library(dslabs, dplyr) data(polls_us_election_2016) # this data is from polls_us_election_2016 <- dplyr::as_tibble(polls_us_election_2016) head(polls_us_election_2016[,1:6], n = 6) # show first 6 lines of first 6 variables ``` ``` ## # A tibble: 6 x 6 ## state startdate enddate pollster grade samplesize ## <fct> <date> <date> <fct> <fct> <int> ## 1 U.S. 2016-11-03 2016-11-06 ABC News/Washington Post A+ 2220 ## 2 U.S. 2016-11-01 2016-11-07 Google Consumer Surveys B 26574 ## 3 U.S. 2016-11-02 2016-11-06 Ipsos A- 2195 ## 4 U.S. 2016-11-04 2016-11-07 YouGov B 3677 ## 5 U.S. 2016-11-03 2016-11-06 Gravis Marketing B- 16639 ## 6 U.S. 2016-11-03 2016-11-06 Fox News/Anderson Robbins Resear… A 1295 ``` 🚨 This is a `tibble` (more informative than a `data.frame`) What variables does this dataset contain? -- 1. `?polls_us_election_2016` 1. [FiveThirtyEight’s Pollster Ratings website]( --- # `filter()`: subset a data.frame * `filter` has the same purpose as `subset` * Example: Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote? ```r filter(polls_us_election_2016, grade == "A" & samplesize > 2000 & rawpoll_trump > 45) ``` -- ``` ## # A tibble: 1 x 15 ## state startdate enddate pollster grade samplesize population ## <fct> <date> <date> <fct> <fct> <int> <chr> ## 1 Indi… 2016-04-26 2016-04-28 Marist … A 2149 rv ## # … with 8 more variables: rawpoll_clinton <dbl>, rawpoll_trump <dbl>, ## # rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, ## # adjpoll_trump <dbl>, adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl> ``` --- # Create a Filter: Comparisons and Logical Operators .pull-left[ * We have a standard suite of comparison operators: - `>`: greater than, - `<`: smaller than, - `>=`: greater than or equal to, - `<=`: smaller than or equal to, - `!=`: not equal to, - `==`: equal to. * Construct more complex filters with logical operators 1. `x & y`: `x` **and** `y` 1. `x | y`: `x` **or** `y` 1. `!y`: **not** `y` ] .pull-right[ * `R` has the convenient `x %in% y` operator (conversely `!(x %in% y)`), `TRUE` if `x` is *a member of* `y`. ```r 3 %in% 1:3 ``` ``` ## [1] TRUE ``` ```r c(2,5) %in% 2:10 # also vectorized ``` ``` ## [1] TRUE TRUE ``` ```r c("S","Po") %in% c("Sciences","Po") # also strings ``` ``` ## [1] FALSE TRUE ``` ] --- # `mutate()`: create new variables * *Example*: What was 1. the combined vote share of Trump and Clinton for each poll? 2. the difference between Trump's raw poll vote share and 538's adjusted vote share? ```r mutate(polls_us_election_2016, trump_clinton_tot = rawpoll_trump + rawpoll_clinton, trump_raw_adj_diff = rawpoll_trump - adjpoll_trump) ``` # `select()`: only keep some variables * *Example*: Only keep the variables `state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump` ```r select(polls_us_election_2016, state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump) ``` --- class: inverse # Task 2 (10 minutes) 1. Which polls had more vote intentions for Trump than for Clinton.
How many polls have a missing `grade`?
Which polls were (i) polled by American Strategies, GfK Group or Merrill Poll, _and_ (ii) had a sample size greater than 1,000, _and_ (iii) started on October 20th, 2016?
*For the following questions you should use `filter` and `mutate`.*
Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% _and_ (iii) had a sample size greater than 1,000.?
Which polls (i) did not poll for vote intentions for Johnson, (ii) had a difference in raw poll vote shares between Trump and Clinton greater than 5, and (iii) were done in the state of Iowa? Apply to each group the operation 3. 