Tidying, Visualising and Summarising Data

Task 1: Data wrangling

Load the data by running the following code:

library(dslabs)
data(polls_us_election_2016)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1. Which polls had a missing grade?

Only showing the first 6 rows using head, otherwise the document would be very long.

polls_us_election_2016 %>%
    filter(is.na(grade)) %>%
    head

##          state  startdate    enddate                 pollster grade samplesize
## 1   New Mexico 2016-11-06 2016-11-06                 Zia Poll  <NA>       8439
## 2         U.S. 2016-11-05 2016-11-07 The Times-Picayune/Lucid  <NA>       2521
## 3         U.S. 2016-11-01 2016-11-07    USC Dornsife/LA Times  <NA>       2972
## 4     Virginia 2016-11-01 2016-11-02                Remington  <NA>       3076
## 5    Wisconsin 2016-11-01 2016-11-02                Remington  <NA>       2720
## 6 Pennsylvania 2016-11-01 2016-11-02                Remington  <NA>       2683
##   population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1         lv           46.00         44.00               6               NA
## 2         lv           45.00         40.00               5               NA
## 3         lv           43.61         46.84              NA               NA
## 4         lv           46.00         44.00              NA               NA
## 5         lv           49.00         41.00              NA               NA
## 6         lv           46.00         45.00              NA               NA
##   adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 1        44.82594      41.59978        7.870127               NA
## 2        45.13966      42.26495        3.679914               NA
## 3        45.32156      43.38579              NA               NA
## 4        45.27399      41.91459              NA               NA
## 5        48.22713      38.86464              NA               NA
## 6        45.30896      42.94988              NA               NA

2. Which polls were (i) polled by American Strategies, GfK Group or Merrill Poll, (ii) had a sample size greater than 1,000, and (iii) started on October 20th, 2016? (Hint: for (i) %in% might come in handy. Recall that vectors are created using the c() function. For (iii) make sure to check the format of the variable containing the poll’s start date.)

polls_us_election_2016 %>%
    filter(pollster %in% c("American Strategies","GfK Group","Merrill Poll") &
               samplesize > 1000 &
               startdate == "2016-10-20")

##   state  startdate    enddate  pollster grade samplesize population
## 1  U.S. 2016-10-20 2016-10-24 GfK Group    B+       1212         lv
##   rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1              51            37               6               NA
##   adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 1        50.28058      39.98632        4.733277               NA

3. Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% and (iii) were done in the state of Ohio? (Hint: it might be practical to first create a variable containing the combined raw poll vote share for Trump and Clinton and then filter.)

polls_us_election_2016 %>%
    mutate(rawpoll_clintontrump = rawpoll_clinton + rawpoll_trump) %>%
    filter(!is.na(rawpoll_johnson) & rawpoll_clintontrump > 95 & state == "Ohio")

##  [1] state                startdate            enddate             
##  [4] pollster             grade                samplesize          
##  [7] population           rawpoll_clinton      rawpoll_trump       
## [10] rawpoll_johnson      rawpoll_mcmullin     adjpoll_clinton     
## [13] adjpoll_trump        adjpoll_johnson      adjpoll_mcmullin    
## [16] rawpoll_clintontrump
## <0 rows> (or 0-length row.names)

4. Which state had the highest average Trump vote share for polls which had at least a sample size of 2,000? (Hint: you’ll have to use filter, group_by, summarise and arrange. To obtain ranking in descending order check arrange’s help page.)

polls_us_election_2016 %>%
    filter(samplesize >= 2000) %>%
    group_by(state) %>%
    summarise(mean_trump = mean(rawpoll_trump)) %>%
    arrange(desc(mean_trump))

## # A tibble: 26 × 2
##    state          mean_trump
##    <fct>               <dbl>
##  1 Alabama              62.5
##  2 Missouri             48.5
##  3 Indiana              47  
##  4 Texas                46.0
##  5 South Carolina       45.5
##  6 Georgia              45.3
##  7 Kansas               44  
##  8 New Mexico           44  
##  9 Florida              44.0
## 10 Ohio                 43.9
## # ℹ 16 more rows

Task 2: Understanding the data

Load the data by running the following code:

library(dslabs)
data(gapminder, package = "dslabs")

1. Compute the average population per continent per year, mean_pop, and assign the output to a new object gapminder_mean. (Hint: you should have one observation (row) per continent for each year. You’ll have to use group_by and summarise.)

gapminder_mean <- gapminder %>%
  group_by(continent, year) %>%
  summarise(mean_pop = mean(population))

## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

Task 3: Visualising data

Using the gapminder data, create the following plots using ggplot2. Don’t forget to label the axes.

1. A histogram of life expectancy in 2015. (Hint: do you need to specify a y in aes() for a histogram?) Once you’ve created the histogram, within the appropriate geom_* set: binwidth to 5, boundary to 45, colour to “white” and fill to “#d90502”. What does each of these options do?
Optional: Using the previous graph, facet it by continent such that each continent’s plot is a new row. (Hint: check the help for facet_grid.)

The basic histogram:

gapminder %>%
    filter(year == 2015) %>%
    ggplot() +
    aes(x = life_expectancy) +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The fancy histogram (with axis labels):

life_exp_hist <- gapminder %>%
    filter(year == 2015) %>%
    ggplot() +
    aes(x = life_expectancy) +
    geom_histogram(binwidth = 5,
                   boundary = 45,
                   colour = "white",
                   fill = "#d90502") +
    labs(x = "Life expectancy",
         y = "Frequency")
life_exp_hist

The faceted fancy histogram:

life_exp_hist +
    facet_grid(rows = vars(continent))

2. A boxplot of average life expectancy per year by continent. Within the appropriate geom_* set: colour to “black” and fill to “#d90502”. (Hint: you need to group by both continent and year.)

gapminder %>%
    group_by(continent, year) %>%
    summarise(mean_life_exp = mean(life_expectancy)) %>%
    ggplot() +
    aes(x = continent, y = mean_life_exp) +
    geom_boxplot(colour = "black",
                 fill = "#d90502") +
    labs(x = "Continent",
         y = "Life expectancy")

## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

3. A scatter plot of fertility rate (y-axis) with respect to infant mortality (x-axis) in 2015. Once you’ve created the scatter plot, within the appropriate geom_* set: size to 3, alpha to 0.5, colour to “#d90502”.

The basic scatter plot:

gapminder %>%
    filter(year == 2015) %>%
    ggplot() +
    aes(x = infant_mortality, y = fertility) +
    geom_point()

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

The fancy scatter plot with axis labels:

gapminder %>%
    filter(year == 2015) %>%
    ggplot() +
    aes(x = infant_mortality, y = fertility) +
    geom_point(size = 3,
               alpha = 0.5,
               colour = "#d90502") +
    labs(x = "Infant mortality", y = "Fertility")

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

Task 4: Summarising data

1. Compute the mean of GDP in 2011 and assign to object mean. You should exclude missing values. (Hint: read the help for mean to remove NAs).

mean <- gapminder %>%
    filter(year == 2011) %>%
    summarise(mean(gdp, na.rm = T))
mean

##   mean(gdp, na.rm = T)
## 1         246954895975

2. Compute the median of GDP in 2011 and assign to object median. Again, you should exclude missing values. Is it greater or smaller than the average?

median <- gapminder %>%
    filter(year == 2011) %>%
    summarise(median(gdp, na.rm = T))
median

##   median(gdp, na.rm = T)
## 1            16031265698

The median is much smaller than the average.

3. Create a density plot of GDP in 2011 using geom_density. A density plot is a way of representing the distribution of a numeric variable. Add the following code to your plot to show the median and mean as vertical lines. What do you observe? geom_vline(xintercept = as.numeric(mean), colour = "red") + <br> geom_vline(xintercept = as.numeric(median), colour = "orange")

gdp_density <- gapminder %>%
    filter(year == 2011) %>%
    ggplot() +
    aes(x = gdp) +
    geom_density() +
    geom_vline(xintercept = as.numeric(mean), colour = "red") +
    geom_vline(xintercept = as.numeric(median), colour = "orange")
gdp_density

## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density()`).

The distribution of GDP is highly skewed: there are many countries with small GDPs and very few with huge GDPs (U.S., Japan, China). In such cases, the average will be (significantly) greater than the median. To see this more clearly, here’s a graph where I’ve transformed the x-axis such that each tick is 10 times larger than the previous one (the scale is therefore not linear, i.e. the first tick is 100,000, the second is 1 million, the third is 10 million, etc.).

gdp_density +
    scale_x_log10()

## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density()`).

4. Compute the correlation between fertility and infant mortality in 2015. To drop NAs in either variable set the argument use to “pairwise.complete.obs” in your cor() function. Is this correlation consistent with the graph you produced in Task 3?

gapminder %>%
    filter(year == 2015) %>%
    summarise(cor(fertility, infant_mortality, use = "pairwise.complete.obs"))

##   cor(fertility, infant_mortality, use = "pairwise.complete.obs")
## 1                                                       0.8286402

This correlation is positive and strong (relatively close to 1) which is consistent with the graph produced in Task 3. Indeed, that graph displayed a positive relationship between these two variables and the points were not that dispersed.

Tidying, Visualising and Summarising Data - Tasks

Florian Oswald

2025-09-10

Task 1: Data wrangling

Task 2: Understanding the data

Task 3: Visualising data

Task 4: Summarising data