Load the data by running the following code:
library(dslabs)
data(polls_us_election_2016)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
1. Which polls had a missing grade
?
Only showing the first 6 rows using head
, otherwise the
document would be very long.
polls_us_election_2016 %>%
filter(is.na(grade)) %>%
head
## state startdate enddate pollster grade samplesize
## 1 New Mexico 2016-11-06 2016-11-06 Zia Poll <NA> 8439
## 2 U.S. 2016-11-05 2016-11-07 The Times-Picayune/Lucid <NA> 2521
## 3 U.S. 2016-11-01 2016-11-07 USC Dornsife/LA Times <NA> 2972
## 4 Virginia 2016-11-01 2016-11-02 Remington <NA> 3076
## 5 Wisconsin 2016-11-01 2016-11-02 Remington <NA> 2720
## 6 Pennsylvania 2016-11-01 2016-11-02 Remington <NA> 2683
## population rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1 lv 46.00 44.00 6 NA
## 2 lv 45.00 40.00 5 NA
## 3 lv 43.61 46.84 NA NA
## 4 lv 46.00 44.00 NA NA
## 5 lv 49.00 41.00 NA NA
## 6 lv 46.00 45.00 NA NA
## adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 1 44.82594 41.59978 7.870127 NA
## 2 45.13966 42.26495 3.679914 NA
## 3 45.32156 43.38579 NA NA
## 4 45.27399 41.91459 NA NA
## 5 48.22713 38.86464 NA NA
## 6 45.30896 42.94988 NA NA
2. Which polls were (i) polled by American Strategies, GfK Group or
Merrill Poll, (ii) had a sample size greater than 1,000, and
(iii) started on October 20th, 2016? (Hint: for (i)
%in%
might come in handy. Recall that vectors are created
using the c()
function. For (iii) make sure to check the
format of the variable containing the poll’s start date.)
polls_us_election_2016 %>%
filter(pollster %in% c("American Strategies","GfK Group","Merrill Poll") &
samplesize > 1000 &
startdate == "2016-10-20")
## state startdate enddate pollster grade samplesize population
## 1 U.S. 2016-10-20 2016-10-24 GfK Group B+ 1212 lv
## rawpoll_clinton rawpoll_trump rawpoll_johnson rawpoll_mcmullin
## 1 51 37 6 NA
## adjpoll_clinton adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## 1 50.28058 39.98632 4.733277 NA
3. Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% and (iii) were done in the state of Ohio? (Hint: it might be practical to first create a variable containing the combined raw poll vote share for Trump and Clinton and then filter.)
polls_us_election_2016 %>%
mutate(rawpoll_clintontrump = rawpoll_clinton + rawpoll_trump) %>%
filter(!is.na(rawpoll_johnson) & rawpoll_clintontrump > 95 & state == "Ohio")
## [1] state startdate enddate
## [4] pollster grade samplesize
## [7] population rawpoll_clinton rawpoll_trump
## [10] rawpoll_johnson rawpoll_mcmullin adjpoll_clinton
## [13] adjpoll_trump adjpoll_johnson adjpoll_mcmullin
## [16] rawpoll_clintontrump
## <0 rows> (or 0-length row.names)
4. Which state had the highest average Trump vote share for polls
which had at least a sample size of 2,000? (Hint: you’ll have to use
filter
, group_by
, summarise
and
arrange
. To obtain ranking in descending order check
arrange
’s help page.)
polls_us_election_2016 %>%
filter(samplesize >= 2000) %>%
group_by(state) %>%
summarise(mean_trump = mean(rawpoll_trump)) %>%
arrange(desc(mean_trump))
## # A tibble: 26 × 2
## state mean_trump
## <fct> <dbl>
## 1 Alabama 62.5
## 2 Missouri 48.5
## 3 Indiana 47
## 4 Texas 46.0
## 5 South Carolina 45.5
## 6 Georgia 45.3
## 7 Kansas 44
## 8 New Mexico 44
## 9 Florida 44.0
## 10 Ohio 43.9
## # ℹ 16 more rows
Load the data by running the following code:
library(dslabs)
data(gapminder, package = "dslabs")
1. Compute the average population per continent per year,
mean_pop
, and assign the output to a new object
gapminder_mean
. (Hint: you should have one observation
(row) per continent for each year. You’ll have to use
group_by
and summarise
.)
gapminder_mean <- gapminder %>%
group_by(continent, year) %>%
summarise(mean_pop = mean(population))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
Using the gapminder
data, create the following plots
using ggplot2
. Don’t forget to label the axes.
1. A histogram of life expectancy in 2015. (Hint: do you need to
specify a y
in aes()
for a histogram?)
Once you’ve created the histogram, within the appropriate
geom_*
set: binwidth
to 5,
boundary
to 45, colour
to “white” and
fill
to “#d90502”. What does each of these options do?
Optional: Using the previous graph, facet it by continent such
that each continent’s plot is a new row. (Hint: check the help for
facet_grid
.)
The basic histogram:
gapminder %>%
filter(year == 2015) %>%
ggplot() +
aes(x = life_expectancy) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The fancy histogram (with axis labels):
life_exp_hist <- gapminder %>%
filter(year == 2015) %>%
ggplot() +
aes(x = life_expectancy) +
geom_histogram(binwidth = 5,
boundary = 45,
colour = "white",
fill = "#d90502") +
labs(x = "Life expectancy",
y = "Frequency")
life_exp_hist
The faceted fancy histogram:
life_exp_hist +
facet_grid(rows = vars(continent))
2. A boxplot of average life expectancy per year by continent. Within
the appropriate geom_*
set: colour
to “black”
and fill
to “#d90502”. (Hint: you need to group by both
continent
and year
.)
gapminder %>%
group_by(continent, year) %>%
summarise(mean_life_exp = mean(life_expectancy)) %>%
ggplot() +
aes(x = continent, y = mean_life_exp) +
geom_boxplot(colour = "black",
fill = "#d90502") +
labs(x = "Continent",
y = "Life expectancy")
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
3. A scatter plot of fertility rate (y-axis) with respect to infant
mortality (x-axis) in 2015. Once you’ve created the scatter plot, within
the appropriate geom_*
set: size
to 3,
alpha
to 0.5, colour
to “#d90502”.
The basic scatter plot:
gapminder %>%
filter(year == 2015) %>%
ggplot() +
aes(x = infant_mortality, y = fertility) +
geom_point()
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
The fancy scatter plot with axis labels:
gapminder %>%
filter(year == 2015) %>%
ggplot() +
aes(x = infant_mortality, y = fertility) +
geom_point(size = 3,
alpha = 0.5,
colour = "#d90502") +
labs(x = "Infant mortality", y = "Fertility")
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
1. Compute the mean of GDP in 2011 and assign to object
mean
. You should exclude missing values. (Hint: read
the help for mean
to remove NA
s).
mean <- gapminder %>%
filter(year == 2011) %>%
summarise(mean(gdp, na.rm = T))
mean
## mean(gdp, na.rm = T)
## 1 246954895975
2. Compute the median of GDP in 2011 and assign to object
median
. Again, you should exclude missing values. Is it
greater or smaller than the average?
median <- gapminder %>%
filter(year == 2011) %>%
summarise(median(gdp, na.rm = T))
median
## median(gdp, na.rm = T)
## 1 16031265698
The median is much smaller than the average.
3. Create a density plot of GDP in 2011 using
geom_density
. A density plot is a way of representing the
distribution of a numeric variable. Add the following code to your plot
to show the median and mean as vertical lines. What do you observe?
geom_vline(xintercept = as.numeric(mean), colour = "red") + <br> geom_vline(xintercept = as.numeric(median), colour = "orange")
gdp_density <- gapminder %>%
filter(year == 2011) %>%
ggplot() +
aes(x = gdp) +
geom_density() +
geom_vline(xintercept = as.numeric(mean), colour = "red") +
geom_vline(xintercept = as.numeric(median), colour = "orange")
gdp_density
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density()`).
The distribution of GDP is highly skewed: there are many countries with small GDPs and very few with huge GDPs (U.S., Japan, China). In such cases, the average will be (significantly) greater than the median. To see this more clearly, here’s a graph where I’ve transformed the x-axis such that each tick is 10 times larger than the previous one (the scale is therefore not linear, i.e. the first tick is 100,000, the second is 1 million, the third is 10 million, etc.).
gdp_density +
scale_x_log10()
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density()`).
4. Compute the correlation between fertility and infant mortality in
2015. To drop NA
s in either variable set the argument
use
to “pairwise.complete.obs” in your cor()
function. Is this correlation consistent with the graph you produced in
Task 3?
gapminder %>%
filter(year == 2015) %>%
summarise(cor(fertility, infant_mortality, use = "pairwise.complete.obs"))
## cor(fertility, infant_mortality, use = "pairwise.complete.obs")
## 1 0.8286402
This correlation is positive and strong (relatively close to 1) which is consistent with the graph produced in Task 3. Indeed, that graph displayed a positive relationship between these two variables and the points were not that dispersed.