.According a to 2014 NYTimes article, "data scientists [...] spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."
dataset: OverviewLet's first load a dataset with these commands:
library(dslabs)gapminder <- dslabs::gapminder
Here are the first 3 rows
head(gapminder, n = 3)
## country year infant_mortality life_expectancy fertility population## 1 Albania 1960 115.4 62.87 6.19 1636054## 2 Algeria 1960 148.2 47.50 7.65 11124892## 3 Angola 1960 208.0 35.98 7.32 5270844## gdp continent region## 1 NA Europe Southern Europe## 2 13828152297 Africa Northern Africa## 3 NA Africa Middle Africa
dataset: OverviewWhat variables does this dataset contain?
## [1] "country" "year" "infant_mortality" "life_expectancy" ## [5] "fertility" "population" "gdp" "continent" ## [9] "region"
gives you the last (6) rows.
dataset: DatatypesIt's important to know how the data is stored.
## 'data.frame': 10545 obs. of 9 variables:## $ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...## $ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...## $ infant_mortality: num 115.4 148.2 208 NA 59.9 ...## $ life_expectancy : num 62.9 47.5 36 63 65.4 ...## $ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...## $ population : num 1636054 11124892 5270844 54681 20619075 ...## $ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...## $ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...## $ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
Create a new variable called gdppercap
corresponding to gdp
divided by population
Which countries had a 2011 GDP per capita greater than 30.000?
Filter the dataset to only keep the year 2015: gapminder_2015
How many countries have an infant mortality in 2015 greater than 90 (per 1000)?
What is the average life expectancy in Africa in 2015?
. 🔍data.frame
In general, we can compute summary statistics, or visualize the data with plots.
Let's start with some statistics first!
: the average of all values in x
ˉx=1NN∑i=1xix <- c(1,2,2,2,2,100)mean(x)
## [1] 18.16667
mean(x) == sum(x) / length(x)
## [1] TRUE
Your turn: What's the mean of infant_mortality
in 1960? Read the help for mean
to remove NA
: the value xj below and above which 50% of the values in x
lie. m is the median if
The median is robust against outliers. 🤔? (later).
## [1] 2
Your turn: What's the median of infant_mortality
in 1960?
What is confusing is that
NA == NA
## [1] NA
It's easy to illustrate like that:
# Let x be Mary's age. We don't know how old she is.x <- NA# Let y be John's age. We don't know how old he is.y <- NA# Are John and Mary the same age?x == y
## [1] NA
#> [1] NA# We don't know!
Another interesting feature is how much a variable is spread out about it's center (the mean in this case).
The variance is such a measure. Var(X) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})^2
Consider two normal distributions
with equal mean at 0
var(x) range(x) # range: max value - min value
Men weight 65 kg and women 55 kg on average. The variance is 25.
# Generate a random datasetset.seed(1234) # `set.seed` allows replicating random datadf <- data.frame( sex=factor(rep(c("F", "M"), each=200)), weight=round(c(rnorm(200, mean=55, sd=5), # ?rnorm | Note: sd = sqrt(variance) rnorm(200, mean=65, sd=5))))
Plot the overall density
ggplot(df, aes(x=weight)) + geom_density() + theme_bw() #white background
Plot separated densities
# Change density plot line colors by groupsggplot(df, aes(x=weight, color=sex)) + geom_density() + theme_bw() # White background
[1]: This example is taken from
is a useful function that counts the occurence of each unique value in x
## ## Africa Americas Asia Europe Oceania ## 2907 2052 2679 2223 684
## ## Australia and New Zealand Caribbean Central America ## 114 741 456 ## Central Asia Eastern Africa Eastern Asia ## 285 912 342 ## Eastern Europe Melanesia Micronesia ## 570 285 114 ## Middle Africa Northern Africa Northern America ## 456 342 171 ## Northern Europe Polynesia South America ## 570 171 684 ## South-Eastern Asia Southern Africa Southern Asia ## 570 285 456 ## Southern Europe Western Africa Western Asia ## 684 912 1026 ## Western Europe ## 399
Given two vectors, table
produces a contingency table:
gapminder_2015 <- subset(gapminder, year == 2015)gapminder_2015$fertility_above_2 = (gapminder_2015$fertility > 2.1) # dummy variable for fertility rate above replacement level fertilitytable(gapminder_2015$fertility_above_2,gapminder_2015$continent)
## ## Africa Americas Asia Europe Oceania## FALSE 2 15 20 39 4## TRUE 49 20 27 0 8
Given two vectors, table
produces a contingency table:
gapminder_2015 <- subset(gapminder, year == 2015)gapminder_2015$fertility_above_2 = (gapminder_2015$fertility > 2.1) # dummy variable for fertility rate above replacement level fertilitytable(gapminder_2015$fertility_above_2,gapminder_2015$continent)
## ## Africa Americas Asia Europe Oceania## FALSE 2 15 20 39 4## TRUE 49 20 27 0 8
With prop.table
, we can get proportions:
# proportions by rowprop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 1)# proportions by columnprop.table(table(gapminder_2015$fertility_above_2,gapminder_2015$continent), margin = 2)
⚠️ To obtain table
s with NA
s, use the useNA = "always"
or useNA = "ifany"
base plotting is fairly good.
There is an extremely powerful alternative in package ggplot2
. We'll see both.
First example: histograms. A histogram counts how many obserations fall within a certain bin.
gapminder_2015 <- gapminder[gapminder$year == 2015,]hist(gapminder_2015$life_expectancy)
We can give additional arguments to hist
Look at ?hist
for more.
hist(gapminder_2015$life_expectancy, xlab = "Life Expectancy", main = "Histogram of life expectancy in 2015", breaks = seq(from = 40, to = 90, by = 5), las = 1, # label orientation col = "#d90502", border = "white")
An outlier is a datapoint far removed from the center of a distribution.
Boxplots are an effective way to visualise the distribution of a variable.
The box typically denotes the interquartile range (observations between 25th pctile and 75th pctile).
The thick line corresponds to the median.
The dots are outliers (⚠️ no universally accepted definition).
boxplot(life_expectancy ~ continent, data = gapminder_2015, xlab = "Continent", ylab = "Life expectancy in 2015", main = "Life expectancy by continent in 2015", pch = 20, cex = 2, # colour and size of outliers col ="#d90502",border = "black", las = 1)
for more optionsYour turn: How does the same plot look like for fertility
? For infant_mortality
Two variables x and y
Natural to ask: How often do certain pairs of (x_i,y_i) occur?
## fertility infant_mortality## 10176 1.78 12.5## 10177 2.71 21.9## 10178 5.65 96.0## 10179 2.06 5.8## 10180 2.15 11.1## 10181 1.41 12.6
plot(fertility ~ infant_mortality, data = gapminder_2015, xlab = "Infant mortality", ylab = "Fertility rate", xlim = c(0,200), # range of x axis ylim = c(0,8), # range of y axis main = "Relationship between fertility and infant mortality in 2015", col = "#d90502", las = 1)
Each dot is one pair (x_i,y_i).
We often call it one observation.
Corresponding to one row of the data.frame
Why do some dots appear darker than others here?
Your turn: How does the same scatter plot looks like for 1960?
Great intro to ggplot2
Based on The Grammar of Graphics (hence ggplot).
More powerful than base R
Let's reproduce the previous graphs in ggplot
: Basic Histogramlibrary(ggplot2)ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram()
: Fancy Histogramlibrary(ggplot2)ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram(binwidth = 5, boundary = 0, # try with 1: what does it do?, colour = "white", fill = "#d90502") + labs(x = "Life Expectancy", y = "Frequency", title = "Histogram of life expectancy in 2015") + theme_bw(base_size = 16) # white background + base font size adjusted to 16 pts
: Fancy Histogram with facet_grid()
library(ggplot2)ggplot(gapminder_2015, aes(x = life_expectancy)) + geom_histogram(binwidth = 5, boundary = 0, colour = "white", fill = "#d90502") + labs(x = "Life Expectancy", y = "Frequency", title = "Histogram of life expectancy in 2015") + theme_bw(base_size = 16) + facet_grid(rows = vars(continent))
: Boxplotsggplot(gapminder_2015, aes(x = continent, y = life_expectancy)) + geom_boxplot(colour = "black", fill = "#d90502") + labs(x = "Continent", y = "Life expectancy in 2015", title = "Life expectancy by continent in 2015") + theme_bw(base_size = 20)
: Scatter Plotsggplot(gapminder_2015, aes(x = infant_mortality, y = fertility)) + geom_point(size = 3, alpha = 0.5, colour = "#d90502") + expand_limits(x = 0, y = 0) + labs(x = "Infant mortality", y = "Fertility rate", title = "Relationship between fertility and infant mortality in 2015") + theme_bw(base_size = 16)
Time for our first tutorial!!
Type this into your RStudio
If you have trouble with the interactive doc, try this version (no interactive content):
## [1] 24.21146
## [1] 0.8286402
is an alternative. Very fast but a bit more difficult.
Both have pros and cons. We'll start you off with dplyr
You must read through Hadley Wickham's chapter. It's concise.
The package is organized around a set of verbs, i.e. actions to be taken.
We operate on data.frames
or tibbles
(nicer looking data.frames.)
All verbs: First argument is a data.frame, subsequent arguments describe what to do, returns another data.frame.
: Choose observations based on a certain value (i.e. subset)
: Reorder rows
: Select variables by name
: Create new variables out of existing ones
and summarise()
: Summarise variables
packagelibrary(dslabs, dplyr)data(polls_us_election_2016) # this data is from fivethirtyeight.compolls_us_election_2016 <- dplyr::as_tibble(polls_us_election_2016)head(polls_us_election_2016[,1:6], n = 6) # show first 6 lines of first 6 variables
## # A tibble: 6 x 6## state startdate enddate pollster grade samplesize## <fct> <date> <date> <fct> <fct> <int>## 1 U.S. 2016-11-03 2016-11-06 ABC News/Washington Post A+ 2220## 2 U.S. 2016-11-01 2016-11-07 Google Consumer Surveys B 26574## 3 U.S. 2016-11-02 2016-11-06 Ipsos A- 2195## 4 U.S. 2016-11-04 2016-11-07 YouGov B 3677## 5 U.S. 2016-11-03 2016-11-06 Gravis Marketing B- 16639## 6 U.S. 2016-11-03 2016-11-06 Fox News/Anderson Robbins Resear… A 1295
🚨 This is a tibble
(more informative than a data.frame
What variables does this dataset contain?
: subset a data.framefilter
has the same purpose as subset
Example: Which A graded poll with at least 2,000 people had Trump win at least 45% of the vote?
filter(polls_us_election_2016, grade == "A" & samplesize > 2000 & rawpoll_trump > 45)
## # A tibble: 1 x 15## state startdate enddate pollster grade samplesize population## <fct> <date> <date> <fct> <fct> <int> <chr> ## 1 Indi… 2016-04-26 2016-04-28 Marist … A 2149 rv ## # … with 8 more variables: rawpoll_clinton <dbl>, rawpoll_trump <dbl>,## # rawpoll_johnson <dbl>, rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>,## # adjpoll_trump <dbl>, adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>
We have a standard suite of comparison operators:
: greater than,<
: smaller than,>=
: greater than or equal to,<=
: smaller than or equal to,!=
: not equal to,==
: equal to.Construct more complex filters with logical operators
x & y
: x
and y
x | y
: x
or y
: not y
has the convenient x %in% y
operator (conversely !(x %in% y)
if x
is a member of y
3 %in% 1:3
## [1] TRUE
c(2,5) %in% 2:10 # also vectorized
## [1] TRUE TRUE
c("S","Po") %in% c("Sciences","Po") # also strings
: create new variablesmutate(polls_us_election_2016, trump_clinton_tot = rawpoll_trump + rawpoll_clinton, trump_raw_adj_diff = rawpoll_trump - adjpoll_trump)
: only keep some variablesstate,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump
select(polls_us_election_2016, state,startdate,enddate,pollster,rawpoll_clinton,rawpoll_trump)
Which polls had more vote intentions for Trump than for Clinton.
How many polls have a missing grade
Which polls were (i) polled by American Strategies, GfK Group or Merrill Poll, and (ii) had a sample size greater than 1,000, and (iii) started on October 20th, 2016?
For the following questions you should use filter
and mutate
Which polls (i) did not have missing poll data for Johnson, (ii) had a combined raw poll vote share for Trump and Clinton greater than 95% and (iii) had a sample size greater than 1,000.?
Which polls (i) did not poll for vote intentions for Johnson, (ii) had a difference in raw poll vote shares between Trump and Clinton greater than 5, and (iii) were done in the state of Iowa?
Often we do some operation by some group in our dataset:
For this, we need to
In dplyr
, this is achieved with group_by()
and summarise
group_by(polls_us_election_2016, grade)
groups/splits polls_us_election_2016
by pollster grade
polls_grade = group_by(polls_us_election_2016, grade)
each chunk and re-combine
summarise(polls_grade, mean_vote_clinton = mean(rawpoll_clinton))
## # A tibble: 11 x 2## grade mean_vote_clinton## <fct> <dbl>## 1 D 46.7## 2 C- 43.2## 3 C 41.8## 4 C+ 44.2## 5 B- 43.9## 6 B 37.3## 7 B+ 44.1## 8 A- 43.0## 9 A 45.3## 10 A+ 45.8## 11 NA 43.2
The magrittr
package gives us the pipe %>%
x %>% f(y)
becomes f(x,y)
With the pipe you construct data pipelines.
Our above example would become:
polls_us_election_2016 %>% group_by(grade) %>% summarise( mean_vote_clinton = mean(rawpoll_clinton) )
which is equivalent to, but nicer than:
summarise( group_by(polls_us_election_2016, grade), mean_vote_clinton = mean(rawpoll_clinton))
Works for all dplyr
polls_us_election_2016 %>% mutate(trump_clinton_diff = rawpoll_trump-rawpoll_clinton) %>% filter(trump_clinton_diff>5 & state == "Iowa" & %>% select(pollster)
## # A tibble: 3 x 1## pollster## <fct> ## 1 Ipsos ## 2 Ipsos ## 3 Ipsos
