This workshop demonstrates basic exploration of COVID-19 data, with the focus on computing and graphing data on relative timelines, in which the “day zero” is unique for each country.
After this workshop participants should be able to:
ggplot2
packageplotly
packageIn this tutorial we will have three visualisation goals. First, we would like to have a tool for plotting multiple trajectories onto the same canvas and explore the differences between trajectories plotted on the same scale. Second, we would like to create a unique temporal context for each trajectory and express it with respect meaningful anchors, such as first detected case or first confirmed death. Thirdly, we would like to better understand the sequence of these temporal anchors to have a better understanding of how the pandemic unfolded.
Here are the graphs we will be creating in this tutorial:
Create a line graph of many trajectories with an interactive tooltip.
Create a trajectory graph depicting how the epidemic unfolded in each contry relative to its key epidemiological events.
Create a graph depicting the sequence of key epidemiological events across countries.
Non-technical readers are welcome to skip this section.
library(magrittr) #Pipes
library(ggplot2) #For graphing
library(dplyr) # for shorter function names. but still prefer dplyr:: stems
library(lubridate) # for working with dates
library(plotly) # interactive graphs
library(crosstalk)
config <- config::get() # common definitions stored in `./config.yml`
# to be applied to every graph we will make
ggplot2::theme_set(
ggplot2::theme_bw(
)+
theme(
strip.background = element_rect(fill="grey90", color = NA)
)
)
# important dates we will refer to in analysis
date_of_exodus <- lubridate::date("2020-01-13") # first case outside of China
date_of_pandemic <- lubridate::date("2020-03-11") # WHO declares pandemic
# to help us focus on a manageable set of countries for purposes of demonstration
oecd_countries <- c(
"AUS", "AUT", "BEL", "CAN", "CZE", "DNK", "EST", "FIN", "FRA",
"DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR",
"LVA", "LTU", "MEX", "NLD", "NZL", "NOR", "POL", "PRT", "SVK",
"SVN", "ESP", "SWE", "CHE", "TUR", "GBR", "USA", "RUS", "ZAF"
)
focus_countries <- c("CAN","USA","ITA","TUR", "NLD","CHE")
# adds neat styling to your knitr table
neat <- function(x, output_format = "html"){
# knitr.table.format = output_format
if(output_format == "pandoc"){
x_t <- knitr::kable(x, format = "pandoc")
}else{
x_t <- x %>%
# x %>%
# knitr::kable() %>%
knitr::kable(format=output_format) %>%
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed","responsive"),
# bootstrap_options = c( "condensed"),
full_width = F,
position = "left"
)
}
return(x_t)
}
The data comes from European Centre for Disease Prevention and Control, with the source available from here. I demonstrate the preparation of this data for analysis in ./manipulation/ellis-covid.R script of this repository.
# covid data
ds_covid <- readr::read_csv(config$path_input_covid)
ds_covid %>% glimpse()
Rows: 31,775
Columns: 8
$ date <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 2020...
$ n_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ n_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ country_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"...
$ n_population_2018 <dbl> 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386,...
$ country_code2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF",...
$ country_label <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",...
$ continent_label <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "...
# note that only `date`, `n_cases`, and `n_deatsh` change with time
# other variables have the same value within each country
A few things to notice about this data set:
Only date
, n_cases
, and n_deaths
change over time. All other variables have a single unique value for each country.
For some countries, the observations are missing for certain dates, for example:
ds_covid %>% filter(country_code == "FIN") %>% filter(date > as_date("2020-02-26"))
# A tibble: 97 x 8
date n_cases n_deaths country_code n_population_2018 country_code2 country_label continent_label
<date> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
1 2020-02-27 1 0 FIN 5518050 FI Finland Europe
2 2020-02-28 0 0 FIN 5518050 FI Finland Europe
3 2020-02-29 1 0 FIN 5518050 FI Finland Europe
4 2020-03-01 0 0 FIN 5518050 FI Finland Europe
5 2020-03-02 3 0 FIN 5518050 FI Finland Europe
6 2020-03-03 NA NA FIN 5518050 FI Finland Europe
7 2020-03-04 1 0 FIN 5518050 FI Finland Europe
8 2020-03-05 NA NA FIN 5518050 FI Finland Europe
9 2020-03-06 5 0 FIN 5518050 FI Finland Europe
10 2020-03-07 7 0 FIN 5518050 FI Finland Europe
# ... with 87 more rows
The data preparation step ensured that each country has the same number of rows, creating these missing cells. This is important for two reasons: A) we want to be able to differentiate between the absence of cases and the absence of reporting and B) missing dates will complicate comutation of relative timelines.
# to help us focus on a smaller subset of countries
oecd_countries <- c(
"AUS", "AUT", "BEL", "CAN", "CZE", "DNK", "EST", "FIN", "FRA",
"DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR",
"LVA", "LTU", "MEX", "NLD", "NZL", "NOR", "POL", "PRT", "SVK",
"SVN", "ESP", "SWE", "CHE", "TUR", "GBR", "USA", "RUS", "ZAF"
)
focus_countries <- c("CAN","USA","ITA","TUR", "NLD","CHE")
# to have a handy filter
ds_covid <- ds_covid %>%
mutate(
oecd = country_code %in% oecd_countries
,focus = country_code %in% focus_countries
)
We focus on 36 country members of the Organization for Economic Co-operation and Development (OECD) because they report more nuanced data on their economic and social development, so we can have a richer pool of explanatory variables (see http://stats.oecd.org/ )
plotly is a JavaScript library used for creating interactive visualizations and dashboards. Its implementation in python
and R
, the latter offered via plotly
package. Its syntax is very similar to ggplot2
, but nevertheless sufficiently different to be able to distract you into a (potentially lenghty) side quest. Luckily, there is a magic function plotly::ggplotly
that transforms a graph built in ggplot2
into an interactive plotly
object.
To demonstrate, let us first build a basic graph with multiple timeseries:
g1 <-
ds_covid %>%
filter(focus) %>%
ggplot(aes(
x = date
,y = n_deaths
,group = country_label
))+
geom_line()
g1
And now we enhanse this plot with highlight component using the plotly::highlight()
function and some additional options:
g1 <-
ds_covid %>%
filter(focus) %>%
plotly::highlight_key(~ country_label) %>% # add BEFORE bulding the ggplot!
ggplot(aes(
x = date
,y = n_deaths
,group = country_label
))+
geom_line()
# g1 # it would still print as a regular static ggplot
g1p <-
plotly::ggplotly(g1) %>% # make into a plotly object
plotly::highlight( # add highlight functionality
on = "plotly_click" # or "plotly_hover"
,dynamic = TRUE # adds color option
,selectize = TRUE # select what to highlight
,defaultValues = "Canada" # highlights in the beginning
) %>%
plotly::layout(margin = list(l = 0, r = 0, b = 100, t = 0, pad = 0))
g1p
For more options and syntax guide, see https://plotly-r.com/client-side-linking.html. There is also a package for rendering ggplot2
into interactive graphs (ggplotly
), which offers less flexibility in display design compared to creating graphs directly with plotly
, but is much simpler to implement.
It is often makes sense to compare the progression of epidemics across countries using a meaningful “time zero”, for example the day of the first confirmed case or the first confirmed death in the country. To create the Goal 2 graph we first need to derive a number of new variables. The following subgoals will help learn the operations used to derive them
Subgoals:
To help us carry out the computation, let us construct a fictional example that we can use to develop the script
# create reproducible example (reprex) to test out your function
d_reprex <- tibble::tribble(
~country_code, ~date, ~n_cases,
"Alabnia", "2020-03-01", NA,
"Alabnia", "2020-03-02", 0,
"Alabnia", "2020-03-03", 1,
"Alabnia", "2020-03-04", 0,
"Alabnia", "2020-03-05", 3,
"Botswana", "2020-04-01", 0,
"Botswana", "2020-04-02", NA,
"Botswana", "2020-04-03", 2,
"Botswana", "2020-04-04", 3,
"Botswana", "2020-04-05", 0,
"Chile", "2020-05-01", 2,
"Chile", "2020-05-02", 0,
"Chile", "2020-05-03", 0,
"Chile", "2020-05-04", 3,
"Chile", "2020-05-05", 1,
) %>%
mutate(
date = lubridate::as_date(date)
)
d_reprex %>% neat()
country_code | date | n_cases |
---|---|---|
Alabnia | 2020-03-01 | NA |
Alabnia | 2020-03-02 | 0 |
Alabnia | 2020-03-03 | 1 |
Alabnia | 2020-03-04 | 0 |
Alabnia | 2020-03-05 | 3 |
Botswana | 2020-04-01 | 0 |
Botswana | 2020-04-02 | NA |
Botswana | 2020-04-03 | 2 |
Botswana | 2020-04-04 | 3 |
Botswana | 2020-04-05 | 0 |
Chile | 2020-05-01 | 2 |
Chile | 2020-05-02 | 0 |
Chile | 2020-05-03 | 0 |
Chile | 2020-05-04 | 3 |
Chile | 2020-05-05 | 1 |
Computing cumulative (running) sum is easily accomplished with cumsum
function paired with group_by
, however, watch out for NA
values: they will break the running sum, resulting in the NA
for the rest of the column. To avoid this, we use convert NA
to 0
on the fly with tidyr::replace_na()
function
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
)
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum |
---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 |
Alabnia | 2020-03-02 | 0 | 0 |
Alabnia | 2020-03-03 | 1 | 1 |
Alabnia | 2020-03-04 | 0 | 1 |
Alabnia | 2020-03-05 | 3 | 4 |
Botswana | 2020-04-01 | 0 | 0 |
Botswana | 2020-04-02 | NA | 0 |
Botswana | 2020-04-03 | 2 | 2 |
Botswana | 2020-04-04 | 3 | 5 |
Botswana | 2020-04-05 | 0 | 5 |
Chile | 2020-05-01 | 2 | 2 |
Chile | 2020-05-02 | 0 | 2 |
Chile | 2020-05-03 | 0 | 2 |
Chile | 2020-05-04 | 3 | 5 |
Chile | 2020-05-05 | 1 | 6 |
This option is better than converting NA
s to zero during the data preparation step, as this would mask the absence of reporting, overwriting it with a definitive value of 0
cases.
Now we will use a simple logical test to create a logical variable indicating when the running total exceeded the value of the chosen threshold.
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
,onset_case = n_cases_cum > 0 # NEW LINE
)
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum | onset_case |
---|---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 | FALSE |
Alabnia | 2020-03-02 | 0 | 0 | FALSE |
Alabnia | 2020-03-03 | 1 | 1 | TRUE |
Alabnia | 2020-03-04 | 0 | 1 | TRUE |
Alabnia | 2020-03-05 | 3 | 4 | TRUE |
Botswana | 2020-04-01 | 0 | 0 | FALSE |
Botswana | 2020-04-02 | NA | 0 | FALSE |
Botswana | 2020-04-03 | 2 | 2 | TRUE |
Botswana | 2020-04-04 | 3 | 5 | TRUE |
Botswana | 2020-04-05 | 0 | 5 | TRUE |
Chile | 2020-05-01 | 2 | 2 | TRUE |
Chile | 2020-05-02 | 0 | 2 | TRUE |
Chile | 2020-05-03 | 0 | 2 | TRUE |
Chile | 2020-05-04 | 3 | 5 | TRUE |
Chile | 2020-05-05 | 1 | 6 | TRUE |
It may make sense to use other operationalization the “onset” event.. For example, we can define it as “the date of the 10th case” or “.01% of population infected” or “5th days of registering new cases in a row”.
Now with the column onset_case
marking whether the running total is higher than a threshold, we can identify the first occurrence of TRUE
:
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
,onset_case = n_cases_cum > 0
,first_case = cumsum(onset_case) == 1L # NEW LINE
)
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum | onset_case | first_case |
---|---|---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 | FALSE | FALSE |
Alabnia | 2020-03-02 | 0 | 0 | FALSE | FALSE |
Alabnia | 2020-03-03 | 1 | 1 | TRUE | TRUE |
Alabnia | 2020-03-04 | 0 | 1 | TRUE | FALSE |
Alabnia | 2020-03-05 | 3 | 4 | TRUE | FALSE |
Botswana | 2020-04-01 | 0 | 0 | FALSE | FALSE |
Botswana | 2020-04-02 | NA | 0 | FALSE | FALSE |
Botswana | 2020-04-03 | 2 | 2 | TRUE | TRUE |
Botswana | 2020-04-04 | 3 | 5 | TRUE | FALSE |
Botswana | 2020-04-05 | 0 | 5 | TRUE | FALSE |
Chile | 2020-05-01 | 2 | 2 | TRUE | TRUE |
Chile | 2020-05-02 | 0 | 2 | TRUE | FALSE |
Chile | 2020-05-03 | 0 | 2 | TRUE | FALSE |
Chile | 2020-05-04 | 3 | 5 | TRUE | FALSE |
Chile | 2020-05-05 | 1 | 6 | TRUE | FALSE |
Notice, that we use the property of logical
class: when used in mathematical expression, FALSE
assumes the value of 0
, while TRUE
is interpreted as 1
.
Now we can use this indicator to extract the date associated with this row:
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
,onset_case = n_cases_cum > 0
,first_case = cumsum(onset_case) == 1L
,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date() # NEW LINE
)
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum | onset_case | first_case | date_of_1case |
---|---|---|---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 | FALSE | FALSE | NA |
Alabnia | 2020-03-02 | 0 | 0 | FALSE | FALSE | NA |
Alabnia | 2020-03-03 | 1 | 1 | TRUE | TRUE | 2020-03-03 |
Alabnia | 2020-03-04 | 0 | 1 | TRUE | FALSE | NA |
Alabnia | 2020-03-05 | 3 | 4 | TRUE | FALSE | NA |
Botswana | 2020-04-01 | 0 | 0 | FALSE | FALSE | NA |
Botswana | 2020-04-02 | NA | 0 | FALSE | FALSE | NA |
Botswana | 2020-04-03 | 2 | 2 | TRUE | TRUE | 2020-04-03 |
Botswana | 2020-04-04 | 3 | 5 | TRUE | FALSE | NA |
Botswana | 2020-04-05 | 0 | 5 | TRUE | FALSE | NA |
Chile | 2020-05-01 | 2 | 2 | TRUE | TRUE | 2020-05-01 |
Chile | 2020-05-02 | 0 | 2 | TRUE | FALSE | NA |
Chile | 2020-05-03 | 0 | 2 | TRUE | FALSE | NA |
Chile | 2020-05-04 | 3 | 5 | TRUE | FALSE | NA |
Chile | 2020-05-05 | 1 | 6 | TRUE | FALSE | NA |
There will be only one date associated with traspassing the threshold, so we populate the rest of the cells with it
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
,onset_case = n_cases_cum > 0
,first_case = cumsum(onset_case) == 1L
,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date()
,date_of_1case = min(date_of_1case, na.rm = T) # NEW LINE
)
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum | onset_case | first_case | date_of_1case |
---|---|---|---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 | FALSE | FALSE | 2020-03-03 |
Alabnia | 2020-03-02 | 0 | 0 | FALSE | FALSE | 2020-03-03 |
Alabnia | 2020-03-03 | 1 | 1 | TRUE | TRUE | 2020-03-03 |
Alabnia | 2020-03-04 | 0 | 1 | TRUE | FALSE | 2020-03-03 |
Alabnia | 2020-03-05 | 3 | 4 | TRUE | FALSE | 2020-03-03 |
Botswana | 2020-04-01 | 0 | 0 | FALSE | FALSE | 2020-04-03 |
Botswana | 2020-04-02 | NA | 0 | FALSE | FALSE | 2020-04-03 |
Botswana | 2020-04-03 | 2 | 2 | TRUE | TRUE | 2020-04-03 |
Botswana | 2020-04-04 | 3 | 5 | TRUE | FALSE | 2020-04-03 |
Botswana | 2020-04-05 | 0 | 5 | TRUE | FALSE | 2020-04-03 |
Chile | 2020-05-01 | 2 | 2 | TRUE | TRUE | 2020-05-01 |
Chile | 2020-05-02 | 0 | 2 | TRUE | FALSE | 2020-05-01 |
Chile | 2020-05-03 | 0 | 2 | TRUE | FALSE | 2020-05-01 |
Chile | 2020-05-04 | 3 | 5 | TRUE | FALSE | 2020-05-01 |
Chile | 2020-05-05 | 1 | 6 | TRUE | FALSE | 2020-05-01 |
This allows us for a very straightforward computation of the distance between any given date and the date of the “onset”, in this case the date of the first confirmed case:
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
,onset_case = n_cases_cum > 0
,first_case = cumsum(onset_case) == 1L
,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date()
,date_of_1case = min(date_of_1case, na.rm = T)
,days_since_1case = (date - date_of_1case) %>% as.integer() # NEW LINE
)
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum | onset_case | first_case | date_of_1case | days_since_1case |
---|---|---|---|---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 | FALSE | FALSE | 2020-03-03 | -2 |
Alabnia | 2020-03-02 | 0 | 0 | FALSE | FALSE | 2020-03-03 | -1 |
Alabnia | 2020-03-03 | 1 | 1 | TRUE | TRUE | 2020-03-03 | 0 |
Alabnia | 2020-03-04 | 0 | 1 | TRUE | FALSE | 2020-03-03 | 1 |
Alabnia | 2020-03-05 | 3 | 4 | TRUE | FALSE | 2020-03-03 | 2 |
Botswana | 2020-04-01 | 0 | 0 | FALSE | FALSE | 2020-04-03 | -2 |
Botswana | 2020-04-02 | NA | 0 | FALSE | FALSE | 2020-04-03 | -1 |
Botswana | 2020-04-03 | 2 | 2 | TRUE | TRUE | 2020-04-03 | 0 |
Botswana | 2020-04-04 | 3 | 5 | TRUE | FALSE | 2020-04-03 | 1 |
Botswana | 2020-04-05 | 0 | 5 | TRUE | FALSE | 2020-04-03 | 2 |
Chile | 2020-05-01 | 2 | 2 | TRUE | TRUE | 2020-05-01 | 0 |
Chile | 2020-05-02 | 0 | 2 | TRUE | FALSE | 2020-05-01 | 1 |
Chile | 2020-05-03 | 0 | 2 | TRUE | FALSE | 2020-05-01 | 2 |
Chile | 2020-05-04 | 3 | 5 | TRUE | FALSE | 2020-05-01 | 3 |
Chile | 2020-05-05 | 1 | 6 | TRUE | FALSE | 2020-05-01 | 4 |
Finally, we can re-express these steps more succinctly, however, it might be advisible to leave these step in comments in case you need to retrace your steps or debug an error down the stream
d_reprex_timeline <- d_reprex %>%
group_by(country_code) %>%
mutate(
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
#,onset_case = n_cases_cum > 0
# ,first_case = cumsum(onset_case) == 1L
# ,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date()
# ,date_of_1case = min(date_of_1case, na.rm = T)
# alternatively, as a single sentence:
,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>% # NEW LINE
min(na.rm=T) %>% # NEW LINE
lubridate::as_date() # NEW LINE
# relative timeline
,days_since_1case = (date - date_of_1case) %>% as.integer()
) %>%
ungroup()
d_reprex_timeline %>% neat()
country_code | date | n_cases | n_cases_cum | date_of_1case | days_since_1case |
---|---|---|---|---|---|
Alabnia | 2020-03-01 | NA | 0 | 2020-03-03 | -2 |
Alabnia | 2020-03-02 | 0 | 0 | 2020-03-03 | -1 |
Alabnia | 2020-03-03 | 1 | 1 | 2020-03-03 | 0 |
Alabnia | 2020-03-04 | 0 | 1 | 2020-03-03 | 1 |
Alabnia | 2020-03-05 | 3 | 4 | 2020-03-03 | 2 |
Botswana | 2020-04-01 | 0 | 0 | 2020-04-03 | -2 |
Botswana | 2020-04-02 | NA | 0 | 2020-04-03 | -1 |
Botswana | 2020-04-03 | 2 | 2 | 2020-04-03 | 0 |
Botswana | 2020-04-04 | 3 | 5 | 2020-04-03 | 1 |
Botswana | 2020-04-05 | 0 | 5 | 2020-04-03 | 2 |
Chile | 2020-05-01 | 2 | 2 | 2020-05-01 | 0 |
Chile | 2020-05-02 | 0 | 2 | 2020-05-01 | 1 |
Chile | 2020-05-03 | 0 | 2 | 2020-05-01 | 2 |
Chile | 2020-05-04 | 3 | 5 | 2020-05-01 | 3 |
Chile | 2020-05-05 | 1 | 6 | 2020-05-01 | 4 |
The reprex
data set we have created allows to inspect the logic of our script manually and have a much higher confidence that it does what we expect it to do. Now we can apply this solution to the real data. Notice that we create a variation of this solution, applying another definition of the “day zero” - the day of the first confirmed mortality in the country.
ds_covid_timeline <- ds_covid %>%
group_by(country_code) %>%
mutate(
# compute timeline of cumulative confirmed cases
n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>%
min(na.rm=T) %>%
lubridate::as_date()
,days_since_1case = (date - date_of_1case) %>% as.integer()
# compute timeine of cumulative deaths
,n_deaths_cum = cumsum(tidyr::replace_na(n_deaths,0))
,date_of_1death = ifelse(cumsum(n_deaths_cum > 0) == 1L, date, NA) %>%
min(na.rm=T) %>%
lubridate::as_date()
,days_since_1death = (date - date_of_1death) %>% as.integer()
# compute absolute timeline
,days_since_exodus = as.integer(date - date_of_exodus) # first case outside of china
,days_since_pandemic = as.integer(date - date_of_pandemic) # WHO declares pandemic
,n_deaths_cum_per_1m = as.integer(n_deaths_cum/n_population_2018*1000000)
,n_cases_cum_per_1m = as.integer(n_cases_cum/ n_population_2018*1000000)
) %>%
ungroup() %>%
select(
date, country_code,
n_cases, n_deaths, n_cases_cum, n_deaths_cum, n_cases_cum_per_1m, n_deaths_cum_per_1m,
days_since_1case, days_since_1death, days_since_exodus, days_since_pandemic,
dplyr::everything()
)
ds_covid_timeline %>% glimpse()
Rows: 31,775
Columns: 20
$ date <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 20...
$ country_code <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AF...
$ n_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_cases_cum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths_cum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_cases_cum_per_1m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths_cum_per_1m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ days_since_1case <int> -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -3...
$ days_since_1death <int> -84, -83, -82, -81, -80, -79, -78, -77, -76, -75, -74, -73, -72, -71, -70, -69, -68, -6...
$ days_since_exodus <int> -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1...
$ days_since_pandemic <int> -71, -70, -69, -68, -67, -66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -5...
$ n_population_2018 <dbl> 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 3717238...
$ country_code2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF...
$ country_label <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan...
$ continent_label <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia",...
$ oecd <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
$ focus <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
$ date_of_1case <date> 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 20...
$ date_of_1death <date> 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 20...
Also notice that we create a few other helper variables: two other operationalization of the absolute time line (days_since_exodus
and days_since_pandemic
) as well as population adjustments for cumulative cases and deaths in each country(n_cases_cum_per_1m
and n_deaths_cum_per_1m
)
With all variable prepared for plotting, let us develop the barebones graph first, to help us see the composition of the plot:
d2 <- ds_covid_timeline %>% filter(focus)
g2dev <-
d2 %>%
ggplot(aes(
x = days_since_exodus
,y = n_cases_cum / 1000
,group = country_code
))+
geom_line()+
facet_wrap(~country_label, scales = "free")+
geom_point(data = d2 %>% filter(days_since_1case == 0 ))+
geom_point(data = d2 %>% filter(days_since_1death == 0 ), size = 3 )
g2dev
Now let’s enhance the appearance of the plot:
d2 <- ds_covid_timeline %>% filter(focus)
g2 <-
d2 %>%
filter(focus) %>%
ggplot(aes(
x = days_since_exodus
,y = n_cases_cum / 1000
))+
geom_line(aes(group = country_code))+
facet_wrap(~country_label, scale = "free", ncol = 3)+
geom_point(
data = d2 %>% filter(days_since_1case == 1)
,size = 2, fill = "#1b9e77", color = "black", alpha = .5, shape = 21
)+
geom_point(
data = d2 %>% filter(days_since_1death == 1)
,size = 2, fill = "#d95f02", color = "black", alpha = .5, shape = 21
)+
geom_vline(xintercept = 58, linetype = "dotted",)+
geom_vline(xintercept = 75, linetype = "dashed", alpha = .5)+
geom_vline(xintercept = 100, linetype = "dashed", color = "red", alpha = .5)+
scale_x_continuous(breaks = seq(0,100, 50))+
labs(
title = "Timeline of COVID-19: Cumulative Cases"
,y = "Cumulative Cases (in thousands)"
,x = "Days since the first confirmed case outside of China"
,caption = "(first dot) = 1st confirmed case, (second dot) = 1st confirmed death,
(dotted line) = pandemic announced by WHO, (dashed lines) = 75 and 100th day since Exodus"
)
g2
# you can overwrite the mapping to plot a different measure:
g2 +
aes(y = n_deaths_cum_per_1m)+
labs(
y = "Cumulative Deaths per 1 mil"
,title = "Timeline of COVID-19: Cumulative Deaths per 1 million"
)
Similarly with the plot from Goal 2, let us first sketch the essential components
d3 <-
ds_covid_timeline %>%
filter(days_since_1case == 0) %>%
filter(oecd) %>%
mutate(
country_label = forcats::fct_reorder(country_label, days_since_exodus),
)
g3dev <-
d3 %>%
ggplot(aes(x = date, y = country_label))+
geom_point(color = "deepskyblue")+
geom_point(aes(x = date_of_1death), color = "tomato")+
geom_segment(aes(yend = country_label, xend = date_of_1death))+
geom_text(aes(label = country_code2, x = date_of_1death), hjust = -1)
g3dev
and then enhance the appearance:
g3 <- ds_covid_timeline %>%
filter(oecd) %>%
filter(days_since_1case == 0) %>%
mutate(
country_label = forcats::fct_reorder(country_label, days_since_exodus)
) %>%
ggplot(aes(x = date, y = country_label))+
geom_point(shape = 21, size =2, alpha = .6, fill = "#1b9e77")+
geom_point(aes(x = date_of_1death), shape = 21, size =2, alpha = .6, fill = "#d95f02")+
geom_segment(aes(yend = country_label, xend = date_of_1death, color = "red"))+
geom_text(aes(label = country_code2, x = date_of_1death), hjust = -1, size = 3, color = "grey60")+
# scale_x_continuous(breaks = seq(0,140, 20))+
guides(color = F)+
labs(
title = "COVID Timeline: Days to 1st Case and 1st Death"
,x = "Date", y = NULL
,caption = "(green dot) = 1st confirmed case, (orange dot) = 1st confirmed death"
)
g3
To help you further you mastery of these graphical forms consider the following exercises:
Using the graph from Goal 1, re-design the interactive highlights so that they depicts both the trajectory line and the dots for temporal anchors ( day of first case and day of first death). Hint: you can no longer filter within the geom, so you need to create a new variable before tranforming the dataset with highlight_key
.
Re-draw the graph in Goal 2 with a different operationalization of the relative timeline: Days since 5 daily new deaths first reported.
Redraw the graph in Goal 3 to sort the countries according to the descending distance between the day of the first case and the day of the first death.
For the sake of documentation and reproducibility, the current report was rendered in the following environment. Click the line below to expand.
Environment
- Session info -------------------------------------------------------------------------------------------------------
setting value
version R version 3.6.3 (2020-02-29)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/New_York
date 2020-06-02
- Packages -----------------------------------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.3)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1)
config 0.3 2018-03-27 [1] CRAN (R 3.6.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
crosstalk * 1.0.0 2016-12-21 [1] CRAN (R 3.6.2)
data.table 1.12.8 2019-12-09 [1] CRAN (R 3.6.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.2)
devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.3)
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.2)
fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.2)
fastmap 1.0.1 2019-10-08 [1] CRAN (R 3.6.2)
forcats 0.4.0 2019-02-17 [1] CRAN (R 3.6.2)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.2)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.2)
ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.2)
glue 1.4.0 2020-04-03 [1] CRAN (R 3.6.3)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.2)
highr 0.8 2019-03-20 [1] CRAN (R 3.6.2)
hms 0.5.3 2020-01-08 [1] CRAN (R 3.6.2)
htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.2)
htmlwidgets 1.5.1 2019-10-08 [1] CRAN (R 3.6.2)
httpuv 1.5.2 2019-09-11 [1] CRAN (R 3.6.2)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.2)
jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.2)
kableExtra 1.1.0 2019-03-16 [1] CRAN (R 3.6.3)
knitr * 1.28 2020-02-06 [1] CRAN (R 3.6.2)
labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
later 1.0.0 2019-10-04 [1] CRAN (R 3.6.2)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.2)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
lubridate * 1.7.8 2020-04-06 [1] CRAN (R 3.6.3)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.6.2)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.2)
mime 0.9 2020-02-04 [1] CRAN (R 3.6.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.2)
pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.2)
pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.2)
plotly * 4.9.2 2020-02-12 [1] CRAN (R 3.6.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2)
promises 1.1.0 2019-10-04 [1] CRAN (R 3.6.2)
ps 1.3.2 2020-02-13 [1] CRAN (R 3.6.2)
purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.6.0)
Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
readr 1.3.1 2018-12-21 [1] CRAN (R 3.6.2)
remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.3)
rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.2)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.2)
rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.2)
rvest 0.3.5 2019-11-08 [1] CRAN (R 3.6.2)
scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
shiny 1.4.0 2019-10-10 [1] CRAN (R 3.6.2)
stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2)
testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
tidyr 1.0.2 2020-01-24 [1] CRAN (R 3.6.2)
tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.2)
usethis 1.6.0 2020-04-09 [1] CRAN (R 3.6.3)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.2)
vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.3)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.6.2)
webshot 0.5.2 2019-11-22 [1] CRAN (R 3.6.3)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.2)
xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.2)
xtable 1.8-4 2019-04-21 [1] CRAN (R 3.6.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
[1] C:/Users/an499583/Documents/R/win-library/3.6
[2] C:/Users/an499583/Documents/R/R-3.6.3/library