TL;DR

This workshop demonstrates basic exploration of COVID-19 data, with the focus on computing and graphing data on relative timelines, in which the “day zero” is unique for each country.

Learning Objectives

After this workshop participants should be able to:

Plot time series of COVID-19 cases using ggplot2 package
Add interactive highlights to trajectories using plotly package
Compute indicators for key epidemiological events in each country (e.g. day of the first death)
Construct country-specific timelines relative to key epidemiological events
Visualize the sequence of key events for a group of countries

Visualization Goals

In this tutorial we will have three visualisation goals. First, we would like to have a tool for plotting multiple trajectories onto the same canvas and explore the differences between trajectories plotted on the same scale. Second, we would like to create a unique temporal context for each trajectory and express it with respect meaningful anchors, such as first detected case or first confirmed death. Thirdly, we would like to better understand the sequence of these temporal anchors to have a better understanding of how the pandemic unfolded.

Here are the graphs we will be creating in this tutorial:

Goal 1 - Absolute, Interactive

Create a line graph of many trajectories with an interactive tooltip.

Goal 2 - Relative, faceted

Create a trajectory graph depicting how the epidemic unfolded in each contry relative to its key epidemiological events.

Goal 3 - Temporal sequence

Create a graph depicting the sequence of key epidemiological events across countries.

Environment

Non-technical readers are welcome to skip this section.

library(magrittr) #Pipes
library(ggplot2) #For graphing
library(dplyr) # for shorter function names. but still prefer dplyr:: stems
library(lubridate) # for working with dates
library(plotly) # interactive graphs
library(crosstalk)

config <- config::get() # common definitions stored in `./config.yml`

# to be applied to every graph we will make
ggplot2::theme_set(
  ggplot2::theme_bw(
  )+
    theme(
      strip.background = element_rect(fill="grey90", color = NA)
    )
)
# important dates we will refer to in analysis
date_of_exodus   <- lubridate::date("2020-01-13") # first case outside of China
date_of_pandemic <- lubridate::date("2020-03-11") # WHO declares pandemic
# to help us focus on a manageable set of countries for purposes of demonstration
oecd_countries <- c(
  "AUS", "AUT", "BEL", "CAN", "CZE", "DNK", "EST", "FIN", "FRA",
  "DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR",
  "LVA", "LTU", "MEX", "NLD", "NZL", "NOR", "POL", "PRT", "SVK",
  "SVN", "ESP", "SWE", "CHE", "TUR", "GBR", "USA", "RUS", "ZAF"
)
focus_countries <- c("CAN","USA","ITA","TUR", "NLD","CHE")

# adds neat styling to your knitr table
neat <- function(x, output_format = "html"){
  # knitr.table.format = output_format
  if(output_format == "pandoc"){
    x_t <- knitr::kable(x, format = "pandoc")
  }else{
    x_t <- x %>%
      # x %>%
      # knitr::kable() %>%
      knitr::kable(format=output_format) %>%
      kableExtra::kable_styling(
        bootstrap_options = c("striped", "hover", "condensed","responsive"),
        # bootstrap_options = c( "condensed"),
        full_width = F,
        position = "left"
      )
  }
  return(x_t)
}

Data

The data comes from European Centre for Disease Prevention and Control, with the source available from here. I demonstrate the preparation of this data for analysis in ./manipulation/ellis-covid.R script of this repository.

# covid data
ds_covid <- readr::read_csv(config$path_input_covid)
ds_covid %>% glimpse()

Rows: 31,775
Columns: 8
$ date              <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 2020...
$ n_cases           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ n_deaths          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ country_code      <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"...
$ n_population_2018 <dbl> 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386,...
$ country_code2     <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF",...
$ country_label     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",...
$ continent_label   <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "...

# note that only `date`, `n_cases`, and `n_deatsh` change with time
# other variables have the same value within each country

A few things to notice about this data set:

Only date , n_cases, and n_deaths change over time. All other variables have a single unique value for each country.
For some countries, the observations are missing for certain dates, for example:

ds_covid %>% filter(country_code == "FIN") %>% filter(date > as_date("2020-02-26"))

# A tibble: 97 x 8
   date       n_cases n_deaths country_code n_population_2018 country_code2 country_label continent_label
   <date>       <dbl>    <dbl> <chr>                    <dbl> <chr>         <chr>         <chr>          
 1 2020-02-27       1        0 FIN                    5518050 FI            Finland       Europe         
 2 2020-02-28       0        0 FIN                    5518050 FI            Finland       Europe         
 3 2020-02-29       1        0 FIN                    5518050 FI            Finland       Europe         
 4 2020-03-01       0        0 FIN                    5518050 FI            Finland       Europe         
 5 2020-03-02       3        0 FIN                    5518050 FI            Finland       Europe         
 6 2020-03-03      NA       NA FIN                    5518050 FI            Finland       Europe         
 7 2020-03-04       1        0 FIN                    5518050 FI            Finland       Europe         
 8 2020-03-05      NA       NA FIN                    5518050 FI            Finland       Europe         
 9 2020-03-06       5        0 FIN                    5518050 FI            Finland       Europe         
10 2020-03-07       7        0 FIN                    5518050 FI            Finland       Europe         
# ... with 87 more rows

The data preparation step ensured that each country has the same number of rows, creating these missing cells. This is important for two reasons: A) we want to be able to differentiate between the absence of cases and the absence of reporting and B) missing dates will complicate comutation of relative timelines.

Data Tweaks

# to help us focus on a smaller subset of countries
oecd_countries <- c(
  "AUS", "AUT", "BEL", "CAN", "CZE", "DNK", "EST", "FIN", "FRA",
  "DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR",
  "LVA", "LTU", "MEX", "NLD", "NZL", "NOR", "POL", "PRT", "SVK",
  "SVN", "ESP", "SWE", "CHE", "TUR", "GBR", "USA", "RUS", "ZAF"
)
focus_countries <- c("CAN","USA","ITA","TUR", "NLD","CHE")
# to have a handy filter
ds_covid <- ds_covid %>%
  mutate(
    oecd   = country_code %in% oecd_countries
    ,focus = country_code %in% focus_countries
  )

We focus on 36 country members of the Organization for Economic Co-operation and Development (OECD) because they report more nuanced data on their economic and social development, so we can have a richer pool of explanatory variables (see http://stats.oecd.org/ )

Goal 1

plotly is a JavaScript library used for creating interactive visualizations and dashboards. Its implementation in python and R, the latter offered via plotly package. Its syntax is very similar to ggplot2, but nevertheless sufficiently different to be able to distract you into a (potentially lenghty) side quest. Luckily, there is a magic function plotly::ggplotly that transforms a graph built in ggplot2 into an interactive plotly object.

To demonstrate, let us first build a basic graph with multiple timeseries:

g1 <-
  ds_covid %>%
  filter(focus) %>%
  ggplot(aes(
    x  = date
    ,y = n_deaths
    ,group = country_label
  ))+
  geom_line()
g1

And now we enhanse this plot with highlight component using the plotly::highlight() function and some additional options:

g1 <-
  ds_covid %>%
  filter(focus) %>%
  plotly::highlight_key(~ country_label) %>% # add BEFORE bulding the ggplot!
  ggplot(aes(
    x  = date
    ,y = n_deaths
    ,group = country_label
  ))+
  geom_line()
# g1 # it would still print as a regular static ggplot
g1p <-
  plotly::ggplotly(g1) %>%                    # make into a plotly object
  plotly::highlight(                         # add highlight functionality
    on             = "plotly_click"          # or "plotly_hover"
    ,dynamic       = TRUE                    # adds color option
    ,selectize     = TRUE                    # select what to highlight
    ,defaultValues = "Canada"                # highlights in the beginning
  ) %>%
  plotly::layout(margin = list(l = 0, r = 0, b = 100, t = 0, pad = 0))
g1p

For more options and syntax guide, see https://plotly-r.com/client-side-linking.html. There is also a package for rendering ggplot2 into interactive graphs (ggplotly), which offers less flexibility in display design compared to creating graphs directly with plotly, but is much simpler to implement.

Goal 2

It is often makes sense to compare the progression of epidemics across countries using a meaningful “time zero”, for example the day of the first confirmed case or the first confirmed death in the country. To create the Goal 2 graph we first need to derive a number of new variables. The following subgoals will help learn the operations used to derive them

Subgoals:

1. Compute comulative cases (Running total)

2. Create onset markers (which values exceed a chosen threshold?)

3. First case marker (what row is the first to exceed a chosen threshold?)

4. Date of first case (what is the date associated with exceeding a chosen threshold?)

5. Number of days since 1st case (how many days have passed between current date and the date of exceeding a threshold?)

Reprex

To help us carry out the computation, let us construct a fictional example that we can use to develop the script

# create reproducible example (reprex) to test out your function
d_reprex <- tibble::tribble(
  ~country_code, ~date, ~n_cases,
  "Alabnia",  "2020-03-01", NA,
  "Alabnia",  "2020-03-02", 0,
  "Alabnia",  "2020-03-03", 1,
  "Alabnia",  "2020-03-04", 0,
  "Alabnia",  "2020-03-05", 3,
  "Botswana", "2020-04-01", 0,
  "Botswana", "2020-04-02", NA,
  "Botswana", "2020-04-03", 2,
  "Botswana", "2020-04-04", 3,
  "Botswana", "2020-04-05", 0,
  "Chile",    "2020-05-01", 2,
  "Chile",    "2020-05-02", 0,
  "Chile",    "2020-05-03", 0,
  "Chile",    "2020-05-04", 3,
  "Chile",    "2020-05-05", 1,
) %>%
  mutate(
    date = lubridate::as_date(date)
  )
d_reprex  %>% neat()

country_code	date	n_cases
Alabnia	2020-03-01	NA
Alabnia	2020-03-02	0
Alabnia	2020-03-03	1
Alabnia	2020-03-04	0
Alabnia	2020-03-05	3
Botswana	2020-04-01	0
Botswana	2020-04-02	NA
Botswana	2020-04-03	2
Botswana	2020-04-04	3
Botswana	2020-04-05	0
Chile	2020-05-01	2
Chile	2020-05-02	0
Chile	2020-05-03	0
Chile	2020-05-04	3
Chile	2020-05-05	1

1. Compute comulative cases (Running total)

Computing cumulative (running) sum is easily accomplished with cumsum function paired with group_by, however, watch out for NA values: they will break the running sum, resulting in the NA for the rest of the column. To avoid this, we use convert NA to 0 on the fly with tidyr::replace_na() function

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
  )
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum
Alabnia	2020-03-01	NA	0
Alabnia	2020-03-02	0	0
Alabnia	2020-03-03	1	1
Alabnia	2020-03-04	0	1
Alabnia	2020-03-05	3	4
Botswana	2020-04-01	0	0
Botswana	2020-04-02	NA	0
Botswana	2020-04-03	2	2
Botswana	2020-04-04	3	5
Botswana	2020-04-05	0	5
Chile	2020-05-01	2	2
Chile	2020-05-02	0	2
Chile	2020-05-03	0	2
Chile	2020-05-04	3	5
Chile	2020-05-05	1	6

This option is better than converting NAs to zero during the data preparation step, as this would mask the absence of reporting, overwriting it with a definitive value of 0 cases.

2. Create onset markers (which values exceed a chosen threshold?)

Now we will use a simple logical test to create a logical variable indicating when the running total exceeded the value of the chosen threshold.

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0                      # NEW LINE
  )
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum	onset_case
Alabnia	2020-03-01	NA	0	FALSE
Alabnia	2020-03-02	0	0	FALSE
Alabnia	2020-03-03	1	1	TRUE
Alabnia	2020-03-04	0	1	TRUE
Alabnia	2020-03-05	3	4	TRUE
Botswana	2020-04-01	0	0	FALSE
Botswana	2020-04-02	NA	0	FALSE
Botswana	2020-04-03	2	2	TRUE
Botswana	2020-04-04	3	5	TRUE
Botswana	2020-04-05	0	5	TRUE
Chile	2020-05-01	2	2	TRUE
Chile	2020-05-02	0	2	TRUE
Chile	2020-05-03	0	2	TRUE
Chile	2020-05-04	3	5	TRUE
Chile	2020-05-05	1	6	TRUE

It may make sense to use other operationalization the “onset” event.. For example, we can define it as “the date of the 10th case” or “.01% of population infected” or “5th days of registering new cases in a row”.

3. First case marker (what row is the first to exceed a chosen threshold?)

Now with the column onset_case marking whether the running total is higher than a threshold, we can identify the first occurrence of TRUE :

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L             # NEW LINE
  )
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum	onset_case	first_case
Alabnia	2020-03-01	NA	0	FALSE	FALSE
Alabnia	2020-03-02	0	0	FALSE	FALSE
Alabnia	2020-03-03	1	1	TRUE	TRUE
Alabnia	2020-03-04	0	1	TRUE	FALSE
Alabnia	2020-03-05	3	4	TRUE	FALSE
Botswana	2020-04-01	0	0	FALSE	FALSE
Botswana	2020-04-02	NA	0	FALSE	FALSE
Botswana	2020-04-03	2	2	TRUE	TRUE
Botswana	2020-04-04	3	5	TRUE	FALSE
Botswana	2020-04-05	0	5	TRUE	FALSE
Chile	2020-05-01	2	2	TRUE	TRUE
Chile	2020-05-02	0	2	TRUE	FALSE
Chile	2020-05-03	0	2	TRUE	FALSE
Chile	2020-05-04	3	5	TRUE	FALSE
Chile	2020-05-05	1	6	TRUE	FALSE

Notice, that we use the property of logical class: when used in mathematical expression, FALSE assumes the value of 0, while TRUE is interpreted as 1.

4. Date of first case (what is the date associated with exceeding a chosen threshold?)

Now we can use this indicator to extract the date associated with this row:

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L
    ,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date() # NEW LINE
  )
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum	onset_case	first_case	date_of_1case
Alabnia	2020-03-01	NA	0	FALSE	FALSE	NA
Alabnia	2020-03-02	0	0	FALSE	FALSE	NA
Alabnia	2020-03-03	1	1	TRUE	TRUE	2020-03-03
Alabnia	2020-03-04	0	1	TRUE	FALSE	NA
Alabnia	2020-03-05	3	4	TRUE	FALSE	NA
Botswana	2020-04-01	0	0	FALSE	FALSE	NA
Botswana	2020-04-02	NA	0	FALSE	FALSE	NA
Botswana	2020-04-03	2	2	TRUE	TRUE	2020-04-03
Botswana	2020-04-04	3	5	TRUE	FALSE	NA
Botswana	2020-04-05	0	5	TRUE	FALSE	NA
Chile	2020-05-01	2	2	TRUE	TRUE	2020-05-01
Chile	2020-05-02	0	2	TRUE	FALSE	NA
Chile	2020-05-03	0	2	TRUE	FALSE	NA
Chile	2020-05-04	3	5	TRUE	FALSE	NA
Chile	2020-05-05	1	6	TRUE	FALSE	NA

There will be only one date associated with traspassing the threshold, so we populate the rest of the cells with it

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L
    ,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date()
    ,date_of_1case = min(date_of_1case, na.rm = T) # NEW LINE
  )
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum	onset_case	first_case	date_of_1case
Alabnia	2020-03-01	NA	0	FALSE	FALSE	2020-03-03
Alabnia	2020-03-02	0	0	FALSE	FALSE	2020-03-03
Alabnia	2020-03-03	1	1	TRUE	TRUE	2020-03-03
Alabnia	2020-03-04	0	1	TRUE	FALSE	2020-03-03
Alabnia	2020-03-05	3	4	TRUE	FALSE	2020-03-03
Botswana	2020-04-01	0	0	FALSE	FALSE	2020-04-03
Botswana	2020-04-02	NA	0	FALSE	FALSE	2020-04-03
Botswana	2020-04-03	2	2	TRUE	TRUE	2020-04-03
Botswana	2020-04-04	3	5	TRUE	FALSE	2020-04-03
Botswana	2020-04-05	0	5	TRUE	FALSE	2020-04-03
Chile	2020-05-01	2	2	TRUE	TRUE	2020-05-01
Chile	2020-05-02	0	2	TRUE	FALSE	2020-05-01
Chile	2020-05-03	0	2	TRUE	FALSE	2020-05-01
Chile	2020-05-04	3	5	TRUE	FALSE	2020-05-01
Chile	2020-05-05	1	6	TRUE	FALSE	2020-05-01

5. Number of days since 1st case (how many days have passed between current date and the date of exceeding a threshold?)

This allows us for a very straightforward computation of the distance between any given date and the date of the “onset”, in this case the date of the first confirmed case:

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L
    ,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date()
    ,date_of_1case = min(date_of_1case, na.rm = T)
    ,days_since_1case = (date - date_of_1case) %>% as.integer() # NEW LINE
  )
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum	onset_case	first_case	date_of_1case	days_since_1case
Alabnia	2020-03-01	NA	0	FALSE	FALSE	2020-03-03	-2
Alabnia	2020-03-02	0	0	FALSE	FALSE	2020-03-03	-1
Alabnia	2020-03-03	1	1	TRUE	TRUE	2020-03-03	0
Alabnia	2020-03-04	0	1	TRUE	FALSE	2020-03-03	1
Alabnia	2020-03-05	3	4	TRUE	FALSE	2020-03-03	2
Botswana	2020-04-01	0	0	FALSE	FALSE	2020-04-03	-2
Botswana	2020-04-02	NA	0	FALSE	FALSE	2020-04-03	-1
Botswana	2020-04-03	2	2	TRUE	TRUE	2020-04-03	0
Botswana	2020-04-04	3	5	TRUE	FALSE	2020-04-03	1
Botswana	2020-04-05	0	5	TRUE	FALSE	2020-04-03	2
Chile	2020-05-01	2	2	TRUE	TRUE	2020-05-01	0
Chile	2020-05-02	0	2	TRUE	FALSE	2020-05-01	1
Chile	2020-05-03	0	2	TRUE	FALSE	2020-05-01	2
Chile	2020-05-04	3	5	TRUE	FALSE	2020-05-01	3
Chile	2020-05-05	1	6	TRUE	FALSE	2020-05-01	4

Re-write for brevity

Finally, we can re-express these steps more succinctly, however, it might be advisible to leave these step in comments in case you need to retrace your steps or debug an error down the stream

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    #,onset_case = n_cases_cum > 0
    # ,first_case = cumsum(onset_case) == 1L
    # ,date_of_1case = ifelse(first_case, date, NA) %>% lubridate::as_date()
    # ,date_of_1case = min(date_of_1case, na.rm = T)
    # alternatively, as a single sentence:
    ,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>% # NEW LINE
      min(na.rm=T) %>%                                                   # NEW LINE
      lubridate::as_date()                                               # NEW LINE
    # relative timeline
    ,days_since_1case = (date - date_of_1case) %>% as.integer()
    ) %>%
  ungroup()
d_reprex_timeline %>% neat()

country_code	date	n_cases	n_cases_cum	date_of_1case	days_since_1case
Alabnia	2020-03-01	NA	0	2020-03-03	-2
Alabnia	2020-03-02	0	0	2020-03-03	-1
Alabnia	2020-03-03	1	1	2020-03-03	0
Alabnia	2020-03-04	0	1	2020-03-03	1
Alabnia	2020-03-05	3	4	2020-03-03	2
Botswana	2020-04-01	0	0	2020-04-03	-2
Botswana	2020-04-02	NA	0	2020-04-03	-1
Botswana	2020-04-03	2	2	2020-04-03	0
Botswana	2020-04-04	3	5	2020-04-03	1
Botswana	2020-04-05	0	5	2020-04-03	2
Chile	2020-05-01	2	2	2020-05-01	0
Chile	2020-05-02	0	2	2020-05-01	1
Chile	2020-05-03	0	2	2020-05-01	2
Chile	2020-05-04	3	5	2020-05-01	3
Chile	2020-05-05	1	6	2020-05-01	4

Compute relative timelines

The reprex data set we have created allows to inspect the logic of our script manually and have a much higher confidence that it does what we expect it to do. Now we can apply this solution to the real data. Notice that we create a variation of this solution, applying another definition of the “day zero” - the day of the first confirmed mortality in the country.

ds_covid_timeline <- ds_covid %>%
  group_by(country_code) %>%
  mutate(
    # compute timeline of cumulative confirmed cases
    n_cases_cum  = cumsum(tidyr::replace_na(n_cases,0))
    ,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>%
      min(na.rm=T) %>%
      lubridate::as_date()
    ,days_since_1case = (date - date_of_1case) %>% as.integer()
    # compute timeine of cumulative deaths
    ,n_deaths_cum  = cumsum(tidyr::replace_na(n_deaths,0))
    ,date_of_1death = ifelse(cumsum(n_deaths_cum > 0) == 1L, date, NA) %>%
      min(na.rm=T) %>%
      lubridate::as_date()
    ,days_since_1death = (date - date_of_1death) %>% as.integer()
    # compute absolute timeline
    ,days_since_exodus   = as.integer(date - date_of_exodus) # first case outside of china
    ,days_since_pandemic = as.integer(date - date_of_pandemic) # WHO declares pandemic
    ,n_deaths_cum_per_1m = as.integer(n_deaths_cum/n_population_2018*1000000)
    ,n_cases_cum_per_1m  = as.integer(n_cases_cum/ n_population_2018*1000000)
  ) %>%
  ungroup() %>%
  select(
    date, country_code,
    n_cases, n_deaths, n_cases_cum, n_deaths_cum, n_cases_cum_per_1m, n_deaths_cum_per_1m,
    days_since_1case, days_since_1death, days_since_exodus, days_since_pandemic,
    dplyr::everything()
  )
ds_covid_timeline %>% glimpse()

Rows: 31,775
Columns: 20
$ date                <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 20...
$ country_code        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AF...
$ n_cases             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_cases_cum         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths_cum        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_cases_cum_per_1m  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths_cum_per_1m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ days_since_1case    <int> -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -3...
$ days_since_1death   <int> -84, -83, -82, -81, -80, -79, -78, -77, -76, -75, -74, -73, -72, -71, -70, -69, -68, -6...
$ days_since_exodus   <int> -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1...
$ days_since_pandemic <int> -71, -70, -69, -68, -67, -66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -5...
$ n_population_2018   <dbl> 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 3717238...
$ country_code2       <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF...
$ country_label       <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan...
$ continent_label     <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia",...
$ oecd                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
$ focus               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
$ date_of_1case       <date> 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 2020-02-25, 20...
$ date_of_1death      <date> 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 2020-03-24, 20...

Also notice that we create a few other helper variables: two other operationalization of the absolute time line (days_since_exodus and days_since_pandemic) as well as population adjustments for cumulative cases and deaths in each country(n_cases_cum_per_1m and n_deaths_cum_per_1m)

Goal 2 graph

With all variable prepared for plotting, let us develop the barebones graph first, to help us see the composition of the plot:

d2 <- ds_covid_timeline %>% filter(focus)
g2dev <-
  d2 %>%
  ggplot(aes(
    x = days_since_exodus
    ,y = n_cases_cum / 1000
    ,group = country_code
  ))+
  geom_line()+
  facet_wrap(~country_label, scales = "free")+
  geom_point(data = d2 %>% filter(days_since_1case == 0 ))+
  geom_point(data = d2 %>% filter(days_since_1death == 0 ), size = 3 )
g2dev

Now let’s enhance the appearance of the plot:

d2 <- ds_covid_timeline %>% filter(focus)
g2 <-
  d2 %>%
  filter(focus) %>%
  ggplot(aes(
    x = days_since_exodus
    ,y = n_cases_cum / 1000
  ))+
  geom_line(aes(group = country_code))+
  facet_wrap(~country_label, scale = "free", ncol = 3)+
  geom_point(
    data  = d2 %>% filter(days_since_1case == 1)
    ,size = 2, fill = "#1b9e77", color = "black", alpha = .5, shape = 21
  )+
  geom_point(
    data  = d2 %>% filter(days_since_1death == 1)
    ,size = 2, fill = "#d95f02", color = "black", alpha = .5, shape = 21
  )+
  geom_vline(xintercept = 58, linetype = "dotted",)+
  geom_vline(xintercept = 75, linetype = "dashed", alpha = .5)+
  geom_vline(xintercept = 100, linetype = "dashed", color = "red", alpha = .5)+
  scale_x_continuous(breaks = seq(0,100, 50))+
  labs(
    title = "Timeline of COVID-19: Cumulative Cases"
    ,y = "Cumulative Cases (in thousands)"
    ,x = "Days since the first confirmed case outside of China"
    ,caption = "(first dot) = 1st confirmed case, (second dot) = 1st confirmed death,
    (dotted line) = pandemic announced by WHO, (dashed lines) = 75 and 100th day since Exodus"
  )
g2

# you can overwrite the mapping to plot a different measure:
g2 +
  aes(y = n_deaths_cum_per_1m)+
  labs(
    y = "Cumulative Deaths per 1 mil"
    ,title = "Timeline of COVID-19: Cumulative Deaths per 1 million"
  )

Goal 3

Similarly with the plot from Goal 2, let us first sketch the essential components

d3 <-
  ds_covid_timeline %>%
  filter(days_since_1case == 0) %>%
  filter(oecd) %>%
  mutate(
    country_label = forcats::fct_reorder(country_label, days_since_exodus),
  )
g3dev <-
  d3 %>%
  ggplot(aes(x = date, y = country_label))+
  geom_point(color = "deepskyblue")+
  geom_point(aes(x = date_of_1death), color = "tomato")+
  geom_segment(aes(yend = country_label, xend = date_of_1death))+
  geom_text(aes(label = country_code2, x = date_of_1death), hjust = -1)
g3dev

and then enhance the appearance:

g3 <- ds_covid_timeline %>%
  filter(oecd) %>%
  filter(days_since_1case == 0) %>%
  mutate(
    country_label                = forcats::fct_reorder(country_label, days_since_exodus)
  ) %>%
  ggplot(aes(x = date, y = country_label))+
  geom_point(shape = 21, size =2, alpha = .6, fill = "#1b9e77")+
  geom_point(aes(x = date_of_1death), shape = 21, size =2, alpha = .6, fill = "#d95f02")+
  geom_segment(aes(yend = country_label, xend = date_of_1death, color = "red"))+
  geom_text(aes(label = country_code2, x = date_of_1death), hjust = -1, size = 3, color = "grey60")+
  # scale_x_continuous(breaks = seq(0,140, 20))+
  guides(color = F)+
  labs(
    title = "COVID Timeline: Days to 1st Case and 1st Death"
    ,x = "Date", y = NULL
    ,caption = "(green dot) = 1st confirmed case, (orange dot) = 1st confirmed death"
  )
g3

Exercises

To help you further you mastery of these graphical forms consider the following exercises:

Using the graph from Goal 1, re-design the interactive highlights so that they depicts both the trajectory line and the dots for temporal anchors ( day of first case and day of first death). Hint: you can no longer filter within the geom, so you need to create a new variable before tranforming the dataset with highlight_key.
Re-draw the graph in Goal 2 with a different operationalization of the relative timeline: Days since 5 daily new deaths first reported.
Redraw the graph in Goal 3 to sort the countries according to the descending distance between the day of the first case and the day of the first death.

session information

For the sake of documentation and reproducibility, the current report was rendered in the following environment. Click the line below to expand.

Environment

- Session info -------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2020-06-02                  

- Packages -----------------------------------------------------------------------------------------------------------
 package      * version date       lib source        
 assertthat     0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
 backports      1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
 callr          3.4.3   2020-03-28 [1] CRAN (R 3.6.3)
 cli            2.0.2   2020-02-28 [1] CRAN (R 3.6.3)
 codetools      0.2-16  2018-12-24 [2] CRAN (R 3.6.3)
 colorspace     1.4-1   2019-03-18 [1] CRAN (R 3.6.1)
 config         0.3     2018-03-27 [1] CRAN (R 3.6.3)
 crayon         1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
 crosstalk    * 1.0.0   2016-12-21 [1] CRAN (R 3.6.2)
 data.table     1.12.8  2019-12-09 [1] CRAN (R 3.6.2)
 desc           1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
 devtools       2.3.0   2020-04-10 [1] CRAN (R 3.6.3)
 digest         0.6.25  2020-02-23 [1] CRAN (R 3.6.3)
 dplyr        * 0.8.5   2020-03-07 [1] CRAN (R 3.6.3)
 ellipsis       0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
 evaluate       0.14    2019-05-28 [1] CRAN (R 3.6.2)
 fansi          0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
 farver         2.0.3   2020-01-16 [1] CRAN (R 3.6.2)
 fastmap        1.0.1   2019-10-08 [1] CRAN (R 3.6.2)
 forcats        0.4.0   2019-02-17 [1] CRAN (R 3.6.2)
 fs             1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
 generics       0.0.2   2018-11-29 [1] CRAN (R 3.6.2)
 ggplot2      * 3.2.1   2019-08-10 [1] CRAN (R 3.6.2)
 glue           1.4.0   2020-04-03 [1] CRAN (R 3.6.3)
 gtable         0.3.0   2019-03-25 [1] CRAN (R 3.6.2)
 highr          0.8     2019-03-20 [1] CRAN (R 3.6.2)
 hms            0.5.3   2020-01-08 [1] CRAN (R 3.6.2)
 htmltools      0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
 htmlwidgets    1.5.1   2019-10-08 [1] CRAN (R 3.6.2)
 httpuv         1.5.2   2019-09-11 [1] CRAN (R 3.6.2)
 httr           1.4.1   2019-08-05 [1] CRAN (R 3.6.2)
 jsonlite       1.6.1   2020-02-02 [1] CRAN (R 3.6.2)
 kableExtra     1.1.0   2019-03-16 [1] CRAN (R 3.6.3)
 knitr        * 1.28    2020-02-06 [1] CRAN (R 3.6.2)
 labeling       0.3     2014-08-23 [1] CRAN (R 3.6.0)
 later          1.0.0   2019-10-04 [1] CRAN (R 3.6.2)
 lazyeval       0.2.2   2019-03-15 [1] CRAN (R 3.6.2)
 lifecycle      0.2.0   2020-03-06 [1] CRAN (R 3.6.3)
 lubridate    * 1.7.8   2020-04-06 [1] CRAN (R 3.6.3)
 magrittr     * 1.5     2014-11-22 [1] CRAN (R 3.6.2)
 memoise        1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
 mime           0.9     2020-02-04 [1] CRAN (R 3.6.2)
 munsell        0.5.0   2018-06-12 [1] CRAN (R 3.6.2)
 pillar         1.4.3   2019-12-20 [1] CRAN (R 3.6.2)
 pkgbuild       1.0.6   2019-10-09 [1] CRAN (R 3.6.2)
 pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 3.6.2)
 pkgload        1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
 plotly       * 4.9.2   2020-02-12 [1] CRAN (R 3.6.2)
 prettyunits    1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
 processx       3.4.2   2020-02-09 [1] CRAN (R 3.6.2)
 promises       1.1.0   2019-10-04 [1] CRAN (R 3.6.2)
 ps             1.3.2   2020-02-13 [1] CRAN (R 3.6.2)
 purrr          0.3.4   2020-04-17 [1] CRAN (R 3.6.3)
 R6             2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
 RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 3.6.0)
 Rcpp           1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
 readr          1.3.1   2018-12-21 [1] CRAN (R 3.6.2)
 remotes        2.1.1   2020-02-15 [1] CRAN (R 3.6.2)
 rlang          0.4.5   2020-03-01 [1] CRAN (R 3.6.3)
 rmarkdown      2.1     2020-01-20 [1] CRAN (R 3.6.2)
 rprojroot      1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
 rstudioapi     0.11    2020-02-07 [1] CRAN (R 3.6.2)
 rvest          0.3.5   2019-11-08 [1] CRAN (R 3.6.2)
 scales         1.1.0   2019-11-18 [1] CRAN (R 3.6.2)
 sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
 shiny          1.4.0   2019-10-10 [1] CRAN (R 3.6.2)
 stringi        1.4.6   2020-02-17 [1] CRAN (R 3.6.2)
 stringr        1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
 testthat       2.3.2   2020-03-02 [1] CRAN (R 3.6.3)
 tibble         3.0.1   2020-04-20 [1] CRAN (R 3.6.3)
 tidyr          1.0.2   2020-01-24 [1] CRAN (R 3.6.2)
 tidyselect     1.0.0   2020-01-27 [1] CRAN (R 3.6.2)
 usethis        1.6.0   2020-04-09 [1] CRAN (R 3.6.3)
 utf8           1.1.4   2018-05-24 [1] CRAN (R 3.6.2)
 vctrs          0.2.4   2020-03-10 [1] CRAN (R 3.6.3)
 viridisLite    0.3.0   2018-02-01 [1] CRAN (R 3.6.2)
 webshot        0.5.2   2019-11-22 [1] CRAN (R 3.6.3)
 withr          2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
 xfun           0.12    2020-01-13 [1] CRAN (R 3.6.2)
 xml2           1.2.2   2019-08-09 [1] CRAN (R 3.6.2)
 xtable         1.8-4   2019-04-21 [1] CRAN (R 3.6.2)
 yaml           2.2.1   2020-02-01 [1] CRAN (R 3.6.2)

[1] C:/Users/an499583/Documents/R/win-library/3.6
[2] C:/Users/an499583/Documents/R/R-3.6.3/library

Visualising the Pandemic

Comparing trajectories of COVID-19 across countries

Andriy Koval

June 2, 2020

TL;DR

Learning Objectives

Visualization Goals

Goal 1 - Absolute, Interactive

Goal 2 - Relative, faceted

Goal 3 - Temporal sequence

Environment

Data

Data Tweaks

Goal 1

Goal 2

1. Compute comulative cases (Running total)

2. Create onset markers (which values exceed a chosen threshold?)

3. First case marker (what row is the first to exceed a chosen threshold?)

4. Date of first case (what is the date associated with exceeding a chosen threshold?)

5. Number of days since 1st case (how many days have passed between current date and the date of exceeding a threshold?)

Reprex

1. Compute comulative cases (Running total)

2. Create onset markers (which values exceed a chosen threshold?)

3. First case marker (what row is the first to exceed a chosen threshold?)

4. Date of first case (what is the date associated with exceeding a chosen threshold?)

5. Number of days since 1st case (how many days have passed between current date and the date of exceeding a threshold?)

Re-write for brevity

Compute relative timelines

Goal 2 graph

Goal 3

Exercises

session information