tidyverse
The tidyverse
is a powerful collection of packages that facilitate data wrangling and visualization. Today you will use various tidyverse
functions to describe drug overdoses in the US between 1999 and 2017.
tidyverse
Use the p_load
function to install and load the tidyverse
:
library(pacman)
p_load(tidyverse)
The data file is 02-Introduction_tidyverse_data.csv
. You can find it on Canvas under Files
then Labs
. Download the file and save it somewhere on your computer.1
The data include drug overdose deaths from the CDC and poverty and unemployment rates from the University of Kentucky Poverty Research Center.
Import the data using read_csv
:
state_data <- read_csv("02-Introduction_tidyverse_data.csv")
You can preview the structure of the data with glimpse
:
glimpse(state_data)
## Observations: 969
## Variables: 9
## $ stname <chr> "Alabama", "Alaska", "Arizona", "Arkansas", ...
## $ year <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 19...
## $ deaths <dbl> 169, 46, 511, 113, 2662, 349, 310, 50, 48, 9...
## $ population <dbl> 4430143, 624781, 5023824, 2651860, 33499204,...
## $ division <chr> "East South Central", "Pacific", "Mountain",...
## $ region <chr> "South", "West", "West", "South", "West", "W...
## $ stabbr <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "D...
## $ poverty_rate <dbl> 15.2, 7.6, 12.2, 14.7, 14.0, 8.5, 7.2, 10.4,...
## $ unemployment_rate <dbl> 4.7, 6.5, 4.4, 4.6, 5.2, 3.1, 2.9, 3.4, 6.4,...
The glimpse
output shows that drug overdose deaths (deaths
) are counts. Use the mutate
function to create a new variable that expresses overdoses deaths as rates (deaths per 100,000 people):
state_data <- state_data %>%
mutate(death_rate = deaths/population*100000)
What’s up with %>%
?
state_data %>% mutate(death_rate = deaths/population*100000)
as “take the object state_data
and then give it to mutate()
.”%>%
means “and then.”Summary statistics can give you a quick sense of the central tendency and spread of the variables in your data. Common summary statistics include the mean, median, min, max, and standard deviation.
You can produce summary statistics using the summarize
function:
# start with summary stats for overdose death rates
state_data %>%
summarize(min = min(death_rate),
max = max(death_rate),
median = median(death_rate),
mean = mean (death_rate),
sd = sd(death_rate))
## # A tibble: 1 x 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.82 53.6 11.6 12.6 6.69
What’s the minimum overdose death rate? What’s the maximum overdose death rate?
The code above produces summary statistics for a single variable. It is common practice to generate summary stats for all of the numerical variables in your dataset. One way you could do this is to copy-paste the code above and replace death_rate
a different a variable until you have summary stats for each variable. Generally speaking, this is bad practice. Constantly repeating yourself in your code can 1) propagate errors and 2) make it difficult to scale-up your code later (e.g., including additional variables).
For repetitive tasks, you can define functions:
summ_fun <- function(x){
state_data %>%
summarize(min = min(x),
max = max(x),
median = median(x),
mean = mean (x),
sd = sd(x))
}
Functions require inputs. The x
is the input to the function defined above. It serves as a placeholder for a variable name from state_data
.
You can use summ_fun
to obtain summary statistics for each variable:
summ_fun(state_data$death_rate)
## # A tibble: 1 x 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.82 53.6 11.6 12.6 6.69
summ_fun(state_data$poverty_rate)
## # A tibble: 1 x 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.5 23.1 12.1 12.6 3.38
summ_fun(state_data$unemployment_rate)
## # A tibble: 1 x 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.3 13.7 5.2 5.64 1.97
summ_fun(state_data$population)
## # A tibble: 1 x 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 491780 39536653 4146101 5946471. 6667352.
summ_fun(state_data$year)
## # A tibble: 1 x 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1999 2017 2008 2008 5.48
Putting state_data$
before the variable name lets R know where to look for the variable. For example, state_data$death_rate
says “hey R, look for death_rate
inside the object state_data
.”
How have overdose death rates changes over time in each state?
You can make a time series plot using ggplot
with geom_line
:
state_data %>%
ggplot(aes(x = year, y = death_rate)) +
geom_line()
Okay. Something didn’t quite work. It looks like ggplot
connected all the dots when you wanted it to connect the dots by state. You can fix this with the group
argument in aes
:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_line()
While the plot looks like spaghetti, you can see that drug overdose death rates increase over time. There is an especially pronounced increase in a handful of states starting around 2010. What do you think is driving these increases?
To get a better sense of where these changes are happening, you can facet your plot by region using facet_wrap()
:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_line() +
facet_wrap(~division)
division
refers to census divisions.
To change the axis labels and add a title, add labs()
to your ggplot
:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_line() +
facet_wrap(~division) +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")
You can change the appearance of your plot by specifying a theme:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_line() +
facet_wrap(~division) +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
What about national trends? To examine national overdose death rates over time, you can aggregate he state data using group_by
and summarize
:
state_data %>%
group_by(year) %>%
summarize(death_rate = weighted.mean(x = death_rate, w = population)) %>%
ggplot(aes(x = year, y = death_rate)) +
geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
Do overdoses correlate with economic conditions?
state_data %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
state_data %>%
ggplot(aes(x = unemployment_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
You can calculate correlation coefficients using cor
:
cor(state_data$death_rate, state_data$poverty_rate)
## [1] 0.2167742
cor(state_data$death_rate, state_data$unemployment_rate)
## [1] 0.2065104
Are the correlations between economic conditions and overdose deaths positive or negative? How strong are the correlations? Are they relatively strong? Relatively weak?
There are some pitfalls to examining bivariate relationships with data from several different years. You will learn about these in EC 421. For now, you can sidestep some of these issues by focusing your attention on the most recent year in the sample period. Using the filter
function, you can restrict the sample to observations from 2017:
state_data_2017 <- state_data %>%
filter(year == 2017)
Note the use of ==
instead of =
. ==
is for logical comparisons while =
for assignment.2
state_data_2017 %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw()
state_data_2017 %>%
ggplot(aes(x = unemployment_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw()
cor(state_data_2017$death_rate, state_data_2017$poverty_rate)
## [1] 0.0552481
cor(state_data_2017$death_rate, state_data_2017$unemployment_rate)
## [1] 0.3593693
Do the correlations change when you restrict the sample to 2017?
In case you were wondering, the deaths of despair explanation for the opioid crisis doesn’t really hold up to closer scrutiny.
Unsolicited life advice: You should make an EC_320
folder where you can store your work in one place. Also, you should pick file and folder names without spaces. Why? R often chokes on spaces in file names when loading data. In place of spaces, you can use underscores (_
) or dashes (-
).↩
Some people use =
as an alternative to the assignment operator <-
. If you want, you can too. Just be consistent.↩