Simple data analysis and visualization

Now we know some useful functions in tidyverse package, how to generate a data (tibble), we are all set for some simple analysis using observational data. Today you will use various tidyverse functions to describe drug overdoses in the US between 1999 and 2017.

Load the tidyverse

Use the p_load function to install and load the tidyverse:

library(pacman)
p_load(tidyverse)

Data

The data file is 04-Introduction_tidyverse_data.csv. You can find it on Canvas under Files then Labs. Download the file and save it somewhere on your computer.1

The data include drug overdose deaths from the CDC and poverty and unemployment rates from the University of Kentucky Poverty Research Center.

Import data

Import the data using read_csv:

state_data <- read_csv("04-Introduction_tidyverse_data.csv")

You can preview the structure of the data with glimpse:

glimpse(state_data)
## Rows: 969
## Columns: 9
## $ stname            <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ year              <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
## $ deaths            <dbl> 169, 46, 511, 113, 2662, 349, 310, 50, 48, 997, 283,…
## $ population        <dbl> 4430143, 624781, 5023824, 2651860, 33499204, 4226018…
## $ division          <chr> "East South Central", "Pacific", "Mountain", "West S…
## $ region            <chr> "South", "West", "West", "South", "West", "West", "N…
## $ stabbr            <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC"…
## $ poverty_rate      <dbl> 15.2, 7.6, 12.2, 14.7, 14.0, 8.5, 7.2, 10.4, 14.7, 1…
## $ unemployment_rate <dbl> 4.7, 6.5, 4.4, 4.6, 5.2, 3.1, 2.9, 3.4, 6.4, 3.9, 3.…

Wrangling data

The glimpse output shows that drug overdose deaths (deaths) are counts. Use the mutate function to create a new variable that expresses overdoses deaths as rates (deaths per 100,000 people):

state_data <- state_data %>% 
  mutate(death_rate = deaths/population*100000)

Pipe operator

Review of the %>% operator

  • Called the “pipe operator.”
  • Allows you to chain commands without having to make new objects.
  • You can think of state_data %>% mutate(death_rate = deaths/population*100000) as “take the object state_data and then give it to mutate().”
  • In English, %>% means “and then.”

Summary statistics

Summary statistics can give you a quick sense of the central tendency and spread of the variables in your data. Common summary statistics include the mean, median, min, max, and standard deviation.

  • The mean describes the “typical” value of a variable.
  • Comparing the median to the mean can give you a rough sketch of the skew of a particular variable. For example, if the mean exceeds the median, then the frequency distribution of the variable may exhibit a relatively long right tail.
  • The min, max, and standard deviation describe the spread of a variable.

You can produce summary statistics using the summarize function:

# start with summary stats for overdose death rates
state_data %>% 
  summarize(min = min(death_rate),
            max = max(death_rate),
            median = median(death_rate),
            mean = mean (death_rate),
            sd = sd(death_rate))
## # A tibble: 1 × 5
##     min   max median  mean    sd
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1  1.82  53.6   11.6  12.6  6.69

What’s the minimum overdose death rate? What’s the maximum overdose death rate?

Writing a summary stats function

The code above produces summary statistics for a single variable. It is common practice to generate summary stats for all of the numerical variables in your dataset. One way you could do this is to copy-paste the code above and replace death_rate a different a variable until you have summary stats for each variable. Generally speaking, this is bad practice. Constantly repeating yourself in your code can 1) propagate errors and 2) make it difficult to scale-up your code later (e.g., including additional variables).

For repetitive tasks, you can define functions:

summ_fun <- function(x){
  state_data %>% 
  summarize(min = min(x),
            max = max(x),
            median = median(x),
            mean = mean (x),
            sd = sd(x))
}

Functions require inputs. The x is the input to the function defined above. It serves as a placeholder for a variable name from state_data.

You can use summ_fun to obtain summary statistics for each variable:

summ_fun(state_data$death_rate)
## # A tibble: 1 × 5
##     min   max median  mean    sd
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1  1.82  53.6   11.6  12.6  6.69
summ_fun(state_data$poverty_rate)
## # A tibble: 1 × 5
##     min   max median  mean    sd
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   4.5  23.1   12.1  12.6  3.38
summ_fun(state_data$unemployment_rate)
## # A tibble: 1 × 5
##     min   max median  mean    sd
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1   2.3  13.7    5.2  5.64  1.97
summ_fun(state_data$population)
## # A tibble: 1 × 5
##      min      max  median     mean       sd
##    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
## 1 491780 39536653 4146101 5946471. 6667352.
summ_fun(state_data$year)
## # A tibble: 1 × 5
##     min   max median  mean    sd
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1  1999  2017   2008  2008  5.48

Putting state_data$ before the variable name lets R know where to look for the variable. For example, state_data$death_rate says “hey R, look for death_rate inside the object state_data.”

Plotting

How to make plots

We are going to use ggplot() function, aka grammar of graphics included in ggplot2 package. At this point you don’t need to load ggplot2 package explicitly as it’s already loaded when you’ve loaded tidyverse package.

?ggplot
# ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
# data argument: Default dataset to use for plot
# mapping argument: Default list of aesthetic mappings to use for plot
ggplot(data = state_data)

# If you want to specify the x axis and y axis, you could do so by
ggplot(data = state_data, mapping=aes(x = year, y = death_rate))

# Nothing appears! You want to add the geoms -- visual marks that present data points
# Here's one with representing data with points
# Note that you want to layer up the plot by using `+` not `%>%`. 
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) + 
  geom_point()

# Here's one with representing data with line
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) + 
  geom_line()

# You could layer up multiple geoms in one plot by connecting them with `+`.
# Following code layered the `geom_point` and `geom_line` on the plot.
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) + 
  geom_point() + geom_line()

# What if you want to label x-axis and y-axis and also add title to the plot?
# You add another layer of `labs` which is short for `labels`
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) + 
  geom_point() + geom_line() + 
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") 

# We can use pipe operator here by pulling out the first argument of ggplot function,
# place it in front of `ggplot()`, then connect it with `%>%`.
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate)) + 
  geom_point() + geom_line() + 
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") 

# If you want to color differently by region, specify color argument inside `aes()`.
# Then add the grouping variable
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, color=region)) + 
  geom_point() + geom_line() + 
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") 

# Something didn't quite work though. 
# It looks like `ggplot` connected all the dots 
# when you wanted it to connect the dots by state. 
# You can fix this with the `color` argument in `aes`.
# What if we group the data by states instead of region?
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, color=stname)) + 
  geom_point() + geom_line() + 
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") 

# We can easily create different plots for each region using `facet_wrap()`. 
# The tilde '~' can be read as "by". 
# This means that we facet *by* region. 
# Notice that we changed `color=stname` to `group=stname`.
# This just groups the data by `stname` but does not color differently by each state.
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, group=stname)) + 
  geom_point() + geom_line() + 
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
  facet_wrap(~region)

Overdose deaths over time

While the plot looks like spaghetti, you can see that drug overdose death rates increase over time. There is an especially pronounced increase in a handful of states starting around 2010. What do you think is driving these increases?

To get a better sense of where these changes are happening, you can facet your plot by region using facet_wrap():

state_data %>% 
  ggplot(aes(x = year, y = death_rate, group = stname)) +
  geom_point()+
  geom_line() +
  facet_wrap(~division)+
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")

division refers to census divisions.

You can change the appearance of your plot by specifying a theme:

state_data %>% 
  ggplot(aes(x = year, y = death_rate, group = stname)) +
  geom_line() +
  facet_wrap(~division) + 
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
  theme_bw()

What about national trends? To examine national overdose death rates over time, you can aggregate the state data using group_by and summarize. Now notice that since we are aggregating up the death rate, we are taking the weighted average where the weight here corresponds to population of each state to calculate the national death rate:

state_data %>% 
  group_by(year) %>% 
  summarize(death_rate = weighted.mean(x = death_rate, w = population)) %>% 
  ggplot(aes(x = year, y = death_rate)) +
  geom_line() +
  labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
  theme_bw()

Bivariate relationships

Do overdoses correlate with economic conditions?

state_data %>% 
  ggplot(aes(x = poverty_rate, y = death_rate)) +
  geom_point() +
  labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
  theme_bw()

state_data %>% 
  ggplot(aes(x = unemployment_rate, y = death_rate)) +
  geom_point() +
  labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
  theme_bw()

You can calculate correlation coefficients using cor:

cor(state_data$death_rate, state_data$poverty_rate)
## [1] 0.2167742
cor(state_data$death_rate, state_data$unemployment_rate)
## [1] 0.2065104

Are the correlations between economic conditions and overdose deaths positive or negative? How strong are the correlations? Are they relatively strong? Relatively weak?

There are some pitfalls to examining bivariate relationships with data from several different years. You will learn about these in EC 421. For now, you can sidestep some of these issues by focusing your attention on the most recent year in the sample period. Using the filter function, you can restrict the sample to observations from 2017:

state_data_2017 <- state_data %>% 
  filter(year == 2017)

Note the use of == instead of =. == is for logical comparisons while = for assignment.2

state_data_2017 %>% 
  ggplot(aes(x = poverty_rate, y = death_rate)) +
  geom_point() +
  labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
  theme_bw() 

state_data_2017 %>% 
  ggplot(aes(x = unemployment_rate, y = death_rate)) +
  geom_point() +
  labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
  theme_bw()

cor(state_data_2017$death_rate, state_data_2017$poverty_rate)
## [1] 0.0552481
cor(state_data_2017$death_rate, state_data_2017$unemployment_rate)
## [1] 0.3593693

Do the correlations change when you restrict the sample to 2017?

In case you were wondering, the deaths of despair explanation for the opioid crisis doesn’t really hold up to closer scrutiny.

Linear Regression

Introduction of lm()

Now let’s look at how to create a fitted linear line that minimizes the sum of residuals on plots.

# Add another layer `geom_smooth(method="lm", se=FALSE)`
state_data_2017 %>% 
  ggplot(aes(x = poverty_rate, y = death_rate)) +
  geom_point() +
  labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
  theme_bw() +
  geom_smooth(method="lm", se=FALSE)

Now using the formula covered in class, we will regenerate the intercept and slope coefficient estimates from OLS regression.

Slope coefficient

\[\hat{\beta}_1 = \dfrac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2}\]

Intercept

\[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \]

# You can use `lm()` to perform ordinary least squares regression analysis. 
?lm()
# "A typical model has the form response ~ terms 
# where response is the (numeric) response vector 
# and terms is a series of terms which specifies a linear predictor for response."
# In other words, the dependent variable comes first.
# Then we use `~` to separate it with independent variables.
# If we were to run a regression when the outcome variable is death_rate
# and the independent variable is poverty_rate, then it looks as follows:
lm(death_rate ~ poverty_rate, data=state_data_2017) %>% summary()
## 
## Call:
## lm(formula = death_rate ~ poverty_rate, data = state_data_2017)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1801  -8.3651  -0.8157   6.5827  30.1764 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   20.0379     6.3711   3.145  0.00282 **
## poverty_rate   0.1979     0.5110   0.387  0.70019   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.52 on 49 degrees of freedom
## Multiple R-squared:  0.003052,   Adjusted R-squared:  -0.01729 
## F-statistic:  0.15 on 1 and 49 DF,  p-value: 0.7002
# Intercept estimate is reported to be 20.0379 using lm function
# Slope estimate is reported to be 0.1979 using lm function
# Let's assign the dependent variable, death_rate as an object called y
y=state_data_2017$death_rate
# Let's assign the treatment variable, poverty_rate as an object called x
x=state_data_2017$poverty_rate
# slope estimate
beta1_hat= sum((y-mean(y))*(x-mean(x)))/sum((x-mean(x))^2)
beta1_hat
## [1] 0.1979355
# intercept estimate
mean(y)-beta1_hat*mean(x) 
## [1] 20.03793

Now your turn

Please open up the 04-Exercise.R and fill out your answer for each question.


  1. Unsolicited life advice: You should make an EC_320 folder where you can store your work in one place. Also, you should pick file and folder names without spaces. Why? R often chokes on spaces in file names when loading data. In place of spaces, you can use underscores (_) or dashes (-).↩︎

  2. Some people use = as an alternative to the assignment operator <-. If you want, you can too. Just be consistent.↩︎