Now we know some useful functions in tidyverse
package, how to generate a data (tibble), we are all set for some simple analysis using observational data. Today you will use various tidyverse
functions to describe drug overdoses in the US between 1999 and 2017.
The data file is 04-Introduction_tidyverse_data.csv
. You can find it on Canvas under Files
then Labs
. Download the file and save it somewhere on your computer.1
The data include drug overdose deaths from the CDC and poverty and unemployment rates from the University of Kentucky Poverty Research Center.
Import the data using read_csv
:
You can preview the structure of the data with glimpse
:
## Rows: 969
## Columns: 9
## $ stname <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor…
## $ year <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999…
## $ deaths <dbl> 169, 46, 511, 113, 2662, 349, 310, 50, 48, 997, 283,…
## $ population <dbl> 4430143, 624781, 5023824, 2651860, 33499204, 4226018…
## $ division <chr> "East South Central", "Pacific", "Mountain", "West S…
## $ region <chr> "South", "West", "West", "South", "West", "West", "N…
## $ stabbr <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC"…
## $ poverty_rate <dbl> 15.2, 7.6, 12.2, 14.7, 14.0, 8.5, 7.2, 10.4, 14.7, 1…
## $ unemployment_rate <dbl> 4.7, 6.5, 4.4, 4.6, 5.2, 3.1, 2.9, 3.4, 6.4, 3.9, 3.…
Review of the %>%
operator
state_data %>% mutate(death_rate = deaths/population*100000)
as “take the object state_data
and then give it to mutate()
.”%>%
means “and then.”Summary statistics can give you a quick sense of the central tendency and spread of the variables in your data. Common summary statistics include the mean, median, min, max, and standard deviation.
You can produce summary statistics using the summarize
function:
# start with summary stats for overdose death rates
state_data %>%
summarize(min = min(death_rate),
max = max(death_rate),
median = median(death_rate),
mean = mean (death_rate),
sd = sd(death_rate))
## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.82 53.6 11.6 12.6 6.69
What’s the minimum overdose death rate? What’s the maximum overdose death rate?
The code above produces summary statistics for a single variable. It is common practice to generate summary stats for all of the numerical variables in your dataset. One way you could do this is to copy-paste the code above and replace death_rate
a different a variable until you have summary stats for each variable. Generally speaking, this is bad practice. Constantly repeating yourself in your code can 1) propagate errors and 2) make it difficult to scale-up your code later (e.g., including additional variables).
For repetitive tasks, you can define functions:
summ_fun <- function(x){
state_data %>%
summarize(min = min(x),
max = max(x),
median = median(x),
mean = mean (x),
sd = sd(x))
}
Functions require inputs. The x
is the input to the function defined above. It serves as a placeholder for a variable name from state_data
.
You can use summ_fun
to obtain summary statistics for each variable:
## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.82 53.6 11.6 12.6 6.69
## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.5 23.1 12.1 12.6 3.38
## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.3 13.7 5.2 5.64 1.97
## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 491780 39536653 4146101 5946471. 6667352.
## # A tibble: 1 × 5
## min max median mean sd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1999 2017 2008 2008 5.48
Putting state_data$
before the variable name lets R know where to look for the variable. For example, state_data$death_rate
says “hey R, look for death_rate
inside the object state_data
.”
We are going to use ggplot()
function, aka grammar of graphics included in ggplot2
package. At this point you don’t need to load ggplot2
package explicitly as it’s already loaded when you’ve loaded tidyverse
package.
?ggplot
# ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
# data argument: Default dataset to use for plot
# mapping argument: Default list of aesthetic mappings to use for plot
# If you want to specify the x axis and y axis, you could do so by
ggplot(data = state_data, mapping=aes(x = year, y = death_rate))
# Nothing appears! You want to add the geoms -- visual marks that present data points
# Here's one with representing data with points
# Note that you want to layer up the plot by using `+` not `%>%`.
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_point()
# Here's one with representing data with line
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_line()
# You could layer up multiple geoms in one plot by connecting them with `+`.
# Following code layered the `geom_point` and `geom_line` on the plot.
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_point() + geom_line()
# What if you want to label x-axis and y-axis and also add title to the plot?
# You add another layer of `labs` which is short for `labels`
ggplot(data = state_data, mapping=aes(x = year, y = death_rate)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")
# We can use pipe operator here by pulling out the first argument of ggplot function,
# place it in front of `ggplot()`, then connect it with `%>%`.
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")
# If you want to color differently by region, specify color argument inside `aes()`.
# Then add the grouping variable
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, color=region)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")
# Something didn't quite work though.
# It looks like `ggplot` connected all the dots
# when you wanted it to connect the dots by state.
# You can fix this with the `color` argument in `aes`.
# What if we group the data by states instead of region?
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, color=stname)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")
# We can easily create different plots for each region using `facet_wrap()`.
# The tilde '~' can be read as "by".
# This means that we facet *by* region.
# Notice that we changed `color=stname` to `group=stname`.
# This just groups the data by `stname` but does not color differently by each state.
state_data %>% ggplot(data = ., mapping=aes(x = year, y = death_rate, group=stname)) +
geom_point() + geom_line() +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
facet_wrap(~region)
While the plot looks like spaghetti, you can see that drug overdose death rates increase over time. There is an especially pronounced increase in a handful of states starting around 2010. What do you think is driving these increases?
To get a better sense of where these changes are happening, you can facet your plot by region using facet_wrap()
:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_point()+
geom_line() +
facet_wrap(~division)+
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)")
division
refers to census divisions.
You can change the appearance of your plot by specifying a theme:
state_data %>%
ggplot(aes(x = year, y = death_rate, group = stname)) +
geom_line() +
facet_wrap(~division) +
labs(title = "Can you spot the opioid crisis?", x = "Year", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
What about national trends? To examine national overdose death rates over time, you can aggregate the state data using group_by
and summarize
. Now notice that since we are aggregating up the death rate, we are taking the weighted average where the weight here corresponds to population of each state to calculate the national death rate:
Do overdoses correlate with economic conditions?
state_data %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
state_data %>%
ggplot(aes(x = unemployment_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)") +
theme_bw()
You can calculate correlation coefficients using cor
:
## [1] 0.2167742
## [1] 0.2065104
Are the correlations between economic conditions and overdose deaths positive or negative? How strong are the correlations? Are they relatively strong? Relatively weak?
There are some pitfalls to examining bivariate relationships with data from several different years. You will learn about these in EC 421. For now, you can sidestep some of these issues by focusing your attention on the most recent year in the sample period. Using the filter
function, you can restrict the sample to observations from 2017:
Note the use of ==
instead of =
. ==
is for logical comparisons while =
for assignment.2
state_data_2017 %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw()
state_data_2017 %>%
ggplot(aes(x = unemployment_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Unemployment Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw()
## [1] 0.0552481
## [1] 0.3593693
Do the correlations change when you restrict the sample to 2017?
In case you were wondering, the deaths of despair explanation for the opioid crisis doesn’t really hold up to closer scrutiny.
lm()
Now let’s look at how to create a fitted linear line that minimizes the sum of residuals on plots.
# Add another layer `geom_smooth(method="lm", se=FALSE)`
state_data_2017 %>%
ggplot(aes(x = poverty_rate, y = death_rate)) +
geom_point() +
labs(title = "Deaths of despair?", x = "Poverty Rate (%)", y = "Drug Overdose Death Rate (per 100,000)", caption = "Note: 2017 data only.") +
theme_bw() +
geom_smooth(method="lm", se=FALSE)
Now using the formula covered in class, we will regenerate the intercept and slope coefficient estimates from OLS regression.
Slope coefficient
\[\hat{\beta}_1 = \dfrac{\sum_{i=1}^n (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i=1}^n (X_i - \bar{X})^2}\]
Intercept
\[ \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \]
# You can use `lm()` to perform ordinary least squares regression analysis.
?lm()
# "A typical model has the form response ~ terms
# where response is the (numeric) response vector
# and terms is a series of terms which specifies a linear predictor for response."
# In other words, the dependent variable comes first.
# Then we use `~` to separate it with independent variables.
# If we were to run a regression when the outcome variable is death_rate
# and the independent variable is poverty_rate, then it looks as follows:
lm(death_rate ~ poverty_rate, data=state_data_2017) %>% summary()
##
## Call:
## lm(formula = death_rate ~ poverty_rate, data = state_data_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.1801 -8.3651 -0.8157 6.5827 30.1764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0379 6.3711 3.145 0.00282 **
## poverty_rate 0.1979 0.5110 0.387 0.70019
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.52 on 49 degrees of freedom
## Multiple R-squared: 0.003052, Adjusted R-squared: -0.01729
## F-statistic: 0.15 on 1 and 49 DF, p-value: 0.7002
# Intercept estimate is reported to be 20.0379 using lm function
# Slope estimate is reported to be 0.1979 using lm function
# Let's assign the dependent variable, death_rate as an object called y
y=state_data_2017$death_rate
# Let's assign the treatment variable, poverty_rate as an object called x
x=state_data_2017$poverty_rate
# slope estimate
beta1_hat= sum((y-mean(y))*(x-mean(x)))/sum((x-mean(x))^2)
beta1_hat
## [1] 0.1979355
## [1] 20.03793
Please open up the 04-Exercise.R
and fill out your answer for each question.
Unsolicited life advice: You should make an EC_320
folder where you can store your work in one place. Also, you should pick file and folder names without spaces. Why? R often chokes on spaces in file names when loading data. In place of spaces, you can use underscores (_
) or dashes (-
).↩︎
Some people use =
as an alternative to the assignment operator <-
. If you want, you can too. Just be consistent.↩︎