class: center, middle, inverse, title-slide .title[ # MKT615: Data Storytelling for Marketers ] .subtitle[ ## Lecture 5.3: Data Visualization ] .author[ ### Davide Proserpio ] .institute[ ### Marshall School of Business ] --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } .large4 { font-size: 400% } .large2 { font-size: 200% } .small90 { font-size: 90% } .small75 { font-size: 75% } </style> # Table of contents 1. [Checklist](#check) 2. [ggplot2](#ggplot2) 3. [Aesthetic](#aesthetic) 4. [Geoms](#geoms) 5. [Layers](#layers) 6. [More things](#more) 7. [Exercises](#exe) 8. [Readings](#readings) --- name: check # Checklist We will need **ggplot2**, **gapminder**, **dplyr**, **data.table**, **ggthemes**, **gganimate**, **hrbrthemes**, **scales**, **tidyverse** --- class: inverse, center, middle name:ggplot2 # ggplot2 <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # Elements of ggplot2 [Hadley Wickham's](http://hadley.nz/) ggplot2 is one of the most popular R packages. - It also happens to be built upon some deep visualization theory: i.e. Leland Wilkinson's [*The Grammar of Graphics*](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448). There's a lot to say about ggplot2's implementation of this "grammar of graphics" approach, but the three key elements are: 1. Your plot ("the visualization") is linked to your variables ("the data") through various **aesthetic mappings**. 2. Once the aesthetic mappings are defined, you can represent your data in different ways by choosing different **geoms** (i.e. "geometric objects" like points, lines or bars). 3. You build your plot in **layers**. -- Let's review each element in turn with some actual plots. --- class: inverse, center, middle name: aesthetic # Aesthetics <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # 1. Aesthetic mappings ```r p = ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p ``` <img src="07-dataviz_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- # 1. Aesthetic mappings (cont.) ```r p = ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p ``` Focus on the top line, which contains the initialising `ggplot()` function call. This function accepts various arguments, including: - Where the data come from (i.e. `data = gapminder`). - What the aesthetic mappings are (i.e. `mapping = aes(x = gdpPercap, y = lifeExp)`). - The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument. -- In this case, the aesthetic mappings here are pretty simple: They just define an x-axis (GDP per capita) and a y-axis (life expecancy). - However, aesthetics are a very powerful tool. To get a sense of the power and flexibility that comes with this approach, consider what happens if we add more aesthetics to the plot call. --- # 1. Aesthetic mappings (cont.) E.g., let's map **color** and point **size** to two variables: ```r p = ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) p = p + geom_point() p ``` <img src="07-dataviz_files/figure-html/aesthetics2-1.svg" style="display: block; margin: auto;" /> -- Note that I've dropped the "data =" and "mapping =" part of the ggplot call (`ggplot2` knows the order of the arguments.) --- # 1. Aesthetic mappings (cont.) E.g., let's map **color** and point **size** to two variables: <img src="07-dataviz_files/figure-html/aesthetics22-1.svg" style="display: block; margin: auto;" /> - As you can see, ggplot automatically assigns *levels* to the aesthetics and creates a legend for each of them. - (For `x` and `y` aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label.) --- # 1. Aesthetic mappings (cont.) We can specify aesthetic mappings in the geom layer too. Doing so aesthetics are applied only locally (i.e., to the specific geometry). ```r p = ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) ## Applicable to all geoms p = p + geom_point(aes(size = pop, col = continent)) ## Applicable to this geom only p ``` <img src="07-dataviz_files/figure-html/aesthetics3-1.svg" style="display: block; margin: auto;" /> --- # 1. Aesthetic mappings (cont.) Oops. What went wrong here? ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point(aes(size = "big", col="black")) p ``` ``` ## Warning: Using size for a discrete variable is not advised. ``` <img src="07-dataviz_files/figure-html/aesthetics4-1.svg" style="display: block; margin: auto;" /> -- **Answer: **Aesthetics must be mapped to variables, not descriptions! --- # 1. Aesthetic mappings (cont.) Note that outside of the `aes()` function, you can pass values and set the aesthetic **manually**, e.g., ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point(size = 0.1, col = "blue") p ``` <img src="07-dataviz_files/figure-html/aesthetics44-1.svg" style="display: block; margin: auto;" /> --- class: inverse, center, middle name: geoms # Geoms <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # 2. Geoms Once your variable relationships have been defined by the aesthetic mappings, you can invoke and combine different geoms to generate different visualizations. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p = p + geom_smooth(method = "loess") p ``` ``` ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="07-dataviz_files/figure-html/geoms1-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms (cont.) Aesthetics can be applied differentially across geoms. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point(aes(size = pop, col = continent)) p = p + geom_smooth(method = "loess") p ``` ``` ## `geom_smooth()` using formula = 'y ~ x' ``` <img src="07-dataviz_files/figure-html/geoms2-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms (cont.) The previous plot provides a good illustration of the power (or effect) that comes from assigning aesthetic mappings "globally" vs "locally" (i.e., in the individual geom layers.) - Compare: What happens if you run the below code chunk? ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp, size = pop, col = continent)) p = p + geom_point(alpha = 0.3) p = p + geom_smooth(method = "loess") p ``` --- # 2. Geoms (cont.) You can use the same idea to specify different data for each layer. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point(aes(color = year)) p = p + geom_smooth( data = subset(gapminder, year == 2002), se = F) p ``` ``` ## `geom_smooth()` using method = 'loess' and formula = 'y ~ x' ``` <img src="07-dataviz_files/figure-html/geoms6-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms (cont.) Similarly, note that some geoms only accept a subset of mappings. E.g. `geom_density()` doesn't know what to do with the "y" aesthetic mapping. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p + geom_density() ``` ``` ## Warning: The following aesthetics were dropped during statistical transformation: y. ## ℹ This can happen when ggplot fails to infer the correct grouping structure in ## the data. ## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical ## variable into a factor? ``` ``` ## Error in `geom_density()`: ## ! Problem while setting up geom. ## ℹ Error occurred in the 1st layer. ## Caused by error in `compute_geom_1()`: ## ! `geom_density()` requires the following missing aesthetics: y. ``` --- # 2. Geoms (cont.) We can fix that by being more careful about how we build the plot. ```r p = ggplot(data = gapminder) ## i.e. No "global" aesthetic mappings" p + geom_density(aes(x = gdpPercap, fill = continent), alpha=0.3) ``` <img src="07-dataviz_files/figure-html/geoms5-1.svg" style="display: block; margin: auto;" /> --- # 2. Geoms (cont.) - ggplot2 provides over 40 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). - The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at http://rstudio.com/resources/cheatsheets. - To learn more about any single geom, use help: ?geom_smooth. --- class: inverse, center, middle name: build # Layers <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # 3. Build your plot in layers We've already seen how we can chain (or "layer") consecutive plot elements. But it bears repeating: You can build out some truly impressive complexity and transformation of your visualization through this simple layering process. - You don't have to transform your original data; ggplot2 takes care of all of that. - For example (see next slide for figure). ```r p2 = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point(aes(size = pop, col = continent), alpha = 0.3) + scale_color_brewer(name = "Continent", palette = "Set1") + ## Different color scale scale_size(name = "Population", labels = scales::comma) + ## Different point (i.e. legend) scale scale_x_log10(labels = scales::dollar) + ## Switch to logarithmic scale on x-axis. Use dollar units. labs(x = "\n Log (GDP per capita)", y = "Life Expectancy\n") + ## Better axis titles theme_minimal() ## Try a minimal (b&w) plot theme ``` --- # 3. Build your plot in layers (cont.) <img src="07-dataviz_files/figure-html/layers2-1.svg" style="display: block; margin: auto;" /> --- class: inverse, center, middle name: more # More things <!-- <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --> --- # Facets One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p = p + facet_wrap(~ continent, nrow = 2) p ``` <img src="07-dataviz_files/figure-html/facet-1.svg" style="display: block; margin: auto;" /> --- # Facets (cont.) To facet your plot on the combination of two variables, add facet_grid() to your plot call. ```r p = ggplot(data = subset(gapminder, year %in% c(2002,2007)), aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p = p + facet_grid( year ~ continent) p ``` <img src="07-dataviz_files/figure-html/facetgrid-1.svg" style="display: block; margin: auto;" /> --- # Facets (cont.) If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name, e.g. `+ facet_grid(. ~ cyl)` (the dot is optional). ```r p = ggplot(data = subset(gapminder, year %in% c(2002,2007)), aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p = p + facet_grid( ~ continent) p ``` <img src="07-dataviz_files/figure-html/facetgrid2-1.svg" style="display: block; margin: auto;" /> --- # Statistical transformations `stat_summary` is very useful to create plots with summary stats w/o altering the original data ```r p = ggplot(data = gapminder, aes(x = year)) p = p + stat_summary(aes(y=gdpPercap), fun = "mean", geom = "point") p ``` <img src="07-dataviz_files/figure-html/summary-1.svg" style="display: block; margin: auto;" /> -- In the exercises, we are going to see a few additional ways to do statistical transformations. --- # Saving plots `ggsave` is very useful to save plots in any format. Note that for papers, the best format is pdf. ```r p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p ``` ```r ggsave("figures/test.pdf", p, h = 3, w = 5) ``` -- Note that you can also combine more than one plot in one pdf, but you need the `patchwork` library: ```r library(patchwork) p = ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) p = p + geom_point() p1 = p p2 = p ggsave("figures/side-by-side.pdf", p1+p2, h = 3, w = 5) ``` --- # There is much more We have barely scratched the surface of ggplot2's functionality... let alone talked about the entire ecosystem of packages that has been built around it. - Here's are two quick additional examples -- Note that you will need to install and load some additional packages if you want to recreate the next two figures on your own machine. A quick way to do this: ```r if (!require("pacman")) install.packages("pacman") ``` ``` ## Loading required package: pacman ``` ```r pacman::p_load(hrbrthemes, gganimate) ``` --- # There is much more (cont.) Simple extension: Use an external package theme. ```r # library(hrbrthemes) p2 + theme_modern_rc() + geom_point(aes(size = pop, col = continent), alpha = 0.2) ``` <img src="07-dataviz_files/figure-html/modern_rc_theme-1.svg" style="display: block; margin: auto;" /> --- # There is much more (cont.) Elaborate extension: Animation! ```r # library(gganimate) ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) + geom_point(alpha = 0.7, show.legend = FALSE) + scale_colour_manual(values = country_colors) + scale_size(range = c(2, 12)) + scale_x_log10() + facet_wrap(~continent) + # Here comes the gganimate specific bits labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') + transition_time(year) + ease_aes('linear') ``` --- name: exe <!-- --- --> <!-- #More (cont.) --> <!-- ```{r ggamin2, echo=FALSE, eval=TRUE} --> <!-- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) + --> <!-- geom_point(alpha = 0.7, show.legend = FALSE) + --> <!-- scale_colour_manual(values = country_colors) + --> <!-- scale_size(range = c(2, 12)) + --> <!-- scale_x_log10(labels = scales::dollar) + --> <!-- facet_wrap(~continent) + --> <!-- # Here comes the gganimate specific bits --> <!-- labs(title = 'Year: {frame_time}', x = 'Log (GDP per capita)', y = 'Life expectancy') + --> <!-- transition_time(year) + --> <!-- ease_aes('linear') --> <!-- ``` --> # Exercises Let's use the ggplot2 `mpg` dataset. What does it contain? 1. Plot the average highway miles per gallon (a dot) by type of car (class) <!-- p = ggplot(mpg, aes(x=class)) --> <!-- p = p + stat_summary(aes(y = hwy), fun = 'mean', geom = 'point') --> <!-- p --> 2. Rename x-axis label to "Type of Car", and y-axis label to "Hwy Miles/Gallon" 3. Now redo the plot but using geom_boxplot (in this case you don't need to summarize the data) and ordering the plot from low to high average hwy mileage per gallon <!-- p = ggplot(data = mpg) --> <!-- p = p + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = mean), y = hwy)) --> <!-- p --> 4. Create a bar plot of the number of cars by type of car (class) <!-- p = ggplot(mpg, aes(x = class)) --> <!-- p = p + geom_bar() --> <!-- p --> 5. Reorder the plot above by count (highest bar first) <!-- p = ggplot(mpg, aes(x = fct_infreq(class))) --> <!-- p = p + geom_bar() --> <!-- p --> 6. Create a scatter plot of highway per miles against (x_axis) engine displacement, color the dots by type of car (class) <!-- ggplot(data = mpg) + --> <!-- geom_point(mapping = aes(x = displ, y = hwy, color = class)) --> 7. Do the plot above but create a panel for each type of car (use facets) <!-- ggplot(data = mpg) + --> <!-- geom_point(mapping = aes(x = displ, y = hwy)) + --> <!-- facet_wrap(~ class, nrow = 2) --> --- name: readings # Readings If you want to do some reading and practice on your own: - [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) of *R for Data Science* by Hadley Wickham and Garett Grolemund. - [*Data Visualization: A Practical Guide*](https://socviz.co/makeplot.html) by Kieran Healy. - [Designing ggplots](https://designing-ggplots.netlify.com) by Malcom Barrett.