Today we’ll be using SelectorGadget, which is a Chrome extension that makes it easy to discover CSS selectors. (Install the extension directly here.) Please note that SelectorGadget is only available for Chrome. If you prefer using Firefox, then you can try ScrapeMate.
Recall that rvest was automatically installed with the rest of the tidyverse. However, these lecture notes assume that you have rvest 1.0.0, which — at the time of writing — has to installed as the development version from GitHub. The code chunk below should take care of installing (if necessary) and loading the packages that you need for today’s lecture.
## Install development version of rvest if necessary
if (numeric_version(packageVersion("rvest")) < numeric_version('0.99.0')) {
remotes::install_github('tidyverse/rvest')
}
## Load and install the packages that we'll be using today
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, rvest, lubridate, janitor, data.table, hrbrthemes)
## My preferred ggplot2 plotting theme (optional)
theme_set(hrbrthemes::theme_ipsum())Tip: If you can get an error about missing fonts whilst following along with this lecture, that’s probably because you don’t have Arial Narrow — required by the
hrbrthemes::theme_ipsum()ggplot2 theme that I’m using here — installed on your system. You can resolve this by downloading the font and adding it to your font book (Google it), or by switching to a different theme (e.g.theme_set(theme_minimal())).
The next two lectures are about getting data, or “content”, off the web and onto our computers. We’re all used to seeing this content in our browers (Chrome, Firefox, etc.). So we know that it must exist somewhere. However, it’s important to realise that there are actually two ways that web content gets rendered in a browser:
You can read here for more details (including example scripts), but for our purposes the essential features are as follows:
Over the next two lectures, we’ll go over the main differences between the two approaches and cover the implications for any webscraping activity. I want to forewarn you that webscraping typically involves a fair bit of detective work. You will often have to adjust your steps according to the type of data you want, and the steps that worked on one website may not work on another. (Or even work on the same website a few months later). All this is to say that webscraping involves as much art as it does science.
The good news is that both server-side and client-side websites allow for webscraping.1 If you can see it in your browser, you can scrape it.
The previous sentence elides some important ethical considerations. Just because you can scrape it, doesn’t mean you should. Now, I first have to tell you that this paragraph used to contain a warning about the legal restrictions pertaining to webscraping activity. I’ve decided to drop those in the wake of the landmark hiQ Labs vs LinkedIn court ruling. (Short version: It is currently legal to scrape data from the web using automated tools, as long as the data are publicly available.) However, it’s still important to realise that the tools we’ll be using over these next two lectures are very powerful. A computer can process commands much, much faster than we can ever type them up manually. It’s pretty easy to write up a function or program that can overwhelm a host server or application through the sheer weight of requests.Or, just as likely, the host server has built-in safeguards that will block you in case of a suspected malicious attack. We’ll return to the “be nice” mantra at the end of this lecture, as well as in the next lecture.
The primary R package that we’ll be using today is rvest (link), a simple webscraping library inspired by Python’s Beautiful Soup (link), but with extra tidyverse functionality. rvest is designed to work with webpages that are built server-side and thus requires knowledge of the relevant CSS selectors… Which means that now is probably a good time for us to cover what these are.
Time for a student presentation on CSS (i.e Cascading Style Sheets) and SelectorGadget. Click on the links if you are reading this after the fact. In short, CSS is a language for specifying the appearance of HTML documents (including web pages). It does this by providing web browsers a set of display rules, which are formed by:
The key point is that if you can identify the CSS selector(s) of the content you want, then you can isolate it from the rest of the webpage content that you don’t want. This where SelectorGadget comes in. We’ll work through an extended example (with a twist!) below, but I highly recommend looking over this quick vignette before proceding.
Okay, let’s get to an application. Say that we want to scrape the Wikipedia page on the Men’s 100 metres world record progression.
First, open up this page in your browser. Take a look at its structure: What type of objects does it contain? How many tables does it have? Do these tables all share the same columns? What row- and columns-spans? Etc.
Once you’ve familiarised yourself with the structure, read the whole page into R using the rvest::read_html() function.
# library(rvest) ## Already loaded
m100 = read_html("http://en.wikipedia.org/wiki/Men%27s_100_metres_world_record_progression")
m100## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
As you can see, this is an XML document2 that contains everything needed to render the Wikipedia page. It’s kind of like viewing someone’s entire LaTeX document (preamble, syntax, etc.) when all we want are the data from some tables in their paper.
Let’s start by scraping the first table on the page, which documents the unofficial progression before the IAAF. The first thing we need to do is identify the table’s unique CSS selector. Here’s a GIF of me using SelectorGadget to do that.