Software requirements

External software

Today we’ll be using SelectorGadget, which is a Chrome extension that makes it easy to discover CSS selectors. (Install the extension directly here.) Please note that SelectorGadget is only available for Chrome. If you prefer using Firefox, then you can try ScrapeMate.

R packages

  • New: rvest, janitor
  • Already used: tidyverse, lubridate, data.table, hrbrthemes

Recall that rvest was automatically installed with the rest of the tidyverse. However, these lecture notes assume that you have rvest 1.0.0, which — at the time of writing — has to installed as the development version from GitHub. The code chunk below should take care of installing (if necessary) and loading the packages that you need for today’s lecture.

## Install development version of rvest if necessary
if (numeric_version(packageVersion("rvest")) < numeric_version('0.99.0')) {
  remotes::install_github('tidyverse/rvest')
}
## Load and install the packages that we'll be using today
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, rvest, lubridate, janitor, data.table, hrbrthemes)
## My preferred ggplot2 plotting theme (optional)
theme_set(hrbrthemes::theme_ipsum())

Tip: If you can get an error about missing fonts whilst following along with this lecture, that’s probably because you don’t have Arial Narrow — required by the hrbrthemes::theme_ipsum() ggplot2 theme that I’m using here — installed on your system. You can resolve this by downloading the font and adding it to your font book (Google it), or by switching to a different theme (e.g. theme_set(theme_minimal())).

Webscraping basics

The next two lectures are about getting data, or “content”, off the web and onto our computers. We’re all used to seeing this content in our browers (Chrome, Firefox, etc.). So we know that it must exist somewhere. However, it’s important to realise that there are actually two ways that web content gets rendered in a browser:

  1. Server-side
  2. Client side

You can read here for more details (including example scripts), but for our purposes the essential features are as follows:

1. Server-side

  • The scripts that “build” the website are not run on our computer, but rather on a host server that sends down all of the HTML code.
    • E.g. Wikipedia tables are already populated with all of the information — numbers, dates, etc. — that we see in our browser.
  • In other words, the information that we see in our browser has already been processed by the host server.
  • You can think of this information being embeded directly in the webpage’s HTML.
  • Webscraping challenges: Finding the correct CSS (or Xpath) “selectors”. Iterating through dynamic webpages (e.g. “Next page” and “Show More” tabs).
  • Key concepts: CSS, Xpath, HTML

2. Client-side

  • The website contains an empty template of HTML and CSS.
    • E.g. It might contain a “skeleton” table without any values.
  • However, when we actually visit the page URL, our browser sends a request to the host server.
  • If everything is okay (e.g. our request is valid), then the server sends a response script, which our browser executes and uses to populate the HTML template with the specific information that we want.
  • Webscraping challenges: Finding the “API endpoints” can be tricky, since these are sometimes hidden from view.
  • Key concepts: APIs, API endpoints

Over the next two lectures, we’ll go over the main differences between the two approaches and cover the implications for any webscraping activity. I want to forewarn you that webscraping typically involves a fair bit of detective work. You will often have to adjust your steps according to the type of data you want, and the steps that worked on one website may not work on another. (Or even work on the same website a few months later). All this is to say that webscraping involves as much art as it does science.

The good news is that both server-side and client-side websites allow for webscraping.1 If you can see it in your browser, you can scrape it.

Webscraping with rvest (server-side)

The primary R package that we’ll be using today is rvest (link), a simple webscraping library inspired by Python’s Beautiful Soup (link), but with extra tidyverse functionality. rvest is designed to work with webpages that are built server-side and thus requires knowledge of the relevant CSS selectors… Which means that now is probably a good time for us to cover what these are.

Student presentation: CSS and SelectorGadget

Time for a student presentation on CSS (i.e Cascading Style Sheets) and SelectorGadget. Click on the links if you are reading this after the fact. In short, CSS is a language for specifying the appearance of HTML documents (including web pages). It does this by providing web browsers a set of display rules, which are formed by:

  1. Properties. CSS properties are the “how” of the display rules. These are things like which font family, styles and colours to use, page width, etc.
  2. Selectors. CSS selectors are the “what” of the display rules. They identify which rules should be applied to which elements. E.g. Text elements that are selected as “.h1” (i.e. top line headers) are usually larger and displayed more prominently than text elements selected as “.h2” (i.e. sub-headers).

The key point is that if you can identify the CSS selector(s) of the content you want, then you can isolate it from the rest of the webpage content that you don’t want. This where SelectorGadget comes in. We’ll work through an extended example (with a twist!) below, but I highly recommend looking over this quick vignette before proceding.

Application 1: Wikipedia

Okay, let’s get to an application. Say that we want to scrape the Wikipedia page on the Men’s 100 metres world record progression.

First, open up this page in your browser. Take a look at its structure: What type of objects does it contain? How many tables does it have? Do these tables all share the same columns? What row- and columns-spans? Etc.

Once you’ve familiarised yourself with the structure, read the whole page into R using the rvest::read_html() function.

# library(rvest) ## Already loaded

m100 = read_html("http://en.wikipedia.org/wiki/Men%27s_100_metres_world_record_progression") 
m100
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

As you can see, this is an XML document2 that contains everything needed to render the Wikipedia page. It’s kind of like viewing someone’s entire LaTeX document (preamble, syntax, etc.) when all we want are the data from some tables in their paper.

Table 1: Pre-IAAF (1881–1912)

Let’s start by scraping the first table on the page, which documents the unofficial progression before the IAAF. The first thing we need to do is identify the table’s unique CSS selector. Here’s a GIF of me using SelectorGadget to do that.