We’re going to be downloading economic data from the FRED API. This will require that you first create a user account and then register a personal API key.
Today I’ll be using JSONView, a browser extension that renders JSON output nicely in Chrome and Firefox. (Not required, but recommended.)
Here’s a convenient way to install (if necessary) and load all of the above packages.
## Load and install the packages that we'll be using today
if (!require("pacman")) install.packages("pacman")
::p_load(tidyverse, httr, lubridate, hrbrthemes, janitor, jsonlite, fredr,
pacman
listviewer, usethis)## My preferred ggplot2 plotting theme (optional)
theme_set(hrbrthemes::theme_ipsum())
During the last lecture, we saw that websites and web applications fall into two categories: 1) Server-side and 2) Client-side. We then practiced scraping data that falls into the first category — i.e. rendered server-side — using the rvest package. This technique focuses on CSS selectors (with help from SelectorGadget) and HTML tags. We also saw that webscraping often involves as much art as science. The plethora of CSS options and the flexibility of HTML itself means that steps which work perfectly well on one website can easily fail on another website.
Today we focus on the second category: Scraping web data that is rendered client-side. The good news is that, when available, this approach typically makes it much easier to scrape data from the web. The downside is that, again, it can involve as much art as it does science. Moreover, as I emphasised last time, just because because we can scrape data, doesn’t mean that we should (i.e. ethical, legal and other considerations). These admonishments aside, let’s proceed…
Recall that websites or applications that are built using a client-side framework typically involve something like the following steps:
All of this requesting, responding and rendering takes places through the host application’s API (or Application Program Interface). Time for a student presentation to go over APIs in more depth…
If you’re new to APIs or reading this after the fact, then I recommend this excellent resource from Zapier: An Introduction to APIs. It’s fairly in-depth, but you don’t need to work through the whole thing to get the gist. The summary version is that an API is really just a collection of rules and methods that allow different software applications to interact and share information. This includes not only web servers and browsers, but also software packages like the R libraries we’ve been using.1 Key concepts include:
GET
(i.e. ask a server to retrieve information), but other common methods are POST
, PUT
and DELETE
.A key point in all of this is that, in the case of web APIs, we can access information directly from the API database if we can specify the correct URL(s). These URLs are known as an API endpoints.
API endpoints are in many ways similar to the normal website URLs that we’re all used to visiting. For starters, you can navigate to them in your web browser. However, whereas normal websites display information in rich HTML content — pictures, cat videos, nice formatting, etc. — an API endpoint is much less visually appealing. Navigate your browser to an API endpoint and you’ll just see a load of seemingly unformatted text. In truth, what you’re really seeing is (probably) either JSON (JavaScript Object Notation) or XML (Extensible Markup Language).
You don’t need to worry too much about the syntax of JSON and XML. The important thing is that the object in your browser — that load of seemingly unformatted text — is actually very precisely structured and formatted. Moreover, it contains valuable information that we can easily read into R (or Python, Julia, etc.) We just need to know the right API endpoint for the data that we want.
Let’s practice doing this through a few example applications. I’ll start with the simplest case (no API key required, explicit API endpoint) and then work through some more complicated examples.
NYC Open Data is a pretty amazing initiative. Its mission is to “make the wealth of public data generated by various New York City agencies and other City organizations available for public use”. You can get data on everything from arrest data, to the location of wi-fi hotspots, to city job postings, to homeless population counts, to dog licenses, to a directory of toilets in public parks… The list goes on. I highly encourage you to explore in your own time, but we’re going to do something “earthy” for this first application: Download a sample of tree data from the 2015 NYC Street Tree Census.
I wanted to begin with an example from NYC Open Data, because you don’t need to set up an API key in advance.2 All you need to do is complete the following steps:
Here’s a GIF of me completing these steps: