We’re going to be downloading economic data from the FRED API. This will require that you first create a user account and then register a personal API key.
Today I’ll be using JSONView, a browser extension that renders JSON output nicely in Chrome and Firefox. (Not required, but recommended.)
As a note, httr2 is a newer version of httr that is currently in development. It is recommended to use it, but I’m sticking with httr for now since it has fewer bells and whistles to teach. I’ll probably switch to httr2 in the next iteration of this course once the package is more stable.
Here’s a convenient way to install (if necessary) and load all of the above packages.
## Load and install the packages that we'll be using today
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, httr, lubridate, hrbrthemes, janitor, jsonlite, fredr,
listviewer, usethis, tidycensus)
## My preferred ggplot2 plotting theme (optional)
theme_set(theme_minimal())
Websites and web applications fall into two categories: 1) Server-side and 2) Client-side. We then practiced scraping data that falls into the first category — i.e. rendered server-side — using the rvest package. This technique focuses on CSS selectors (with help from SelectorGadget) and HTML tags. We also saw that webscraping often involves as much art as science. The plethora of CSS options and the flexibility of HTML itself means that steps which work perfectly well on one website can easily fail on another website.
Today we focus on the second category: Scraping web data that is rendered client-side. The good news is that, when available, this approach typically makes it much easier to scrape data from the web. The downside is that, again, it can involve as much art as it does science. Moreover, as I emphasised last time, just because because we can scrape data, doesn’t mean that we should (i.e. ethical, legal and other considerations). These admonishments aside, let’s proceed…
Recall that websites or applications that are built using a client-side framework typically involve something like the following steps:
All of this requesting, responding and rendering takes places through the host application’s API (or Application Program Interface).
A key point in all of this is that, in the case of web APIs, we can access information directly from the API database if we can specify the correct URL(s). These URLs are known as an API endpoints.
API endpoints are in many ways similar to the normal website URLs that we’re all used to visiting. For starters, you can navigate to them in your web browser. However, whereas normal websites display information in rich HTML content — pictures, cat videos, nice formatting, etc. — an API endpoint is much less visually appealing. Navigate your browser to an API endpoint and you’ll just see a load of seemingly unformatted text. In truth, what you’re really seeing is (probably) either JSON (JavaScript Object Notation) or XML (Extensible Markup Language).
You don’t need to worry too much about the syntax of JSON and XML. The important thing is that the object in your browser — that load of seemingly unformatted text — is actually very precisely structured and formatted. Moreover, it contains valuable information that we can easily read into R (or Python, Julia, etc.) We just need to know the right API endpoint for the data that we want.
Let’s practice doing this through a few example applications. I’ll start with the simplest case (no API key required, explicit API endpoint) and then work through some more complicated examples.
NYC Open Data is a pretty amazing initiative. Its mission is to “make the wealth of public data generated by various New York City agencies and other City organizations available for public use”. You can get data on everything from arrest data, to the location of wi-fi hotspots, to city job postings, to homeless population counts, to dog licenses, to a directory of toilets in public parks… The list goes on. I highly encourage you to explore in your own time, but we’re going to do something “earthy” for this first application: Download a sample of tree data from the 2015 NYC Street Tree Census.
I wanted to begin with an example from NYC Open Data, because you don’t need to set up an API key in advance.1 All you need to do is complete the following steps:
Here’s a GIF of me completing these steps: