Web Scraping

Collecting Data from the Web using R

After talking quite a bit about web data in the last session, today’s session is dedicated to data collection - from the web!

What we will cover:

scraping static webpages
scraping multiple static webpages
API calls
building up and maintaining you own original sets of web-based data

What we will not cover (today):

scraping dynamic webpages

Why webscrape with R? 🌍

Web scraping broadly includes:

getting (unstructured) data from the web, and
bringing it into shape (e.g. cleaning it, getting it into tabular format).

Why web scrape? While some influential people consider “Data Scientist” 👩‍💻 to be the sexiest job of the 21st century (congratulations!), one of the sexiest just emerging academic disciplines is Computational Social Science (CSS). Why is that so?

data abundance online
social interaction online
services track social behavior

Online data are a very promising vehicle to gather insights for you as data scientist for the common good.

BUT online data are usually meant for display, not a (clean) download!

Luckily, with R we can automate the whole pipeline of downloading, parsing, and post-processing to make our projects easily reproducible.

In general, remember, the basic workflow for scraping static webpages is the following.

Scraping static sites with `rvest` 🚜

Who doesn’t love Wikipedia? Let’s use this as our first, straight forward test case. Let’s take a look at the Cologne page.

📝 To keep in mind

For illustrative purposes, in this tutorial we will parse the page source directly from the live webpage. You should know that the best practice to ensure reproducibility would be to download the HTML file locally. In that way, you can avoid issues arising from changes in the content or structure of the source.

Step 1. Load the packages rvest and stringr.

library(rvest)
library(stringr)
library(tidyverse)
library(here)

Step 2. Parse the page source. If encoding is necessary, you can add it in the reading html step.

parsed_url <- rvest::read_html("https://en.wikipedia.org/wiki/Cologne", 
                               encoding = "UTF-8")

Step 3. Extract information.

parsed_url |> 
  rvest::html_element(xpath = '//p[76]') |> 
  rvest::html_text()

## [1] "The Cologne carnival is one of the largest street festivals in Europe. In Cologne, the carnival season officially starts on 11 November at 11 minutes past 11 a.m. with the proclamation of the new Carnival Season, and continues until Ash Wednesday. However, the so-called \"Tolle Tage\" (crazy days) do not start until Weiberfastnacht (Women's Carnival) or, in dialect, Wieverfastelovend, the Thursday before Ash Wednesday, which is the beginning of the street carnival. Zülpicher Strasse and its surroundings, Neumarkt square, Heumarkt and all bars and pubs in the city are crowded with people in costumes dancing and drinking in the streets. Hundreds of thousands of visitors flock to Cologne during this time. Generally, around a million people celebrate in the streets on the Thursday before Ash Wednesday.[73]"

There are many ways to get the same content:

# Method 1: Semantic/content-based selection
# Finds the h3 heading with id "Carnival", then grabs the paragraph that follows it
# More robust because it relies on the content structure (headings and their IDs)
# Will still work even if Wikipedia adds/removes other content on the page
parsed_url |> 
  rvest::html_element(xpath = '//div[contains(@class, "mw-heading mw-heading3")]//h3[@id="Carnival"]/following::p') |> 
  rvest::html_text()

## [1] "The Cologne carnival is one of the largest street festivals in Europe. In Cologne, the carnival season officially starts on 11 November at 11 minutes past 11 a.m. with the proclamation of the new Carnival Season, and continues until Ash Wednesday. However, the so-called \"Tolle Tage\" (crazy days) do not start until Weiberfastnacht (Women's Carnival) or, in dialect, Wieverfastelovend, the Thursday before Ash Wednesday, which is the beginning of the street carnival. Zülpicher Strasse and its surroundings, Neumarkt square, Heumarkt and all bars and pubs in the city are crowded with people in costumes dancing and drinking in the streets. Hundreds of thousands of visitors flock to Cologne during this time. Generally, around a million people celebrate in the streets on the Thursday before Ash Wednesday.[73]"

# Method 2: Absolute path selection  
# Goes to the exact 76th paragraph in a very specific location in the HTML tree
# Super fragile - breaks if Wikipedia changes anything above this paragraph
# Like giving directions "take the 76th door" instead of "go to the room labeled Carnival"
parsed_url |> 
  rvest::html_element(xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/p[76]") |> 
  rvest::html_text()

## [1] "The Cologne carnival is one of the largest street festivals in Europe. In Cologne, the carnival season officially starts on 11 November at 11 minutes past 11 a.m. with the proclamation of the new Carnival Season, and continues until Ash Wednesday. However, the so-called \"Tolle Tage\" (crazy days) do not start until Weiberfastnacht (Women's Carnival) or, in dialect, Wieverfastelovend, the Thursday before Ash Wednesday, which is the beginning of the street carnival. Zülpicher Strasse and its surroundings, Neumarkt square, Heumarkt and all bars and pubs in the city are crowded with people in costumes dancing and drinking in the streets. Hundreds of thousands of visitors flock to Cologne during this time. Generally, around a million people celebrate in the streets on the Thursday before Ash Wednesday.[73]"

How can we draft our queries? 🤔

In here we present two ways:

Manually inspecting the source code
Using selector tools (e.g., Selector Gadget)

Option 1. On your page of interest, go to a section/table that you’d like to scrape. Our favorite browser for web scraping is Google Chrome, but others work as well. On Chrome, you go in View > Developer > inspect elements or right click on the element you are interested and go to Inspect.

If you hover over the code on the right, you should see boxes of different colors framing different elements of the page. Once the part of the page you would like to scrape is selected, right click on the HTML code and Copy > Copy XPath. That’s it.

Let’s try it with the Cologne Wikipedia page.

parsed_url |> 
  rvest::html_elements(xpath = '//*[@id="mw-content-text"]/div[1]/p[76]') |> 
  rvest::html_text()

## [1] "The Cologne carnival is one of the largest street festivals in Europe. In Cologne, the carnival season officially starts on 11 November at 11 minutes past 11 a.m. with the proclamation of the new Carnival Season, and continues until Ash Wednesday. However, the so-called \"Tolle Tage\" (crazy days) do not start until Weiberfastnacht (Women's Carnival) or, in dialect, Wieverfastelovend, the Thursday before Ash Wednesday, which is the beginning of the street carnival. Zülpicher Strasse and its surroundings, Neumarkt square, Heumarkt and all bars and pubs in the city are crowded with people in costumes dancing and drinking in the streets. Hundreds of thousands of visitors flock to Cologne during this time. Generally, around a million people celebrate in the streets on the Thursday before Ash Wednesday.[73]"

Option 2. SelectorGadget. You can download the Chrome Extension SelectorGadget and activate it while browsing the page you’d like to scrape from. You will see a selection box moving with your cursor. You select an element by clicking on it. It turns green - and so does all other content that would be selected with the current XPath.

Now click on any irrelevant elements to deselect them (they’ll turn red when deselected). Scroll through the entire page to make sure you’re only selecting what you actually want. Once you’re happy with your selection, click the XPath button at the bottom of the SelectorGadget window to get the XPath expression. Important: Use single quotation marks when you paste this XPath into your R code, since the XPath itself contains double quotes.

parsed_url |> 
  rvest::html_elements(xpath = '//p[(((count(preceding-sibling::*) + 1) = 179) and parent::*)]') |> 
  rvest::html_text()

## [1] "The Cologne carnival is one of the largest street festivals in Europe. In Cologne, the carnival season officially starts on 11 November at 11 minutes past 11 a.m. with the proclamation of the new Carnival Season, and continues until Ash Wednesday. However, the so-called \"Tolle Tage\" (crazy days) do not start until Weiberfastnacht (Women's Carnival) or, in dialect, Wieverfastelovend, the Thursday before Ash Wednesday, which is the beginning of the street carnival. Zülpicher Strasse and its surroundings, Neumarkt square, Heumarkt and all bars and pubs in the city are crowded with people in costumes dancing and drinking in the streets. Hundreds of thousands of visitors flock to Cologne during this time. Generally, around a million people celebrate in the streets on the Thursday before Ash Wednesday.[73]"

Exercise 1 🏋

Your task is to extract the second paragraph from the “Roman Cologne” section of the Wikipedia page on Cologne any way you like. Use the parsed_url object we created earlier.

# YOUR ANSWER HERE

Scraping HTML tables 🚀

Oftentimes, we would like to scrape tabular data from the web.

Let’s take a look at the List of Spotify streaming records Wikipedia entry.

# load html
url_p <- rvest::read_html("https://en.wikipedia.org/wiki/List_of_Spotify_streaming_records")

# extract table
spotify_table_raw <- rvest::html_table(url_p, header = T) |> # extracts all <table> elements from page
  purrr::pluck(1) |> # get table in nth place (1)
  janitor::clean_names() # clean the table names

# inspect the table we got
head(spotify_table_raw)

## # A tibble: 6 × 6
##   rank  song                    artist_s     streams_billions release_date ref  
##   <chr> <chr>                   <chr>        <chr>            <chr>        <chr>
## 1 1     "\"Blinding Lights\""   The Weeknd   5.057            29 November… [1]  
## 2 2     "\"Shape of You\""      Ed Sheeran   4.579            6 January 2… [2]  
## 3 3     "\"Starboy\""           The Weeknd … 4.125            21 Septembe… [3]  
## 4 4     "\"Someone You Loved\"" Lewis Capal… 4.075            8 November … [4]  
## 5 5     "\"Sweater Weather\""   The Neighbo… 4.066            3 December … [5]  
## 6 6     "\"As It Was\""         Harry Styles 4.060            1 April 2022 [6]

tail(spotify_table_raw)

## # A tibble: 6 × 6
##   rank                 song         artist_s streams_billions release_date ref  
##   <chr>                <chr>        <chr>    <chr>            <chr>        <chr>
## 1 96                   "\"Clean Ba… Dream S… 2.430            29 April 20… ""   
## 2 97                   "\"Sweet Ch… Guns N'… 2.428            3 June 1988  "[96…
## 3 98                   "\"Levitati… Dua Lip… 2.425            1 October 2… "[97…
## 4 99                   "\"Creep\""  Radiohe… 2.423            21 Septembe… ""   
## 5 100                  "\"Jocelyn … XXXTent… 2.420            31 October … "[98…
## 6 As of 5 October 2025 "As of 5 Oc… As of 5… As of 5 October… As of 5 Oct… "As …

# clean up table a bit 
spotify_table <- spotify_table_raw |>
  dplyr::mutate(song = stringr::str_remove_all(song, '\"'), # remove quotation marks from song titles
                streams_billions = as.numeric(streams_billions) # make stream_billions a numeric variable
                ) |> 
  dplyr::slice(1:100) |> # drop last row
  dplyr::select(-ref) # drop ref col

head(spotify_table)

## # A tibble: 6 × 5
##   rank  song              artist_s                 streams_billions release_date
##   <chr> <chr>             <chr>                               <dbl> <chr>       
## 1 1     Blinding Lights   The Weeknd                           5.06 29 November…
## 2 2     Shape of You      Ed Sheeran                           4.58 6 January 2…
## 3 3     Starboy           The Weeknd and Daft Punk             4.12 21 Septembe…
## 4 4     Someone You Loved Lewis Capaldi                        4.08 8 November …
## 5 5     Sweater Weather   The Neighbourhood                    4.07 3 December …
## 6 6     As It Was         Harry Styles                         4.06 1 April 2022

Exercise 2

Can you tell us:

What does it take to be within the top-100 most streamed songs on Spotify? (i.e., how many streams?)
Which artists have the highest number of most streamed songs? (Note that the artist_s string might contain multiple unique artists)

Bonus at home 🏠

Which artists have the most cumulative streams?

# YOUR ANSWER HERE

Scraping multiple pages 🤖

Whenever you want to really understand what’s going on within the functions of a new R package, it is very likely that there is a relevant article published in the Journal of Statistical Software. Let’s say you are interested in how the journal was doing over the past years [2023-2024].

Step 1. Inspect the source. Basically, follow steps to extract the XPath information.

browseURL("http://www.jstatsoft.org/issue/archive")

Step 2. Develop a scraping strategy. We need a set of URLs leading to all sources. Inspect the URLs of different sources and find the pattern. Then, construct the list of URLs from scratch.

## URL list build

# base
baseurl <- "http://www.jstatsoft.org/article/view/v"

# volume number
volurl <- as.character(105:110) # volumes 105 to 110

# issue number
issueurl <- c(paste0("0", 1:9), 10:12) # 01 to 12 maximum number of issues in a volume
# (there is a more efficient way of dealing with the inconsistency, but
# let's try to keep this simple

combinations <- tidyr::expand_grid(volurl, issueurl)
#tidyr::expand_grid produces all combinations of volurl and issueurl

urls_list <- combinations |> 
  dplyr::mutate(url = paste0(baseurl, volurl, 'i', issueurl)) |>
  dplyr::pull(url)

names_for_files <- combinations |>
  dplyr::mutate(name = paste0(volurl, '_', issueurl, '.html')) |>
  dplyr::pull(name)

Step 3. Think about where you want your scraped material to be stored and create a directory.

here()
tempwd <- here::here("session-05-web-scraping/data/jstatsoftStats")
#here from here package creates a character string that represents a path to a directory

dir.create(tempwd, recursive = TRUE)
#recursive = TRUE indicates that it should create all directories along the specified path if they don't exist

setwd(tempwd)

Step 4. Download the pages. Note that we did not do this step last time, when we were only scraping one page.

folder <- paste0(tempwd, "/html_articles/")
#concatenate the tempwd path with "/html_articles/" to create another path

dir.create(folder, recursive = TRUE)

for (i in seq_along(urls_list)) {
  # only update, don't replace
  if (!file.exists(paste0(folder, names_for_files[i]))) {
    # skip article when we run into an error
    tryCatch(
      download.file(urls_list[i], destfile = paste0(folder, names_for_files[i])),
      error = function(e)
        e #the error object (denoted by e) is returned but not acted upon, meaning the error is essentially ignored
    )
    # don't kill their server --> be polite!
    Sys.sleep(runif(1, 0, 1)) #one random sleep interval of 0-1 seconds
  } 
}

While R is downloading the pages for you, you can watch it directly in the directory you defined…

Check whether it worked.

list_files <- list.files(folder, pattern = "0.*") #list of file names that match the regex
list_files_path <- list.files(folder, pattern = "0.*", full.names = TRUE) #full file paths including directory paths of matched files

length(list_files)

Yay! Apparently, we scraped the HTML pages of 60 articles.

Step 5. Import files and parse out information. A loop is helpful here!

# define output first
authors <- character()
title <- character()
datePublish <- character()

# then run the loop
for (i in seq_along(list_files_path)) {
  
   html_out <- rvest::read_html(list_files_path[i], encoding = "UTF-8")
    
  authors[i] <- html_out |> 
    rvest::html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "authors_long", " " ))]//strong') |> 
    rvest::html_text2()
    
    
  title[i] <- html_out |> 
    rvest::html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "page-header", " " ))]') |> 
    rvest::html_text2()
    
  datePublish[i] <- html_out |> 
    rvest::html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "article-meta", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "row", " " )) and (((count(preceding-sibling::*) + 1) = 2) and parent::*)]//*[contains(concat( " ", @class, " " ), concat( " ", "col-sm-8", " " ))]') |> 
    rvest::html_text2()
  
}

# inspect data
authors[1:3] 
title[1:3]
datePublish[1:3]

# create a data frame
dat <- data.frame(authors = authors, title = title, datePublish = datePublish)
head(dat)

Step 6. Clean data…

You see, scraping data from multiple pages is no problem in R. Most of the brain work often goes into developing a scraping strategy and tidying the data, not into the actual downloading/scraping part.

(Git)ignoring files 🙅

In case you scraping project is linked to GitHub (as it will be in your assignment!), it can be useful to .gitignore the folder of downloaded files. This means that the folder can be stored in your local directory of the project but will not be synced with the remote (main) repository. Here is information on how to do this using RStudio.

In Github Desktop it is very simple, you do your scraping work, the folder is created in your local repository and before your commit and push these changes, you go on Repository > Repository Settings > Ignored Files and edit the .gitignore file (add the name of the new folder / files you don’t want to sync). More generally, it makes sense to exclude .Rproj files, .RData files (and other binary or large data files), draft folders and sensitive information from version control. Remember, git is built to track changes in code, not in large data files.

Or you can access the .gitignore file in your file explorer command + shift + . in macOS.

On an API far, far away… ⭐

library(httr)
library(jsonlite)
library(xml2)
library(glue)

To get data from an API, we suggest to follow a workflow like this:

Read the APIs documentation!
Get the baseurl
Find out the parameters referring to the resources of interest to you
Create a query url from the base url and the query parameters
Run the GET function on the query url
Depending on the encoding (usually, it’s json), you will need to:

Parse the result with the content function
Either use jsonlite or xml2 to parse the json or xml files

Let’s have a look at an example, the Star Wars API:

From the API documentation, we can see that the API has six main resource types we can query:

films
people
planets
species
starships
vehicles

Let’s start by querying the films resource:

baseurl <- "https://swapi.dev/api/"

query <- 'films'

httr::GET(paste0(baseurl, query)) |> # Make API call
  httr::content(as = 'text') |> # extract content as text
  jsonlite::fromJSON() |> 
  purrr::pluck(4) |> 
  head(1)

##        title episode_id
## 1 A New Hope          4
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        opening_crawl
## 1 It is a period of civil war.\r\nRebel spaceships, striking\r\nfrom a hidden base, have won\r\ntheir first victory against\r\nthe evil Galactic Empire.\r\n\r\nDuring the battle, Rebel\r\nspies managed to steal secret\r\nplans to the Empire's\r\nultimate weapon, the DEATH\r\nSTAR, an armored space\r\nstation with enough power\r\nto destroy an entire planet.\r\n\r\nPursued by the Empire's\r\nsinister agents, Princess\r\nLeia races home aboard her\r\nstarship, custodian of the\r\nstolen plans that can save her\r\npeople and restore\r\nfreedom to the galaxy....
##       director                  producer release_date
## 1 George Lucas Gary Kurtz, Rick McCallum   1977-05-25
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  characters
## 1 https://swapi.dev/api/people/1/, https://swapi.dev/api/people/2/, https://swapi.dev/api/people/3/, https://swapi.dev/api/people/4/, https://swapi.dev/api/people/5/, https://swapi.dev/api/people/6/, https://swapi.dev/api/people/7/, https://swapi.dev/api/people/8/, https://swapi.dev/api/people/9/, https://swapi.dev/api/people/10/, https://swapi.dev/api/people/12/, https://swapi.dev/api/people/13/, https://swapi.dev/api/people/14/, https://swapi.dev/api/people/15/, https://swapi.dev/api/people/16/, https://swapi.dev/api/people/18/, https://swapi.dev/api/people/19/, https://swapi.dev/api/people/81/
##                                                                                                planets
## 1 https://swapi.dev/api/planets/1/, https://swapi.dev/api/planets/2/, https://swapi.dev/api/planets/3/
##                                                                                                                                                                                                                                                                                            starships
## 1 https://swapi.dev/api/starships/2/, https://swapi.dev/api/starships/3/, https://swapi.dev/api/starships/5/, https://swapi.dev/api/starships/9/, https://swapi.dev/api/starships/10/, https://swapi.dev/api/starships/11/, https://swapi.dev/api/starships/12/, https://swapi.dev/api/starships/13/
##                                                                                                                                     vehicles
## 1 https://swapi.dev/api/vehicles/4/, https://swapi.dev/api/vehicles/6/, https://swapi.dev/api/vehicles/7/, https://swapi.dev/api/vehicles/8/
##                                                                                                                                                                    species
## 1 https://swapi.dev/api/species/1/, https://swapi.dev/api/species/2/, https://swapi.dev/api/species/3/, https://swapi.dev/api/species/4/, https://swapi.dev/api/species/5/
##                       created                      edited
## 1 2014-12-10T14:23:31.880000Z 2014-12-20T19:49:45.256000Z
##                              url
## 1 https://swapi.dev/api/films/1/

Now let’s query the API for people matching “Skywalker”. We’ll go through this step by step to understand the API response structure.

Step 1. Construct the query and check what type of output it returns.

query <- 'people/?search=skywalker'

httr::GET(paste0(baseurl, query)) |>
  httr::http_type()  # Check what output returns

## [1] "application/json"

Step 2. Look at the raw JSON response to understand its structure.

The JSON has four main elements: “count”, “next”, “previous”, and “results”. The actual data we want is in “results”, the 4th element.

# Make the API call and show the raw JSON text
raw_json <- httr::GET(paste0(baseurl, query)) |>
  httr::content(as = 'text')  # Get raw JSON as text

# Print the raw JSON to see the structure
cat(substr(jsonlite::prettify(raw_json), 1, 2000))

## {
##     "count": 3,
##     "next": null,
##     "previous": null,
##     "results": [
##         {
##             "name": "Luke Skywalker",
##             "height": "172",
##             "mass": "77",
##             "hair_color": "blond",
##             "skin_color": "fair",
##             "eye_color": "blue",
##             "birth_year": "19BBY",
##             "gender": "male",
##             "homeworld": "https://swapi.dev/api/planets/1/",
##             "films": [
##                 "https://swapi.dev/api/films/1/",
##                 "https://swapi.dev/api/films/2/",
##                 "https://swapi.dev/api/films/3/",
##                 "https://swapi.dev/api/films/6/"
##             ],
##             "species": [
## 
##             ],
##             "vehicles": [
##                 "https://swapi.dev/api/vehicles/14/",
##                 "https://swapi.dev/api/vehicles/30/"
##             ],
##             "starships": [
##                 "https://swapi.dev/api/starships/12/",
##                 "https://swapi.dev/api/starships/22/"
##             ],
##             "created": "2014-12-09T13:50:51.644000Z",
##             "edited": "2014-12-20T21:17:56.891000Z",
##             "url": "https://swapi.dev/api/people/1/"
##         },
##         {
##             "name": "Anakin Skywalker",
##             "height": "188",
##             "mass": "84",
##             "hair_color": "blond",
##             "skin_color": "fair",
##             "eye_color": "blue",
##             "birth_year": "41.9BBY",
##             "gender": "male",
##             "homeworld": "https://swapi.dev/api/planets/1/",
##             "films": [
##                 "https://swapi.dev/api/films/4/",
##                 "https://swapi.dev/api/films/5/",
##                 "https://swapi.dev/api/films/6/"
##             ],
##             "species": [
## 
##             ],
##             "vehicles": [
##                 "https://swapi.dev/api/vehicles/44/",
##                 "https://swapi.dev/api/vehicles/46/"
##             ],
##             "starships": [
##                 "https://swapi.dev/api/starships/39/",
##                 "https://swapi.dev/api/starships/59/"

Step 3. Parse the results and create an R object.

# Parse the JSON
parsed_data <- httr::GET(paste0(baseurl, query)) |>
  httr::content(as = 'text') |>
  jsonlite::fromJSON()

# Extract the "results" element (4th position) and display
parsed_data |>
  purrr::pluck(4) |>  # Get the 4th element ("results")
  head(3)

##               name height    mass hair_color skin_color eye_color birth_year
## 1   Luke Skywalker    172      77      blond       fair      blue      19BBY
## 2 Anakin Skywalker    188      84      blond       fair      blue    41.9BBY
## 3   Shmi Skywalker    163 unknown      black       fair     brown      72BBY
##   gender                        homeworld
## 1   male https://swapi.dev/api/planets/1/
## 2   male https://swapi.dev/api/planets/1/
## 3 female https://swapi.dev/api/planets/1/
##                                                                                                                            films
## 1 https://swapi.dev/api/films/1/, https://swapi.dev/api/films/2/, https://swapi.dev/api/films/3/, https://swapi.dev/api/films/6/
## 2                                 https://swapi.dev/api/films/4/, https://swapi.dev/api/films/5/, https://swapi.dev/api/films/6/
## 3                                                                 https://swapi.dev/api/films/4/, https://swapi.dev/api/films/5/
##   species
## 1    NULL
## 2    NULL
## 3    NULL
##                                                                 vehicles
## 1 https://swapi.dev/api/vehicles/14/, https://swapi.dev/api/vehicles/30/
## 2 https://swapi.dev/api/vehicles/44/, https://swapi.dev/api/vehicles/46/
## 3                                                                       
##                                                                                                       starships
## 1                                      https://swapi.dev/api/starships/12/, https://swapi.dev/api/starships/22/
## 2 https://swapi.dev/api/starships/39/, https://swapi.dev/api/starships/59/, https://swapi.dev/api/starships/65/
## 3                                                                                                              
##                       created                      edited
## 1 2014-12-09T13:50:51.644000Z 2014-12-20T21:17:56.891000Z
## 2 2014-12-10T16:20:44.310000Z 2014-12-20T21:17:50.327000Z
## 3 2014-12-19T17:57:41.191000Z 2014-12-20T21:17:50.401000Z
##                                url
## 1  https://swapi.dev/api/people/1/
## 2 https://swapi.dev/api/people/11/
## 3 https://swapi.dev/api/people/43/

# Alternatively, you could parse by name
# parsed_data$results |> 
#   head(3)

API keys and authentication 🔒`

For many APIs, you will need to obtain an api key to retrieve data. Once you received your api key (or token), you will also need to adapt your GET query. How you need to do this depends a lot on the API. Let’s take a look at how to do this with the API for the US congress. The API is actually valid across a number of US government institutions.

We can sign up for an API key here. The API website, gives us detailed information on how to build our queries. But how do we actually authenticate ourselves with the API key? The documentation tells us that there are multiple ways to go about this. We can either do adapt the use HTTP basic authentication, the GET query parameter, or the HTTP header.

Here’s a quick overview of how to implement these with httr:

# sign up for API key: https://api.data.gov/signup/
#Sys.setenv(MY_API_KEY = "[YOUR API KEY GOES HERE]") # store your personal API key in R environment

# US Congress: Bills
baseurl <- "https://api.congress.gov/v3/" # base url (remains consistent across queries)
api_key <- glue::glue("api_key={Sys.getenv('MY_API_KEY')}") # your personal API key
query <- "bill" # query

try2 <- GET(glue::glue(baseurl, query, "?", api_key)) |> 
  content(as = "text") |> 
  fromJSON() |> 
  pluck(1) |> 
  as.data.frame()

# US Congress: Actions on a specific nomination
query <- "nomination/115/2259/actions"

GET(glue::glue(baseurl, query, "?", api_key)) |> 
  content(as = "text") |> 
  fromJSON() |> 
  pluck(1) |> 
  as.data.frame()

# US Congress: Summaries filtered by congress and bill type
query <- "summaries/117/hr?fromDateTime=2022-04-01T00:00:00Z&toDateTime=2022-04-03T00:00:00Z&sort=updateDate+desc"

GET(glue::glue("{baseurl}{query}&{api_key}")) |> # another way of using glue()
  content(as = 'text') |>
  fromJSON() |>
  pluck(3) |>
  as.data.frame()

# US Congress: Legislation sponsored by a specific congress member
query <- "member/L000174/sponsored-legislation.xml" # data can be formatted as xml too

GET(glue::glue(baseurl, query, "?", api_key),
    add_headers("X-Api-Key" = Sys.getenv("MY_API_KEY"))) |>
  content(as = "text") |>
  read_xml() |>
  xml_find_all("//sponsoredLegislation//title") |> 
  xml_text()

Retrieving data from API’s using httr can at times be quite tiresome. Luckily, there are many R libraries that make it much easier to retrieve data from APIs. Here is a list of ready-made R bindings to web-APIs. Actually, even the Star Wars API we queried earlier has its own R package, rwars!

A note on good scraping practice

There is a set of general rules to the game:

You take all the responsibility for your web scraping work.
Think about the nature of the data. Does it entail sensitive information? Do not collect personal data without explicit permission.
Take all copyrights of a country’s jurisdiction into account. If you publish data, do not commit copyright fraud.
If possible, stay identifiable. Stay polite. Stay friendly. Obey the scraping etiquette.
If in doubt, ask the author/creator/provider of data for permission—if your interest is entirely scientific, chances aren’t bad that you get data.

How do I know the scraping etiquette of a site? 🤝

Robot exclusion standards (robot.txt) are informal protocols to prohibit web robots from crawling content. They list documents that are allowed to crawl and which not. It is not a technical barrier but an ask for compliance.

They are located in the root directory of a website (e.g https://de.wikipedia.org/robots.txt).

For example, let’s have a look at Wikipedia’s robot.txt file, which is very human readable.

General rules are listed under User-agent: * which is most interesting for R-based crawlers. A universal ban for a directory looks like this Disallows: /, sometimes Crawl-delays are suggested (in seconds) Crawl-delay: 2.

What is “polite” scraping? 🐌

First thing would be not to scrape at a speed that causes trouble for their server. Therefore, whenever you loop over a list of URLs, add a pause to the system at the end of the loop (e.g., Sys.sleep(runif(1, 1, 2)) a random number between 1 and 2, and then the Sys.sleep() function causes the R script to pause for that random number of seconds).

And generally, it is better practice to store data on your local drive first (download.file()), then parse (rvest::read_html()).

🌳 A footnote on this practice.

In the digital context, we often forget that or actions do have physical consequences. For example, training AI, using blockchain, and streaming videos all create considerable amounts of \(CO^2\) emissions. So does bombarding a server with requests - certainly to a much lesser extent than the previous examples - but please consider whether you have to re-run a large scraping project 100 times in order to debug things.

Furthermore, downloading massive amounts of data may arouse attention from server administrators. Assuming that you’ve got nothing to hide, you should stay identifiable beyond your IP address.

How can I stay identifiable? 👤

Option 1: Get in touch with website administrators / data owners.

Option 2: Use HTTP header fields From and User-Agent to provide information about yourself.

url <- "http://a-totally-random-website.com"

#rvest's session() creates a session object that responds to HTTP and HTML methods.
rvest_session <- rvest::session(url, 
                                httr::add_headers(
                                  `From` = "my@email.com", 
                                  `UserAgent` = R.Version()$version.string
                                  )
                                )
                
scraped_text <- rvest_session |> 
            rvest::html_elements(xpath = "p//a") |> 
            rvest::html_text()

rvest’s session() creates a session object that responds to HTTP and HTML methods. Here, we provide our email address and the current R version as User-Agent information. This will pop up in the server logs: The webpage administrator has the chance to easily get in touch with you.

Acknowledgements

This tutorial drew heavily on Simon Munzert’s book Automated Data Collection with R and related course materials. We also used an example from Keith McNulty’s blog post on tidy web scraping in R. For the regex part, we used examples from the string manipulation section in Hadley Wickham’s R for Data Science book.

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz,Killian Conyngham and Carol Sobral.