` and `` - They are used to group content over lines (`

`, creating a block-level element) or within lines (``, creating an inline-element). - By grouping or dividing content into blocks, it's easier to identify or apply different styling to them. - They do not change the layout themselves but work together with CSS (see later!). .pull-left[ Example of CSS definition: ```{css, prompt = FALSE, eval = FALSE} div.happy { color:pink; font-family:"Comic Sans MS"; font-size:120% } span.happy { color:pink; font-family:"Comic Sans MS"; font-size:120% } ``` ] .pull-right[ In the HTML document: ```{html, prompt = FALSE, eval = FALSE}

I am a happy-styled paragraph

unhappy text with some happiness ``` ] --- # Important tags and attributes ### Form tag `

` - Allows to incorporate HTML forms. - Client can send information to the server via forms. - Whenever you type something into a field or click on radio buttons in your browser, you are interacting with forms. Example: ```{html, prompt = FALSE, eval = FALSE} password: ``` --- # Important tags and attributes ### Table tags ``, ``, `

`, and `

` - Standard HTML tables always follow a standard architecture. - The different tags allow defining the table as a whole, individual rows (including the heading), and cells. - If the data is hidden in tables, scraping will be straightforward. Example: ```{html, prompt = FALSE, eval = FALSE}

Rank	Nominal GDP	Name
	(per capita, USD)
1	170,373	Lichtenstein
2	167,021	Monaco
3	115,377	Luxembourg
4	98,565	Norway
5	92,682	Qatar

``` --- # More resources on HTML .pull-left-wide[ ### More HTML - All in all there are over 100 HTML elements. - But overall, it's still a fairly tight and easy-to-understand markup language. - Knowing more about the rest is probably not necessary to become a good web scraper, but it helps parsing (in your brain) HTML documents quicker. ### More resources - Check out the excellent [MDN Web Docs](https://developer.mozilla.org/en-US/docs/Web/HTML) for an overview, which also point to additional tutorials and references. - The [W3Schools tutorials](https://www.w3schools.com/) are also a classic. - While you're at it, you might also want to learn about related technologies such as CSS (used to specify a webpage's appearance/layout) and JavaScript (used to enrich HTMLs with additional functionality and options to interact). ] .pull-right-small[

] --- # Accessing the web using your browser vs. R .pull-left-wide[ ### Using your browser to access webpages 1. You click on a link, enter a URL, run a Google query, etc. 2. Browser/your machine sends request to server that hosts website. 3. Server returns resource (often an HTML document). 4. Browser interprets HTML and renders it in a nice fashion. ### Using R to access webpages 1. You manually specify a resource. 2. R/your machine sends a request to the server that hosts the website. 3. The server returns a resource (e.g., an HTML file). 4. R parses the HTML, but does not render it in a nice fashion. 5. It's up to you to tell R what content to extract. ] .pull-right-small[

] --- # Interacting with your browser ### On web browsers - Modern browsers are complex pieces of software that take care of multiple operations while you browse the web. And they're basically all doing a good job.¹ Common operations are to retrieve resources, render and display information, and provide interface for user-webpage interaction. - Although our goal is to automate web data retrieval, the browser is an important tool in web scraping workflow. ### The use of browsers for web scraping - Give you an intuitive impression of the architecture of a webpage - Allow you to inspect the source code - Let you construct XPath/CSS selector expressions with plugins - Render dynamic web content (JavaScript interpreter) .footnote[¹ Check out this Wikipedia article on the [Browser Wars](https://en.wikipedia.org/wiki/Browser_wars) that happened in the 1990s and 2000s (yes, there was Browser War I and Browser War II - and for once Germany was not to blame) to relive some of your instructor's pains when he started to look into this "internet".] --- # Inspecting HTML source code .pull-left-small[

- Goal: retrieving data from a Wikipedia page on [List of tallest buildings](https://en.wikipedia.org/wiki/List_of_tallest_buildings) - Right-click on page (anywhere) - Select `View Page Source` - HTML (CSS, JavaScript) code can be ugly - But looking more closely, we find the displayed information ] .pull-right-wide[

] --- # Inspecting the live HTML source code with the DOM explorer .pull-left-small[ - Goal: retrieving data from a Wikipedia page on [List of tallest buildings](https://en.wikipedia.org/wiki/List_of_tallest_buildings) - Right-click on the element of interest - Select `Inspect` - The Web Developer Tools window pops up - Corresponding part in the HTML tree is highlighted - Interaction with the tree possible! ] .pull-right-wide[

] --- # When to do what with your browser .pull-left-wide[ ### When to inspect the complete page source - Check whether data is in static source code (the search function helps!) - For small HTML files: understand structure ### When to use the DOM explorer - Almost always - Particularly useful to construct XPath/CSS selector expressions - To monitor dynamic changes in the DOM tree ### A note on browser differences - Inspecting the source code (as shown on the following slides) works more or less identically in Chrome and Firefox. - In Safari, go to → `Preferences`, then → `Advanced` and select `Show Develop menu in menu bar`. This unlocks the `Show Page Source` and `Inspect` options and the Web Developer Tools. ] .pull-right-small-center[

`Credit` [watershedcreative.com](http://watershedcreative.com/naked/html-tree.html) ] --- class: inverse, center, middle name: xpath # XPath basics

--- # Accessing the DOM tree with R ### Different perspectives on HTML - HTML documents are human-readable. - HTML tags structure the document, comprising the DOM. - **Web user perspective**: The browser interprets the code and renders the page. - **Web scraper perspective**: Parse the document retaining the structure, use the tree/tags to locate information. -- ### HTML parsing - Our goal is to get HTML into R while retaining the tree structure. That's similar to getting a spreadsheet into R and retaining the rectangular structure. - HTML is human-readable, so we could also import HTML files as plain text via `readLines()`. That's a bad option though - the document's structure would not be retained. - The `xml2` package allows us to parse XML-style documents. HTML is a "flavor" of XML, so it works for us. - The `rvest` package, which we will mainly use for scraping, wraps the `xml2` package, so we rarely have to load it manually. - There is one high-level function to remember: `read_html()`. It represents the HTML in a list-style fashion. --- # Accessing the DOM tree with R (cont.) ### Getting HTML into R Parsing a website is straightforward: ```{r, eval = TRUE, message= FALSE} library(rvest) parsed_doc <- read_html("https://google.com") parsed_doc ``` There are various functions to inspect the parsed document. They aren't really helpful - better use the browser instead if you want to dive into the HTML. ```{r, eval = FALSE, message= FALSE} xml2::html_structure(parsed_doc) xml2::as_list(parsed_doc) ``` --- # What's XPath? ### Definition - Short for **XML Path Language**, another W3C standard. - A query language for XML-based documents (including HTML). - With XPath we can access node sets (e.g., elements, attributes) and extract content. ### Why XPath for web scraping? - Source code of webpages (HTML) structures both layout and content. - Not only content, but context matters! - XPath enables us to extract content based on its location in the document (and potentially other features). - With XPath, we can tell R to do things like: 1. Give me all `

` elements in the document! 2. Look for all `` elements in the document and give me the third one! 3. Extract all content in `

` elements that is labelled with `class=newscontent`! --- # Example: source code ```{html, prompt = FALSE, eval = FALSE} Collected R wisdoms

Robert Gentleman

'What we have is nice, but we need something very different'

Source: Statistical Computing 2003, Reisensburg

Rolf Turner

'R is wonderful, but it cannot work magic'
answering a request for automatic generation of 'data from a known mean and 95% CI'

Source: R-help

The book homepage

``` --- # Example: DOM tree

--- # Applying XPath on HTML in R - Load package `rvest` - Parse HTML document with `read_html()` ```{r, eval = TRUE, message= FALSE} library(rvest) parsed_doc <- read_html("materials/fortunes.html") parsed_doc ``` - Query document using `html_elements()` - `rvest` can process XPath queries as well as CSS selectors. - Today, we'll focus on XPath: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div[last()]/p/i") ``` --- # Grammar of XPath ### Basic rules 1. We access nodes/elements by writing down the hierarchical structure in the DOM that locates the element set of interest. 2. A sequence of nodes is separated by `/`. 3. The easiest localization of a element is given by the absolute path (but often not the most efficient one!). 4. Apply XPath on DOM in R using `html_elements()`. ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div[last()]/p/i") ``` --- # Grammar of XPath ### Absolute vs. relative paths **Absolute paths** start at the root element and follow the whole way down to the target element (with simple slashes, `/`). ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "/html/body/div/p/i") ``` **Relative paths** skip nodes (with double slashes, `//`). ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//body//p/i") ``` Relative paths are often preferrable. They are faster to write and more comprehensive. On the other hand, they are less targeted and therefore potentially less robust, and running them takes more computing time, as the entire tree has to be evaluated. But that's usually not relevant for reasonably small documents. --- # Grammar of XPath ### The wildcard operator - Meta symbol `*` - Matches any element - Works only for one arbitrary element - Far less important than, e.g., wildcards in content-based queries (regex!) ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "/html/body/div/*/i") # the following does not work: html_elements(parsed_doc, xpath = "/html/body/div/*/i") ``` --- # Grammar of XPath ### Navigational operators `"."`and `".."` - `"."` accesses elements on the same level ("self axis"), which is useful when working with predicates (see later!). - `".."` accesses elements at a higher hierarchical level. ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//title/..") ``` ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div[starts-with(./@id, 'R')]") ``` --- # Element (node) relations ("axes") in XPath .pull-left[ ### Family relations between elements - The tools learned so far are sometimes not sufficient to access specific elements without accessing other, undesired elements as well. - Relationship statuses are useful to establish unambiguity. - Can be combined with other elements of the grammar - Basic syntax: `element1/relation::element2` - We describe relation of `element2` to `element1` - `element2` is to be extracted - we always extract the element at the end! ] .pull-right[

] --- # Element (node) relations in XPath |

Axis name

| Description | |---|---| | `ancestor` | All ancestors (parent, grandparent etc.) of the current element | | `ancestor-or-self` | All ancestors of the current element and the current element itself | | `attribute` | All attributes of the current element | | `child` | All children of the current element | | `descendant` | All descendants (children, grandchildren etc.) of the current element | | `descendant-or-self` | All descendants of the current element and the current element itself | | `following` | Everything in the document after the closing tag of the current element | | `following-sibling` | All siblings after the current element | | `parent` | The parent of the current element | | `preceding` | All elements that appear before the current element, except ancestors/attribute elements | | `preceding-sibling` | All siblings before the current element | | `self` | The current element | --- # Element (node) relations in XPath Example: access the `

` elements that are ancestors to an `` element: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//a/ancestor::div") ``` Another example: Select all `

` nodes that precede a `
` node: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//p/preceding-sibling::h1") ``` --- # Predicates ### What are predicates? - Predicates are conditions based on an element's features (`true/false`). - Think of them as ways to filter nodesets. - They are applicable to a variety of features: name, value attribute. - Basic syntax: `element[predicate]` Select all first `
` elements that are children of a `
` element, using a numeric predicate: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div/p[1]") ``` -- Can you find out what the following expressions do? ```{r, eval = FALSE, message= FALSE} html_elements(parsed_doc, xpath = "//div/p[last()-1]") html_elements(parsed_doc, xpath = "//div[count(./@)>2]") html_elements(parsed_doc, xpath = "//[string-length(text())>50]") ``` --- # Predicates (cont.) Select all `
` nodes that contain an attribute named `’October/2011’`, using a textual predicate: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath ="//div[@date='October/2011']") ``` Rudimentary string matching is also possible using string functions like `contains()`, `starts-with()`, or `ends-with()`. -- Can you tell what the following calls do? ```{r, eval = FALSE, message= FALSE} html_elements(parsed_doc, xpath = "//div[starts-with(./@id, 'R')]") html_elements(parsed_doc, xpath = "//div[substring-after(./@date, '/')='2003']//i") ``` --- # Content extraction - Until now, we used XPath expressions to extract complete nodes or nodesets (that is, elements with tags). - However, in most cases we're interested in extracting the content only. - To that end, we can use extractor functions that are applied on the output of XPath query calls. | Function | Argument | Return value | |---|---|---| | `html_text()` | | Element value | | `html_text2()` | | Element value (with a bit more cleanup) | | `html_attr()` | `name` | Element attribute | | `html_attrs()` | | (All) element attributes | | `html_name()` | `trim` | Element name | | `html_children()` | | Element children | --- # Content extraction (cont.) Extracting element values/content: ```{r, eval = TRUE} html_elements(parsed_doc, xpath = "//title") %>% html_text2() ``` Extracting attributes: ```{r, eval = TRUE} html_elements(parsed_doc, xpath = "//div[1]") %>% html_attrs() ``` Extracting attribute values: ```{r, eval = TRUE} html_elements(parsed_doc, xpath = "//div") %>% html_attr("lang") ``` --- # More XPath? ### Training resources - XPath is a little language of its own. As always with languages, mastery comes with practice. - A good environment for practice is the [XPath expression testbed at whitebeam.org](http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm). - Also check out this [cheat sheet](https://devhints.io/xpath). ### XPath creator tools - Now, do you really have to construct XPath expressions by your own? No! At least not always. - SelectorGadget: [http://selectorgadget.com](http://selectorgadget.com) is a browser plugin that constructs XPath statements via a point-and-click approach. The generated expressions are not always efficient and effective though (more on this later). - Web developer tools - the internal browser functionality to study the DOM, among other things, also lets you extract XPath statements for selected nodes. These are specific to unique nodes/elements though, and therefore less helpful to extract node sets. (But they come in handy when we want to script live navigation, e.g. for Selenium.) --- class: inverse, center, middle name: css # CSS basics
--- # What is CSS? .pull-left[ ### Background - Cascading Style Sheets (CSS) is a style sheet language that allows web developers to adjust the "look and feel" of websites. - By using CSS to adjust style features such as layout, colors, and fonts, it's easier to separate content (HTML) from presentation (CSS). ### Three ways to insert CSS into HTML 1. External CSS. Inside `` with a reference to the external file inside the `` element. 2. Internal CSS. Inside `` and stored in ` ``` Inline CSS ```{html, prompt = FALSE, eval = FALSE}
This is a paragraph.
``` ] --- # CSS selectors .pull-left[ ### Selectors - CSS selectors find/select the HTML elements that should be styled. - There are various categories of selectors. In addition to generic element selectors (which selected just based on the element name, such as `
`), we often care about: - CSS id selectors, which use the `id` attribute of an HTML element. Think of them as "labels", as in `
`. The respective CSS selector would be `#para1`. - CSS class selectors, which use the `class` attribute of an HTML element, as in `
`. Note that these can refer to more than one class (here: `center` and `large`). The respective CSS selector would be `p.center.large`. ] -- .pull-right[ ### Writing CSS selectors - Just as XPath, CSS selectors are a little language of their own. - I won't teach you more about it, but you might nevertheless want to learn it. - Check out the CSS diner tutorial at https://flukeout.github.io/. It's one of the best tutorials of anything out there.

] --- class: inverse, center, middle name: scrapingstatic # Scraping static webpages with R
--- # The scraping workflow .pull-left[ ### Key tools for scraping static webpages 1. You are able to inspect HTML pages in your browser using the web developer tools. 2. You are able to parse HTML into R with `rvest`. 3. You are able to speak XPath (or CSS selectors). 4. You are able to apply XPath expressions with `rvest`. 5. You are able to tidy web data with R/`dplyr`/`regex`. ### The big picture - Every scraping project is different, but the coding pipeline is fundamentally similar. - The (technically) hardest steps are location (XPath, CSS selectors) and extraction (clean-up), sometimes the scaling (from one to multiple sources). ] .pull-right[

] --- # Web scraping with rvest .pull-left-wide[ `rvest` is a suite of scraping tools. It is part of the tidyverse and has made scraping with R much more convenient. There are three key `rvest` verbs that you need to learn.¹ 1. `read_html()`: Read (parsing) an HTML resource. 2. `html_elements()`: Find elements that match a CSS selector or XPath expression. 3. `html_text2()`: Extract the text/value inside the node set. ] .footnote[ ¹ There is more in `rvest` than what we can cover today. Have a glimpse at the [overview at tidyverse.org](https://rvest.tidyverse.org/) and at this excellent (unofficial) [cheat sheet](https://github.com/yusuzech/r-web-scraping-cheat-sheet). ] .pull-right-small-center[

] --- # Web scraping with rvest: example .pull-left-vsmall[ - We are going to scrape a information from a Wikipedia article on women philosophers available at [https://en.wikipedia.org/wiki/](https://en.wikipedia.org/wiki/List_of_women_philosophers) [List_of_women_philosophers](https://en.wikipedia.org/wiki/List_of_women_philosophers). - The article provides two types of lists - one by period and one sorted alphabetically. We want the alphabetical list. - The information we are actually interested in - names - is stored in unordered list elements. ] .pull-right-vwide[

] --- # Scraping HTML tables: example (cont.) Step 1: Parse the page ```{r, eval = FALSE} url_p <- read_html("https://en.wikipedia.org/wiki/List_of_women_philosophers") ``` ```{r, eval = TRUE, echo = FALSE} library(rvest) url <- "https://en.wikipedia.org/w/index.php?title=List_of_women_philosophers&oldid=1041210397" url_p <- read_html(url) ``` -- Step 2: Develop an XPath expression (or multiple) that select the information of interest and apply it ```{r, eval = TRUE} elements_set <- html_elements(url_p, xpath = "//h2/span[text()='Alphabetically']//following::li/a[1]") ``` -- The XPath expression reads: - `//h2`: Look for `h2` elements anywhere in the document. - `/span[text()='Alphabetically']`: Within that element look for `span` elements with the content `"Alphabetically"`. - `//following::li`: In the DOM tree following that element (at any level), look for `li` elements. - `/a[1]` within these elements look for the first `a` element you can find. --- # Scraping HTML tables: example (cont.) Step 3: Extract information and clean it up ```{r, eval = TRUE} phil_names <- elements_set %>% html_text2() phil_names[c(1:2, 101:102)] ``` -- Step 4: Clean up (here: select the subset of links we care about) ```{r, eval = TRUE} names_iffer <- seq_along(phil_names) >= seq_along(phil_names)[str_detect(phil_names, "Felicia Nimue Ackerman")] & seq_along(phil_names) <= seq_along(phil_names)[str_detect(phil_names, "Alenka Zupančič")] philosopher_names_clean <- phil_names[names_iffer] length(philosopher_names_clean) philosopher_names_clean[1:5] ``` --- # Quick-n-dirty static webscraping with SelectorGadget .pull-left[ ### The hassle with XPath - The most cumbersome part of web scraping (data tidying aside) is the construction of XPath expressions that match the components of a page you want to extract. - It will take a couple of scraping projects until you’ll truly have mastered XPath. ### A much-appreciated helper - SelectorGadget is a JavaScript browser plugin that constructs XPath statements (or CSS selectors) via a point-and-click approach. - It is available here: http://selectorgadget.com/ (there is also a Chrome extension). - The tool is magic and you will love it. ] -- .pull-right[ ### What does SelectorGadget do? - You activate the tool on any webpage you want to scrape. - Based on your selection of components, the tool learns about your desired components and generates an XPath expression (or CSS selector) for you. ### Under the hood - Based on your selection(s), the tool looks for similar elements on the page. - The underlying algorithm, which draws on Google’s diff-match-patch libraries, focuses on CSS characteristics, such as tag names and `
` and `` attributes. ] --- # SelectorGadget: example --- # SelectorGadget: example (cont.) ```{r, eval = FALSE} library(rvest) url_p <- read_html("https://www.nytimes.com") # xpath: paste the expression from Selectorgadget! # note: we use single quotation marks here (' instead of ") to wrap around the expression! xpath <- '//[contains(concat( " ", @class, " " ), concat( " ", "erslblw0", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "e1lsht870", " " ))]' headlines <- html_elements(url_p, xpath = xpath) headlines_raw <- html_text(headlines) length(headlines_raw) head(headlines_raw) ``` ```{r, eval = TRUE, echo = FALSE} library(rvest) url_p <- read_html("materials/nytimes-com-2021-09-29.html") xpath <- '//[contains(concat( " ", @class, " " ), concat( " ", "erslblw0", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "e1lsht870", " " ))]' headlines <- html_elements(url_p, xpath = xpath) # we use single quotation marks here to wrap around the expression! headlines_raw <- html_text(headlines) length(headlines_raw) head(headlines_raw) ``` --- # SelectorGadget: when to use and not to use it Having learned about a semi-automated approach to generating XPath expressions, you might ask: Why bother with learning XPath at all? Well... - SelectorGadget is not perfect. Sometimes, the algorithm will fail. - Starting from a different element sometimes (but not always!) helps. - Often the generated expressions are unnecessarily complex and therefore difficult to debug. - In my experience, SelectorGadget works 50-60% of the times when scraping from static webpages. - You are also prepared for the remaining 40-50%! --- # Scraping HTML tables

--- # Scraping HTML tables - HTML tables are everywhere. - They are easy to spot in the wild - just look for `

` tags! - Exactly because scraping tables is an easy and repetitive task, there is a dedicated `rvest` function for it: `html_table()`. .pull-left-vsmall[

**Function definition** ```{r, eval = FALSE} html_table(x, header = NA, trim = TRUE, dec = ".", na.strings = "NA", convert = TRUE ) ``` ] .pull-right-vwide[
| Argument | Description | |---|---| | `x` | Document (from `read_html()`) or node set (from `html_elements()`). | | `header` | Use first row as header? If `NA`, will use first row if it consists of `

` tags. | | `trim` | Remove leading and trailing whitespace within each cell? | | `dec` | The character used as decimal place marker. | | `na.strings` | Character vector of values that will be converted to `NA` if `convert` is `TRUE`. | | `convert` | If `TRUE`, will run `type.convert()` to interpret texts as int, dbl, or `NA`. | ] --- # Scraping HTML tables: example .pull-left-small[ - We are going to scrape a small table from the Wikipedia page [https://en.wikipedia.org/wiki/](https://en.wikipedia.org/wiki/List_of_human_spaceflights) [List_of_human_spaceflights](https://en.wikipedia.org/wiki/List_of_human_spaceflights). - (Note that we're actually using an old version of the page (dating back to May 1, 2018), which is accessible [here](https://en.wikipedia.org/w/index.php?title=List_of_human_spaceflights&oldid=778165808). Wikipedia pages change, but this old revision and associated link won't.)) - The table is not entirely clean: There are some empty cells, but also images and links. - The HTML code looks straightforward though. ] .pull-right-wide[

] --- # Scraping HTML tables: example (cont.) ```{r, eval = FALSE} library(rvest) url <- "https://en.wikipedia.org/wiki/List_of_human_spaceflights" url_p <- read_html(url) tables <- html_table(url_p, header = TRUE) spaceflights <- tables[[1]] spaceflights ``` ```{r, eval = TRUE, echo = FALSE} library(rvest) url <- "https://en.wikipedia.org/w/index.php?title=List_of_human_spaceflights&oldid=778165808" url_p <- read_html(url) tables <- html_table(url_p, header = TRUE) spaceflights <- tables[[1]] spaceflights ``` --- class: inverse, center, middle name: goodpractice # Web scraping: good practice

--- # Scraping: the rules of the game
1. You take all the responsibility for your web scraping work. 2. Think about the nature of the data. Does it entail sensitive information? Do not collect personal data without explicit permission. 3. Take all copyrights of a country’s jurisdiction into account. If you publish data, do not commit copyright fraud. 4. If possible, stay identifiable. Stay polite. Stay friendly. Obey the scraping etiquette. 5. If in doubt, ask the author/creator/provider of data for permission—if your interest is entirely scientific, chances aren’t bad that you get data. --- # Consult robots.txt .pull-left[ ### What's robots.txt? - "[Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)", informal protocol to prohibit web robots from crawling content - Located in the root directory of a website (e.g., [google.com/robots.txt](https://www.google.com/robots.txt)) - Documents which bot is allowed to crawl which resources (and which not) - Not a technical barrier, but a sign that asks for compliance ### What's robots.txt? - Not an official W3C standard - Rules listed bot by bot - General rule listed under `User-agent: *` (most interesting entry for R-based crawlers) - Directories folders listed separately ] .pull-right[ **Example** ```{txt, prompt = FALSE, eval = FALSE} User-agent: Googlebot Disallow: /images/ Disallow: /private/ ``` **Universal ban** ```{txt, prompt = FALSE, eval = FALSE} User-agent: * Disallow: / ``` **Allow declaration** ```{txt, prompt = FALSE, eval = FALSE} User-agent: * Disallow: /images/ Allow: /images/public/ ``` **Crawl delay (in seconds)** ```{txt, prompt = FALSE, eval = FALSE} User-agent: * Crawl-delay: 2 ``` ] --- # Downloading HTML files .pull-left[ ### Stay modest when accessing lots of data - Content on the web is publicly available. - But accessing the data causes server traffic. - Stay polite by querying resources as sparsely as possible. ### Two easy-to-implement practices 1. Do not bombard the server with requests - and if you have to, do so at modest pace. 2. Store web data on your local drive first, then parse. ] .pull-right[ ### Looping over a list of URLs ```{r, eval = FALSE} for (i in 1:length(list_of_urls)) { if (!file.exists(paste0(folder, file_names[i]))) { download.file(list_of_urls[i], destfile = paste0(folder, file_names[i]) ) Sys.sleep(runif(1, 1, 2)) } } ``` - `!file.exists()` checks whether a file does not exist in the specified location. - `download.file()` downloads the file to a folder. The destination file (location + name) has to be specified. - `Sys.sleep()` suspends the execution of R code for a given time interval (in seconds). ] --- # Staying identifiable .pull-left[ ### Don't be a phantom - Downloading massive amounts of data may arouse attention from server administrators. - Assuming that you've got nothing to hide, you should stay identifiable beyond your IP address. ### Two easy-to-implement practices 1. Get in touch with website administrators / data owners. 2. Use HTTP header fields `From` and `User-Agent` to provide information about yourself (by passing these to `add_headers()` from the `httr` library). ] .pull-right[ ### Staying identifiable in practice ```{r, eval = FALSE} url <- "http://a-totally-random-website.com" rvest_session <- session(url, add_headers(From = "my@email.com", `UserAgent` = R.Version()$version.string ) ) headlines <- rvest_session %>% html_elements(xpath = "p//a") %>% html_text() ``` - `rvest`'s `session()` creates a session object that responds to HTTP and HTML methods. - Here, we provide our email address and the current R version as `User-Agent` information. - This will pop up in the server logs: The webpage administrator has the chance to easily get in touch with you. ] --- # Scraping etiquette (cont.)

--- class: inverse, center, middle name: summary # Summary

--- # Outlook Until now, the toy examples were limited to single HTML pages. However, often we want to **scrape data from multiple pages**. You might think of newspaper articles, Wikipedia pages, shopping items and the like. In such scenarios, automating the scraping process becomes really powerful. Also, principles of polite scraping are more relevant then. In other cases, you might be confronted with - forms, - authentication, - dynamic (JavaScript-enriched) content, or want to - automatically navigate through pages interactively. Moreover, we've ignored a major alternative way to collect data from the web so far which goes beyond scraping: accessing [web APIs](https://en.wikipedia.org/wiki/Web_API). Be sure to check out the respective sessions in the workshop. There's only so much we can cover in one session. Check out more material online [here](https://github.com/hertie-data-science-lab/ds-workshop-webscraping) and [there](https://github.com/yusuzech/r-web-scraping-cheat-sheet) to learn about solutions to some of these problems. --- # Coming up

### Assignment Assignment 3 is about to go online on GitHub Classroom. Check it out and start scraping the web (politely). ### Next lecture Model fitting and simulation. Now that we know how to retrieve data, let's learn how to run and learn from them.

`, `

heading of level 1 - this will be BIG

heading of level 6 - the smallest heading

Robert Gentleman

Rolf Turner