` and `` - They are used to group content over lines (`

`, creating a block-level element) or within lines (``, creating an inline-element). - By grouping or dividing content into blocks, it's easier to identify or apply different styling to them. - They do not change the layout themselves but work together with CSS (see later!). .pull-left[ Example of CSS definition: ```{css, prompt = FALSE, eval = FALSE} div.happy { color:pink; font-family:"Comic Sans MS"; font-size:120% } span.happy { color:pink; font-family:"Comic Sans MS"; font-size:120% } ``` ] .pull-right[ In the HTML document: ```{html, prompt = FALSE, eval = FALSE}

I am a happy-styled paragraph

unhappy text with some happiness ``` ] --- # Important tags and attributes ### Form tag ` ``` --- # Important tags and attributes ### Table tags ``, ``, `

`, and `

` - Standard HTML tables always follow a standard architecture. - The different tags allow defining the table as a whole, individual rows (including the heading), and cells. - If the data is hidden in tables, scraping will be straightforward. Example: ```{html, prompt = FALSE, eval = FALSE}

Rank	Nominal GDP	Name
	(per capita, USD)
1	170,373	Lichtenstein
2	167,021	Monaco
3	115,377	Luxembourg
4	98,565	Norway
5	92,682	Qatar

``` --- # More resources on HTML .pull-left-wide[ ### More HTML - All in all there are over 100 HTML elements. - But overall, it's still a fairly tight and easy-to-understand markup language. - Knowing more about the rest is probably not necessary to become a good web scraper, but it helps parsing (in your brain) HTML documents quicker. ### More resources - Check out the excellent [MDN Web Docs](https://developer.mozilla.org/en-US/docs/Web/HTML) for an overview, which also point to additional tutorials and references. - The [W3Schools tutorials](https://www.w3schools.com/) are also a classic. - While you're at it, you might also want to learn about related technologies such as CSS (used to specify a webpage's appearance/layout) and JavaScript (used to enrich HTMLs with additional functionality and options to interact). ] .pull-right-small[

] --- # Accessing the web using your browser vs. R .pull-left-wide[ ### Using your browser to access webpages 1. You click on a link, enter a URL, run a Google query, etc. 2. Browser/your machine sends request to server that hosts website. 3. Server returns resource (often an HTML document). 4. Browser interprets HTML and renders it in a nice fashion. ### Using R to access webpages 1. You manually specify a resource. 2. R/your machine sends a request to the server that hosts the website. 3. The server returns a resource (e.g., an HTML file). 4. R parses the HTML, but does not render it in a nice fashion. 5. It's up to you to tell R what content to extract. ] .pull-right-small[

] --- # Interacting with your browser ### On web browsers - Modern browsers are complex pieces of software that take care of multiple operations while you browse the web. And they're basically all doing a good job.¹ Common operations are to retrieve resources, render and display information, and provide interface for user-webpage interaction. - Although our goal is to automate web data retrieval, the browser is an important tool in web scraping workflow. ### The use of browsers for web scraping - Give you an intuitive impression of the architecture of a webpage - Allow you to inspect the source code - Let you construct XPath/CSS selector expressions with plugins - Render dynamic web content (JavaScript interpreter) .footnote[¹ Check out this Wikipedia article on the [Browser Wars](https://en.wikipedia.org/wiki/Browser_wars) that happened in the 1990s and 2000s (yes, there was Browser War I and Browser War II - and for once Germany was not to blame) to relive some of your instructor's pains when he started to look into this "internet".] --- # Inspecting HTML source code .pull-left-small[

- Goal: retrieving data from a Wikipedia page on [List of tallest buildings](https://en.wikipedia.org/wiki/List_of_tallest_buildings) - Right-click on page (anywhere) - Select `View Page Source` - HTML (CSS, JavaScript) code can be ugly - But looking more closely, we find the displayed information ] .pull-right-wide[

] --- # Inspecting the live HTML source code with the DOM explorer .pull-left-small[ - Goal: retrieving data from a Wikipedia page on [List of tallest buildings](https://en.wikipedia.org/wiki/List_of_tallest_buildings) - Right-click on the element of interest - Select `Inspect` - The Web Developer Tools window pops up - Corresponding part in the HTML tree is highlighted - Interaction with the tree possible! ] .pull-right-wide[

] --- # When to do what with your browser .pull-left-wide[ ### When to inspect the complete page source - Check whether data is in static source code (the search function helps!) - For small HTML files: understand structure ### When to use the DOM explorer - Almost always - Particularly useful to construct XPath/CSS selector expressions - To monitor dynamic changes in the DOM tree ### A note on browser differences - Inspecting the source code (as shown on the following slides) works more or less identically in Chrome and Firefox. - In Safari, go to → `Preferences`, then → `Advanced` and select `Show Develop menu in menu bar`. This unlocks the `Show Page Source` and `Inspect` options and the Web Developer Tools. ] .pull-right-small-center[

`Credit` [watershedcreative.com](http://watershedcreative.com/naked/html-tree.html) ] --- class: inverse, center, middle name: xpath # XPath basics

--- # Accessing the DOM tree with R ### Different perspectives on HTML - HTML documents are human-readable. - HTML tags structure the document, comprising the DOM. - **Web user perspective**: The browser interprets the code and renders the page. - **Web scraper perspective**: Parse the document retaining the structure, use the tree/tags to locate information. -- ### HTML parsing - Our goal is to get HTML into R while retaining the tree structure. That's similar to getting a spreadsheet into R and retaining the rectangular structure. - HTML is human-readable, so we could also import HTML files as plain text via `readLines()`. That's a bad option though - the document's structure would not be retained. - The `xml2` package allows us to parse XML-style documents. HTML is a "flavor" of XML, so it works for us. - The `rvest` package, which we will mainly use for scraping, wraps the `xml2` package, so we rarely have to load it manually. - There is one high-level function to remember: `read_html()`. It represents the HTML in a list-style fashion. --- # Accessing the DOM tree with R (cont.) ### Getting HTML into R Parsing a website is straightforward: ```{r, eval = TRUE, message= FALSE} library(rvest) parsed_doc <- read_html("https://google.com") parsed_doc ``` There are various functions to inspect the parsed document. They aren't really helpful - better use the browser instead if you want to dive into the HTML. ```{r, eval = FALSE, message= FALSE} xml2::html_structure(parsed_doc) xml2::as_list(parsed_doc) ``` --- # What's XPath? ### Definition - Short for **XML Path Language**, another W3C standard. - A query language for XML-based documents (including HTML). - With XPath we can access node sets (e.g., elements, attributes) and extract content. ### Why XPath for web scraping? - Source code of webpages (HTML) structures both layout and content. - Not only content, but context matters! - XPath enables us to extract content based on its location in the document (and potentially other features). - With XPath, we can tell R to do things like: 1. Give me all `

` elements in the document! 2. Look for all `` elements in the document and give me the third one! 3. Extract all content in `

` elements that is labelled with `class=newscontent`! --- # Example: source code ```{html, prompt = FALSE, eval = FALSE} Collected R wisdoms

Robert Gentleman

'What we have is nice, but we need something very different'

Source: Statistical Computing 2003, Reisensburg

Rolf Turner

'R is wonderful, but it cannot work magic'
answering a request for automatic generation of 'data from a known mean and 95% CI'

Source: R-help

The book homepage

``` --- # Example: DOM tree

--- # Applying XPath on HTML in R - Load package `rvest` - Parse HTML document with `read_html()` ```{r, eval = TRUE, message= FALSE} library(rvest) parsed_doc <- read_html("materials/fortunes.html") parsed_doc ``` - Query document using `html_elements()` - `rvest` can process XPath queries as well as CSS selectors. - Today, we'll focus on XPath: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div[last()]/p/i") ``` --- # Grammar of XPath ### Basic rules 1. We access nodes/elements by writing down the hierarchical structure in the DOM that locates the element set of interest. 2. A sequence of nodes is separated by `/`. 3. The easiest localization of a element is given by the absolute path (but often not the most efficient one!). 4. Apply XPath on DOM in R using `html_elements()`. ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div[last()]/p/i") ``` --- # Grammar of XPath ### Absolute vs. relative paths **Absolute paths** start at the root element and follow the whole way down to the target element (with simple slashes, `/`). ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "/html/body/div/p/i") ``` **Relative paths** skip nodes (with double slashes, `//`). ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//body//p/i") ``` Relative paths are often preferrable. They are faster to write and more comprehensive. On the other hand, they are less targeted and therefore potentially less robust, and running them takes more computing time, as the entire tree has to be evaluated. But that's usually not relevant for reasonably small documents. --- # Grammar of XPath ### The wildcard operator - Meta symbol `*` - Matches any element - Works only for one arbitrary element - Far less important than, e.g., wildcards in content-based queries (regex!) ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "/html/body/div/*/i") # the following does not work: html_elements(parsed_doc, xpath = "/html/body/*/i") ``` --- # Grammar of XPath ### Navigational operators `"."`and `".."` - `"."` accesses elements on the same level ("self axis"), which is useful when working with predicates (see later!). - `".."` accesses elements at a higher hierarchical level. ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//title/..") ``` ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div[starts-with(./@id, 'R')]") ``` --- # Element (node) relations ("axes") in XPath .pull-left[ ### Family relations between elements - The tools learned so far are sometimes not sufficient to access specific elements without accessing other, undesired elements as well. - Relationship statuses are useful to establish unambiguity. - Can be combined with other elements of the grammar - Basic syntax: `element1/relation::element2` - We describe relation of `element2` to `element1` - `element2` is to be extracted - we always extract the element at the end! ] .pull-right[

] --- # Element (node) relations in XPath |

Axis name

| Description | |---|---| | `ancestor` | All ancestors (parent, grandparent etc.) of the current element | | `ancestor-or-self` | All ancestors of the current element and the current element itself | | `attribute` | All attributes of the current element | | `child` | All children of the current element | | `descendant` | All descendants (children, grandchildren etc.) of the current element | | `descendant-or-self` | All descendants of the current element and the current element itself | | `following` | Everything in the document after the closing tag of the current element | | `following-sibling` | All siblings after the current element | | `parent` | The parent of the current element | | `preceding` | All elements that appear before the current element, except ancestors/attribute elements | | `preceding-sibling` | All siblings before the current element | | `self` | The current element | --- # Element (node) relations in XPath Example: access the `

` elements that are ancestors to an `` element: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//a/ancestor::div") ``` Another example: Select all `

` nodes that precede a `
` node: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//p/preceding-sibling::h1") ``` --- # Predicates ### What are predicates? - Predicates are conditions based on an element's features (`true/false`). - Think of them as ways to filter nodesets. - They are applicable to a variety of features: name, value attribute. - Basic syntax: `element[predicate]` Select all first `
` elements that are children of a `
` element, using a numeric predicate: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath = "//div/p[1]") ``` -- Can you find out what the following expressions do? ```{r, eval = FALSE, message= FALSE} html_elements(parsed_doc, xpath = "//div/p[last()-1]") html_elements(parsed_doc, xpath = "//div[count(./@)>2]") html_elements(parsed_doc, xpath = "//[string-length(text())>50]") ``` --- # Predicates (cont.) Select all `
` nodes that contain an attribute named `’October/2011’`, using a textual predicate: ```{r, eval = TRUE, message= FALSE} html_elements(parsed_doc, xpath ="//div[@date='October/2011']") ``` Rudimentary string matching is also possible using string functions like `contains()`, `starts-with()`, or `ends-with()`. -- Can you tell what the following calls do? ```{r, eval = FALSE, message= FALSE} html_elements(parsed_doc, xpath = "//div[starts-with(./@id, 'R')]") html_elements(parsed_doc, xpath = "//div[substring-after(./@date, '/')='2003']//i") ``` --- # Content extraction - Until now, we used XPath expressions to extract complete nodes or nodesets (that is, elements with tags). - However, in most cases we're interested in extracting the content only. - To that end, we can use extractor functions that are applied on the output of XPath query calls. | Function | Argument | Return value | |---|---|---| | `html_text()` | | Element value | | `html_text2()` | | Element value (with a bit more cleanup) | | `html_attr()` | `name` | Element attribute | | `html_attrs()` | | (All) element attributes | | `html_name()` | `trim` | Element name | | `html_children()` | | Element children | --- # Content extraction (cont.) Extracting element values/content: ```{r, eval = TRUE} html_elements(parsed_doc, xpath = "//title") %>% html_text2() ``` Extracting attributes: ```{r, eval = TRUE} html_elements(parsed_doc, xpath = "//div[1]") %>% html_attrs() ``` Extracting attribute values: ```{r, eval = TRUE} html_elements(parsed_doc, xpath = "//div") %>% html_attr("lang") ``` --- # More XPath? ### Training resources - XPath is a little language of its own. As always with languages, mastery comes with practice. - A good environment for practice is the [XPath expression testbed at whitebeam.org](http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm). - Also check out this [cheat sheet](https://devhints.io/xpath). ### XPath creator tools - Now, do you really have to construct XPath expressions by your own? No! At least not always. - SelectorGadget: [http://selectorgadget.com](http://selectorgadget.com) is a browser plugin that constructs XPath statements via a point-and-click approach. The generated expressions are not always efficient and effective though (more on this later). - Web developer tools - the internal browser functionality to study the DOM, among other things, also lets you extract XPath statements for selected nodes. These are specific to unique nodes/elements though, and therefore less helpful to extract node sets. (But they come in handy when we want to script live navigation, e.g. for Selenium.) --- class: inverse, center, middle name: css # CSS basics
--- # What is CSS? .pull-left[ ### Background - Cascading Style Sheets (CSS) is a style sheet language that allows web developers to adjust the "look and feel" of websites. - By using CSS to adjust style features such as layout, colors, and fonts, it's easier to separate content (HTML) from presentation (CSS). ### Three ways to insert CSS into HTML 1. External CSS. Inside `` with a reference to the external file inside the `` element. 2. Internal CSS. Inside `` and stored in ` ``` Inline CSS ```{html, prompt = FALSE, eval = FALSE}
This is a paragraph.
``` ] --- # CSS selectors .pull-left[ ### Selectors - CSS selectors find/select the HTML elements that should be styled. - There are various categories of selectors. In addition to generic element selectors (which selected just based on the element name, such as `
`), we often care about: - CSS id selectors, which use the `id` attribute of an HTML element. Think of them as "labels", as in `
`. The respective CSS selector would be `#para1`. - CSS class selectors, which use the `class` attribute of an HTML element, as in `
`. Note that these can refer to more than one class (here: `center` and `large`). The respective CSS selector would be `p.center.large`. ] -- .pull-right[ ### Writing CSS selectors - Just as XPath, CSS selectors are a little language of their own. - I won't teach you more about it, but you might nevertheless want to learn it. - Check out the CSS diner tutorial at https://flukeout.github.io/. It's one of the best tutorials of anything out there.

] --- class: inverse, center, middle name: regex # Regular expressions
--- # What are regular expressions? .pull-left[ ### Definition Regular expressions a.k.a. regex or RegExp is a tool - a little language of it's own really - that lets you describe patterns in text/strings. Funnily, a regular expression itself is a sequence of characters, some with special, some with literal meaning. Regular expressions are widely applicable and implemented in many programming languages, including R, as well as search engines, search and replace dialogs, etc.

] -- .pull-right[ ### Why is this useful for web scraping? Information on the web can often be described by patterns (think email addresses, numbers, cells in HTML tables, ...). If the data of interest follow specific patterns, we can match and extract them - regardless of page layout and HTML overhead. Whenever the information of interest is (stored in) text, regular expressions are useful for extraction and tidying purposes. ] --- # Regular expressions: example Below you see a string that contains unstructured phone book entries. The goal is to clean it up and extract the entries. The problem is that the text is really messy, and to find a pattern that helps us describe names on the one hand and phone numbers on the other is difficult. But: regular expressions FTW! ```{r} phone_vec <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery 555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226 Simpson,Homer5553642Dr. Julius Hibbert" ``` -- We're loading the `stringr` package that provides us with tidyverse functionality to operate with string data and apply regular expressions. Then, we construct a regular expression each for the names and the phone numbers (this is the tricky part!). Finally, we apply the regular expressions on the raw vector to extract the information of interest. --- # Regular expressions: example ```{r} phone_vec <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery 555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226 Simpson,Homer5553642Dr. Julius Hibbert" ``` -- ```{r} library(stringr) names_vec <- unlist(str_extract_all(phone_vec, "[[:alpha:]., ]{2,}")) names_vec ``` -- ```{r} numbers_vec <- unlist(str_extract_all(phone_vec, "\$?([:digit:]{3})?\$?(-| )?[:digit:]{3}(-| )?[:digit:]{4}")) numbers_vec ``` -- Wait, wait?! 🤯 --- # 🎶 Regex superheroooo 🎶

--- # Regular expressions in R Here's an example string we're going to work with: ```{r} example.obj <- "1. A small sentence. - 2. Another tiny sentence." ``` -- We are going to use the `str_extract()` and the `str_extract_all()` functions from the `stringr` package to apply regular expressions to strings. The generic syntax is: - `str_extract(string, pattern)` - `str_extract_all(string, pattern)` Here's the difference: `str_extract()` returns the first match, `str_extract_all()` returns all matches. --- # Basic regex syntax ```{r} example.obj <- "1. A small sentence. - 2. Another tiny sentence." ``` .pull-left[ ### Strings match themselves ```{r} str_extract(example.obj, "small") str_extract(example.obj, "banana") ``` ] -- .pull-right[ ### Multiple matches are returned as a list ```{r} multi_vec <- c("text", "manipulation", "basics") str_extract_all(multi_vec, "a") ``` ] --- # Basic regex syntax cont. ```{r} example.obj <- "1. A small sentence. - 2. Another tiny sentence." ``` .pull-left[ ### Character matching is case sensitive ```{r} str_extract(example.obj, "small") str_extract(example.obj, "SMALL") str_extract(example.obj, regex("SMALL", ignore_case = TRUE)) ``` ] -- .pull-right[ ### We can match arbitrary combinations of characters ```{r} str_extract(example.obj, "mall sent") ``` ] --- # Basic regex syntax cont. ```{r} example.obj <- "1. A small sentence. - 2. Another tiny sentence." ``` .pull-left[ ### Matching the beginning of a string ```{r} str_extract(example.obj, "^1") str_extract(example.obj, "^2") ``` ] -- .pull-right[ ### Matching the end of a string ```{r} str_extract(example.obj, "sentence$") str_extract(example.obj, "sentence.$") ``` ] --- # Basic regex syntax cont. ```{r} example.obj <- "1. A small sentence. - 2. Another tiny sentence." ``` .pull-left[ ### Express an "or" with the pipe operator ```{r} unlist(str_extract_all(example.obj, "tiny|sentence")) ``` ] -- .pull-right[ ### The dot: the ultimate wildcard ```{r} str_extract(example.obj, "sm.ll") ``` ] --- # Meta-characters ### Matching of meta-characters - Some symbols have a special meaning in the regex syntax: `.`, `|`, `(`, `)`, `[`, `]`, `{`, `}`, `^`, `$`, ``, `+`, `?`, and `-`. - If we want to match them literally, we have to use an escape sequence: `\symbol` - As `\` is a meta character itself, we have to escape it with `\`, so we always write `\\`. 🤯 - Alternatively, use `fixed("symbols")` to let the parser interpret a chain of symbols literally. ```{r} unlist(str_extract_all(example.obj, "\\.")) unlist(str_extract_all(example.obj, fixed("."))) ``` --- # Character classes ### Square brackets `[]` define character classes - Character classes help define special wild cards. - The idea is that any of the characters within the brackets can be matched. ```{r} str_extract(example.obj, "sm[abc]ll") ``` - The hyphen defines a range of characters. ```{r} str_extract(example.obj, "sm[a-p]ll") ``` --- # Character classes cont.* Some character classes are pre-defined. They are very convenient to efficiently describe specific string patterns. | Specification | Meaning | |---|---| | `[:digit:]` | Digits: 0 1 2 3 4 5 6 7 8 9 | | `[:lower:]` | Lower-case characters: a-z | | `[:upper:]` | Upper-case characters: A-Z | | `[:alpha:]` | Alphabetic characters: a-z and A-Z | | `[:alnum:]` | Digits and alphabetic characters | | `[:punct:]` | Punctuation characters: `.`, `,`, `;`, etc. | | `[:graph:]` | Graphical characters: `[:alnum:]` and `[:punct:]` | | `[:blank:]` | Blank characters: Space and tab | | `[:space:]` | Space characters: Space, tab, newline, and others | | `[:print:]` | Printable characters: `[:alnum:]`, `[:punct:]` and `[:space:]`| --- # Character classes in action Pre-defined character classes are useful because they are efficient and let us - combine different kinds of characters - facilitate reading of an expression - include special characters, e.g., ß, ö, æ, ... - can be extended ```{r} unlist(str_extract_all(example.obj, "[[:punct:]ABC]")) unlist(str_extract_all(example.obj, "[^[:alnum:]]")) ``` --- # Meta symbols in character classes Within a character class, most meta-characters lose their special meaning. There are exceptions though: - `^` becomes "not": `[^abc]` matches any character other than "a", "b", or "c". - `-` becomes a range specifier: `[a-d]` matches any character from a to d. However, `-` at the beginning or the end of a character class matches the hyphen. ```{r} unlist(str_extract_all(example.obj, "[1-2]")) unlist(str_extract_all(example.obj, "[12-]")) ``` --- # Quantifiers Quantifiers are meta-characters that allow you to specify how often a certain string pattern should be allowed to appear. | Quantifier | Meaning | |---|---| | `?` | The preceding item is optional and will be matched at most once | | `` | The preceding item will be matched zero or more times | | `+` | The preceding item will be matched one or more times | | `{n}` | The preceding item is matched exactly n times | | `{n,}` | The preceding item is matched n or more times | | `{n,m}` | The preceding item is matched between n and m times | ```{r} str_extract(example.obj, "s[[:alpha:]]{3}l") str_extract(example.obj, "A.+sentence") ``` --- # Greedy quantification The use of `.+` results in "greedy" matching, i.e. the parser tries to match as many characters as possible. This is not always desired. However, the meta-character `? ` helps avoid greedy quantification. More generally, it re-interprets the quantifiers ``, `+`, `?` or `{m,n}` to match as few times as possible. ```{r} str_extract(example.obj, "A.+sentence") str_extract(example.obj, "A.+?sentence") ``` --- # Backreferencing Sometimes it's useful to induce some "memory" into regex, as in: "Find something that matches a certain pattern, and then again a repeated match of previously matched pattern. The first pattern is defined with round brackets, as in `(pattern)`. We then refer to the it using `\1` (or with `\2` for the second pattern, etc.). Example: Match the first letter, then anything until you ﬁnd the first letter again (not greedy). ```{r} str_extract(example.obj, "([[:alpha:]]).+?\\1") ``` --- # Backreferencing cont. Goal: Match a word that does not include "a" until the word appears the second time. Solution: ```{r} str_extract(example.obj, "([ [:punct:]][b-z]+[ [:punct:]]).+?\\1") ``` How it works: - Match all letters without a, therefore: `[b-z]` - Match complete words with beginning/end: `[ [:punct:]]` - Define first word pattern* `(...)` - Match anything between occurrences of both words: `.+?` - Refer to original word `\\1` --- # Coming up

### Assignment No assignment due, but web technologies will be featured in the next assignment. ### Next lecture We'll get serious about scraping data from the web and tapping APIs.

`, `

heading of level 1 - this will be BIG

heading of level 6 - the smallest heading

Robert Gentleman

Rolf Turner