---
title: "Introduction to Data Science"
subtitle: "Session 6: Web data and technologies"
author: "Simon Munzert"
institute: "Hertie School | [GRAD-C11/E1339](https://github.com/intro-to-data-science-23)" #"`r format(Sys.time(), '%d %B %Y')`"
output:
xaringan::moon_reader:
css: [default, 'simons-touch.css', metropolis, metropolis-fonts]
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
ratio: '16:9'
hash: true
---
```{css, echo=FALSE}
@media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619
.has-continuation {
display: block !important;
}
}
```
```{r setup, include=FALSE}
# figures formatting setup
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(
prompt = T,
fig.align="center", #fig.width=6, fig.height=4.5,
# out.width="748px", #out.length="520.75px",
dpi=300, #fig.path='Figs/',
cache=F, #echo=F, warning=F, message=F
engine.opts = list(bash = "-l")
)
## Next hook based on this SO answer: https://stackoverflow.com/a/39025054
knit_hooks$set(
prompt = function(before, options, envir) {
options(
prompt = if (options$engine %in% c('sh','bash')) '$ ' else 'R> ',
continue = if (options$engine %in% c('sh','bash')) '$ ' else '+ '
)
})
library(tidyverse)
```
# Table of contents
1. [Web data for data science](#webdata)
2. [HTML basics](#html)
3. [XPath basics](#xpath)
4. [CSS basics](#css)
5. [Regular expressions](#regex)
6. [Summary](#summary)
---
class: inverse, center, middle
name: webdata
# Web data for data science
---
# What is web data?
---
# What is web data? (cont.)
---
# What is web data? (cont.)
.pull-left[
### So what is web data, really?
- Not all data you get from the web is "web data".
- Web data is **data that is created on, for, or via the web**. By that definition, a survey dataset that you download from a data repository is not web data.
- On the other hand, survey data collected online (i.e., web/mobile questionnaires) is web data but we don't consider it in today's session.
- Examples of web data:
- Online news articles
- Social media network structures
- Crowdsourced databases (e.g., Wikidata)
- Server logs (e.g., viewership statistics)
- Data from surveys, experiments, clickworkers
- Just any website
]
--
.pull-right[
### And why is web data attractive?
- Data is abundant online.
- Human behavior increasingly takes place online.
- Countless services track human behavior.
- Getting data from the web is cheap and often quick.
- An analysis workflow that involves web data can often be easily updated.
- The vast majority of web data was not created with a data analysis purpose in mind. This fact is often a feature, not a bug.
]
---
# Technologies of the world wide web
.pull-left[
- To fully unlock the potential of web data for data science, we draw on certain web technologies.
- Importantly, often a basic understanding of these technologies is sufficient as the focus is on web data collection, not [web development](https://en.wikipedia.org/wiki/Web_development).
- Specifically, we have to understand
- How our machine/browser/R communicates with web servers (→ **HTTP/S**)
- How websites are built (→ **HTML**, **CSS**, basics of **JavaScript**)
- How content in webpages can be effectively located (→ **XPath**, **CSS selectors**)
- How dynamic web applications are executed and tapped (→ **AJAX**, **Selenium**)
- How data by web services is distributed and processed (→ **APIs**, **JSON**, **XML**)
]
.pull-right-center[
`Credit` [ADCR](http://r-datacollection.com/)
]
---
class: inverse, center, middle
name: html
# HTML basics
---
# HTML background
.pull-left-wide[
### What is HTML?
- **H**yper**T**ext **M**arkup **L**anguage
- Markup language = plain text + markups
- Originally specified by [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee) at [CERN](https://en.wikipedia.org/wiki/CERN) in 1989/90
- [W3C](https://en.wikipedia.org/wiki/World_Wide_Web_Consortium) standard for the construction of websites.
- The fundamentals of HTML haven't changed much recently. Current version is HTML 5.2 (published in 2017).
### What is it good for?
- In the early days, the internet was mainly good for sharing texts. But plain text is boring. Markupis *fun*!
- HTML lies underneath of what you see in your browser. You don't see it because your browser interprets and renders it for you.
- A basic understanding of HTML helps us locate the information we want to retrieve.
]
.pull-right-small-center[
]
---
# HTML tree structure
.pull-left[
### The DOM tree
- HTML documents are hierarchically structured. Think of them as a tree with multiple nodes and branches.
- When a webpage (HTML resource) is loaded, the browser creates a [Document Object Model](https://en.wikipedia.org/wiki/Document_Object_Model) of that page - the **DOM Tree**.
- Think of it as a representation that considers all HTML elements as objects than can be accessed.
### Parts of the tree
- The DOM is constituted of **nodes**, which are just data types that can be referred to - such as "attribute node", "text node", or "element node".
- A **node set** is a set of nodes. This will become relevant when you learn about XPath, which you can use to access multiple nodes (e.g., all `title` nodes).
]
.pull-right[
```{html, prompt = FALSE, eval = FALSE}
First HTML
I am your first HTML file!
```
]
---
# HTML: elements and attributes
.pull-left[
### Elements
- Elements are a combination of start tags, content, and end tags.
- Example: `First HTML`
- An element is everything from (including) the element's start tag to (including) the element's end tag, but also other elements that are nested within that element.
- Syntax:
| Component | Representation |
|---|---|
| Element title | `title` |
| Start tag | `` |
| End tag | `` |
| Value | `First HTML` |
]
.pull-right[
### Attributes
- Describe elements and are stored in the start tag.
- There are specific attributes for specific elements.
- Example: `Link to Homepage`
- Syntax:
- Name-value pairs: `name="value"`
- Simple and double quotation marks possible
- Several attributes per element possible
### Why tags and attributes are important
- Tags structure HTML documents.
- In the context of web scraping, the structure can be exploited to locate and extract data from websites.
]
---
# Important tags and attributes
### Anchor tag ``
- Links to other pages or resources.
- Classical links are always formatted with an anchor tag.
- The `href` attribute determines the target location.
- The value is the name of the link.
Link to another resource:
```{html, eval = FALSE, prompt = FALSE}
Link with absolute path
```
Reference within a document:
```{html, eval = FALSE, prompt = FALSE}
Reference point
```
Link to a reference within a document:
```{html, eval = FALSE, prompt = FALSE}
Link to reference point
```
---
# Important tags and attributes
### Heading tags `
`, `
`, ..., and paragraph tag `
`
- Structure text and paragraphs.
- Heading tags range from level 1 to 6.
- Paragraph tag induces a line break.
Examples:
```{html, eval = FALSE, prompt = FALSE}
This text is going to be a paragraph one day and separated from other text by line breaks.
```
```{html, eval = FALSE, prompt = FALSE}
heading of level 1 - this will be BIG
...
heading of level 6 - the smallest heading
```
---
# Important tags and attributes
### Listing tags `
`, ``, and `
`
- The `` tag creates a numeric list.
- The `
` tag creates an unnumbered list.
- The `
` tag creates a description list.
- List elements within `` and `
```
---
# Important tags and attributes
### Organizational and styling tags `
` and ``
- They are used to group content over lines (`
`, creating a block-level element) or within lines (``, creating an inline-element).
- By grouping or dividing content into blocks, it's easier to identify or apply different styling to them.
- They do not change the layout themselves but work together with CSS (see later!).
.pull-left[
Example of CSS definition:
```{css, prompt = FALSE, eval = FALSE}
div.happy {
color:pink;
font-family:"Comic Sans MS";
font-size:120%
}
span.happy {
color:pink;
font-family:"Comic Sans MS";
font-size:120%
}
```
]
.pull-right[
In the HTML document:
```{html, prompt = FALSE, eval = FALSE}
I am a happy-styled paragraph
unhappy text with some
happiness
```
]
---
# Important tags and attributes
### Form tag `
```
---
# Important tags and attributes
### Table tags `
`, `
`, `
`, and `
`
- Standard HTML tables always follow a standard architecture.
- The different tags allow defining the table as a whole, individual rows (including the heading), and cells.
- If the data is hidden in tables, scraping will be straightforward.
Example:
```{html, prompt = FALSE, eval = FALSE}
Rank
Nominal GDP
Name
(per capita, USD)
1
170,373
Lichtenstein
2
167,021
Monaco
3
115,377
Luxembourg
4
98,565
Norway
5
92,682
Qatar
```
---
# More resources on HTML
.pull-left-wide[
### More HTML
- All in all there are over 100 HTML elements.
- But overall, it's still a fairly tight and easy-to-understand markup language.
- Knowing more about the rest is probably not necessary to become a good web scraper, but it helps parsing (in your brain) HTML documents quicker.
### More resources
- Check out the excellent [MDN Web Docs](https://developer.mozilla.org/en-US/docs/Web/HTML) for an overview, which also point to additional tutorials and references.
- The [W3Schools tutorials](https://www.w3schools.com/) are also a classic.
- While you're at it, you might also want to learn about related technologies such as CSS (used to specify a webpage's appearance/layout) and JavaScript (used to enrich HTMLs with additional functionality and options to interact).
]
.pull-right-small[
]
---
# Accessing the web using your browser vs. R
.pull-left-wide[
### Using your browser to access webpages
1. You click on a link, enter a URL, run a Google query, etc.
2. Browser/your machine sends request to server that hosts website.
3. Server returns resource (often an HTML document).
4. Browser interprets HTML and renders it in a nice fashion.
### Using R to access webpages
1. You manually specify a resource.
2. R/your machine sends a request to the server that hosts the website.
3. The server returns a resource (e.g., an HTML file).
4. R parses the HTML, but does not render it in a nice fashion.
5. It's up to you to tell R what content to extract.
]
.pull-right-small[
]
---
# Interacting with your browser
### On web browsers
- Modern browsers are complex pieces of software that take care of multiple operations while you browse the web. And they're basically all doing a good job.1 Common operations are to retrieve resources, render and display information, and provide interface for user-webpage interaction.
- Although our goal is to automate web data retrieval, the browser is an important tool in web scraping workflow.
### The use of browsers for web scraping
- Give you an intuitive impression of the architecture of a webpage
- Allow you to inspect the source code
- Let you construct XPath/CSS selector expressions with plugins
- Render dynamic web content (JavaScript interpreter)
.footnote[1 Check out this Wikipedia article on the [Browser Wars](https://en.wikipedia.org/wiki/Browser_wars) that happened in the 1990s and 2000s (yes, there was Browser War I and Browser War II - and for once Germany was not to blame) to relive some of your instructor's pains when he started to look into this "internet".]
---
# Inspecting HTML source code
.pull-left-small[
- Goal: retrieving data from a Wikipedia page on [List of tallest buildings](https://en.wikipedia.org/wiki/List_of_tallest_buildings)
- Right-click on page (anywhere)
- Select `View Page Source`
- HTML (CSS, JavaScript) code can be ugly
- But looking more closely, we find the displayed information
]
.pull-right-wide[
]
---
# Inspecting the live HTML source code with the DOM explorer
.pull-left-small[
- Goal: retrieving data from a Wikipedia page on [List of tallest buildings](https://en.wikipedia.org/wiki/List_of_tallest_buildings)
- Right-click on the element of interest
- Select `Inspect`
- The Web Developer Tools window pops up
- Corresponding part in the HTML tree is highlighted
- Interaction with the tree possible!
]
.pull-right-wide[
]
---
# When to do what with your browser
.pull-left-wide[
### When to inspect the complete page source
- Check whether data is in static source code (the search function helps!)
- For small HTML files: understand structure
### When to use the DOM explorer
- Almost always
- Particularly useful to construct XPath/CSS selector expressions
- To monitor dynamic changes in the DOM tree
### A note on browser differences
- Inspecting the source code (as shown on the following slides) works more or less identically in Chrome and Firefox.
- In Safari, go to → `Preferences`, then → `Advanced` and select `Show Develop menu in menu bar`. This unlocks the `Show Page Source` and `Inspect` options and the Web Developer Tools.
]
.pull-right-small-center[
`Credit` [watershedcreative.com](http://watershedcreative.com/naked/html-tree.html)
]
---
class: inverse, center, middle
name: xpath
# XPath basics
---
# Accessing the DOM tree with R
### Different perspectives on HTML
- HTML documents are human-readable.
- HTML tags structure the document, comprising the DOM.
- **Web user perspective**: The browser interprets the code and renders the page.
- **Web scraper perspective**: Parse the document retaining the structure, use the tree/tags to locate information.
--
### HTML parsing
- Our goal is to get HTML into R while retaining the tree structure. That's similar to getting a spreadsheet into R and retaining the rectangular structure.
- HTML is human-readable, so we could also import HTML files as plain text via `readLines()`. That's a bad option though - the document's structure would not be retained.
- The `xml2` package allows us to parse XML-style documents. HTML is a "flavor" of XML, so it works for us.
- The `rvest` package, which we will mainly use for scraping, wraps the `xml2` package, so we rarely have to load it manually.
- There is one high-level function to remember: `read_html()`. It represents the HTML in a list-style fashion.
---
# Accessing the DOM tree with R (cont.)
### Getting HTML into R
Parsing a website is straightforward:
```{r, eval = TRUE, message= FALSE}
library(rvest)
parsed_doc <- read_html("https://google.com")
parsed_doc
```
There are various functions to inspect the parsed document. They aren't really helpful - better use the browser instead if you want to dive into the HTML.
```{r, eval = FALSE, message= FALSE}
xml2::html_structure(parsed_doc)
xml2::as_list(parsed_doc)
```
---
# What's XPath?
### Definition
- Short for **XML Path Language**, another W3C standard.
- A query language for XML-based documents (including HTML).
- With XPath we can access node sets (e.g., elements, attributes) and extract content.
### Why XPath for web scraping?
- Source code of webpages (HTML) structures both layout and content.
- Not only content, but context matters!
- XPath enables us to extract content based on its location in the document (and potentially other features).
- With XPath, we can tell R to do things like:
1. Give me all `
` elements in the document!
2. Look for all `
` elements in the document and give me the third one!
3. Extract all content in `
` elements that is labelled with `class=newscontent`!
---
# Example: source code
```{html, prompt = FALSE, eval = FALSE}
Collected R wisdoms
Robert Gentleman
'What we have is nice, but we need something very different'
Source: Statistical Computing 2003, Reisensburg
Rolf Turner
'R is wonderful, but it cannot work magic' answering a request for automatic generation of 'data from a known mean and 95% CI'
---
# Applying XPath on HTML in R
- Load package `rvest`
- Parse HTML document with `read_html()`
```{r, eval = TRUE, message= FALSE}
library(rvest)
parsed_doc <- read_html("materials/fortunes.html")
parsed_doc
```
- Query document using `html_elements()`
- `rvest` can process XPath queries as well as CSS selectors.
- Today, we'll focus on XPath:
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "//div[last()]/p/i")
```
---
# Grammar of XPath
### Basic rules
1. We access nodes/elements by writing down the hierarchical structure in the DOM that locates the element set of interest.
2. A sequence of nodes is separated by `/`.
3. The easiest localization of a element is given by the absolute path (but often not the most efficient one!).
4. Apply XPath on DOM in R using `html_elements()`.
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "//div[last()]/p/i")
```
---
# Grammar of XPath
### Absolute vs. relative paths
**Absolute paths** start at the root element and follow the whole way down to the target element (with simple slashes, `/`).
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "/html/body/div/p/i")
```
**Relative paths** skip nodes (with double slashes, `//`).
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "//body//p/i")
```
Relative paths are often preferrable. They are faster to write and more comprehensive. On the other hand, they are less targeted and therefore potentially less robust, and running them takes more computing time, as the entire tree has to be evaluated. But that's usually not relevant for reasonably small documents.
---
# Grammar of XPath
### The wildcard operator
- Meta symbol `*`
- Matches any element
- Works only for one arbitrary element
- Far less important than, e.g., wildcards in content-based queries (regex!)
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "/html/body/div/*/i")
# the following does not work:
html_elements(parsed_doc, xpath = "/html/body/*/i")
```
---
# Grammar of XPath
### Navigational operators `"."`and `".."`
- `"."` accesses elements on the same level ("self axis"), which is useful when working with predicates (see later!).
- `".."` accesses elements at a higher hierarchical level.
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "//title/..")
```
```{r, eval = TRUE, message= FALSE}
html_elements(parsed_doc, xpath = "//div[starts-with(./@id, 'R')]")
```
---
# Element (node) relations ("axes") in XPath
.pull-left[
### Family relations between elements
- The tools learned so far are sometimes not sufficient to access specific elements without accessing other, undesired elements as well.
- Relationship statuses are useful to establish unambiguity.
- Can be combined with other elements of the grammar
- Basic syntax: `element1/relation::element2`
- We describe relation of `element2` to `element1`
- `element2` is to be extracted - we always extract the element at the end!
]
.pull-right[
]
---
# Element (node) relations in XPath
|
Axis name
| Description |
|---|---|
| `ancestor` | All ancestors (parent, grandparent etc.) of the current element |
| `ancestor-or-self` | All ancestors of the current element and the current element itself |
| `attribute` | All attributes of the current element |
| `child` | All children of the current element |
| `descendant` | All descendants (children, grandchildren etc.) of the current element |
| `descendant-or-self` | All descendants of the current element and the current element itself |
| `following` | Everything in the document after the closing tag of the current element |
| `following-sibling` | All siblings after the current element |
| `parent` | The parent of the current element |
| `preceding` | All elements that appear before the current element, except ancestors/attribute elements |
| `preceding-sibling` | All siblings before the current element |
| `self` | The current element |
---
# Element (node) relations in XPath
Example: access the `