---
title: "Introduction to Data Science"
subtitle: "Session 5: Web data and technologies"
author: "Simon Munzert"
institute: "Hertie School | [GRAD-C11/E1339](https://github.com/intro-to-data-science-21)" #"`r format(Sys.time(), '%d %B %Y')`"
output:
xaringan::moon_reader:
css: [default, 'simons-touch.css', metropolis, metropolis-fonts]
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
ratio: '16:9'
hash: true
---
```{css, echo=FALSE}
@media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619
.has-continuation {
display: block !important;
}
}
```
```{r setup, include=FALSE}
# figures formatting setup
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(
prompt = T,
fig.align="center", #fig.width=6, fig.height=4.5,
# out.width="748px", #out.length="520.75px",
dpi=300, #fig.path='Figs/',
cache=F, #echo=F, warning=F, message=F
engine.opts = list(bash = "-l")
)
## Next hook based on this SO answer: https://stackoverflow.com/a/39025054
knit_hooks$set(
prompt = function(before, options, envir) {
options(
prompt = if (options$engine %in% c('sh','bash')) '$ ' else 'R> ',
continue = if (options$engine %in% c('sh','bash')) '$ ' else '+ '
)
})
library(tidyverse)
```
# Table of contents
1. [Web data for data science](#webdata)
2. [HTML basics](#html)
3. [XPath basics](#xpath)
4. [CSS basics](#css)
5. [Scraping static webpages with R](#scrapingstatic)
6. [Web scraping: good practice](#goodpractice)
7. [Summary](#summary)
---
class: inverse, center, middle
name: webdata
# Web data for data science
---
# What is web data?
---
# What is web data? (cont.)
---
# What is web data? (cont.)
.pull-left[
### So what is web data, really?
- Not all data you get from the web is "web data".
- Web data is **data that is created on, for, or via the web**. By that definition, a survey dataset that you download from a data repository is not web data.
- On the other hand, survey data collected online (i.e., web/mobile questionnaires) is web data but we don't consider it in today's session.
- Examples of web data:
- Online news articles
- Social media network structures
- Crowdsourced databases (e.g., Wikidata)
- Server logs (e.g., viewership statistics)
- Data from surveys, experiments, clickworkers
- Just any website
]
--
.pull-right[
### And why is web data attractive?
- Data is abundant online.
- Human behavior increasingly takes place online.
- Countless services track human behavior.
- Getting data from the web is cheap and often quick.
- An analysis workflow that involves web data can often be easily updated.
- The vast majority of web data was not created with a data analysis purpose in mind. This fact is often a feature, not a bug.
Today, we focus on one particular way of collecting data from the web: web scraping. This also limits the type of web data we'll be talking about (basically: data from static webpages). But it'll be fun nevertheless.
]
---
# Web scraping
.pull-left[
### What is web scraping?
1. Pulling (unstructured) data from websites (HTMLs)
2. Bringing it into shape (into an analysis-ready format)
### The philosophy of scraping with R
- No point-and-click procedure
- Script the entire process from start to finish
- **Automate**
- The downloading of files
- The scraping of information from web sites
- Tapping APIs
- Parsing of web content
- Data tidying, text data processing
- Easily scale up scraping procedures
- Scheduling of scraping tasks
]
.pull-right-center[
`Credit` [prowebscraping.com](http://prowebscraping.com/web-scraping-vs-web-crawling/)
]
---
# Technologies of the world wide web
.pull-left[
- To fully unlock the potential of web data for data science, we draw on certain web technologies.
- Importantly, often a basic understanding of these technologies is sufficient as the focus is on web data collection, not [web development](https://en.wikipedia.org/wiki/Web_development).
- Specifically, we have to understand
- How our machine/browser/R communicates with web servers (→ **HTTP/S**)
- How websites are built (→ **HTML**, **CSS**, basics of **JavaScript**)
- How content in webpages can be effectively located (→ **XPath**, **CSS selectors**)
- How dynamic web applications are executed and tapped (→ **AJAX**, **Selenium**)
- How data by web services is distributed and processed (→ **APIs**, **JSON**, **XML**)
]
.pull-right-center[
`Credit` [ADCR](http://r-datacollection.com/)
]
---
class: inverse, center, middle
name: html
# HTML basics
---
# HTML background
.pull-left-wide[
### What is HTML?
- **H**yper**T**ext **M**arkup **L**anguage
- Markup language = plain text + markups
- Originally specified by [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee) at [CERN](https://en.wikipedia.org/wiki/CERN) in 1989/90
- [W3C](https://en.wikipedia.org/wiki/World_Wide_Web_Consortium) standard for the construction of websites.
- The fundamentals of HTML haven't changed much recently. Current version is HTML 5.2 (published in 2017).
### What is it good for?
- In the early days, the internet was mainly good for sharing texts. But plain text is boring. Markupis *fun*!
- HTML lies underneath of what you see in your browser. You don't see it because your browser interprets and renders it for you.
- A basic understanding of HTML helps us locate the information we want to retrieve.
]
.pull-right-small-center[
]
---
# HTML tree structure
.pull-left[
### The DOM tree
- HTML documents are hierarchically structured. Think of them as a tree with multiple nodes and branches.
- When a webpage (HTML resource) is loaded, the browser creates a [Document Object Model](https://en.wikipedia.org/wiki/Document_Object_Model) of that page - the **DOM Tree**.
- Think of it as a representation that considers all HTML elements as objects than can be accessed.
### Parts of the tree
- The DOM is constituted of **nodes**, which are just data types that can be referred to - such as "attribute node", "text node", or "element node".
- A **node set** is a set of nodes. This will become relevant when you learn about XPath, which you can use to access multiple nodes (e.g., all `title` nodes).
]
.pull-right[
```{html, prompt = FALSE, eval = FALSE}
First HTML
I am your first HTML file!
```
]
---
# HTML: elements and attributes
.pull-left[
### Elements
- Elements are a combination of start tags, content, and end tags.
- Example: `First HTML`
- An element is everything from (including) the element's start tag to (including) the element's end tag, but also other elements that are nested within that element.
- Syntax:
| Component | Representation |
|---|---|
| Element title | `title` |
| Start tag | `` |
| End tag | `` |
| Value | `First HTML` |
]
.pull-right[
### Attributes
- Describe elements and are stored in the start tag.
- There are specific attributes for specific elements.
- Example: `Link to Homepage`
- Syntax:
- Name-value pairs: `name="value"`
- Simple and double quotation marks possible
- Several attributes per element possible
### Why tags and attributes are important
- Tags structure HTML documents.
- In the context of web scraping, the structure can be exploited to locate and extract data from websites.
]
---
# Important tags and attributes
### Anchor tag ``
- Links to other pages or resources.
- Classical links are always formatted with an anchor tag.
- The `href` attribute determines the target location.
- The value is the name of the link.
Link to another resource:
```{html, eval = FALSE, prompt = FALSE}
Link with absolute path
```
Reference within a document:
```{html, eval = FALSE, prompt = FALSE}
Reference point
```
Link to a reference within a document:
```{html, eval = FALSE, prompt = FALSE}
Link to reference point
```
---
# Important tags and attributes
### Heading tags `
`, `
`, ..., and paragraph tag `
`
- Structure text and paragraphs.
- Heading tags range from level 1 to 6.
- Paragraph tag induces a line break.
Examples:
```{html, eval = FALSE, prompt = FALSE}
This text is going to be a paragraph one day and separated from other text by line breaks.
```
```{html, eval = FALSE, prompt = FALSE}
heading of level 1 - this will be BIG
...
heading of level 6 - the smallest heading
```
---
# Important tags and attributes
### Listing tags `
`, ``, and `
`
- The `` tag creates a numeric list.
- The `
` tag creates an unnumbered list.
- The `
` tag creates a description list.
- List elements within `` and `
```
---
# Important tags and attributes
### Organizational and styling tags `
` and ``
- They are used to group content over lines (`
`, creating a block-level element) or within lines (``, creating an inline-element).
- By grouping or dividing content into blocks, it's easier to identify or apply different styling to them.
- They do not change the layout themselves but work together with CSS (see later!).
.pull-left[
Example of CSS definition:
```{css, prompt = FALSE, eval = FALSE}
div.happy {
color:pink;
font-family:"Comic Sans MS";
font-size:120%
}
span.happy {
color:pink;
font-family:"Comic Sans MS";
font-size:120%
}
```
]
.pull-right[
In the HTML document:
```{html, prompt = FALSE, eval = FALSE}
I am a happy-styled paragraph
unhappy text with some
happiness
```
]
---
# Important tags and attributes
### Form tag `