--- title: "Introduction to Data Science" subtitle: "Session 5: Web data and technologies" author: "Simon Munzert" institute: "Hertie School | [GRAD-C11/E1339](https://github.com/intro-to-data-science-21)" #"`r format(Sys.time(), '%d %B %Y')`" output: xaringan::moon_reader: css: [default, 'simons-touch.css', metropolis, metropolis-fonts] lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false ratio: '16:9' hash: true --- ```{css, echo=FALSE} @media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619 .has-continuation { display: block !important; } } ``` ```{r setup, include=FALSE} # figures formatting setup options(htmltools.dir.version = FALSE) library(knitr) opts_chunk$set( prompt = T, fig.align="center", #fig.width=6, fig.height=4.5, # out.width="748px", #out.length="520.75px", dpi=300, #fig.path='Figs/', cache=F, #echo=F, warning=F, message=F engine.opts = list(bash = "-l") ) ## Next hook based on this SO answer: https://stackoverflow.com/a/39025054 knit_hooks$set( prompt = function(before, options, envir) { options( prompt = if (options$engine %in% c('sh','bash')) '$ ' else 'R> ', continue = if (options$engine %in% c('sh','bash')) '$ ' else '+ ' ) }) library(tidyverse) ``` # Table of contents
1. [Web data for data science](#webdata) 2. [HTML basics](#html) 3. [XPath basics](#xpath) 4. [CSS basics](#css) 5. [Scraping static webpages with R](#scrapingstatic) 6. [Web scraping: good practice](#goodpractice) 7. [Summary](#summary) --- class: inverse, center, middle name: webdata # Web data for data science

--- # What is web data?
--- # What is web data? (cont.)
--- # What is web data? (cont.) .pull-left[ ### So what is web data, really? - Not all data you get from the web is "web data". - Web data is **data that is created on, for, or via the web**. By that definition, a survey dataset that you download from a data repository is not web data. - On the other hand, survey data collected online (i.e., web/mobile questionnaires) is web data but we don't consider it in today's session. - Examples of web data: - Online news articles - Social media network structures - Crowdsourced databases (e.g., Wikidata) - Server logs (e.g., viewership statistics) - Data from surveys, experiments, clickworkers - Just any website ] -- .pull-right[ ### And why is web data attractive? - Data is abundant online. - Human behavior increasingly takes place online. - Countless services track human behavior. - Getting data from the web is cheap and often quick. - An analysis workflow that involves web data can often be easily updated. - The vast majority of web data was not created with a data analysis purpose in mind. This fact is often a feature, not a bug.

Today, we focus on one particular way of collecting data from the web: web scraping. This also limits the type of web data we'll be talking about (basically: data from static webpages). But it'll be fun nevertheless.

] --- # Web scraping .pull-left[ ### What is web scraping? 1. Pulling (unstructured) data from websites (HTMLs) 2. Bringing it into shape (into an analysis-ready format) ### The philosophy of scraping with R - No point-and-click procedure - Script the entire process from start to finish - **Automate** - The downloading of files - The scraping of information from web sites - Tapping APIs - Parsing of web content - Data tidying, text data processing - Easily scale up scraping procedures - Scheduling of scraping tasks ] .pull-right-center[
`Credit` [prowebscraping.com](http://prowebscraping.com/web-scraping-vs-web-crawling/) ] --- # Technologies of the world wide web .pull-left[ - To fully unlock the potential of web data for data science, we draw on certain web technologies. - Importantly, often a basic understanding of these technologies is sufficient as the focus is on web data collection, not [web development](https://en.wikipedia.org/wiki/Web_development). - Specifically, we have to understand - How our machine/browser/R communicates with web servers (→ **HTTP/S**) - How websites are built (→ **HTML**, **CSS**, basics of **JavaScript**) - How content in webpages can be effectively located (→ **XPath**, **CSS selectors**) - How dynamic web applications are executed and tapped (→ **AJAX**, **Selenium**) - How data by web services is distributed and processed (→ **APIs**, **JSON**, **XML**) ] .pull-right-center[

`Credit` [ADCR](http://r-datacollection.com/) ] --- class: inverse, center, middle name: html # HTML basics

--- # HTML background .pull-left-wide[ ### What is HTML? - **H**yper**T**ext **M**arkup **L**anguage - Markup language = plain text + markups - Originally specified by [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee) at [CERN](https://en.wikipedia.org/wiki/CERN) in 1989/90 - [W3C](https://en.wikipedia.org/wiki/World_Wide_Web_Consortium) standard for the construction of websites. - The fundamentals of HTML haven't changed much recently. Current version is HTML 5.2 (published in 2017). ### What is it good for? - In the early days, the internet was mainly good for sharing texts. But plain text is boring. Markup is *fun*! - HTML lies underneath of what you see in your browser. You don't see it because your browser interprets and renders it for you. - A basic understanding of HTML helps us locate the information we want to retrieve. ] .pull-right-small-center[


] --- # HTML tree structure .pull-left[ ### The DOM tree - HTML documents are hierarchically structured. Think of them as a tree with multiple nodes and branches. - When a webpage (HTML resource) is loaded, the browser creates a [Document Object Model](https://en.wikipedia.org/wiki/Document_Object_Model) of that page - the **DOM Tree**. - Think of it as a representation that considers all HTML elements as objects than can be accessed. ### Parts of the tree - The DOM is constituted of **nodes**, which are just data types that can be referred to - such as "attribute node", "text node", or "element node". - A **node set** is a set of nodes. This will become relevant when you learn about XPath, which you can use to access multiple nodes (e.g., all `title` nodes). ] .pull-right[ ```{html, prompt = FALSE, eval = FALSE} First HTML I am your first HTML file! ```
] --- # HTML: elements and attributes .pull-left[ ### Elements - Elements are a combination of start tags, content, and end tags. - Example: `First HTML` - An element is everything from (including) the element's start tag to (including) the element's end tag, but also other elements that are nested within that element. - Syntax: | Component | Representation | |---|---| | Element title | `title` | | Start tag | `` | | End tag | `` | | Value | `First HTML` | ] .pull-right[ ### Attributes - Describe elements and are stored in the start tag. - There are specific attributes for specific elements. - Example: `Link to Homepage` - Syntax: - Name-value pairs: `name="value"` - Simple and double quotation marks possible - Several attributes per element possible ### Why tags and attributes are important - Tags structure HTML documents. - In the context of web scraping, the structure can be exploited to locate and extract data from websites. ] --- # Important tags and attributes ### Anchor tag `` - Links to other pages or resources. - Classical links are always formatted with an anchor tag. - The `href` attribute determines the target location. - The value is the name of the link. Link to another resource: ```{html, eval = FALSE, prompt = FALSE} Link with absolute path ``` Reference within a document: ```{html, eval = FALSE, prompt = FALSE} Reference point ``` Link to a reference within a document: ```{html, eval = FALSE, prompt = FALSE} Link to reference point ``` --- # Important tags and attributes ### Heading tags `

`, `

`, ..., and paragraph tag `

` - Structure text and paragraphs. - Heading tags range from level 1 to 6. - Paragraph tag induces a line break. Examples: ```{html, eval = FALSE, prompt = FALSE}

This text is going to be a paragraph one day and separated from other text by line breaks.

``` ```{html, eval = FALSE, prompt = FALSE}

heading of level 1 - this will be BIG

...
heading of level 6 - the smallest heading
``` --- # Important tags and attributes ### Listing tags `