SOLUTIONS
Tutorial 2, Advanced Crime Analysis, BSc Security and Crime Science, UCL
Aim of this tutorial
This tutorial will help you consolidate some techniques presented and used earlier in this module about APIs and web-scraping. You will also be able to start working on your own webscraping programme that might be useful for your final project.
Task 2: Using YouTube’s API to analyse highly controversial content
Recently, the razor manufacturer Gilette released a video called We Believe: The Best Men Can Be. That video has been extremely controversial and has evoked a number considerably opinioted responses.
Use YouTube’s API to gether information about that video. Note: retrieving all comments will exceed your quota and will take a very long time.
#your code comes here
Task 3: Retrieving the public opinion on that controversial topic.
You might not that some comments below that video are on the boundary of hate speech or even in full-blown aggressive language.
Let’s look at a different source of opinion about that video. This recent opinion article in the Guardian discusses the video.
Try to access the comments made to that video and store them in a dataframe along with a unique identifier.
#your code comes here
Task 4: Recreate the lyrics scraping example with your own artist selection
In this blog you can see stepwise how to scrape data from popular music artists and then access the lyrics of their songs.
(Re-)use the code from the above blog post and re-do their scraping process using your own artist (charts) selection.
#NOTE: Differences in html node names - CORRECTED
#NOTE: Found that artists included featuring... which meant I got 404 errors - CORRECTED
#Load Packages
library(tidyverse)
library(rvest)
#Identify the url from where you want to extract data
URLBase<-"https://www.billboard.com/charts/hot-100-60th-anniversary"
#Read HTML of that page
pageBase<-read_html(URLBase)
#Get the artist name
artist<-html_nodes(pageBase, ".chart-list-item__artist")
#Convert to char
artist<-as.character(html_text(artist))
#Get the artist rank
artistRank<-html_nodes(pageBase, ".chart-list-item__rank ")
#Convert to numeric
artistRank<-as.numeric(html_text(artistRank))
# Save it to a tibble df
#Note how we replace the '\n' character (which means newline) with nothing
#Then we filter for top 10
artistsTop<-tibble('Artist' = gsub("\n", "", artist),
'Rank' = artistRank) %>%
filter(artistRank <= 10)
#Format the link to navigate to the artists genius webpage
URLGenius<-paste0("https://genius.com/artists/",artistsTop$Artist)
#Issues aries when songs in the chart are from one artist featuring another
#Because these collaborations do not have their own artist page
#We therefore consider the main artist only
#Edit song link
#Remove everything after the word Featuring
URLGenius<-lapply(URLGenius, function(x) {sub('Featuring.*', '', x)})
#Unlist so in same format
URLGenius<-unlist(URLGenius)
#Initialize a tibble df to store the results
artistLyrics<-tibble()
#Retrieve song links for each artist
for (i in 1:10) {
pageGenius<-read_html(URLGenius[i])
URLSongs <- html_nodes(pageGenius, ".mini_card_grid-song a") %>%
html_attr("href")
#Inner loop to get the Song Name and Lyrics from the Song Link
for (j in 1:10) {
# Get lyrics
songLyrics <- read_html(URLSongs[j]) %>%
html_nodes("div.lyrics p") %>%
html_text()
# Get song name
songName <- read_html(URLSongs[j]) %>%
html_nodes("h1.header_with_cover_art-primary_info-title") %>%
html_text()
# Save the details to a tibble df
artistLyrics <- rbind(artistLyrics, tibble(Rank = artistsTop$Rank[i],
Artist = artistsTop$Artist[i],
Song = songName,
Lyrics = songLyrics ))
# Insert time delay
Sys.sleep(10)
}
}
#View results
artistLyrics
Task 5: Creating a local database of missing persons
The problem of missing persons in the UK is increasingly recognised in academic research. However, to date not curated database exists that researchers can easily download to query the data of people reported missing.
Note that some of the images are from dead persons and might be confrontational to look at.
Your task is to create a local database on your computer of missing females. Use this urlc (https://www.missingpersons.police.uk/en-gb/case-search/9444442) as a starting point and retrieve the (1) gender, (2) age, (3) ethnicity and (4) circumstances details of the first three pages of search results.
#your code comes here
#Takes the base URL and pastes with 1,2,3 to get a list of pages to search
#Each page is then searched for the 'View Case Details' buttons and the hyperlinks noted
URLMissingPersons <- lapply(paste0('https://www.missingpersons.police.uk/en-gb/case-search/9444442?orderBy=dateDesc&page=', 1:3),
function(url){
url %>% read_html() %>%
html_nodes(".Case") %>%
html_nodes(".btn-default") %>%
html_attr('href')
})
#View Output
URLMissingPersons
#Unlist output
#Simplify as we have nensted lists: one for each page
URLMissingPersons<-unlist(URLMissingPersons)
#View Output
URLMissingPersons
#Add base URL to get complete URL
URLMissingPersons[] <- lapply(URLMissingPersons, function(x) paste("https://www.missingpersons.police.uk/en-gb", x, sep=""))
#Unlist
URLMissingPersons<-unlist(URLMissingPersons)
#ViewOutput
URLMissingPersons
#Extract Case Details Section
#Uses lapply to apply the following function to every URL in our case list
#Extract a list of the case detail field titles
#Extract a list of the case details field details
#Bind the two together
#Return a dataframe
dbMissingPersons<- lapply(URLMissingPersons,
function(url){
caseDetailsTitle<- url %>% read_html() %>%
html_nodes('dl.CaseDetails dt') %>%
html_text()
caseDetailsDescription<-url %>% read_html() %>%
html_nodes('dl.CaseDetails dd') %>%
html_text()
caseDetails<-data.frame(caseDetailsTitle, caseDetailsDescription)
return(caseDetails)
})
#ViewOutput
View(dbMissingPersons)
Task 6: Web-scraping for the detection of potentially suspicious items
You can use web-scraping to look for suspicious items on online market places. Let’s take the example of clothing from the “Supreme” brand on gumtree.
You can access the search results for supreme hoodie
here: https://www.gumtree.com/search?featured_filter=false&urgent_filter=false&sort=date&search_scope=false&photos_filter=false&search_category=all&q=supreme+hoodie&tq=%7B%22i%22%3A%22supreme%22%2C%22s%22%3A%22supreme+hoodie%22%2C%22p%22%3A9%2C%22t%22%3A14%7D&search_location=
Your task now is to scrape the (1) price, (2) title, and (3) location of each ad.
#your code comes here
#acces gumtree page
target_url = "https://www.gumtree.com/search?featured_filter=false&urgent_filter=false&sort=date&search_scope=false&photos_filter=false&search_category=all&q=supreme+hoodie&tq=%7B%22i%22%3A%22supreme%22%2C%22s%22%3A%22supreme+hoodie%22%2C%22p%22%3A9%2C%22t%22%3A14%7D&search_location="
target_page = read_html(target_url)
# look for page structure and find products"
ProductListings = target_page %>% html_nodes(xpath = '//*[@id="srp-results"]') %>% html_nodes("div.srp-results") %>% html_nodes("div.grid-row") %>% html_nodes("div.space-mbxs:nth-child(3)") %>% html_nodes(".list-listing-mini") %>% html_nodes("li")
# contents for products
SingleProds = ProductListings %>% html_nodes("article") %>% html_nodes("a.listing-link") %>% html_nodes("div.listing-content")
# title, price, location of products
ProductsTitle = SingleProds %>% html_node("h2.listing-title") %>% html_text()
ProductsPrice = SingleProds %>% html_node("span.listing-price") %>% html_text()
ProductsLocation = SingleProds %>% html_node("div.listing-location") %>% html_text()
# save data in data frame
AllData = data.frame(ProductsTitle, ProductsPrice, ProductsLocation)
Task 7: Start with your own web-scraping
Choose a website that you want to scrape (e.g. for your final project for this module). In this task, try to understand the structure of that website and how you can retrieve the content you wish to access.
Write a stepwise plan on what you need to do to obtain that data.
#write your plan here
#step1:
#step2:
#step3:
#step4:
#step5:
Now start by creating a my_target_url
variable and do the initial accessing of that url with the read_html
function:
#your code comes here
Task 8: Your custom web-scraper
Start building your own webscraper here:
#your code comes here
---
title: "Solutions: Webscraping in R"
author: "B Kleinberg, F Soldner, D Hammocks"
date: 22 January 2019
subtitle: Dept of Security and Crime Science, UCL
output: html_notebook
---

**SOLUTIONS**

---

Tutorial 2, Advanced Crime Analysis, BSc Security and Crime Science, UCL

---

## Aim of this tutorial

This tutorial will help you consolidate some techniques presented and used earlier in this module about APIs and web-scraping. You will also be able to start working on your own webscraping programme that might be useful for your final project.


## Task 1: Geographical variation of public perception on Twitter

Use Twitter's API to retrieve Tweets about "crime" in these cities: (1) New York City. (2) London, (3) Los Angeles, (4) Austin, Texas, and (5) Dublin.

Store all tweets in a single dataframe with a column identifying the city.

```{r}
#your code comes here
```


## Task 2: Using YouTube's API to analyse highly controversial content

Recently, the razor manufacturer Gilette released a video called [We Believe: The Best Men Can Be](https://www.youtube.com/watch?v=koPmuEyP3a0). That video has been extremely controversial and has evoked a number considerably opinioted responses.

Use YouTube's API to gether information about that video. _Note: retrieving all comments will exceed your quota and will take a very long time._

```{r}
#your code comes here
```


## Task 3: Retrieving the public opinion on that controversial topic.

You might not that some comments below that video are on the boundary of hate speech or even in full-blown aggressive language.

Let's look at a different source of opinion about that video. This [recent opinion article](https://www.theguardian.com/commentisfree/2019/jan/16/men-masculinity-gillette-advertisement) in the Guardian discusses the video.

Try to access the comments made to that video and store them in a dataframe along with a unique identifier.

```{r}
#your code comes here
```


## Task 4: Recreate the lyrics scraping example with your own artist selection

In [this blog](https://towardsdatascience.com/learn-to-create-your-own-datasets-web-scraping-in-r-f934a31748a5) you can see stepwise how to scrape data from popular music artists and then access the lyrics of their songs.

(Re-)use the code from the above blog post and re-do their scraping process using your own artist (charts) selection.

```{r}
#NOTE: Differences in html node names - CORRECTED
#NOTE: Found that artists included featuring... which meant I got 404 errors - CORRECTED

#Load Packages
library(tidyverse)
library(rvest)

#Identify the url from where you want to extract data
URLBase<-"https://www.billboard.com/charts/hot-100-60th-anniversary"
#Read HTML of that page
pageBase<-read_html(URLBase)

#Get the artist name
artist<-html_nodes(pageBase, ".chart-list-item__artist")
#Convert to char
artist<-as.character(html_text(artist))

#Get the artist rank
artistRank<-html_nodes(pageBase, ".chart-list-item__rank ")
#Convert to numeric
artistRank<-as.numeric(html_text(artistRank))

# Save it to a tibble df
  #Note how we replace the '\n' character (which means newline) with nothing
  #Then we filter for top 10
artistsTop<-tibble('Artist' = gsub("\n", "", artist),
                      'Rank' = artistRank) %>%
                  filter(artistRank <= 10)

#Format the link to navigate to the artists genius webpage
URLGenius<-paste0("https://genius.com/artists/",artistsTop$Artist)

#Issues aries when songs in the chart are from one artist featuring another
#Because these collaborations do not have their own artist page
#We therefore consider the main artist only

#Edit song link
  #Remove everything after the word Featuring
URLGenius<-lapply(URLGenius, function(x) {sub('Featuring.*', '', x)})
#Unlist so in same format
URLGenius<-unlist(URLGenius)

#Initialize a tibble df to store the results
artistLyrics<-tibble()

#Retrieve song links for each artist 
for (i in 1:10) {
  pageGenius<-read_html(URLGenius[i])
  URLSongs <- html_nodes(pageGenius, ".mini_card_grid-song a") %>%
      html_attr("href") 
  
   #Inner loop to get the Song Name and Lyrics from the Song Link    
    for (j in 1:10) {
      
      # Get lyrics
      songLyrics <- read_html(URLSongs[j]) %>%
          html_nodes("div.lyrics p") %>%
          html_text()
        
      # Get song name
      songName <- read_html(URLSongs[j]) %>%
         html_nodes("h1.header_with_cover_art-primary_info-title") %>%
         html_text()
        
      # Save the details to a tibble df
      artistLyrics <- rbind(artistLyrics, tibble(Rank = artistsTop$Rank[i],
                                                   Artist = artistsTop$Artist[i],
                                                   Song = songName,
                                                   Lyrics = songLyrics ))
      
      # Insert time delay
      Sys.sleep(10)
     }
} 

#View results
artistLyrics
```

## Task 5: Creating a local database of missing persons

The problem of missing persons in the UK is increasingly recognised in academic research. However, to date not curated database exists that researchers can easily download to query the data of people reported missing.

_Note that some of the images are from dead persons and might be confrontational to look at._

Your task is to create a local database on your computer of missing females. Use this urlc [(https://www.missingpersons.police.uk/en-gb/case-search/9444442)](https://www.missingpersons.police.uk/en-gb/case-search/9444442) as a starting point and retrieve the (1) gender, (2) age, (3) ethnicity and (4) circumstances details of the first three pages of search results.

```{r}
#your code comes here
#Takes the base URL and pastes with 1,2,3 to get a list of pages to search
#Each page is then searched for the 'View Case Details' buttons and the hyperlinks noted
URLMissingPersons <- lapply(paste0('https://www.missingpersons.police.uk/en-gb/case-search/9444442?orderBy=dateDesc&page=', 1:3),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".Case") %>%
                        html_nodes(".btn-default") %>%
                        html_attr('href')
                })
```

```{r}
#View Output
URLMissingPersons

#Unlist output
#Simplify as we have nensted lists: one for each page
URLMissingPersons<-unlist(URLMissingPersons)

#View Output
URLMissingPersons

#Add base URL to get complete URL
URLMissingPersons[] <- lapply(URLMissingPersons, function(x) paste("https://www.missingpersons.police.uk/en-gb", x, sep=""))

#Unlist
URLMissingPersons<-unlist(URLMissingPersons)

#ViewOutput
URLMissingPersons
```

```{r}
#Extract Case Details Section
#Uses lapply to apply the following function to every URL in our case list
  #Extract a list of the case detail field titles
  #Extract a list of the case details field details
  #Bind the two together
  #Return a dataframe

dbMissingPersons<- lapply(URLMissingPersons,
                          function(url){
                            caseDetailsTitle<- url %>% read_html() %>%
                              html_nodes('dl.CaseDetails dt') %>%
                              html_text()
                            caseDetailsDescription<-url %>% read_html() %>%
                              html_nodes('dl.CaseDetails dd') %>%
                              html_text()
          
                            caseDetails<-data.frame(caseDetailsTitle, caseDetailsDescription)
                            
                            return(caseDetails)
                          })

#ViewOutput
View(dbMissingPersons)
```



## Task 6: Web-scraping for the detection of potentially suspicious items

You can use web-scraping to look for suspicious items on online market places. Let's take the example of clothing from the "Supreme" brand on gumtree.

You can access the search results for `supreme hoodie` here: [https://www.gumtree.com/search?featured_filter=false&urgent_filter=false&sort=date&search_scope=false&photos_filter=false&search_category=all&q=supreme+hoodie&tq=%7B%22i%22%3A%22supreme%22%2C%22s%22%3A%22supreme+hoodie%22%2C%22p%22%3A9%2C%22t%22%3A14%7D&search_location=](https://www.gumtree.com/search?featured_filter=false&urgent_filter=false&sort=date&search_scope=false&photos_filter=false&search_category=all&q=supreme+hoodie&tq=%7B%22i%22%3A%22supreme%22%2C%22s%22%3A%22supreme+hoodie%22%2C%22p%22%3A9%2C%22t%22%3A14%7D&search_location=)

Your task now is to scrape the (1) price, (2) title, and (3) location of each ad.

```{r}
#your code comes here

#acces gumtree page
target_url = "https://www.gumtree.com/search?featured_filter=false&urgent_filter=false&sort=date&search_scope=false&photos_filter=false&search_category=all&q=supreme+hoodie&tq=%7B%22i%22%3A%22supreme%22%2C%22s%22%3A%22supreme+hoodie%22%2C%22p%22%3A9%2C%22t%22%3A14%7D&search_location="
target_page = read_html(target_url)
```

```{r}
# look for page structure and find products"
ProductListings = target_page %>% html_nodes(xpath = '//*[@id="srp-results"]') %>% html_nodes("div.srp-results") %>% html_nodes("div.grid-row") %>% html_nodes("div.space-mbxs:nth-child(3)") %>% html_nodes(".list-listing-mini") %>% html_nodes("li")
```

```{r}
# contents for products
SingleProds = ProductListings %>% html_nodes("article") %>% html_nodes("a.listing-link") %>% html_nodes("div.listing-content")
```

```{r}
# title, price, location of products
ProductsTitle = SingleProds %>% html_node("h2.listing-title") %>% html_text()
ProductsPrice = SingleProds %>% html_node("span.listing-price") %>% html_text()
ProductsLocation = SingleProds %>% html_node("div.listing-location") %>% html_text()
```

```{r}
# save data in data frame
AllData = data.frame(ProductsTitle, ProductsPrice, ProductsLocation) 
```


## Task 7: Start with your own web-scraping

Choose a website that you want to scrape (e.g. for your final project for this module). In this task, try to understand the structure of that website and how you can retrieve the content you wish to access.

Write a stepwise plan on what you need to do to obtain that data.

```{r}
#write your plan here

#step1:
#step2:
#step3:
#step4:
#step5:
```


Now start by creating a `my_target_url` variable and do the initial accessing of that url with the `read_html` function:

```{r}
#your code comes here
```


## Task 8: Your custom web-scraper

Start building your own webscraper here:

```{r}
#your code comes here
```


