Web scraping 2

Advanced Crime Analysis UCL

Bennett Kleinberg

21 Jan 2019

Getting data from the Internet

Webscraping 2

Today

  • “Real” webscraping: basics of a webpage
  • Access webpage in the wild
  • Retrieve wild data
  • Download data through webscraping

APIs: Pros & Cons

Pro

  • easiy to access
  • nicely documentation
  • works even if website changes

Cons

  • quota limits ($ $ $)
  • under the platforms’ control
  • only for few platforms

Don’t let the data determine your research!

COOL

But what about:

No APIs

  • incels.me
  • stormfront
  • 4chan

  • APIs are restrictive!

… what about:

Your research question –> no API?

Main problem:

Really ‘juicy’ data of the Internet vs APIs

“Real” webscraping: basics of a webpage

Three elements of a webpage

  1. Structure
  2. Behaviour
  3. Style

Three elements of a webpage

  1. Structure
  2. Behaviour
    • JavaScript (!= Java)
    • user interaction
    • examples: alerts, popups, server-interaction
  3. Style

Three elements of a webpage

  1. Structure
  2. Behaviour
  3. Style
    • CSS (Cascading Style Sheets)
    • formatting, design, responsiveness
    • examples: submit buttons, app interaces

Three elements of a webpage

  1. Structure
    • HTML (hypertext markup language)
    • structured with <tags>
    • contains the pure content of the webpage
  2. Behaviour
  3. Style

For now: HTML

The very basics of HTML:

Raw architecture of a webpage

<!DOCTYPE html>
<html>
<body>

HERE COMES THE VISIBLE PART!!

</body>
</html>

Note: Every tags < > is closed < />. Content is contained within the tag.

HTML basics

Ways to put content in the <body> ... </body> tag:

  • headings: <h1>I'm a heading at level 1</>

Content in the body tag

  • paragraphs: <p>This is a paragraph</p>

Content in the body tag

  • images: <img src="./img/ucl.jpg">

Content in the body tag

  • links: <a href="https://www.ucl.ac.uk/">Click here to go to UCL's website</a></a>

Content in the body tag

  • tables
<table>
  <tr>
    <th>Departments</th>
    <th>Location</th>
  </tr>
  <tr>
    <td>Dept. of Security and Crime Science</td>
    <td>Division of Psychology and Language Sciences</td>
  </tr>
  <tr>
    <td>35 Tavistock Square</td>
    <td>26 Bedford Way</td>
  </tr>
</table> 

Html <table>...</table>

Content in the body tag

  • lists
<ul>
  <li>Terrorism</li>
  <li>Cyber Crime</li>
  <li>Data Science</li>
</ul> 

HTML basics

Elements (can) have IDs:

<p id='paragraph1'>This is a paragraph</p>
<img id='ucl_image' src="./img/ucl.jpg">

Same for tables, links, etc.

Every element can have an ID.

You need unique IDs! Two elements cannot have the same ID.

HTML basics

Common elements (can) have CLASSES:

<p id="paragraph1" class="paragraph_class">I am the first paragraph</p>
<p class="paragraph_class">I am the second paragraph</p>
<p class="paragraph_class">I am the third paragraph</p>

Multiple elements can have the same class.

Now what?

Web scraping logic

If all webpages are built in this structure…

… then we could access this structure programmatically.

But where do I find that structure?

Is it just “there”?

YES!!

How to see the html structure?

Example 1: Missing persons

Example 1: Missing persons

Example 2: FBI most wanted

Webscraping in a nutshell

  1. understand the structure of a webpage
  2. exploit that structure for web-scraping

Webscraping in practice

Case for today: Missing persons FBI

https://www.fbi.gov/wanted/kidnap

Explore the target page

Aims

  1. Get a list of all names
  2. Store the bio information
  3. Extract the “details” description
  4. Download the poster

Getting started

Set up your workspace first:

library(rvest)
## Loading required package: xml2
target_url = 'https://www.fbi.gov/wanted/kidnap'

Remember…

  • understanding the structure of a webpage
  • exploiting that structure for web-scraping

1. Get a list of all names

understanding the structure of a webpage

https://www.fbi.gov/wanted/kidnap

1. Get a list of all names

exploiting that structure for web-scraping

Access the full html page (snapshot-mode):

target_page = read_html(target_url)
target_page
## {xml_document}
## <html lang="en" data-gridsystem="bs3">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="visual-portal-wrapper" class="  portaltype-folder site-Plo ...

1. Get a list of all names

Key here: look for the <h3> heading with class title:

target_page %>%
  html_nodes('h3.title')
## {xml_nodeset (40)}
##  [1] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/aria ...
##  [2] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ange ...
##  [3] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/jenn ...
##  [4] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/feli ...
##  [5] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/mark ...
##  [6] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/mich ...
##  [7] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ashl ...
##  [8] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/jaym ...
##  [9] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ilen ...
## [10] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/carl ...
## [11] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/lisa ...
## [12] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/kyro ...
## [13] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/bian ...
## [14] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/jabe ...
## [15] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/shan ...
## [16] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/abby ...
## [17] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ruoc ...
## [18] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/josh ...
## [19] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/copy ...
## [20] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/eric ...
## ...
#note: equivalent to "html_nodes(target_page, 'h3.title')""

1. Get a list of all names

A closer look:

all_titles = target_page %>%
  html_nodes('h3.title')

all_titles[1]
## {xml_nodeset (1)}
## [1] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/arian ...

It’s the text of the <a href=...> tag.

1. Get a list of all names

So what we want is:

  1. Access the full html page
  2. Search all h3 headings with class “title”
  3. Find all <a > tags (= links)
  4. Extract the text

1. Get a list of all names

  1. Access the full html page read_html(target_url)
  2. Search all h3 headings with class “title” html_nodes('h3.title')
  3. Find all <a > tags (= links) html_nodes('a')
  4. Extract the text html_text()

1. Get a list of all names

Combined:

all_names = target_page %>%
  html_nodes('h3.title') %>%
  html_nodes('a') %>%
  html_text()

1. Get a list of all names

all_names
##  [1] "ARIANNA FITTS"                     
##  [2] "ANGELA MAE MEEKER"                 
##  [3] "JENNIFER LYNN MARCUM"              
##  [4] "FELIX BATISTA"                     
##  [5] "MARK HIMEBAUGH"                    
##  [6] "MICHAELA JOY GARECHT"              
##  [7] "ASHLEY SUMMERS"                    
##  [8] "JAYME CLOSS"                       
##  [9] "ILENE BETH MISHELOFF"              
## [10] "CARLA VICENTINI"                   
## [11] "LISA IRWIN"                        
## [12] "KYRON RICHARD HORMAN"              
## [13] "BIANCA LEBRON"                     
## [14] "JABEZ SPANN"                       
## [15] "SHANNA GENELLE PEOPLES"            
## [16] "ABBY LYNN PATTERSON"               
## [17] "RUOCHEN LIAO"                      
## [18] "JOSHUA KESHABA SIERRA GARCIA"      
## [19] "ALEXIS TIARA MURPHY"               
## [20] "ERICA NICOLE HUNT"                 
## [21] "TIONDA Z. BRADLEY"                 
## [22] "DIAMOND YVETTE BRADLEY"            
## [23] "NEFERTIRI R. TRADER"               
## [24] "KRISTEN MODAFFERI"                 
## [25] "LASHAYA STINE"                     
## [26] "JONATHAN FRASER"                   
## [27] "AMINA AND BELEL KANDIL"            
## [28] "FALOMA LUHK"                       
## [29] "MALEINA LUHK"                      
## [30] "KAYLAH HUNTER AND KRISTIAN JUSTICE"
## [31] "RUSSELL JOHN MORT"                 
## [32] "AKIA SHAWNTA EGGLESTON"            
## [33] "RONDREIZ PHILLIPS"                 
## [34] "DEVONTE JORDAN HART"               
## [35] "AMBER ELIZABETH CATES"             
## [36] "SHAINA ASHLEY KIRKPATRICK"         
## [37] "SHAUSHA LATINE HENSON"             
## [38] "AMY LYNN BRADLEY"                  
## [39] "SUZANNE G. LYALL"                  
## [40] "MYRA LEWIS"

2. Store the bio information

understanding the structure of a webpage

https://www.fbi.gov/wanted/kidnap

2. Store the bio information

We know: there’s a table with class wanted-person-description that contains the data we want.

But: we need to access each ‘kidnapped’ person!?

For-loops to the rescue…

2. Store the bio information

exploiting that structure for web-scraping

So what we want is:

  1. Access the full html page
  2. Search all h3 headings with class “title”
  3. Find all <a > tags (= links)
  4. Extract the text actual link
  5. Access that page
  6. Extract the table with class wanted-person-description

2. Store the bio information

all_persons_links = target_page %>%
  html_nodes('h3.title') %>%
  html_nodes('a') %>%
  html_attr('href')

all_persons_links
##  [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts"               
##  [2] "https://www.fbi.gov/wanted/kidnap/angela-mae-meeker"           
##  [3] "https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum"        
##  [4] "https://www.fbi.gov/wanted/kidnap/felix-batista"               
##  [5] "https://www.fbi.gov/wanted/kidnap/mark-himebaugh"              
##  [6] "https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht"        
##  [7] "https://www.fbi.gov/wanted/kidnap/ashley-summers"              
##  [8] "https://www.fbi.gov/wanted/kidnap/jayme-closs"                 
##  [9] "https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff"        
## [10] "https://www.fbi.gov/wanted/kidnap/carla-vicentini"             
## [11] "https://www.fbi.gov/wanted/kidnap/lisa-irwin"                  
## [12] "https://www.fbi.gov/wanted/kidnap/kyron-richard-horman"        
## [13] "https://www.fbi.gov/wanted/kidnap/bianca-lebron"               
## [14] "https://www.fbi.gov/wanted/kidnap/jabez-spann"                 
## [15] "https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples"      
## [16] "https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson"         
## [17] "https://www.fbi.gov/wanted/kidnap/ruochen-liao"                
## [18] "https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia"
## [19] "https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy" 
## [20] "https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt"           
## [21] "https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley"           
## [22] "https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley"      
## [23] "https://www.fbi.gov/wanted/kidnap/nefertiri-trader"            
## [24] "https://www.fbi.gov/wanted/kidnap/kristen-modafferi"           
## [25] "https://www.fbi.gov/wanted/kidnap/lashaya-stine"               
## [26] "https://www.fbi.gov/wanted/kidnap/jonathan-fraser"             
## [27] "https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil"      
## [28] "https://www.fbi.gov/wanted/kidnap/faloma-luhk"                 
## [29] "https://www.fbi.gov/wanted/kidnap/maleina-luhk"                
## [30] "https://www.fbi.gov/wanted/kidnap/kaylah-hunter"               
## [31] "https://www.fbi.gov/wanted/kidnap/russell-john-mort"           
## [32] "https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston"      
## [33] "https://www.fbi.gov/wanted/kidnap/rondreiz-phillips"           
## [34] "https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart"         
## [35] "https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates"       
## [36] "https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick"   
## [37] "https://www.fbi.gov/wanted/kidnap/shausha-latine-henson"       
## [38] "https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley"            
## [39] "https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall"            
## [40] "https://www.fbi.gov/wanted/kidnap/myra-lewis"

2. Store the bio information

Now what?

for(i in all_persons_links){
  print(i)
}
## [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts"
## [1] "https://www.fbi.gov/wanted/kidnap/angela-mae-meeker"
## [1] "https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum"
## [1] "https://www.fbi.gov/wanted/kidnap/felix-batista"
## [1] "https://www.fbi.gov/wanted/kidnap/mark-himebaugh"
## [1] "https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht"
## [1] "https://www.fbi.gov/wanted/kidnap/ashley-summers"
## [1] "https://www.fbi.gov/wanted/kidnap/jayme-closs"
## [1] "https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff"
## [1] "https://www.fbi.gov/wanted/kidnap/carla-vicentini"
## [1] "https://www.fbi.gov/wanted/kidnap/lisa-irwin"
## [1] "https://www.fbi.gov/wanted/kidnap/kyron-richard-horman"
## [1] "https://www.fbi.gov/wanted/kidnap/bianca-lebron"
## [1] "https://www.fbi.gov/wanted/kidnap/jabez-spann"
## [1] "https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples"
## [1] "https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson"
## [1] "https://www.fbi.gov/wanted/kidnap/ruochen-liao"
## [1] "https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia"
## [1] "https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy"
## [1] "https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt"
## [1] "https://www.fbi.gov/wanted/kidnap/ti