Getting data from the Internet
Webscraping 2
Pro
Cons
But what about:
4chan
APIs are restrictive!
Main problem:
Really ‘juicy’ data of the Internet vs APIs
<tags>
The very basics of HTML:
Raw architecture of a webpage
<!DOCTYPE html>
<html>
<body>
HERE COMES THE VISIBLE PART!!
</body>
</html>
Note: Every tags < >
is closed < />
. Content is contained within the tag.
Ways to put content in the <body> ... </body>
tag:
<h1>I'm a heading at level 1</>
<p>This is a paragraph</p>
<img src="./img/ucl.jpg">
<a href="https://www.ucl.ac.uk/">Click here to go to UCL's website</a></a>
<table>
<tr>
<th>Departments</th>
<th>Location</th>
</tr>
<tr>
<td>Dept. of Security and Crime Science</td>
<td>Division of Psychology and Language Sciences</td>
</tr>
<tr>
<td>35 Tavistock Square</td>
<td>26 Bedford Way</td>
</tr>
</table>
<table>...</table>
<ul>
<li>Terrorism</li>
<li>Cyber Crime</li>
<li>Data Science</li>
</ul>
Elements (can) have IDs:
<p id='paragraph1'>This is a paragraph</p>
<img id='ucl_image' src="./img/ucl.jpg">
Same for tables, links, etc.
Every element can have an ID.
You need unique IDs! Two elements cannot have the same ID.
Common elements (can) have CLASSES:
<p id="paragraph1" class="paragraph_class">I am the first paragraph</p>
<p class="paragraph_class">I am the second paragraph</p>
<p class="paragraph_class">I am the third paragraph</p>
Multiple elements can have the same class.
If all webpages are built in this structure…
… then we could access this structure programmatically.
Is it just “there”?
YES!!
Case for today: Missing persons FBI
https://www.fbi.gov/wanted/kidnap
Explore the target page
Set up your workspace first:
library(rvest)
## Loading required package: xml2
target_url = 'https://www.fbi.gov/wanted/kidnap'
understanding the structure of a webpage
exploiting that structure for web-scraping
Access the full html page (snapshot-mode):
target_page = read_html(target_url)
target_page
## {xml_document}
## <html lang="en" data-gridsystem="bs3">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="visual-portal-wrapper" class=" portaltype-folder site-Plo ...
Key here: look for the <h3>
heading with class title
:
target_page %>%
html_nodes('h3.title')
## {xml_nodeset (40)}
## [1] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/aria ...
## [2] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ange ...
## [3] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/jenn ...
## [4] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/feli ...
## [5] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/mark ...
## [6] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/mich ...
## [7] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ashl ...
## [8] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/jaym ...
## [9] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ilen ...
## [10] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/carl ...
## [11] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/lisa ...
## [12] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/kyro ...
## [13] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/bian ...
## [14] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/jabe ...
## [15] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/shan ...
## [16] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/abby ...
## [17] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/ruoc ...
## [18] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/josh ...
## [19] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/copy ...
## [20] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/eric ...
## ...
#note: equivalent to "html_nodes(target_page, 'h3.title')""
A closer look:
all_titles = target_page %>%
html_nodes('h3.title')
all_titles[1]
## {xml_nodeset (1)}
## [1] <h3 class="title">\n<a href="https://www.fbi.gov/wanted/kidnap/arian ...
It’s the text of the <a href=...>
tag.
So what we want is:
<a >
tags (= links)read_html(target_url)
html_nodes('h3.title')
<a >
tags (= links) html_nodes('a')
html_text()
Combined:
all_names = target_page %>%
html_nodes('h3.title') %>%
html_nodes('a') %>%
html_text()
all_names
## [1] "ARIANNA FITTS"
## [2] "ANGELA MAE MEEKER"
## [3] "JENNIFER LYNN MARCUM"
## [4] "FELIX BATISTA"
## [5] "MARK HIMEBAUGH"
## [6] "MICHAELA JOY GARECHT"
## [7] "ASHLEY SUMMERS"
## [8] "JAYME CLOSS"
## [9] "ILENE BETH MISHELOFF"
## [10] "CARLA VICENTINI"
## [11] "LISA IRWIN"
## [12] "KYRON RICHARD HORMAN"
## [13] "BIANCA LEBRON"
## [14] "JABEZ SPANN"
## [15] "SHANNA GENELLE PEOPLES"
## [16] "ABBY LYNN PATTERSON"
## [17] "RUOCHEN LIAO"
## [18] "JOSHUA KESHABA SIERRA GARCIA"
## [19] "ALEXIS TIARA MURPHY"
## [20] "ERICA NICOLE HUNT"
## [21] "TIONDA Z. BRADLEY"
## [22] "DIAMOND YVETTE BRADLEY"
## [23] "NEFERTIRI R. TRADER"
## [24] "KRISTEN MODAFFERI"
## [25] "LASHAYA STINE"
## [26] "JONATHAN FRASER"
## [27] "AMINA AND BELEL KANDIL"
## [28] "FALOMA LUHK"
## [29] "MALEINA LUHK"
## [30] "KAYLAH HUNTER AND KRISTIAN JUSTICE"
## [31] "RUSSELL JOHN MORT"
## [32] "AKIA SHAWNTA EGGLESTON"
## [33] "RONDREIZ PHILLIPS"
## [34] "DEVONTE JORDAN HART"
## [35] "AMBER ELIZABETH CATES"
## [36] "SHAINA ASHLEY KIRKPATRICK"
## [37] "SHAUSHA LATINE HENSON"
## [38] "AMY LYNN BRADLEY"
## [39] "SUZANNE G. LYALL"
## [40] "MYRA LEWIS"
understanding the structure of a webpage
We know: there’s a table with class wanted-person-description
that contains the data we want.
But: we need to access each ‘kidnapped’ person!?
For-loops to the rescue…
exploiting that structure for web-scraping
So what we want is:
<a >
tags (= links)wanted-person-description
all_persons_links = target_page %>%
html_nodes('h3.title') %>%
html_nodes('a') %>%
html_attr('href')
all_persons_links
## [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts"
## [2] "https://www.fbi.gov/wanted/kidnap/angela-mae-meeker"
## [3] "https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum"
## [4] "https://www.fbi.gov/wanted/kidnap/felix-batista"
## [5] "https://www.fbi.gov/wanted/kidnap/mark-himebaugh"
## [6] "https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht"
## [7] "https://www.fbi.gov/wanted/kidnap/ashley-summers"
## [8] "https://www.fbi.gov/wanted/kidnap/jayme-closs"
## [9] "https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff"
## [10] "https://www.fbi.gov/wanted/kidnap/carla-vicentini"
## [11] "https://www.fbi.gov/wanted/kidnap/lisa-irwin"
## [12] "https://www.fbi.gov/wanted/kidnap/kyron-richard-horman"
## [13] "https://www.fbi.gov/wanted/kidnap/bianca-lebron"
## [14] "https://www.fbi.gov/wanted/kidnap/jabez-spann"
## [15] "https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples"
## [16] "https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson"
## [17] "https://www.fbi.gov/wanted/kidnap/ruochen-liao"
## [18] "https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia"
## [19] "https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy"
## [20] "https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt"
## [21] "https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley"
## [22] "https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley"
## [23] "https://www.fbi.gov/wanted/kidnap/nefertiri-trader"
## [24] "https://www.fbi.gov/wanted/kidnap/kristen-modafferi"
## [25] "https://www.fbi.gov/wanted/kidnap/lashaya-stine"
## [26] "https://www.fbi.gov/wanted/kidnap/jonathan-fraser"
## [27] "https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil"
## [28] "https://www.fbi.gov/wanted/kidnap/faloma-luhk"
## [29] "https://www.fbi.gov/wanted/kidnap/maleina-luhk"
## [30] "https://www.fbi.gov/wanted/kidnap/kaylah-hunter"
## [31] "https://www.fbi.gov/wanted/kidnap/russell-john-mort"
## [32] "https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston"
## [33] "https://www.fbi.gov/wanted/kidnap/rondreiz-phillips"
## [34] "https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart"
## [35] "https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates"
## [36] "https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick"
## [37] "https://www.fbi.gov/wanted/kidnap/shausha-latine-henson"
## [38] "https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley"
## [39] "https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall"
## [40] "https://www.fbi.gov/wanted/kidnap/myra-lewis"
Now what?
for(i in all_persons_links){
print(i)
}
## [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts"
## [1] "https://www.fbi.gov/wanted/kidnap/angela-mae-meeker"
## [1] "https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum"
## [1] "https://www.fbi.gov/wanted/kidnap/felix-batista"
## [1] "https://www.fbi.gov/wanted/kidnap/mark-himebaugh"
## [1] "https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht"
## [1] "https://www.fbi.gov/wanted/kidnap/ashley-summers"
## [1] "https://www.fbi.gov/wanted/kidnap/jayme-closs"
## [1] "https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff"
## [1] "https://www.fbi.gov/wanted/kidnap/carla-vicentini"
## [1] "https://www.fbi.gov/wanted/kidnap/lisa-irwin"
## [1] "https://www.fbi.gov/wanted/kidnap/kyron-richard-horman"
## [1] "https://www.fbi.gov/wanted/kidnap/bianca-lebron"
## [1] "https://www.fbi.gov/wanted/kidnap/jabez-spann"
## [1] "https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples"
## [1] "https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson"
## [1] "https://www.fbi.gov/wanted/kidnap/ruochen-liao"
## [1] "https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia"
## [1] "https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy"
## [1] "https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt"
## [1] "https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley"
## [1] "https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley"
## [1] "https://www.fbi.gov/wanted/kidnap/nefertiri-trader"
## [1] "https://www.fbi.gov/wanted/kidnap/kristen-modafferi"
## [1] "https://www.fbi.gov/wanted/kidnap/lashaya-stine"
## [1] "https://www.fbi.gov/wanted/kidnap/jonathan-fraser"
## [1] "https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil"
## [1] "https://www.fbi.gov/wanted/kidnap/faloma-luhk"
## [1] "https://www.fbi.gov/wanted/kidnap/maleina-luhk"
## [1] "https://www.fbi.gov/wanted/kidnap/kaylah-hunter"
## [1] "https://www.fbi.gov/wanted/kidnap/russell-john-mort"
## [1] "https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston"
## [1] "https://www.fbi.gov/wanted/kidnap/rondreiz-phillips"
## [1] "https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart"
## [1] "https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates"
## [1] "https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick"
## [1] "https://www.fbi.gov/wanted/kidnap/shausha-latine-henson"
## [1] "https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley"
## [1] "https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall"
## [1] "https://www.fbi.gov/wanted/kidnap/myra-lewis"
Before you write a loop…
arianna = all_persons_links[1]
temp_target_url = arianna
temp_target_page = read_html(temp_target_url)
Single-case proof.
description = temp_target_page %>%
html_nodes('table.wanted-person-description') %>%
html_table()
description
## [[1]]
## X1 X2
## 1 Date(s) of Birth Used September 6, 2013
## 2 Place of Birth California
## 3 Hair Black
## 4 Eyes Brown
## 5 Height 2'0" (at time of disappearance)
## 6 Weight 45 pounds (at time of disappearance)
## 7 Sex Female
## 8 Race Black
## 9 Nationality American
What we need for the for-loop:
list_for_data = list()
for(i in all_persons_links){
print(paste('Accessing:', i))
temp_target_url = i
temp_target_page = read_html(temp_target_url)
description = temp_target_page %>%
html_nodes('table.wanted-person-description') %>%
html_table()
index_of_i = which(i == all_persons_links)
list_for_data[[index_of_i]] = description
print('--- NEXT ---')
}
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/arianna-fitts"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/angela-mae-meeker"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/felix-batista"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/mark-himebaugh"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/ashley-summers"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jayme-closs"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/carla-vicentini"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/lisa-irwin"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/kyron-richard-horman"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/bianca-lebron"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jabez-spann"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/ruochen-liao"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/nefertiri-trader"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/kristen-modafferi"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/lashaya-stine"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jonathan-fraser"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/faloma-luhk"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/maleina-luhk"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/kaylah-hunter"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/russell-john-mort"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/rondreiz-phillips"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/shausha-latine-henson"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/myra-lewis"
## [1] "--- NEXT ---"
Now we have a list of tables.
Each table contains the details of one missing person:
# thirteenth element in the list
list_for_data[[13]]
## [[1]]
## X1 X2
## 1 Date(s) of Birth Used June 26, 1991
## 2 Hair Long Dark Brown
## 3 Eyes Hazel
## 4 Height 4'11"
## 5 Weight 115 pounds
## 6 Sex Female
## 7 Race White (Hispanic)
understanding the structure of a webpage
We want: the text of the <div >
with class wanted-person-details
exploiting that structure for web-scraping
So what we want is:
<a >
tags (= links)wanted-person-description
<div >
with class wanted-person-details
Start with single-case proof:
arianna = all_persons_links[1]
temp_target_url = arianna
temp_target_page = read_html(temp_target_url)
person_details = temp_target_page %>%
html_nodes('div.wanted-person-details') %>%
html_text()
person_details
## [1] "\nDetails:\nArianna Fitts was reported missing from the San Francisco, California, area on April 5, 2016. She was last seen in Oakland, California, in February of 2016. On April 8, 2016, Arianna's mother, Nicole Fitts, was found murdered and buried in a public park in San Francisco. It is believed that Arianna was not with her mother when she was killed.\n"
Putting it in a loop:
list_for_person_details = list()
for(i in all_persons_links){
print(paste('Accessing:', i))
temp_target_url = i
temp_target_page = read_html(temp_target_url)
person_details = temp_target_page %>%
html_nodes('div.wanted-person-details') %>%
html_text()
index_of_i = which(i == all_persons_links)
list_for_person_details[[index_of_i]] = person_details
print('--- NEXT ---')
}
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/arianna-fitts"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/angela-mae-meeker"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/felix-batista"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/mark-himebaugh"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/ashley-summers"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jayme-closs"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/carla-vicentini"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/lisa-irwin"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/kyron-richard-horman"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/bianca-lebron"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jabez-spann"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/ruochen-liao"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/nefertiri-trader"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/kristen-modafferi"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/lashaya-stine"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/jonathan-fraser"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/faloma-luhk"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/maleina-luhk"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/kaylah-hunter"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/russell-john-mort"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/rondreiz-phillips"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/shausha-latine-henson"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall"
## [1] "--- NEXT ---"
## [1] "Accessing: https://www.fbi.gov/wanted/kidnap/myra-lewis"
## [1] "--- NEXT ---"
#look at the 20th element
list_for_person_details[[20]]
## [1] "\nDetails:\nErica Nicole Hunt was last seen on July 4, 2016, near a relative's house in Opelousas, Louisiana. Her roommate reported her missing on July 5, 2016, and she has not been seen or heard from since.\n"
understanding the structure of a webpage
What do we want?
understanding the structure of a webpage
Each kidnapped person has a ‘download link’…
https://www.fbi.gov/wanted/kidnap/arianna-fitts/download.pdf
Don’t overthink it!
Compare these two:
https://www.fbi.gov/wanted/kidnap/arianna-fitts/download.pdf
arianna
## [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts"
Notice something?
We can just ‘work around’ this:
download_url_arianna = paste(tolower(arianna), '/download.pdf', sep="")
download_url_arianna
## [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts/download.pdf"
https://www.fbi.gov/wanted/kidnap/arianna-fitts/download.pdf
Make use of R’s vectorised structure:
all_download_links = paste(tolower(all_persons_links), '/download.pdf', sep="")
all_download_links
## [1] "https://www.fbi.gov/wanted/kidnap/arianna-fitts/download.pdf"
## [2] "https://www.fbi.gov/wanted/kidnap/angela-mae-meeker/download.pdf"
## [3] "https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum/download.pdf"
## [4] "https://www.fbi.gov/wanted/kidnap/felix-batista/download.pdf"
## [5] "https://www.fbi.gov/wanted/kidnap/mark-himebaugh/download.pdf"
## [6] "https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht/download.pdf"
## [7] "https://www.fbi.gov/wanted/kidnap/ashley-summers/download.pdf"
## [8] "https://www.fbi.gov/wanted/kidnap/jayme-closs/download.pdf"
## [9] "https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff/download.pdf"
## [10] "https://www.fbi.gov/wanted/kidnap/carla-vicentini/download.pdf"
## [11] "https://www.fbi.gov/wanted/kidnap/lisa-irwin/download.pdf"
## [12] "https://www.fbi.gov/wanted/kidnap/kyron-richard-horman/download.pdf"
## [13] "https://www.fbi.gov/wanted/kidnap/bianca-lebron/download.pdf"
## [14] "https://www.fbi.gov/wanted/kidnap/jabez-spann/download.pdf"
## [15] "https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples/download.pdf"
## [16] "https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson/download.pdf"
## [17] "https://www.fbi.gov/wanted/kidnap/ruochen-liao/download.pdf"
## [18] "https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia/download.pdf"
## [19] "https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy/download.pdf"
## [20] "https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt/download.pdf"
## [21] "https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley/download.pdf"
## [22] "https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley/download.pdf"
## [23] "https://www.fbi.gov/wanted/kidnap/nefertiri-trader/download.pdf"
## [24] "https://www.fbi.gov/wanted/kidnap/kristen-modafferi/download.pdf"
## [25] "https://www.fbi.gov/wanted/kidnap/lashaya-stine/download.pdf"
## [26] "https://www.fbi.gov/wanted/kidnap/jonathan-fraser/download.pdf"
## [27] "https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil/download.pdf"
## [28] "https://www.fbi.gov/wanted/kidnap/faloma-luhk/download.pdf"
## [29] "https://www.fbi.gov/wanted/kidnap/maleina-luhk/download.pdf"
## [30] "https://www.fbi.gov/wanted/kidnap/kaylah-hunter/download.pdf"
## [31] "https://www.fbi.gov/wanted/kidnap/russell-john-mort/download.pdf"
## [32] "https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston/download.pdf"
## [33] "https://www.fbi.gov/wanted/kidnap/rondreiz-phillips/download.pdf"
## [34] "https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart/download.pdf"
## [35] "https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates/download.pdf"
## [36] "https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick/download.pdf"
## [37] "https://www.fbi.gov/wanted/kidnap/shausha-latine-henson/download.pdf"
## [38] "https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley/download.pdf"
## [39] "https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall/download.pdf"
## [40] "https://www.fbi.gov/wanted/kidnap/myra-lewis/download.pdf"
Now:
Create filenames:
library(stringr)
file_names = paste(all_names, '.pdf', sep="")
file_names
## [1] "ARIANNA FITTS.pdf"
## [2] "ANGELA MAE MEEKER.pdf"
## [3] "JENNIFER LYNN MARCUM.pdf"
## [4] "FELIX BATISTA.pdf"
## [5] "MARK HIMEBAUGH.pdf"
## [6] "MICHAELA JOY GARECHT.pdf"
## [7] "ASHLEY SUMMERS.pdf"
## [8] "JAYME CLOSS.pdf"
## [9] "ILENE BETH MISHELOFF.pdf"
## [10] "CARLA VICENTINI.pdf"
## [11] "LISA IRWIN.pdf"
## [12] "KYRON RICHARD HORMAN.pdf"
## [13] "BIANCA LEBRON.pdf"
## [14] "JABEZ SPANN.pdf"
## [15] "SHANNA GENELLE PEOPLES.pdf"
## [16] "ABBY LYNN PATTERSON.pdf"
## [17] "RUOCHEN LIAO.pdf"
## [18] "JOSHUA KESHABA SIERRA GARCIA.pdf"
## [19] "ALEXIS TIARA MURPHY.pdf"
## [20] "ERICA NICOLE HUNT.pdf"
## [21] "TIONDA Z. BRADLEY.pdf"
## [22] "DIAMOND YVETTE BRADLEY.pdf"
## [23] "NEFERTIRI R. TRADER.pdf"
## [24] "KRISTEN MODAFFERI.pdf"
## [25] "LASHAYA STINE.pdf"
## [26] "JONATHAN FRASER.pdf"
## [27] "AMINA AND BELEL KANDIL.pdf"
## [28] "FALOMA LUHK.pdf"
## [29] "MALEINA LUHK.pdf"
## [30] "KAYLAH HUNTER AND KRISTIAN JUSTICE.pdf"
## [31] "RUSSELL JOHN MORT.pdf"
## [32] "AKIA SHAWNTA EGGLESTON.pdf"
## [33] "RONDREIZ PHILLIPS.pdf"
## [34] "DEVONTE JORDAN HART.pdf"
## [35] "AMBER ELIZABETH CATES.pdf"
## [36] "SHAINA ASHLEY KIRKPATRICK.pdf"
## [37] "SHAUSHA LATINE HENSON.pdf"
## [38] "AMY LYNN BRADLEY.pdf"
## [39] "SUZANNE G. LYALL.pdf"
## [40] "MYRA LEWIS.pdf"
Refine filenames:
refined_file_names = tolower(str_replace_all(string = file_names, pattern = " ", replacement = "_"))
refined_file_names
## [1] "arianna_fitts.pdf"
## [2] "angela_mae_meeker.pdf"
## [3] "jennifer_lynn_marcum.pdf"
## [4] "felix_batista.pdf"
## [5] "mark_himebaugh.pdf"
## [6] "michaela_joy_garecht.pdf"
## [7] "ashley_summers.pdf"
## [8] "jayme_closs.pdf"
## [9] "ilene_beth_misheloff.pdf"
## [10] "carla_vicentini.pdf"
## [11] "lisa_irwin.pdf"
## [12] "kyron_richard_horman.pdf"
## [13] "bianca_lebron.pdf"
## [14] "jabez_spann.pdf"
## [15] "shanna_genelle_peoples.pdf"
## [16] "abby_lynn_patterson.pdf"
## [17] "ruochen_liao.pdf"
## [18] "joshua_keshaba_sierra_garcia.pdf"
## [19] "alexis_tiara_murphy.pdf"
## [20] "erica_nicole_hunt.pdf"
## [21] "tionda_z._bradley.pdf"
## [22] "diamond_yvette_bradley.pdf"
## [23] "nefertiri_r._trader.pdf"
## [24] "kristen_modafferi.pdf"
## [25] "lashaya_stine.pdf"
## [26] "jonathan_fraser.pdf"
## [27] "amina_and_belel_kandil.pdf"
## [28] "faloma_luhk.pdf"
## [29] "maleina_luhk.pdf"
## [30] "kaylah_hunter_and_kristian_justice.pdf"
## [31] "russell_john_mort.pdf"
## [32] "akia_shawnta_eggleston.pdf"
## [33] "rondreiz_phillips.pdf"
## [34] "devonte_jordan_hart.pdf"
## [35] "amber_elizabeth_cates.pdf"
## [36] "shaina_ashley_kirkpatrick.pdf"
## [37] "shausha_latine_henson.pdf"
## [38] "amy_lynn_bradley.pdf"
## [39] "suzanne_g._lyall.pdf"
## [40] "myra_lewis.pdf"
Download each pdf from the url and use the refined filenames
for(i in 1:length(all_download_links)){
print(paste('Accessing URL: ', all_download_links[i], sep=""))
download.file(url = all_download_links[i]
, destfile = refined_file_names[i]
, mode = "wb")
}
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/arianna-fitts/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/angela-mae-meeker/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/jennifer-lynn-marcum/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/felix-batista/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/mark-himebaugh/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/michaela-joy-garecht/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/ashley-summers/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/jayme-closs/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/ilene-beth-misheloff/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/carla-vicentini/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/lisa-irwin/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/kyron-richard-horman/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/bianca-lebron/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/jabez-spann/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/shanna-genelle-peoples/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/abby-lynn-patterson/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/ruochen-liao/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/joshua-keshaba-sierra-garcia/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/copy_of_alexis-tiara-murphy/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/erica-nicole-hunt/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/tionda-z.-bradley/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/diamond-yvette-bradley/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/nefertiri-trader/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/kristen-modafferi/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/lashaya-stine/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/jonathan-fraser/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/amina-and-belel-kandil/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/faloma-luhk/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/maleina-luhk/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/kaylah-hunter/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/russell-john-mort/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/akia-shawnta-eggleston/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/rondreiz-phillips/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/devonte-jordan-hart/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/amber-elizabeth-cates/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/shaina-ashley-kirkpatrick/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/shausha-latine-henson/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/amy-lynn-bradley/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/suzanne-g.-lyall/download.pdf"
## [1] "Accessing URL: https://www.fbi.gov/wanted/kidnap/myra-lewis/download.pdf"
Same idea, different code details:
Same idea, different code details:
Same idea, different code details:
Same idea, different code details:
Tutorial: APIs + Webscraping in R
Homework: Do the readings for today (blogposts/guides)
Next week: Text data 1