Strings! Numeric value’s unruly, difficult, unpopular cousin? We think not! String values play not only an important role in data cleaning, but texts itself are increasingly treated as rich sources of data for research in political science and public policy (see here, for example https://journals.sagepub.com/doi/full/10.1177/20531680211022206, or here: https://onlinelibrary.wiley.com/doi/10.1111/padm.12656). So today, we want to dive deeper into how to manipulate strings using the stringR package.
Most of us already became somewhat familiar with the basic functionality of the stringR package when we used it for the previous assignments. Part of the tidyverse, the stringR package was published in 2019 and is currently running version 1.4.0. All functions in stringr start with “str_” and the first argument within the bracket is always the vector of strings that you want to modify, e.g. str_replace(argument1, …) which comes in handy when using the the pipe to write your code.
In the following document, we want to walk you through the following key functions of the stringR package and some applications:
First, let’s load the necessary packages we’ll need for this exercise:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stringr)
library(xml2)
For the purpose of this workshops we will be working with a tried and trusted data source: newspaper headlines! Thanks to the folks at the Internet Archive (https://archive.org), we will look at newspaper headlines across different points in time. Below, we can see the link to the Guardian headlines from the first day of classes, September 1, 2021.
<- "https://web.archive.org/web/20210901040912/https://www.theguardian.com/international" guardian_url
Let’s extract the newspaper headlines from that day.
<- guardian_url %>% read_html() %>%
guardian_headlines ::html_nodes(xpath = '//a[contains(concat( " ", @class, " " ), concat( " ", "js-headline-text", " " ))]') %>% rvest::html_text()
rvestc(1:5)] guardian_headlines[
## [1] "Biden calls for new era in US foreign policy in defensive speech"
## [2] "Biden sets himself apart by placing Afghanistan blame at predecessors’ feet"
## [3] "Wheelchair basketball, road cycling, badminton begins and more – live!"
## [4] "Father seeking $2m before stepping down as conservator, court filing claims"
## [5] "Up to half of world’s wild tree species could be at risk of extinction"
Nice! What if we wanted to access the headlines of the following day? Let’s take a moment and think about the logic behind the URL:
https://web.archive.org/web/20210901040912/https://www.theguardian.com/international
We have the Internet Archive “…/web” followed by a time stamp, and the URL of our news source. Let’s split up the string with str_split into the base part of the url which we call url_archive and the changing part of the url including the date and news source. str_split splits the string into substrings where the splitting pattern occurs and returns a list of substrings. Here we want to split after “web/”, so let us make this our splitting pattern.
<- guardian_url %>% str_split(pattern = "web/")
guardian_split guardian_split
## [[1]]
## [1] "https://web.archive.org/"
## [2] "20210901040912/https://www.theguardian.com/international"
We can then store the base part and changing part into into separate strings.
<- guardian_split[[1]][1]
url_archive <- guardian_split[[1]][2]
url_date url_archive
## [1] "https://web.archive.org/"
url_date
## [1] "20210901040912/https://www.theguardian.com/international"
We can see that the first four digits are the year, followed by two digits for the month, two for the day, and the remaining 6 for the time of day at which the information was collected (we can disregard the time of day for now).
Using the str_sub command, specifying the start and end position in the string, we subset year, month, and day. (Negative indexing is also possible.)
<- url_date %>% str_sub(start = 1, end = 4)
url_year <- url_date %>% str_sub(start = 5, end = 6)
url_month <- url_date %>% str_sub(start = 7, end = 8)
url_day
url_year
## [1] "2021"
url_month
## [1] "09"
url_day
## [1] "01"
If we wanted to replace the information about year, month, or day, we could use the str_sub or str_replace function and assign a new value to the positions in the URL.
str_sub(url_date, start = 7, end = 8) <- '02'
url_date
## [1] "20210902040912/https://www.theguardian.com/international"
<- str_replace(url_date, "01", "02")
url_date2 url_date2
## [1] "20210902040912/https://www.theguardian.com/international"
Now that we have changed the day of the URL to September 2, 2021 we are merging the two substrings again.We can use str_c for this. Keep in mind that earlier, we separated the URL at ‘web/’, so we need to add this again to receive a working URL.
<- str_c(url_archive, url_date, sep = "web/")
new_url new_url
## [1] "https://web.archive.org/web/20210902040912/https://www.theguardian.com/international"
With our new URL, we can now scrape the headlines for the next day’s edition of the Guardian newspaper!
<- new_url %>% read_html() %>%
guardian_headlines_new ::html_nodes(xpath = '//a[contains(concat( " ", @class, " " ), concat( " ", "js-headline-text", " " ))]') %>% rvest::html_text()
rvestc(1:5)] guardian_headlines_new[
## [1] "Texas law 'blatantly violates' constitutional rights, says Biden"
## [2] "Sackler family set to pay $4.5bn to settle claims "
## [3] "US military chief could work with Taliban on IS counter-terror strikes"
## [4] "Joe Rogan has the virus – and his treatment will make health experts feel ill"
## [5] "Site bans Covid misinformation forum after ‘go dark’ protest"
Now that we now how to retrieve headlines, let us have a look at the headlines we got for 1 September. First, we might be interested to order them alphabetically. We can use str_sort for this.
str_sort(guardian_headlines)
## [1] " More from the series "
## [2] " The British citizens left behind in Kabul "
## [3] "‘People are broken’: Afghans describe first day under full Taliban control"
## [4] "10 ways to approach a sensitive, daunting conversation"
## [5] "75 cases recorded after two days of falls "
## [6] "A human is not a horse. So why is a livestock drug sweeping America?"
## [7] "Afghan athlete evacuated from Kabul belatedly competes"
## [8] "Afghanistan women’s cricketers left feeling abandoned"
## [9] "Afghanistan’s neighbours offered millions in aid to harbour refugees"
## [10] "Astonishing and petrifying"
## [11] "Banned BBC journalist says country ‘moving in reverse’ in final report "
## [12] "Biden calls for new era in US foreign policy in defensive speech"
## [13] "Biden sets himself apart by placing Afghanistan blame at predecessors’ feet"
## [14] "Brazil football legend in hospital for routine exams"
## [15] "Can the ‘high heel index’ predict economic growth?"
## [16] "Concern grows for global supply as Vietnam struggles in lockdown"
## [17] "Cyclist becomes Summer and Winter Paralympic champion with gold"
## [18] "Cyprus prepares for oil spill from Syrian power plant"
## [19] "Doja Cat, fires and festivals"
## [20] "Dunkley eager to face New Zealand and judge Hundred impact"
## [21] "Evin prison guards investigated after abuse video leak"
## [22] "Family of US journalist Danny Fenster calls for release after 100 days of detention "
## [23] "Fatah critic’s death highlights brutality of Palestinian Authority"
## [24] "Father seeking $2m before stepping down as conservator, court filing claims"
## [25] "Floating wind turbines could open up vast ocean tracts for generation"
## [26] "Football legend in hospital in Brazil for routine exams"
## [27] "For years it was seen as a far-off problem"
## [28] "Fox News accused of stoking violence after Tucker Carlson ‘revolt’ prediction"
## [29] "Germany warns union against setting target of Afghan refugees"
## [30] "Governor pardons seven Black men executed in 1951 for rape of a white woman"
## [31] "Griezmann re-joins Atlético from Barcelona in shock move"
## [32] "Hate crimes rise to highest level in 12 years, says FBI report"
## [33] "Health authorities warn against mixing Covid vaccine types"
## [34] "Hicks adds time trial gold to track silver medal"
## [35] "How a hot blob off the country's coast is contributing to drought in South America"
## [36] "How artificial birds are relaying the secrets of ocean currents"
## [37] "How artificial birds are relaying the secrets of ocean currents"
## [38] "How did a Bob Ross documentary become so contentious?"
## [39] "How have you been affected by those in southern Europe?"
## [40] "How New Zealand’s Maori are reclaiming land with occupations"
## [41] "How the US created a world of endless war"
## [42] "I fear for my family in Kabul, but I know the Taliban can be resisted"
## [43] "I hitchhiked 100 miles home from my school for the blind"
## [44] "In Afghanistan, Islamic State is seeking to exploit divisions within the Taliban"
## [45] "Israel registers record daily coronavirus cases"
## [46] "Its responsibilities don’t end here"
## [47] "Judge orders hospital to treat Covid patient with ivermectin"
## [48] "Kane has no regrets over trying to force move from Spurs"
## [49] "Last man out: the haunting image of America’s final moments in Afghanistan"
## [50] "Lockdown has made us fall in love with the sea"
## [51] "Miss Marple back on the case in stories by Naomi Alderman, Ruth Ware and more"
## [52] "New Zealand minister’s TV interview interrupted by son waving phallic carrot"
## [53] "No-cook dinners for summer nights"
## [54] "Outrage after Ivory Coast TV presenter asks guest to simulate rape"
## [55] "Outrage after Ivory Coast TV presenter asks guest to simulate rape"
## [56] "Passports will make hesitant people ‘even more reluctant to get jabbed’"
## [57] "People can self-identify as male or female in Scottish census, says guidance"
## [58] "Population surpasses 5m for first time since 1851"
## [59] "Princess Diana film debuts as industry aims for return to normality"
## [60] "Queen hired as set designer on new Netflix film"
## [61] "Raducanu shrugs off nerves to reach second round"
## [62] "Sergei Kovalev obituary"
## [63] "Share your experience of coronavirus"
## [64] "Skyscraper plans threaten UK’s oldest synagogue"
## [65] "Storm leaves trail of destruction as road collapse raises death toll to four"
## [66] "Storm leaves trail of destruction as road collapse raises death toll to four"
## [67] "Storm’s rampage through Louisiana"
## [68] "Taliban enjoy moment of victory as focus shifts to challenges ahead"
## [69] "The art show co-curated by a five-year-old"
## [70] "The end of Geronimo and the last US soldier in Afghanistan"
## [71] "The fight over Jeff Buckley’s final recordings"
## [72] "The photography of Hiro "
## [73] "The Stranglers on fights, drugs and finally growing up"
## [74] "The Sydney suburbs under curfew: street lights and empty spaces "
## [75] "The US supreme court is deciding more and more cases in a secretive ‘shadow docket’"
## [76] "The wild adventures of Fred Baldwin "
## [77] "To understand what happens next in Afghanistan, look to its neighbours"
## [78] "Unstoppable movement: how New Zealand’s Maori are reclaiming land with occupations"
## [79] "Up to half of world’s wild tree species could be at risk of extinction"
## [80] "US national parks are overcrowded. Some think ‘selfie stations’ will help"
## [81] "US veterans on seeing Afghanistan fall to the Taliban"
## [82] "Vaccine passports will make hesitant people ‘even more reluctant to get jabbed’"
## [83] "Walking-and-talking romance never gets anywhere "
## [84] "Welsh teen in hospital with Covid targeted online by anti-vaxxers "
## [85] "Western economies can’t return to ‘business as usual’ after the pandemic"
## [86] "What evolutionary advantage comes from women having considerably less body hair than men?"
## [87] "What was the moment that changed you?"
## [88] "Wheelchair basketball, road cycling, badminton begins and more – live!"
## [89] "Why a rude carrot can spark sheer joy – from ancient Egypt to today"
## [90] "Wild cockatoos observed using tools as ‘cutlery’ to extract seeds from tropical fruit"
## [91] "Without a guiding purpose, Boris Johnson will always be governing in a crisis"
Did you realise that there are two headlines at the beginning that start with a whitespace and are therefore mistakenly ranked first? We can trim whitespace at the beginning and end of each string using str_trunc. Furthermore, let us truncate the strings to make the output neater using str_trunc. This gives us a nicely sorted list.
str_sort(str_trunc(str_trim(guardian_headlines),17))
## [1] "‘People are br..." "10 ways to app..." "75 cases recor..."
## [4] "A human is not..." "Afghan athlete..." "Afghanistan wo..."
## [7] "Afghanistan’s ..." "Astonishing an..." "Banned BBC jou..."
## [10] "Biden calls fo..." "Biden sets him..." "Brazil footbal..."
## [13] "Can the ‘high ..." "Concern grows ..." "Cyclist become..."
## [16] "Cyprus prepare..." "Doja Cat, fire..." "Dunkley eager ..."
## [19] "Evin prison gu..." "Family of US j..." "Fatah critic’s..."
## [22] "Father seeking..." "Floating wind ..." "Football legen..."
## [25] "For years it w..." "Fox News accus..." "Germany warns ..."
## [28] "Governor pardo..." "Griezmann re-j..." "Hate crimes ri..."
## [31] "Health authori..." "Hicks adds tim..." "How a hot blob..."
## [34] "How artificial..." "How artificial..." "How did a Bob ..."
## [37] "How have you b..." "How New Zealan..." "How the US cre..."
## [40] "I fear for my ..." "I hitchhiked 1..." "In Afghanistan..."
## [43] "Israel registe..." "Its responsibi..." "Judge orders h..."
## [46] "Kane has no re..." "Last man out: ..." "Lockdown has m..."
## [49] "Miss Marple ba..." "More from the ..." "New Zealand mi..."
## [52] "No-cook dinner..." "Outrage after ..." "Outrage after ..."
## [55] "Passports will..." "People can sel..." "Population sur..."
## [58] "Princess Diana..." "Queen hired as..." "Raducanu shrug..."
## [61] "Sergei Kovalev..." "Share your exp..." "Skyscraper pla..."
## [64] "Storm leaves t..." "Storm leaves t..." "Storm’s rampag..."
## [67] "Taliban enjoy ..." "The art show c..." "The British ci..."
## [70] "The end of Ger..." "The fight over..." "The photograph..."
## [73] "The Stranglers..." "The Sydney sub..." "The US supreme..."
## [76] "The wild adven..." "To understand ..." "Unstoppable mo..."
## [79] "Up to half of ..." "US national pa..." "US veterans on..."
## [82] "Vaccine passpo..." "Walking-and-ta..." "Welsh teen in ..."
## [85] "Western econom..." "What evolution..." "What was the m..."
## [88] "Wheelchair bas..." "Why a rude car..." "Wild cockatoos..."
## [91] "Without a guid..."
If we want to know which numeric rank the headlines in the list will have if sorted, we use str_order.
str_order(str_trim(guardian_headlines))
## [1] 82 62 20 32 22 27 91 54 46 1 2 10 65 7 23 53 80 30 45 9 16 4 41 28 43
## [26] 50 51 11 25 49 19 24 42 14 67 55 75 13 66 31 60 38 87 34 21 29 69 64 59 40
## [51] 89 63 8 84 18 48 52 56 57 26 68 73 47 6 44 81 90 12 39 77 17 79 15 76 37
## [76] 78 36 86 5 71 70 85 58 83 35 72 74 3 61 88 33
Finally, we might also be interested in finding out how many characters each headline has. We can use str_length for this.
str_length(guardian_headlines)
## [1] 64 75 70 75 70 76 64 66 84 52 75 42 60 63 54 66 46 71 58 42 60 54 63 48 56
## [26] 48 53 55 56 58 69 68 77 35 72 70 83 80 43 22 69 82 42 76 54 71 47 76 62 77
## [51] 61 49 53 26 53 67 47 48 77 56 67 54 33 46 50 41 63 23 74 53 73 89 36 37 55
## [76] 64 58 36 24 29 33 74 66 66 79 82 47 85 76 67 68
Next, we want to process the headlines, so that we can analyse the words in the headlines properly. First we want to make sure that our list of headlines does not include duplicates. We can eliminate duplicates with “unique”. FYI: If we wanted to duplicate strings we can use str_dup, specifying the number of repetition in the second argument
<- unique(guardian_headlines)
guardian_headlines <- str_dup(guardian_headlines[1], times = 3)
headlines_triple headlines_triple
## [1] "Biden calls for new era in US foreign policy in defensive speechBiden calls for new era in US foreign policy in defensive speechBiden calls for new era in US foreign policy in defensive speech"
Second, we want to collapse all headlines into a single string for which we use str_flatten. Note, we are the second argument is the pattern that will be placed between the strings. In our case we want a space, so we specify " ".
<- str_flatten(guardian_headlines, " ")
headline_text headline_text
## [1] "Biden calls for new era in US foreign policy in defensive speech Biden sets himself apart by placing Afghanistan blame at predecessors’ feet Wheelchair basketball, road cycling, badminton begins and more – live! Father seeking $2m before stepping down as conservator, court filing claims Up to half of world’s wild tree species could be at risk of extinction Storm leaves trail of destruction as road collapse raises death toll to four Concern grows for global supply as Vietnam struggles in lockdown Outrage after Ivory Coast TV presenter asks guest to simulate rape Family of US journalist Danny Fenster calls for release after 100 days of detention Brazil football legend in hospital for routine exams Governor pardons seven Black men executed in 1951 for rape of a white woman The art show co-curated by a five-year-old How New Zealand’s Maori are reclaiming land with occupations How artificial birds are relaying the secrets of ocean currents The Stranglers on fights, drugs and finally growing up Fatah critic’s death highlights brutality of Palestinian Authority The fight over Jeff Buckley’s final recordings Passports will make hesitant people ‘even more reluctant to get jabbed’ Health authorities warn against mixing Covid vaccine types 75 cases recorded after two days of falls Judge orders hospital to treat Covid patient with ivermectin Afghan athlete evacuated from Kabul belatedly competes Cyclist becomes Summer and Winter Paralympic champion with gold Hicks adds time trial gold to track silver medal Griezmann re-joins Atlético from Barcelona in shock move Raducanu shrugs off nerves to reach second round Afghanistan women’s cricketers left feeling abandoned Football legend in hospital in Brazil for routine exams Kane has no regrets over trying to force move from Spurs Dunkley eager to face New Zealand and judge Hundred impact I fear for my family in Kabul, but I know the Taliban can be resisted A human is not a horse. So why is a livestock drug sweeping America? Without a guiding purpose, Boris Johnson will always be governing in a crisis Its responsibilities don’t end here Western economies can’t return to ‘business as usual’ after the pandemic To understand what happens next in Afghanistan, look to its neighbours The US supreme court is deciding more and more cases in a secretive ‘shadow docket’ In Afghanistan, Islamic State is seeking to exploit divisions within the Taliban The British citizens left behind in Kabul More from the series Floating wind turbines could open up vast ocean tracts for generation How a hot blob off the country's coast is contributing to drought in South America For years it was seen as a far-off problem Evin prison guards investigated after abuse video leak Banned BBC journalist says country ‘moving in reverse’ in final report Skyscraper plans threaten UK’s oldest synagogue People can self-identify as male or female in Scottish census, says guidance Hate crimes rise to highest level in 12 years, says FBI report Fox News accused of stoking violence after Tucker Carlson ‘revolt’ prediction Germany warns union against setting target of Afghan refugees Population surpasses 5m for first time since 1851 Cyprus prepares for oil spill from Syrian power plant Astonishing and petrifying How did a Bob Ross documentary become so contentious? Princess Diana film debuts as industry aims for return to normality Queen hired as set designer on new Netflix film Walking-and-talking romance never gets anywhere Miss Marple back on the case in stories by Naomi Alderman, Ruth Ware and more I hitchhiked 100 miles home from my school for the blind Why a rude carrot can spark sheer joy – from ancient Egypt to today 10 ways to approach a sensitive, daunting conversation No-cook dinners for summer nights Lockdown has made us fall in love with the sea Can the ‘high heel index’ predict economic growth? How the US created a world of endless war Sergei Kovalev obituary Last man out: the haunting image of America’s final moments in Afghanistan US veterans on seeing Afghanistan fall to the Taliban US national parks are overcrowded. Some think ‘selfie stations’ will help What evolutionary advantage comes from women having considerably less body hair than men? Share your experience of coronavirus What was the moment that changed you? How have you been affected by those in southern Europe? The Sydney suburbs under curfew: street lights and empty spaces The end of Geronimo and the last US soldier in Afghanistan The wild adventures of Fred Baldwin The photography of Hiro Doja Cat, fires and festivals Storm’s rampage through Louisiana ‘People are broken’: Afghans describe first day under full Taliban control Welsh teen in hospital with Covid targeted online by anti-vaxxers Vaccine passports will make hesitant people ‘even more reluctant to get jabbed’ Unstoppable movement: how New Zealand’s Maori are reclaiming land with occupations Israel registers record daily coronavirus cases Wild cockatoos observed using tools as ‘cutlery’ to extract seeds from tropical fruit New Zealand minister’s TV interview interrupted by son waving phallic carrot Taliban enjoy moment of victory as focus shifts to challenges ahead Afghanistan’s neighbours offered millions in aid to harbour refugees"
Nice! However, there are still punctuations and too many spaces in the text. Let us eliminate all punctuations at once using str_replace_all. The Regex term [:punct:] recognises all punctuation signs which eliminates the need of specifying dots, exclamation marks etc. separately. Similarily [:space:] refers to all lign change, tabs and spaces. Alternatively, we can use str_squish here. Compare the results.
<- str_replace_all(headline_text, "[:punct:]", "")
headline_text <- str_replace_all(headline_text, "[:space:]", " ")
headline_text <- str_squish(headline_text)
headline_text2 headline_text
## [1] "Biden calls for new era in US foreign policy in defensive speech Biden sets himself apart by placing Afghanistan blame at predecessors feet Wheelchair basketball road cycling badminton begins and more live Father seeking $2m before stepping down as conservator court filing claims Up to half of worlds wild tree species could be at risk of extinction Storm leaves trail of destruction as road collapse raises death toll to four Concern grows for global supply as Vietnam struggles in lockdown Outrage after Ivory Coast TV presenter asks guest to simulate rape Family of US journalist Danny Fenster calls for release after 100 days of detention Brazil football legend in hospital for routine exams Governor pardons seven Black men executed in 1951 for rape of a white woman The art show cocurated by a fiveyearold How New Zealands Maori are reclaiming land with occupations How artificial birds are relaying the secrets of ocean currents The Stranglers on fights drugs and finally growing up Fatah critics death highlights brutality of Palestinian Authority The fight over Jeff Buckleys final recordings Passports will make hesitant people even more reluctant to get jabbed Health authorities warn against mixing Covid vaccine types 75 cases recorded after two days of falls Judge orders hospital to treat Covid patient with ivermectin Afghan athlete evacuated from Kabul belatedly competes Cyclist becomes Summer and Winter Paralympic champion with gold Hicks adds time trial gold to track silver medal Griezmann rejoins Atlético from Barcelona in shock move Raducanu shrugs off nerves to reach second round Afghanistan womens cricketers left feeling abandoned Football legend in hospital in Brazil for routine exams Kane has no regrets over trying to force move from Spurs Dunkley eager to face New Zealand and judge Hundred impact I fear for my family in Kabul but I know the Taliban can be resisted A human is not a horse So why is a livestock drug sweeping America Without a guiding purpose Boris Johnson will always be governing in a crisis Its responsibilities dont end here Western economies cant return to business as usual after the pandemic To understand what happens next in Afghanistan look to its neighbours The US supreme court is deciding more and more cases in a secretive shadow docket In Afghanistan Islamic State is seeking to exploit divisions within the Taliban The British citizens left behind in Kabul More from the series Floating wind turbines could open up vast ocean tracts for generation How a hot blob off the countrys coast is contributing to drought in South America For years it was seen as a faroff problem Evin prison guards investigated after abuse video leak Banned BBC journalist says country moving in reverse in final report Skyscraper plans threaten UKs oldest synagogue People can selfidentify as male or female in Scottish census says guidance Hate crimes rise to highest level in 12 years says FBI report Fox News accused of stoking violence after Tucker Carlson revolt prediction Germany warns union against setting target of Afghan refugees Population surpasses 5m for first time since 1851 Cyprus prepares for oil spill from Syrian power plant Astonishing and petrifying How did a Bob Ross documentary become so contentious Princess Diana film debuts as industry aims for return to normality Queen hired as set designer on new Netflix film Walkingandtalking romance never gets anywhere Miss Marple back on the case in stories by Naomi Alderman Ruth Ware and more I hitchhiked 100 miles home from my school for the blind Why a rude carrot can spark sheer joy from ancient Egypt to today 10 ways to approach a sensitive daunting conversation Nocook dinners for summer nights Lockdown has made us fall in love with the sea Can the high heel index predict economic growth How the US created a world of endless war Sergei Kovalev obituary Last man out the haunting image of Americas final moments in Afghanistan US veterans on seeing Afghanistan fall to the Taliban US national parks are overcrowded Some think selfie stations will help What evolutionary advantage comes from women having considerably less body hair than men Share your experience of coronavirus What was the moment that changed you How have you been affected by those in southern Europe The Sydney suburbs under curfew street lights and empty spaces The end of Geronimo and the last US soldier in Afghanistan The wild adventures of Fred Baldwin The photography of Hiro Doja Cat fires and festivals Storms rampage through Louisiana People are broken Afghans describe first day under full Taliban control Welsh teen in hospital with Covid targeted online by antivaxxers Vaccine passports will make hesitant people even more reluctant to get jabbed Unstoppable movement how New Zealands Maori are reclaiming land with occupations Israel registers record daily coronavirus cases Wild cockatoos observed using tools as cutlery to extract seeds from tropical fruit New Zealand ministers TV interview interrupted by son waving phallic carrot Taliban enjoy moment of victory as focus shifts to challenges ahead Afghanistans neighbours offered millions in aid to harbour refugees"
headline_text2
## [1] "Biden calls for new era in US foreign policy in defensive speech Biden sets himself apart by placing Afghanistan blame at predecessors feet Wheelchair basketball road cycling badminton begins and more live Father seeking $2m before stepping down as conservator court filing claims Up to half of worlds wild tree species could be at risk of extinction Storm leaves trail of destruction as road collapse raises death toll to four Concern grows for global supply as Vietnam struggles in lockdown Outrage after Ivory Coast TV presenter asks guest to simulate rape Family of US journalist Danny Fenster calls for release after 100 days of detention Brazil football legend in hospital for routine exams Governor pardons seven Black men executed in 1951 for rape of a white woman The art show cocurated by a fiveyearold How New Zealands Maori are reclaiming land with occupations How artificial birds are relaying the secrets of ocean currents The Stranglers on fights drugs and finally growing up Fatah critics death highlights brutality of Palestinian Authority The fight over Jeff Buckleys final recordings Passports will make hesitant people even more reluctant to get jabbed Health authorities warn against mixing Covid vaccine types 75 cases recorded after two days of falls Judge orders hospital to treat Covid patient with ivermectin Afghan athlete evacuated from Kabul belatedly competes Cyclist becomes Summer and Winter Paralympic champion with gold Hicks adds time trial gold to track silver medal Griezmann rejoins Atlético from Barcelona in shock move Raducanu shrugs off nerves to reach second round Afghanistan womens cricketers left feeling abandoned Football legend in hospital in Brazil for routine exams Kane has no regrets over trying to force move from Spurs Dunkley eager to face New Zealand and judge Hundred impact I fear for my family in Kabul but I know the Taliban can be resisted A human is not a horse So why is a livestock drug sweeping America Without a guiding purpose Boris Johnson will always be governing in a crisis Its responsibilities dont end here Western economies cant return to business as usual after the pandemic To understand what happens next in Afghanistan look to its neighbours The US supreme court is deciding more and more cases in a secretive shadow docket In Afghanistan Islamic State is seeking to exploit divisions within the Taliban The British citizens left behind in Kabul More from the series Floating wind turbines could open up vast ocean tracts for generation How a hot blob off the countrys coast is contributing to drought in South America For years it was seen as a faroff problem Evin prison guards investigated after abuse video leak Banned BBC journalist says country moving in reverse in final report Skyscraper plans threaten UKs oldest synagogue People can selfidentify as male or female in Scottish census says guidance Hate crimes rise to highest level in 12 years says FBI report Fox News accused of stoking violence after Tucker Carlson revolt prediction Germany warns union against setting target of Afghan refugees Population surpasses 5m for first time since 1851 Cyprus prepares for oil spill from Syrian power plant Astonishing and petrifying How did a Bob Ross documentary become so contentious Princess Diana film debuts as industry aims for return to normality Queen hired as set designer on new Netflix film Walkingandtalking romance never gets anywhere Miss Marple back on the case in stories by Naomi Alderman Ruth Ware and more I hitchhiked 100 miles home from my school for the blind Why a rude carrot can spark sheer joy from ancient Egypt to today 10 ways to approach a sensitive daunting conversation Nocook dinners for summer nights Lockdown has made us fall in love with the sea Can the high heel index predict economic growth How the US created a world of endless war Sergei Kovalev obituary Last man out the haunting image of Americas final moments in Afghanistan US veterans on seeing Afghanistan fall to the Taliban US national parks are overcrowded Some think selfie stations will help What evolutionary advantage comes from women having considerably less body hair than men Share your experience of coronavirus What was the moment that changed you How have you been affected by those in southern Europe The Sydney suburbs under curfew street lights and empty spaces The end of Geronimo and the last US soldier in Afghanistan The wild adventures of Fred Baldwin The photography of Hiro Doja Cat fires and festivals Storms rampage through Louisiana People are broken Afghans describe first day under full Taliban control Welsh teen in hospital with Covid targeted online by antivaxxers Vaccine passports will make hesitant people even more reluctant to get jabbed Unstoppable movement how New Zealands Maori are reclaiming land with occupations Israel registers record daily coronavirus cases Wild cockatoos observed using tools as cutlery to extract seeds from tropical fruit New Zealand ministers TV interview interrupted by son waving phallic carrot Taliban enjoy moment of victory as focus shifts to challenges ahead Afghanistans neighbours offered millions in aid to harbour refugees"
This looks far better, however, we still have the problem that some words are capitalised because they used to be the first word in the headline. We can convert strings to lower case using str_to_lower. For commands to capitalise strings and convert them to title case, have a look at the cheat sheet.
<- str_to_lower(headline_text)
headline_text headline_text
## [1] "biden calls for new era in us foreign policy in defensive speech biden sets himself apart by placing afghanistan blame at predecessors feet wheelchair basketball road cycling badminton begins and more live father seeking $2m before stepping down as conservator court filing claims up to half of worlds wild tree species could be at risk of extinction storm leaves trail of destruction as road collapse raises death toll to four concern grows for global supply as vietnam struggles in lockdown outrage after ivory coast tv presenter asks guest to simulate rape family of us journalist danny fenster calls for release after 100 days of detention brazil football legend in hospital for routine exams governor pardons seven black men executed in 1951 for rape of a white woman the art show cocurated by a fiveyearold how new zealands maori are reclaiming land with occupations how artificial birds are relaying the secrets of ocean currents the stranglers on fights drugs and finally growing up fatah critics death highlights brutality of palestinian authority the fight over jeff buckleys final recordings passports will make hesitant people even more reluctant to get jabbed health authorities warn against mixing covid vaccine types 75 cases recorded after two days of falls judge orders hospital to treat covid patient with ivermectin afghan athlete evacuated from kabul belatedly competes cyclist becomes summer and winter paralympic champion with gold hicks adds time trial gold to track silver medal griezmann rejoins atlético from barcelona in shock move raducanu shrugs off nerves to reach second round afghanistan womens cricketers left feeling abandoned football legend in hospital in brazil for routine exams kane has no regrets over trying to force move from spurs dunkley eager to face new zealand and judge hundred impact i fear for my family in kabul but i know the taliban can be resisted a human is not a horse so why is a livestock drug sweeping america without a guiding purpose boris johnson will always be governing in a crisis its responsibilities dont end here western economies cant return to business as usual after the pandemic to understand what happens next in afghanistan look to its neighbours the us supreme court is deciding more and more cases in a secretive shadow docket in afghanistan islamic state is seeking to exploit divisions within the taliban the british citizens left behind in kabul more from the series floating wind turbines could open up vast ocean tracts for generation how a hot blob off the countrys coast is contributing to drought in south america for years it was seen as a faroff problem evin prison guards investigated after abuse video leak banned bbc journalist says country moving in reverse in final report skyscraper plans threaten uks oldest synagogue people can selfidentify as male or female in scottish census says guidance hate crimes rise to highest level in 12 years says fbi report fox news accused of stoking violence after tucker carlson revolt prediction germany warns union against setting target of afghan refugees population surpasses 5m for first time since 1851 cyprus prepares for oil spill from syrian power plant astonishing and petrifying how did a bob ross documentary become so contentious princess diana film debuts as industry aims for return to normality queen hired as set designer on new netflix film walkingandtalking romance never gets anywhere miss marple back on the case in stories by naomi alderman ruth ware and more i hitchhiked 100 miles home from my school for the blind why a rude carrot can spark sheer joy from ancient egypt to today 10 ways to approach a sensitive daunting conversation nocook dinners for summer nights lockdown has made us fall in love with the sea can the high heel index predict economic growth how the us created a world of endless war sergei kovalev obituary last man out the haunting image of americas final moments in afghanistan us veterans on seeing afghanistan fall to the taliban us national parks are overcrowded some think selfie stations will help what evolutionary advantage comes from women having considerably less body hair than men share your experience of coronavirus what was the moment that changed you how have you been affected by those in southern europe the sydney suburbs under curfew street lights and empty spaces the end of geronimo and the last us soldier in afghanistan the wild adventures of fred baldwin the photography of hiro doja cat fires and festivals storms rampage through louisiana people are broken afghans describe first day under full taliban control welsh teen in hospital with covid targeted online by antivaxxers vaccine passports will make hesitant people even more reluctant to get jabbed unstoppable movement how new zealands maori are reclaiming land with occupations israel registers record daily coronavirus cases wild cockatoos observed using tools as cutlery to extract seeds from tropical fruit new zealand ministers tv interview interrupted by son waving phallic carrot taliban enjoy moment of victory as focus shifts to challenges ahead afghanistans neighbours offered millions in aid to harbour refugees"
Nice!! Now we are ready to analyse the words. Let us say we are interested in finding out the most frequent noun used in the headlines. Following https://r4ds.had.co.nz/strings.html, while imperfect, we could define a noun as a word proceeded by either “the” or “a”. Using this imperfect definition, we can use either str_match_all or str_extract_all to extract all the nouns from the text.
<- "(a|the) ([^ ]+)"
noun_function <- str_extract_all(headline_text, noun_function)
extracted_words <- str_match_all(headline_text, noun_function)
extracted_words2 extracted_words
## [[1]]
## [1] "a in" "a white" "the art" "a fiveyearold"
## [5] "the secrets" "the stranglers" "the fight" "a in"
## [9] "the taliban" "a human" "a horse" "a livestock"
## [13] "a without" "a guiding" "a crisis" "the pandemic"
## [17] "the us" "a secretive" "the taliban" "the british"
## [21] "the series" "a hot" "the countrys" "a for"
## [25] "a faroff" "a bob" "a film" "the case"
## [29] "the blind" "a rude" "a sensitive" "the sea"
## [33] "the high" "the us" "a world" "the haunting"
## [37] "the taliban" "the moment" "the sydney" "the end"
## [41] "the last" "the wild" "the photography" "a cat"
## [45] "a people"
extracted_words2
## [[1]]
## [,1] [,2] [,3]
## [1,] "a in" "a" "in"
## [2,] "a white" "a" "white"
## [3,] "the art" "the" "art"
## [4,] "a fiveyearold" "a" "fiveyearold"
## [5,] "the secrets" "the" "secrets"
## [6,] "the stranglers" "the" "stranglers"
## [7,] "the fight" "the" "fight"
## [8,] "a in" "a" "in"
## [9,] "the taliban" "the" "taliban"
## [10,] "a human" "a" "human"
## [11,] "a horse" "a" "horse"
## [12,] "a livestock" "a" "livestock"
## [13,] "a without" "a" "without"
## [14,] "a guiding" "a" "guiding"
## [15,] "a crisis" "a" "crisis"
## [16,] "the pandemic" "the" "pandemic"
## [17,] "the us" "the" "us"
## [18,] "a secretive" "a" "secretive"
## [19,] "the taliban" "the" "taliban"
## [20,] "the british" "the" "british"
## [21,] "the series" "the" "series"
## [22,] "a hot" "a" "hot"
## [23,] "the countrys" "the" "countrys"
## [24,] "a for" "a" "for"
## [25,] "a faroff" "a" "faroff"
## [26,] "a bob" "a" "bob"
## [27,] "a film" "a" "film"
## [28,] "the case" "the" "case"
## [29,] "the blind" "the" "blind"
## [30,] "a rude" "a" "rude"
## [31,] "a sensitive" "a" "sensitive"
## [32,] "the sea" "the" "sea"
## [33,] "the high" "the" "high"
## [34,] "the us" "the" "us"
## [35,] "a world" "a" "world"
## [36,] "the haunting" "the" "haunting"
## [37,] "the taliban" "the" "taliban"
## [38,] "the moment" "the" "moment"
## [39,] "the sydney" "the" "sydney"
## [40,] "the end" "the" "end"
## [41,] "the last" "the" "last"
## [42,] "the wild" "the" "wild"
## [43,] "the photography" "the" "photography"
## [44,] "a cat" "a" "cat"
## [45,] "a people" "a" "people"
The convenience of str_match is that it saves each individual word separately in a column too. This way we can save the nouns to a new vector and use this list to find the five most frequently used nouns.
<- extracted_words2[[1]][,3]
nouns
<- tibble(nouns)%>%
top5words count(nouns) %>%
arrange(desc(n))%>%
head(5)
top5words
Scrape the newspaper articles on the date of your most recent birthday. What were the most common nouns?
Compare two or more of the following news outlets’ headlines. Which one has the longest average headlines on that day?
A work by Max Eckert & Kai Foerster
Prepared for Intro to Data Science, taught by Simon Munzert