Lets do some string manipulation together!
Make sure to load the stringr
and stringi
packages. We are going through different levels of difficulty in string
manipulation.
The first exercises will consist of straight forward application of some
of the main functions of the stringr and the
stringi package.
In the medium level, we will deal with more complex tasks that involve
string manipulation.
The advanced exercises include the use of Regex expression. 😨
Exercise 1: Create two string vectors that say “Today I am gonna learn how to process strings.” & “This will be a lot of fun!” 😊
Exercise 2: Please extract the component “Science” out of the string “Data Science”. 💻
Exercise 3: Count the characters in your string from exercise 1.
Exercise 4: Please add the first and the second string from exercise 1 together. Be minjob_dataul about the whitespace behind the first senctence.
Exercise 5: For some task, we need all the characters to be in lowercase - for example, if we want to count specific words and do not want to include the lower- and the uppercase version of a specific word. From now on, please use the lower case version of our motto.
Exercise 6: Check if the word “horror” is in the each of the concatenated string. Do it for “fun” also. 😎
Exercise 7: Not everyone thinks that string processing is “a lot of” fun. Create a motto for these people (using str_replace).
Exercise 8: We want to extract the first word of our motto.
Exercise 9: We want to count how many vowels are there in our motto.
Exercise 10: Consider the following case - all strings must have a width of 100 characters. Find out how long our string is and then pad our motto to the necessary length.
In this part, we will show how to create a function with stringr and stringi. Creating a function for string processing can be helpful if you want to execute the same manipulation on a lot of different character vectors, for example tweets or sentences in a sentiment analysis. 🕊. The rest of the exercises the manipulation and analysis of webscraped data from Wikipedia.💬
Exercise 11: We want to create a function that counts the words in a sentence, a paragraph or, for example, a tweet. Think about this: Do you want to include punctuation? And if not, how can we make sure these signs are not counted? Make sure to try out the function with our motto from the beginner exercises.
For thes following exercises, we first need some strings we can process. We prepared a short scraping code to get data from a Wikipedia page. Please run the code - then we can start with the string processing.
Webscraper
Exercise 12: We are only interested in any information about the moon that includes numbers - the size, the thickness, anything related to numbers. How can we do that with the stringr package?
Hint: What do we see when we take a look at the
paragraph? The string vector also contains the reference numbers ([89],
[99]). 😢.
If we want to extract only the sentences that contain numerical
information about the moon, we need to get rid of the reference
numbers.
Exercise 13: For this exercise, we are interested in
the length of the different words in the paragraph to proof that science
word are super long 😲!
Hint: The filler worlds “the, a, and” are slowing down
any further operations. Let’s get rid of them before we are performing
the length analysis. 💁
Imagine you are are a new data science intern at a economic research institute. The institute has historical data on names, job titles, and employers. The records are stored in an inconsistent format that looks like sentences. The institute would like to extract the names, job titles, and employers in order to investigate the relationship between terminal degrees, job titles, and employers.
Data on employee names, positions, and employers are saved in the following format:
## [1] "Yandel Erdman is employed as a Clinical biochemist at Jakubowski-Jakubowski"
## [2] "Dr. Garland Zboncak is employed as a Public librarian at Sanford-Sanford"
## [3] "Sanford-Sanford employs Kelsie Zieme as a Pilot, airline"
## [4] "Dwain Nicolas-Considine is employed as a Broadcast presenter at Sanford-Sanford"
## [5] "Wuckert Inc employs Con Koch as a Editor, commissioning"
## [6] "Tiera Hauck works as a Seismic interpreter at Olson, Olson and Olson"
Exercise 1: There are three formats that the data is written in. Please write regular expressions to identify rows that match each format. Make sure your regexes do not match other formats as well.
Hint: When simulating the data, I used the following formats:
formats <- c("{name} is employed as a {job} at {company}",
"{company} employs {name} as a {job}",
"{name} works as a {job} at {company}")
Exercise 2: Now write code that takes a dataframe
job_data
with the column job_strings
and
creates a column format
to identify which format the row
uses.
Exercise 3: Now that we know the format of each
line, we can write regular expressions to extract the name from each
format. Let’s start with the first format,
{name} is employed as a {job} at {company}
. How can we
extract job from job_strings
?
Exercise 4: Now write code that does the same thing
for all three formats stored in job_data$job_strings
.
Exercise 5: Now extract the names from
job_strings
.
Exercise 6: What regular expression could be used to
identify terminal degrees (PhD, ScD, MD, DDS, DVM) in
name
?
Hint: You can create groups in regular expressions
using parenthesis and use |
as an OR
operator.
Exercise 7: Now let’s find the most common jobs held by people with advanced degrees.
Exercise 8: Why does this code evaluate to false?
## [1] FALSE
Exercise 9: Look at the canonical regex for validating email addresses and think about when regexes are and are not appropriate. Imagine how complicated this regular expression will become once unicode support is added to the email address specification.
Exercise 10 Let’s revisit the regular expression for
extracting domains from email addresses. How can we rewrite it using
stringr
?
## [1] "Emails: person@icloud.com ; person@gmail.com ; person@MacBook"
gsub("\\.[a-zA-Z]{2,}$", "", gsub("^.+@", "", emails[grep("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", emails)]))
## [1] "icloud" "gmail"
Exercise 11 You are now processing log files that
contain personally identifiable data (PII). Your employer wants you to
strip out the PII while retaining as much data as possible. Your legal
team has determined that email domains and internal identifiers like
UserID
should be retained, but they want email addresses
converted to “user@domain” and credit card numbers replaced with
“CREDIT-CARD” (Note: we are not using the Luhn algorithm
because it cannot be implemented using regular expressions, so ignore
the invalid credit card numbers and pretend any 16 digit number is a
credit card).. Please write code that takes log
and returns
a version without PII.
logs <- c(
"2021-10-28 12:34:56 INFO User j.doe@example.com logged in successfully. UserID: 102938",
"2021-10-28 12:35:12 INFO User j.doe@example.com added item to cart. ItemID: 7890",
"2021-10-28 12:36:32 ERROR Payment failed for j.doe@example.com. Error code: 345",
"2021-10-28 12:40:15 INFO User jane.d@example.net logged in successfully. UserID: 475869",
"2021-10-28 12:41:09 INFO User jane.d@example.net made a purchase. OrderID: 192837 with 4000-6000-2023-1900",
"2021-10-28 12:45:23 INFO User bill.gates@microsoft.com logged in successfully. UserID: 918273",
"2021-10-28 12:46:45 INFO User bill.gates@microsoft.com added item to cart. ItemID: 5647",
"2021-10-28 12:50:00 INFO User elon.musk@spacex.com logged in successfully. UserID: 546372",
"2021-10-28 12:50:12 INFO User elon.musk@spacex.com made a purchase. OrderID: 293847 with 4567-1234-1900-2023",
"2021-10-28 12:52:19 ERROR Payment failed for bill.gates@microsoft.com. Error code: 908",
"2021-10-28 12:54:00 INFO User satya.nadella@microsoft.com logged in successfully. UserID: 192847",
"2021-10-28 12:54:56 INFO User satya.nadella@microsoft.com added item to cart. ItemID: 6574",
"2021-10-28 12:58:10 INFO User tim.cook@apple.com logged in successfully. UserID: 109283",
"2021-10-28 12:59:12 INFO User tim.cook@apple.com made a purchase. OrderID: 546372 with gift card 123124123",
"2021-10-28 13:01:25 ERROR Payment failed for satya.nadella@microsoft.com. Error code: 789",
"2021-10-28 13:03:45 INFO User sundar.pichai@google.com logged in successfully. UserID: 546789",
"2021-10-28 13:05:09 INFO User sundar.pichai@google.com made a purchase. OrderID: 908172 with 4321-9876-2000-1009"
)
# Thank you ChatGPT for generating a first cut of this log data, although I had to manually generate the 16 digit codes and use a jailbreak