Welcome to our workshop! 👋

Lets do some string manipulation together!

Notes on the practice sheet

Make sure to load the stringr and stringi packages. We are going through different levels of difficulty in string manipulation.
The first exercises will consist of straight forward application of some of the main functions of the stringr and the stringi package.
In the medium level, we will deal with more complex tasks that involve string manipulation.
The advanced exercises include the use of Regex expression. 😨

Exercises

Beginner exercises 🤓

Exercise 1: Create two string vectors that say “Today I am gonna learn how to process strings.” & “This will be a lot of fun!” 😊

Exercise 2: Please extract the component “Science” out of the string “Data Science”. 💻

Exercise 3: Count the characters in your string from exercise 1.

Exercise 4: Please add the first and the second string from exercise 1 together. Be minjob_dataul about the whitespace behind the first senctence.

Exercise 5: For some task, we need all the characters to be in lowercase - for example, if we want to count specific words and do not want to include the lower- and the uppercase version of a specific word. From now on, please use the lower case version of our motto.

Exercise 6: Check if the word “horror” is in the each of the concatenated string. Do it for “fun” also. 😎

Exercise 7: Not everyone thinks that string processing is “a lot of” fun. Create a motto for these people (using str_replace).

Exercise 8: We want to extract the first word of our motto.

Exercise 9: We want to count how many vowels are there in our motto.

Exercise 10: Consider the following case - all strings must have a width of 100 characters. Find out how long our string is and then pad our motto to the necessary length.

Medium difficulty exercises 🦊

In this part, we will show how to create a function with stringr and stringi. Creating a function for string processing can be helpful if you want to execute the same manipulation on a lot of different character vectors, for example tweets or sentences in a sentiment analysis. 🕊. The rest of the exercises the manipulation and analysis of webscraped data from Wikipedia.💬

Exercise 11: We want to create a function that counts the words in a sentence, a paragraph or, for example, a tweet. Think about this: Do you want to include punctuation? And if not, how can we make sure these signs are not counted? Make sure to try out the function with our motto from the beginner exercises.

Webscraping exercises

For thes following exercises, we first need some strings we can process. We prepared a short scraping code to get data from a Wikipedia page. Please run the code - then we can start with the string processing.

Webscraper

Exercise 12: We are only interested in any information about the moon that includes numbers - the size, the thickness, anything related to numbers. How can we do that with the stringr package?

Hint: What do we see when we take a look at the paragraph? The string vector also contains the reference numbers ([89], [99]). 😢.
If we want to extract only the sentences that contain numerical information about the moon, we need to get rid of the reference numbers.

Exercise 13: For this exercise, we are interested in the length of the different words in the paragraph to proof that science word are super long 😲!
Hint: The filler worlds “the, a, and” are slowing down any further operations. Let’s get rid of them before we are performing the length analysis. 💁

Advanced Exercise 💥

Extended Exercise

Imagine you are are a new data science intern at a economic research institute. The institute has historical data on names, job titles, and employers. The records are stored in an inconsistent format that looks like sentences. The institute would like to extract the names, job titles, and employers in order to investigate the relationship between terminal degrees, job titles, and employers.

Data on employee names, positions, and employers are saved in the following format:

head(job_strings)

## [1] "Yandel Erdman is employed as a Clinical biochemist at Jakubowski-Jakubowski"    
## [2] "Dr. Garland Zboncak is employed as a Public librarian at Sanford-Sanford"       
## [3] "Sanford-Sanford employs Kelsie Zieme as a Pilot, airline"                       
## [4] "Dwain Nicolas-Considine is employed as a Broadcast presenter at Sanford-Sanford"
## [5] "Wuckert Inc employs Con Koch as a Editor, commissioning"                        
## [6] "Tiera Hauck works as a Seismic interpreter at Olson, Olson and Olson"

Exercise 1: There are three formats that the data is written in. Please write regular expressions to identify rows that match each format. Make sure your regexes do not match other formats as well.

Hint: When simulating the data, I used the following formats:

formats <- c("{name} is employed as a {job} at {company}",
             "{company} employs {name} as a {job}",
             "{name} works as a {job} at {company}")

Exercise 2: Now write code that takes a dataframe job_data with the column job_strings and creates a column format to identify which format the row uses.

Exercise 3: Now that we know the format of each line, we can write regular expressions to extract the name from each format. Let’s start with the first format, {name} is employed as a {job} at {company}. How can we extract job from job_strings?

Exercise 4: Now write code that does the same thing for all three formats stored in job_data$job_strings.

Exercise 5: Now extract the names from job_strings.

Exercise 6: What regular expression could be used to identify terminal degrees (PhD, ScD, MD, DDS, DVM) in name?

Hint: You can create groups in regular expressions using parenthesis and use | as an OR operator.

Exercise 7: Now let’s find the most common jobs held by people with advanced degrees.

Miscellanious Exercises

Exercise 8: Why does this code evaluate to false?

"look" == "lооk"

## [1] FALSE

Exercise 9: Look at the canonical regex for validating email addresses and think about when regexes are and are not appropriate. Imagine how complicated this regular expression will become once unicode support is added to the email address specification.

Exercise 10 Let’s revisit the regular expression for extracting domains from email addresses. How can we rewrite it using stringr?

## [1] "Emails:  person@icloud.com ; person@gmail.com ; person@MacBook"

gsub("\\.[a-zA-Z]{2,}$", "", gsub("^.+@", "", emails[grep("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", emails)]))

## [1] "icloud" "gmail"

Exercise 11 You are now processing log files that contain personally identifiable data (PII). Your employer wants you to strip out the PII while retaining as much data as possible. Your legal team has determined that email domains and internal identifiers like UserID should be retained, but they want email addresses converted to “user@domain” and credit card numbers replaced with “CREDIT-CARD” (Note: we are not using the Luhn algorithm because it cannot be implemented using regular expressions, so ignore the invalid credit card numbers and pretend any 16 digit number is a credit card).. Please write code that takes log and returns a version without PII.

logs <- c(
"2021-10-28 12:34:56 INFO User j.doe@example.com logged in successfully. UserID: 102938",
"2021-10-28 12:35:12 INFO User j.doe@example.com added item to cart. ItemID: 7890",
"2021-10-28 12:36:32 ERROR Payment failed for j.doe@example.com. Error code: 345",
"2021-10-28 12:40:15 INFO User jane.d@example.net logged in successfully. UserID: 475869",
"2021-10-28 12:41:09 INFO User jane.d@example.net made a purchase. OrderID: 192837 with 4000-6000-2023-1900",
"2021-10-28 12:45:23 INFO User bill.gates@microsoft.com logged in successfully. UserID: 918273",
"2021-10-28 12:46:45 INFO User bill.gates@microsoft.com added item to cart. ItemID: 5647",
"2021-10-28 12:50:00 INFO User elon.musk@spacex.com logged in successfully. UserID: 546372",
"2021-10-28 12:50:12 INFO User elon.musk@spacex.com made a purchase. OrderID: 293847 with 4567-1234-1900-2023",
"2021-10-28 12:52:19 ERROR Payment failed for bill.gates@microsoft.com. Error code: 908",
"2021-10-28 12:54:00 INFO User satya.nadella@microsoft.com logged in successfully. UserID: 192847",
"2021-10-28 12:54:56 INFO User satya.nadella@microsoft.com added item to cart. ItemID: 6574",
"2021-10-28 12:58:10 INFO User tim.cook@apple.com logged in successfully. UserID: 109283",
"2021-10-28 12:59:12 INFO User tim.cook@apple.com made a purchase. OrderID: 546372 with gift card 123124123",
"2021-10-28 13:01:25 ERROR Payment failed for satya.nadella@microsoft.com. Error code: 789",
"2021-10-28 13:03:45 INFO User sundar.pichai@google.com logged in successfully. UserID: 546789",
"2021-10-28 13:05:09 INFO User sundar.pichai@google.com made a purchase. OrderID: 908172 with 4321-9876-2000-1009"
)
# Thank you ChatGPT for generating a first cut of this log data, although I had to manually generate the 16 digit codes and use a jailbreak

String processing in R

The packages stringr and stringi

Jackson Luckey, Miriam Runde, Daniel Boppert

2023-10-30 (updated: 2023-10-28)