The following activity is adapted from work by Joshua C. Fjelstul and Julia Silge and David Robinson.
Here’s what we’ll be starting with to do some text tricks today.
Regular expressions are a way to describe patterns in text. They are
used in many different programming languages to find and manipulate
text. In R, regular expressions are implemented in the
stringr
package.
stringr
packageThe stringr
package has a number of functions that allow
you to manipulate text using regular expressions. The most commonly used
functions are:
str_extract()
extracts the first matchstr_extract_all()
extracts all matchesstr_detect()
detects if a string matches a patternstr_count()
counts the number of matchesstr_locate()
locates the position of the first
matchstr_locate_all()
locates the position of all
matchesstr_replace()
replaces the first matchstr_replace_all()
replaces all matchesstr_split()
splits a string into a vector of
stringsstr_subset()
returns a subset of strings that match a
patternThere are many special symbols that can be used in regular expressions. Here are some of the most commonly used symbols:
For more on regular expressions, see resources like regex101, RegExplain, stringr Cheatsheet.
Today, you’ll walk through an example of how you can use regular expressions to replicate text. I’ll get you started out and then I’ll ask you to try a few on your own.
First, we’ll pull in the Ask A Manager dataset for 2023. You’ll do some basic cleaning with it.
The following code reads in the data from the Ask A Manager Google Sheet. Then it immediately names the columns meaningful things. Finally, it converts all string variables to lowercase.
column_names <- c('timestamp','age','industry','area','jobtitle','jobtitle2',
'annual_salary','income_additional','currency','currency_other',
'income_additional','country','state','city','remote','experience_overall',
'experience_field','education','gender','race')
managers2023 <- read_csv(gsheet::gsheet2text('https://docs.google.com/spreadsheets/d/ 1ioUjhnz6ywSpEbARI-G3RoPyO0NRBqrJnWf-7C_eirs/edit?resourcekey#gid=1854892322'),
col_names = column_names,
skip = 1)
## No encoding supplied: defaulting to UTF-8.
## New names:
## Rows: 17070 Columns: 20
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (18): timestamp, age, industry, area, jobtitle, jobtitle2, currency, cur... dbl
## (2): annual_salary, income_additional...8
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `income_additional` -> `income_additional...8`
## • `income_additional` -> `income_additional...11`
## # A tibble: 6 × 20
## timestamp age industry area jobtitle jobtitle2 annual_salary
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 4/11/2023 11:02:00 35-44 Government & … Engi… Materia… <NA> 125000
## 2 4/11/2023 11:02:07 25-34 Galleries, Li… Gall… Assista… <NA> 71000
## 3 4/11/2023 11:02:12 35-44 Education (Hi… Educ… Directo… <NA> 60000
## 4 4/11/2023 11:02:15 25-34 Education (Hi… Gove… Adminis… <NA> 42000
## 5 4/11/2023 11:02:25 18-24 Accounting, B… Admi… Executi… <NA> 65000
## 6 4/11/2023 11:02:29 25-34 Government & … Law Counsel <NA> 88000
## # ℹ 13 more variables: income_additional...8 <dbl>, currency <chr>,
## # currency_other <chr>, income_additional...11 <chr>, country <chr>,
## # state <chr>, city <chr>, remote <chr>, experience_overall <chr>,
## # experience_field <chr>, education <chr>, gender <chr>, race <chr>
Let’s homogenize the text to make better sense of it. That means we want to lower case things and remove special characters.
If you look at the data, you likely see many special characters and different cases. Sometimes these are useful, sometimes they are not. For example, if you want to count the number of times a word appears in a text, you don’t want to count “word” and “word.” as two different words. Alternatively, to pick out percentages, you’ll want to keep percent.
The following code uses the tolower
and
str_replace_all()
functions to replace all special
characters other than spaces with nothing. In
str_replace_all()
, the first argument is the string you
want to replace. The second argument is the pattern you want to replace.
The third argument is what you want to replace the pattern with. In this
case, we want to replace the pattern with nothing.
Below I remove all of the non-alphanumeric and space characters from
the jobtitle
variable. Try adding a similar command to
remove all of the non-alphanumeric and space characters other than the
percent sign from the income_additional
variable.
Now, let’s pick out the percentages from the
income_additional
variable. We’ll use the
str_extract_all()
function to extract all of the
percentages from the income_additional
variable. The
pattern we want to match is any number followed by a percent sign. The
pattern for any number is ‘\d+’. The pattern for a percent sign is ‘\%’.
The pattern for any number followed by a percent sign is ‘\d+\%’.
managers2023 %>% select(income_additional) %>%
filter(!is.na(income_additional)) %>%
mutate(add_percentage=str_extract(income_additional, '')) %>%
head(5)
Are there other types of numbers in the
income_additional
variable that are not percentages? How
can you find them? For example, someone may mention that if they exceed
X hours they get a bonus. That will require a different pattern to flag
and you want to save it as a separate column if you want to use it
later. Why? You don’t want to accidentally count the number of hours
someone works as a percentage of their income in a calculation of their
bonus pay.
The industry variable is a mess. There are many different ways to describe the same industry. For example, “software” and “software development” are the same industry. This is because of free-form text entry. That’s neat, but it makes it hard to analyze the data. Let’s use the North American Industry Classification System (NAICS) data that the Census and BLS uses.
Naics follows a 2-6 digit system. The first two digits are the sector. The first three digits are the subsector. The first four digits are the industry group. The first five digits are the NAICS industry. The first six digits are the national industry.
For example, 54 is professional, scientific, and technical services. 541 is professional, scientific, and technical services. 5413 is architectural, engineering, and related services. 54133 is engineering services. 541330 is engineering services.
## Rows: 2678 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): industry_code, industry_title
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s use the fuzzyjoin
package to match the industry
variable to the NAICS data. The fuzzyjoin
package allows
you to match strings that are not exactly the same. This is useful when
you have free-form text entry.
The fuzzyjoin
package has a number of functions that
allow you to match strings. There is a great vignette explaining
how it works. The most commonly used functions are:
stringdist_left_join()
joins two data frames based on a
string distancestringdist_inner_join()
joins two data frames based on
a string distancestringdist_semi_join()
returns all rows from the first
data frame that have a match in the second data framestringdist_anti_join()
returns all rows from the first
data frame that do not have a match in the second data framestringdist_full_join()
joins two data frames based on a
string distanceFirst we’ll need to prepare the NAICS data a little bit. We’ll need
to make the industries lowercase to match the managers data. You’ll also
need to change the str_replace_all()
function to remove all
the leading numeric characters and the word “naics” from the
industry_title
variable, which I rename to
industry
.
There may be other changes you want to include like removal of
special characters OR homogenizing of special characters. For example,
use str_replace_all()
to replace all the “&” in the
NAICS data with “and.”
Notably, if a NAICS code has no further subdivisions, the name is repeated and a zero added to the end of the code. For example, 11116 is Rice farming, which is as granular as that industry gets. 111160 is also Rice farming to make sure all code hierarchies terminate at six digits. To work around that, we’ll just remove distinct duplicates.
I’m also removing the “NAICS” for 10, which is “Total, all industries” as that is a bit silly to match on.
Now we’ll use the stringdist_left_join()
function to
match the industry
variable in the
managers2023
data to the industry
variable in
the naics_prep
data. The
stringdist_left_join()
function takes four arguments. The
first argument is the data frame you want to add the new variable to.
The second argument is the data frame you want to match to. The third
argument is the variable you want to match on in the first data frame.
The fourth argument is the variable you want to match on in the second
data frame. The max_dist
argument is the maximum distance
between the two strings. The distance_col
argument is the
name of the column that will be added to the first data frame that
contains the distance between the two strings. The method
argument is the method used to calculate the distance between the two
strings. The ignore_case
argument is whether or not to
ignore case when calculating the distance between the two strings. The
full_join
argument is whether or not to do a full join. The
max_dist
argument is the maximum distance between the two
strings. The distance_col
argument is the name of the
column that will be added to the first data frame that contains the
distance between the two strings. The method
argument is
the method used to calculate the distance between the two strings. The
ignore_case
argument is whether or not to ignore case when
calculating the distance between the two strings.
managers2023_naics <- head(managers2023_no_special,25) %>%
stringdist_left_join(naics_prep, by=c('industry'='industry'), max_dist=.4, distance_col='distance', method='jw', ignore_case=TRUE)
Check the data you got back. Notice any issues? For one, I got 4,972 rows back from 25 rows of data. That’s a lot of rows. Why? Because there are many matches per row of data. For example, “government & public administration” matched to “natural resources and mining” with a 0.354 distance. That’s not a good match, but it was permitted.
What will we get out of a better match? Instead of trying to match the groups of industries in the raw data, we’ll have them immediately mapped to one key industry – the NAICS industry. That will make it easier to link to other datasets.
Of course, in the end it may be easier to do this by hand for the
largest industries and then use the fuzzyjoin
package to
match the rest or even just drop the rest.
Let’s make a wordcloud of the income_additional variable. I’m curious what language is used to described unusual compensation schemes.
We’ll need to start by tokenizing the text. Tokenization is the process of breaking a string into tokens. A token is a sequence of characters that represents a unit of meaning. For example, a word is a token in a sentence. Tokenization is useful because it breaks a string into meaningful units that can be analyzed.
Navigate to the slides from today’s lecture and augment the code below to make a wordcloud of the income_additional variable.
tokens <- managers2023 %>%
select(jobtitle) %>%
unnest_tokens(word,jobtitle) %>%
count(word,sort=T)
tokens
## # A tibble: 2,384 × 2
## word n
## <chr> <int>
## 1 manager 3483
## 2 senior 1924
## 3 director 1856
## 4 engineer 1088
## 5 of 976
## 6 assistant 945
## 7 analyst 916
## 8 specialist 852
## 9 associate 800
## 10 coordinator 662
## # ℹ 2,374 more rows
There’s probably a lot of words in the tokens data frame that are not useful. For example, “the” and “a” are not useful words. These are called stop words. Stop words are words that are so common that they are not useful for analysis.
The tidytext
package has a built-in list of stop words.
You can access it using the stop_words
function. Use the
anti_join
function to remove the stop words from the tokens
data frame.
Last, let’s make a wordcloud! I use pal brewer.pal()
to
make it more pleasing to the eye.
Can you make the same thing with 2-word ngrams?