QTM 151 - Introduction to Statistical Computing II

Lecture 24 - Text Data

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 😊

Recap 🤓

Time series data

Last time, we…

Dived a bit deeper into time series data
Saw how to plot (multiple) trends in the same plot
Learned how to calculate growth rates and normalise data in different ways
Learned how to use the diff() function to calculate differences between consecutive observations and the shift() function to shift the data
Saw how to use query() to filter data

Today’s plan 📚

Text data

Today we will work with text data in Python
Text data is as popular as ever, thanks to social media, LLMs, and the like
We use text data to estimate sentiment, classify documents, ideal points of politicians, and much more
Text is usually messy and unstructured, so we need to clean it before we can use it
We will learn how to clean text data and how to use them for analysis

We will use some pandas funcions today
But there are several libraries that are worth cheking out too, such as nltk, spaCy, and gensim
We will also introduce the concept of regular expressions, which allow you to search for patterns in text

Some interesting applications of text analysis

Ideal point estimation

Barberá, P. (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political analysis, 23(1), 76-91. YouTube video

Some interesting applications of text analysis

Word frequency as a proxy for importance

Rozado, D., Al-Gharbi, M., & Halberstadt, J. (2023). Prevalence of prejudice-denoting words in news media discourse: A chronological analysis. Social Science computer review, 41(1), 99-122.

Some interesting applications of text analysis

Sentiment analysis

Robinson, D. Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half. Variance Explained.

Some interesting applications of text analysis

What gets censored in China?

King, G., Pan, J., & Roberts, M. E. (2013). How censorship in China allows government criticism but silences collective expression. American political science Review, 107(2), 326-343.

Text data 📜

Text data

Import the necessary libraries and data

As usual, let’s start by importing the necessary libraries and the data we will use today
The data are about congressional bills in the US

import pandas as pd

# Load the data
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
bills_actions.dtypes

Congress        int64
bill_number     int64
bill_type      object
action         object
main_action    object
category       object
member_id       int64
dtype: object

Let’s have a quick look at the data

bills_actions.head()

	Congress	bill_number	bill_type	action	main_action	category	member_id
0	116	1029	s	S.Amdt.1274 Amendment SA 1274 proposed by Sena...	senate amendment proposed (on the floor)	amendment	858
1	116	1031	s	S.Amdt.2698 Amendment SA 2698 proposed by Sena...	senate amendment proposed (on the floor)	amendment	675
2	116	1160	s	S.Amdt.2659 Amendment SA 2659 proposed by Sena...	senate amendment proposed (on the floor)	amendment	858
3	116	1199	s	Committee on Health, Education, Labor, and Pen...	senate committee/subcommittee actions	senate bill	1561
4	116	1208	s	Committee on the Judiciary. Reported by Senato...	senate committee/subcommittee actions	senate bill	1580

Basic text operations

Counting categories

A simple way to start working with text data is to count the number of words in a text or dataset
Let’s see how many categories we have in the category column

bills_actions["category"].nunique()

We can also use value_counts() to see the frequency of each category

bills_actions["category"].value_counts()

category
amendment                       1529
house bill                       902
senate bill                      514
house resolution                 234
senate resolution                 60
house joint resolution            22
house concurrent resolution       20
senate concurrent resolution      14
senate joint resolution            8
Name: count, dtype: int64

Subset text categories

Here we are only interested in bills. So let’s use query() to subset the data
We select all entries in the column called category which have values contain in list_categories
- in is used to test whether a word belongs to a list
- @ is the syntax to reference global variables that are defined in the global environment

list_categories = ["house bill","senate bill"]
bills = bills_actions.query('category in @list_categories')

# Verify that the code worked:
bills["category"].value_counts()

category
house bill     902
senate bill    514
Name: count, dtype: int64

Data manipulation with sentences

The str attribute allows us to access string methods, and there are a lot of them
Here we will use str.contains() to check if a sentence contains a specific word
Let’s see how many bills contain the word senator

# Check if the action contains the word "Senator"
bool_contains = bills["action"].str.contains("Senator")

# Check the result
bool_contains.head()

3    True
4    True
5    True
6    True
7    True
Name: action, dtype: bool

# Calculate the proportion 
bool_contains.mean()

0.3199152542372881

Let’s double-check the result

bills[bills["action"].str.contains("Senator")]

	Congress	bill_number	bill_type	action	main_action	category	member_id
3	116	1199	s	Committee on Health, Education, Labor, and Pen...	senate committee/subcommittee actions	senate bill	1561
4	116	1208	s	Committee on the Judiciary. Reported by Senato...	senate committee/subcommittee actions	senate bill	1580
5	116	1231	s	Committee on the Judiciary. Reported by Senato...	senate committee/subcommittee actions	senate bill	1580
6	116	1228	s	Committee on Commerce, Science, and Transporta...	senate committee/subcommittee actions	senate bill	1002
7	116	123	s	Committee on Veterans' Affairs. Reported by Se...	senate committee/subcommittee actions	senate bill	1490
...	...	...	...	...	...	...	...
2944	116	617	hr	Committee on Energy and Natural Resources. Rep...	senate committee/subcommittee actions	house bill	1581
3081	116	762	hr	Committee on Energy and Natural Resources. Rep...	senate committee/subcommittee actions	house bill	1581
3142	116	828	hr	Committee on Homeland Security and Governmenta...	senate committee/subcommittee actions	house bill	1701
3150	116	829	hr	Committee on Homeland Security and Governmenta...	senate committee/subcommittee actions	house bill	1701
3195	116	887	hr	Committee on Homeland Security and Governmenta...	senate committee/subcommittee actions	house bill	1701

453 rows × 7 columns

Replacing text

We can also use str.replace() to replace text
Let’s replace the word Senator by Sen. in the action column

bills["action_custom"] = bills["action"].str.replace("Senator","Sen.")
bills[["action","action_custom"]].head(10)

	action	action_custom
3	Committee on Health, Education, Labor, and Pen...	Committee on Health, Education, Labor, and Pen...
4	Committee on the Judiciary. Reported by Senato...	Committee on the Judiciary. Reported by Sen. G...
5	Committee on the Judiciary. Reported by Senato...	Committee on the Judiciary. Reported by Sen. G...
6	Committee on Commerce, Science, and Transporta...	Committee on Commerce, Science, and Transporta...
7	Committee on Veterans' Affairs. Reported by Se...	Committee on Veterans' Affairs. Reported by Se...
9	Committee on Homeland Security and Governmenta...	Committee on Homeland Security and Governmenta...
10	Committee on Homeland Security and Governmenta...	Committee on Homeland Security and Governmenta...
12	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
13	Committee on the Judiciary. Reported by Senato...	Committee on the Judiciary. Reported by Sen. G...
15	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...

Try it yourself! 🤓

Create a new dataset called resolutions which subsets rows contain the category values:
- ["house resolution","senate resolution"]
Create a new column called action_custom which replaces the word resolution by res. in the action column
Check the first 10 rows of the new dataset
Appendix 01

Regular expressions 🔎

Regular expressions

Regular expressions (“regex”) are a flexible tool to search for patterns in text
Regex is a language in itself, and it is used in many programming languages
They are indeed very powerful, but can also be very complex
Here we will see just the basics of regex, but it is worth learning more about it (if you have the courage!)
An example of a very simple regex used to validate email addresses: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
It means “a string that starts with a sequence of letters, numbers, underscores, hyphens, and dots, followed by an @, followed by another sequence of letters, numbers, underscores, hyphens, and dots, followed by a dot and a sequence of letters with length between 2 and 5” 😅

Regular expressions in Python

Let’s load the dataset again

dataset = pd.read_csv("data_raw/bills_actions.csv")

And let’s split the data into two datasets: one for senate bills and another for amendments

senate_bills = dataset.query('category == "senate bill"')
amendments = dataset.query('category == "amendment"')

Let’s see all actions that contain the words to reconsider
We can use the str.contains() method again

dataset[dataset['action'].str.contains('to reconsider')].head()

	Congress	bill_number	bill_type	action	main_action	category	member_id
38	116	1	s	Motion by Senator McConnell to reconsider the ...	senate floor actions	senate bill	858
39	116	1	s	Motion by Senator McConnell to reconsider the ...	senate floor actions	senate bill	858
40	116	1	s	Motion by Senator McConnell to reconsider the ...	senate floor actions	senate bill	858
41	116	1	s	Motion by Senator McConnell to reconsider the ...	senate floor actions	senate bill	858
268	116	2657	s	Motion by Senator McConnell to reconsider the ...	senate floor actions	senate bill	858

Regular expressions in Python

Now let’s combine the str.contains() method with a regular expression
Python handles regex with the re module, which is part of the standard library
Let’s import it too

import re

We can use the str.findall() method to find all occurrences of a pattern in a string
This is similar to str.contains(), but it returns the actual matches instead of a boolean (True/False) when a match is found
We will search for all occurrences of the word Amdt followed by a dot and a number or a non-digit character (e.g., Amdt.1, Amdt.2!, Amdt.3., Amdt.4@, Amdt.5a, etc)

The regex pattern is Amdt\.\d+
- Amdt is the word we are looking for
- \. is used to escape the dot, which is a special character in regex
- \d is used to match any digit
- + is used to match one or more occurrences of the previous character
- \D+ is used to match any non-digit character
Note the r before the string, which is used to indicate a raw string (to avoid Python from interpreting the backslashes)
Let’s see the result

amendments["action"].str.findall(r'Amdt\.\d+\D').head()

0     [Amdt.1274 ]
1     [Amdt.2698 ]
2     [Amdt.2659 ]
8     [Amdt.2424 ]
11    [Amdt.1275 ]
Name: action, dtype: object

Wildcards and quantifiers

There are several special characters in regex. Examples:
. = any character except a newline
\d = digit
\D = non-digit character
\w = any word character (alphanumeric character plus underscore)
\W = any non-word character
\s = any whitespace character
\S = any non-whitespace character
+ = one or more occurrences of the previous character

There are many more special characters and quantifiers
Check the documentation for more information

Some examples

Get digits after string

amendments["action"].str.findall(r"Amdt\.\d+").head()

0     [Amdt.1274]
1     [Amdt.2698]
2     [Amdt.2659]
8     [Amdt.2424]
11    [Amdt.1275]
Name: action, dtype: object

Get any character before string

amendments["action"].str.findall(r"\wmdt\.").head()

0     [Amdt.]
1     [Amdt.]
2     [Amdt.]
8     [Amdt.]
11    [Amdt.]
Name: action, dtype: object

Get two characters before string and four characters after string

amendments["action"].str.findall(r"\w{2}dt\.\w{4}").head()

0     [Amdt.1274]
1     [Amdt.2698]
2     [Amdt.2659]
8     [Amdt.2424]
11    [Amdt.1275]
Name: action, dtype: object

Wildcards and quantifiers

Quantifiers are used to specify how many occurrences of a character we want to match
* = zero or more occurrences of the previous character
? = zero or one occurrence of the previous character
{n} = exactly n occurrences of the previous character
{n,} = at least n occurrences of the previous character
{n,m} = between n and m occurrences of the previous character
^ = start of a string
$ = end of a string
[] = a set of characters
| = or
() = group
Enough for today! 😅

Match any characters (including none) before Amdt followed by non-whitespace

amendments["action"].str.findall(r".*Amdt\S*").head()

0     [S.Amdt.1274]
1     [S.Amdt.2698]
2     [S.Amdt.2659]
8     [S.Amdt.2424]
11    [S.Amdt.1275]
Name: action, dtype: object

Try it yourself! 😅

Practice using the senate_bills dataset

senate_bills = dataset.query('category == "senate bill"')

Use .str.findall() to find the word Senator
Use the regular expression "Senator \S" to extract the the first letter of senator
Use * to extract senator names
Appendix 02

That’s all for today! 🎉

Happy Thanksgiving! 🦃

Appendix 01

# Subset the data
list_categories = ["house resolution","senate resolution"]
resolutions = bills_actions.query('category in @list_categories')

# Replace the word "resolution" by "res."
resolutions["action_custom"] = resolutions["action"].str.replace("resolution","res.")

# Check the first 10 rows
resolutions[["action","action_custom"]].head(10)

	action	action_custom
485	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
486	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
487	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
488	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
489	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
490	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
493	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
494	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
496	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...
497	Committee on Foreign Relations. Reported by Se...	Committee on Foreign Relations. Reported by Se...

Back to exercise

Appendix 02

# Find the word "Senator"
senate_bills["action"].str.findall(r"Senator").head()

3    [Senator]
4    [Senator]
5    [Senator]
6    [Senator]
7    [Senator]
Name: action, dtype: object

# Extract the first letter of senator
senate_bills["action"].str.findall(r"Senator \S").head()

3    [Senator A]
4    [Senator G]
5    [Senator G]
6    [Senator W]
7    [Senator M]
Name: action, dtype: object

# Extract senator names
senate_bills["action"].str.findall(r"Senator \S*").head()

3    [Senator Alexander]
4       [Senator Graham]
5       [Senator Graham]
6       [Senator Wicker]
7        [Senator Moran]
Name: action, dtype: object

Back to exercise