QTM 151 - Introduction to Statistical Computing II

Lecture 24 - Text Data

Danilo Freire

Emory University

25 November, 2024

Hello, everyone! 😊

Recap 🤓

Time series data

Last time, we…

  • Dived a bit deeper into time series data
  • Saw how to plot (multiple) trends in the same plot
  • Learned how to calculate growth rates and normalise data in different ways
  • Learned how to use the diff() function to calculate differences between consecutive observations and the shift() function to shift the data
  • Saw how to use query() to filter data

Today’s plan 📚

Text data

  • Today we will work with text data in Python
  • Text data is as popular as ever, thanks to social media, LLMs, and the like
  • We use text data to estimate sentiment, classify documents, ideal points of politicians, and much more
  • Text is usually messy and unstructured, so we need to clean it before we can use it
  • We will learn how to clean text data and how to use them for analysis
  • We will use some pandas funcions today
  • But there are several libraries that are worth cheking out too, such as nltk, spaCy, and gensim
  • We will also introduce the concept of regular expressions, which allow you to search for patterns in text

Some interesting applications of text analysis

Ideal point estimation

Barberá, P. (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political analysis, 23(1), 76-91. YouTube video

Some interesting applications of text analysis

Word frequency as a proxy for importance

Rozado, D., Al-Gharbi, M., & Halberstadt, J. (2023). Prevalence of prejudice-denoting words in news media discourse: A chronological analysis. Social Science computer review, 41(1), 99-122.

Some interesting applications of text analysis

Sentiment analysis

Robinson, D. Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half. Variance Explained.

Some interesting applications of text analysis

What gets censored in China?

King, G., Pan, J., & Roberts, M. E. (2013). How censorship in China allows government criticism but silences collective expression. American political science Review, 107(2), 326-343.

Text data 📜

Text data

Import the necessary libraries and data

  • As usual, let’s start by importing the necessary libraries and the data we will use today
  • The data are about congressional bills in the US
import pandas as pd

# Load the data
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
bills_actions.dtypes
Congress        int64
bill_number     int64
bill_type      object
action         object
main_action    object
category       object
member_id       int64
dtype: object
  • Let’s have a quick look at the data
bills_actions.head()
Congress bill_number bill_type action main_action category member_id
0 116 1029 s S.Amdt.1274 Amendment SA 1274 proposed by Sena... senate amendment proposed (on the floor) amendment 858
1 116 1031 s S.Amdt.2698 Amendment SA 2698 proposed by Sena... senate amendment proposed (on the floor) amendment 675
2 116 1160 s S.Amdt.2659 Amendment SA 2659 proposed by Sena... senate amendment proposed (on the floor) amendment 858
3 116 1199 s Committee on Health, Education, Labor, and Pen... senate committee/subcommittee actions senate bill 1561
4 116 1208 s Committee on the Judiciary. Reported by Senato... senate committee/subcommittee actions senate bill 1580

Basic text operations

Counting categories

  • A simple way to start working with text data is to count the number of words in a text or dataset
  • Let’s see how many categories we have in the category column
bills_actions["category"].nunique()
9
  • We can also use value_counts() to see the frequency of each category
bills_actions["category"].value_counts()
category
amendment                       1529
house bill                       902
senate bill                      514
house resolution                 234
senate resolution                 60
house joint resolution            22
house concurrent resolution       20
senate concurrent resolution      14
senate joint resolution            8
Name: count, dtype: int64

Subset text categories

  • Here we are only interested in bills. So let’s use query() to subset the data
  • We select all entries in the column called category which have values contain in list_categories
    • in is used to test whether a word belongs to a list
    • @ is the syntax to reference global variables that are defined in the global environment
list_categories = ["house bill","senate bill"]
bills = bills_actions.query('category in @list_categories')

# Verify that the code worked:
bills["category"].value_counts()
category
house bill     902
senate bill    514
Name: count, dtype: int64

Data manipulation with sentences

  • The str attribute allows us to access string methods, and there are a lot of them
  • Here we will use str.contains() to check if a sentence contains a specific word
  • Let’s see how many bills contain the word senator
# Check if the action contains the word "Senator"
bool_contains = bills["action"].str.contains("Senator")

# Check the result
bool_contains.head()
3    True
4    True
5    True
6    True
7    True
Name: action, dtype: bool
# Calculate the proportion 
bool_contains.mean()
0.3199152542372881
  • Let’s double-check the result
bills[bills["action"].str.contains("Senator")]
Congress bill_number bill_type action main_action category member_id
3 116 1199 s Committee on Health, Education, Labor, and Pen... senate committee/subcommittee actions senate bill 1561
4 116 1208 s Committee on the Judiciary. Reported by Senato... senate committee/subcommittee actions senate bill 1580
5 116 1231 s Committee on the Judiciary. Reported by Senato... senate committee/subcommittee actions senate bill 1580
6 116 1228 s Committee on Commerce, Science, and Transporta... senate committee/subcommittee actions senate bill 1002
7 116 123 s Committee on Veterans' Affairs. Reported by Se... senate committee/subcommittee actions senate bill 1490
... ... ... ... ... ... ... ...
2944 116 617 hr Committee on Energy and Natural Resources. Rep... senate committee/subcommittee actions house bill 1581
3081 116 762 hr Committee on Energy and Natural Resources. Rep... senate committee/subcommittee actions house bill 1581
3142 116 828 hr Committee on Homeland Security and Governmenta... senate committee/subcommittee actions house bill 1701
3150 116 829 hr Committee on Homeland Security and Governmenta... senate committee/subcommittee actions house bill 1701
3195 116 887 hr Committee on Homeland Security and Governmenta... senate committee/subcommittee actions house bill 1701

453 rows × 7 columns

Replacing text

  • We can also use str.replace() to replace text
  • Let’s replace the word Senator by Sen. in the action column
bills["action_custom"] = bills["action"].str.replace("Senator","Sen.")
bills[["action","action_custom"]].head(10)
action action_custom
3 Committee on Health, Education, Labor, and Pen... Committee on Health, Education, Labor, and Pen...
4 Committee on the Judiciary. Reported by Senato... Committee on the Judiciary. Reported by Sen. G...
5 Committee on the Judiciary. Reported by Senato... Committee on the Judiciary. Reported by Sen. G...
6 Committee on Commerce, Science, and Transporta... Committee on Commerce, Science, and Transporta...
7 Committee on Veterans' Affairs. Reported by Se... Committee on Veterans' Affairs. Reported by Se...
9 Committee on Homeland Security and Governmenta... Committee on Homeland Security and Governmenta...
10 Committee on Homeland Security and Governmenta... Committee on Homeland Security and Governmenta...
12 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
13 Committee on the Judiciary. Reported by Senato... Committee on the Judiciary. Reported by Sen. G...
15 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...

Try it yourself! 🤓

  • Create a new dataset called resolutions which subsets rows contain the category values:
    • ["house resolution","senate resolution"]
  • Create a new column called action_custom which replaces the word resolution by res. in the action column
  • Check the first 10 rows of the new dataset
  • Appendix 01

Regular expressions 🔎

Regular expressions

  • Regular expressions (“regex”) are a flexible tool to search for patterns in text
  • Regex is a language in itself, and it is used in many programming languages
  • They are indeed very powerful, but can also be very complex
  • Here we will see just the basics of regex, but it is worth learning more about it (if you have the courage!)
  • An example of a very simple regex used to validate email addresses: ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
  • It means “a string that starts with a sequence of letters, numbers, underscores, hyphens, and dots, followed by an @, followed by another sequence of letters, numbers, underscores, hyphens, and dots, followed by a dot and a sequence of letters with length between 2 and 5” 😅

Regular expressions in Python

  • Let’s load the dataset again
dataset = pd.read_csv("data_raw/bills_actions.csv")
  • And let’s split the data into two datasets: one for senate bills and another for amendments
senate_bills = dataset.query('category == "senate bill"')
amendments = dataset.query('category == "amendment"')
  • Let’s see all actions that contain the words to reconsider
  • We can use the str.contains() method again
dataset[dataset['action'].str.contains('to reconsider')].head()
Congress bill_number bill_type action main_action category member_id
38 116 1 s Motion by Senator McConnell to reconsider the ... senate floor actions senate bill 858
39 116 1 s Motion by Senator McConnell to reconsider the ... senate floor actions senate bill 858
40 116 1 s Motion by Senator McConnell to reconsider the ... senate floor actions senate bill 858
41 116 1 s Motion by Senator McConnell to reconsider the ... senate floor actions senate bill 858
268 116 2657 s Motion by Senator McConnell to reconsider the ... senate floor actions senate bill 858

Regular expressions in Python

  • Now let’s combine the str.contains() method with a regular expression
  • Python handles regex with the re module, which is part of the standard library
  • Let’s import it too
import re
  • We can use the str.findall() method to find all occurrences of a pattern in a string
  • This is similar to str.contains(), but it returns the actual matches instead of a boolean (True/False) when a match is found
  • We will search for all occurrences of the word Amdt followed by a dot and a number or a non-digit character (e.g., Amdt.1, Amdt.2!, Amdt.3., Amdt.4@, Amdt.5a, etc)
  • The regex pattern is Amdt\.\d+
    • Amdt is the word we are looking for
    • \. is used to escape the dot, which is a special character in regex
    • \d is used to match any digit
    • + is used to match one or more occurrences of the previous character
    • \D+ is used to match any non-digit character
  • Note the r before the string, which is used to indicate a raw string (to avoid Python from interpreting the backslashes)
  • Let’s see the result
amendments["action"].str.findall(r'Amdt\.\d+\D').head()
0     [Amdt.1274 ]
1     [Amdt.2698 ]
2     [Amdt.2659 ]
8     [Amdt.2424 ]
11    [Amdt.1275 ]
Name: action, dtype: object

Wildcards and quantifiers

  • There are several special characters in regex. Examples:
  • . = any character except a newline
  • \d = digit
  • \D = non-digit character
  • \w = any word character (alphanumeric character plus underscore)
  • \W = any non-word character
  • \s = any whitespace character
  • \S = any non-whitespace character
  • + = one or more occurrences of the previous character

  • There are many more special characters and quantifiers
  • Check the documentation for more information

Some examples

  • Get digits after string
amendments["action"].str.findall(r"Amdt\.\d+").head()
0     [Amdt.1274]
1     [Amdt.2698]
2     [Amdt.2659]
8     [Amdt.2424]
11    [Amdt.1275]
Name: action, dtype: object
  • Get any character before string
amendments["action"].str.findall(r"\wmdt\.").head()
0     [Amdt.]
1     [Amdt.]
2     [Amdt.]
8     [Amdt.]
11    [Amdt.]
Name: action, dtype: object
  • Get two characters before string and four characters after string
amendments["action"].str.findall(r"\w{2}dt\.\w{4}").head()
0     [Amdt.1274]
1     [Amdt.2698]
2     [Amdt.2659]
8     [Amdt.2424]
11    [Amdt.1275]
Name: action, dtype: object

Wildcards and quantifiers

  • Quantifiers are used to specify how many occurrences of a character we want to match
  • * = zero or more occurrences of the previous character
  • ? = zero or one occurrence of the previous character
  • {n} = exactly n occurrences of the previous character
  • {n,} = at least n occurrences of the previous character
  • {n,m} = between n and m occurrences of the previous character
  • ^ = start of a string
  • $ = end of a string
  • [] = a set of characters
  • | = or
  • () = group
  • Enough for today! 😅

  • Match any characters (including none) before Amdt followed by non-whitespace
amendments["action"].str.findall(r".*Amdt\S*").head()
0     [S.Amdt.1274]
1     [S.Amdt.2698]
2     [S.Amdt.2659]
8     [S.Amdt.2424]
11    [S.Amdt.1275]
Name: action, dtype: object

Try it yourself! 😅

  • Practice using the senate_bills dataset
senate_bills = dataset.query('category == "senate bill"')
  • Use .str.findall() to find the word Senator
  • Use the regular expression "Senator \S" to extract the the first letter of senator
  • Use * to extract senator names
  • Appendix 02

That’s all for today! 🎉

Happy Thanksgiving! 🦃

Appendix 01

# Subset the data
list_categories = ["house resolution","senate resolution"]
resolutions = bills_actions.query('category in @list_categories')

# Replace the word "resolution" by "res."
resolutions["action_custom"] = resolutions["action"].str.replace("resolution","res.")

# Check the first 10 rows
resolutions[["action","action_custom"]].head(10)
action action_custom
485 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
486 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
487 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
488 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
489 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
490 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
493 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
494 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
496 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...
497 Committee on Foreign Relations. Reported by Se... Committee on Foreign Relations. Reported by Se...

Back to exercise

Appendix 02

# Find the word "Senator"
senate_bills["action"].str.findall(r"Senator").head()
3    [Senator]
4    [Senator]
5    [Senator]
6    [Senator]
7    [Senator]
Name: action, dtype: object
# Extract the first letter of senator
senate_bills["action"].str.findall(r"Senator \S").head()
3    [Senator A]
4    [Senator G]
5    [Senator G]
6    [Senator W]
7    [Senator M]
Name: action, dtype: object
# Extract senator names
senate_bills["action"].str.findall(r"Senator \S*").head()
3    [Senator Alexander]
4       [Senator Graham]
5       [Senator Graham]
6       [Senator Wicker]
7        [Senator Moran]
Name: action, dtype: object

Back to exercise