library(tidyverse)
library(knitr)
library(kableExtra)
library(RedditExtractoR)
library(tidytext)
library(topicmodels)
library(tm)
library(igraph)
library(ggraph)
Exploring Political Discussions on Social Media
Introduction
In this tutorial, we will explore political discussions online using data from Reddit. We will use the Reddit API to collect posts and comments, perform basic text analysis, and visualize the network of interactions.
Prerequisites
To follow this tutorial, you need:
- An R environment set up on your computer (e.g., RStudio)
- The following R packages installed:
httr
,jsonlite
,tidyverse
,tidytext
,igraph
,ggraph
,redditExtractoR
.
You can install these packages with:
install.packages(c("httr", "jsonlite", "tidyverse", "tidytext", "igraph", "ggraph", "RedditExtractoR", "tm))
Start by loading all necessary packages
Step 1: Access the Reddit API
If you’d like to work with data from a different platform. You may have a look at the Bluesky API. The authentication process is a bit different from Reddit and implementation may take you a bit longer. 🦋
To access Instagram, Facebook, X or TikTok data, you currently have to go through an approval process and/or pay a considerable amount for data access. Reddit is still pretty open to research, this is why we start here.
Setting Up API Access
- Create an account on Reddit if you don’t already have one.
- Go to Reddit Apps and create a new application. Give it a name, add a brief description (select “script”). As redirect uri, you can use
http://localhost:8080
. Note the client ID and secret. At this point you may also want to have a look at the following:
- Create an R script that you call
authenticate.R
and you populate with the following information:
# Replace with your client ID, secret, and username/password
<- "your_client_id"
client_id <- "your_client_secret"
client_secret <- "your_username"
username <- "your_password" password
- Use the
httr
package to authenticate and make requests. This is the standard “raw” way how to query APIs using html GET and SET requests.
# Load credentials from authenticate.R
source("authenticate.R")
library(httr)
# Obtain OAuth token
<- POST("https://www.reddit.com/api/v1/access_token",
response authenticate(client_id, client_secret),
body = list(grant_type = "password", username = username, password = password),
encode = "form")
<- content(response)$access_token
token <- add_headers(Authorization = paste("bearer", token)) headers
Fetching Data
For this tutorial, we will fetch posts and comments from the subreddit r/politics
. If you inspect the output, you’ll see that content from a raw GET request looks pretty messy and requires significant parsing…
# Fetch recent posts from r/politics
<- "https://oauth.reddit.com/r/politics/hot"
url <- GET(url, headers)
response <- content(response, as = "parsed", simplifyVector = TRUE)
data
str(data)
…However, many social media APIs come with wrappers - packages built by people who make your life easier. In R, a package that simplifies working with the Reddit API is the package RedditExtractoR
. If you seriously want to work with Reddit data, I’d recommend moving your data collection to Python and use PRAW
.
# Find thread URLs
<- find_thread_urls(subreddit = "politics", sort_by = "hot")
thread_urls
# Extract content from the first few threads
<- get_thread_content(thread_urls$url[1:10])
threads_content
# Inspect the data - there are two datasets: one for the posts ($threads) and one for the comments ($comments)
str(threads_content)
<- threads_content$threads
posts #write_csv(posts, "../data/posts.csv")
<- threads_content$comments
comments #write_csv(comments, "../data/comments.csv")
# Hash usernames
library(digest)
<- function(usernames) {
hash_username sapply(usernames, function(x) digest(x, algo = "sha256"))
}
$author <- hash_username(posts$author)
posts$author <- hash_username(comments$author)
comments#write_csv(posts, "../data/posts_hashed.csv")
#write_csv(comments, "../data/comments_hashed.csv")
In case you got stuck along the way - you can work with the posts and comments I collected.
<- read.csv("../data/posts_hashed.csv")
posts <- read.csv("../data/comments_hashed.csv") comments
Step 2: Basic Text Analysis
We will analyze the text content of the posts using tidytext
.
# Tokenize words
<- posts %>%
words unnest_tokens(word, title) %>%
anti_join(stop_words) # Remove stopwords
# Count word frequencies
<- words %>%
word_counts count(word, sort = TRUE)
# Display top words
%>%
word_counts top_n(10, n) %>%
arrange(desc(n)) %>%
kbl(col.names = c("Word", "Frequency"), caption = "Top 10 Words in Reddit Posts")%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Word | Frequency |
---|---|
trump | 25 |
bill | 11 |
house | 5 |
judge | 5 |
republicans | 5 |
tax | 5 |
administration | 4 |
beautiful | 4 |
court | 4 |
cut | 4 |
foreign | 4 |
health | 4 |
medicaid | 4 |
musk | 4 |
students | 4 |
## Bigram Analysis
# Tokenize the text into bigrams (pairs of consecutive words)
<- posts %>%
bigrams unnest_tokens(bigram, title, token = "ngrams", n = 2)
# Separate the bigrams into two words
<- bigrams %>%
bigrams_separated separate(bigram, into = c("word1", "word2"), sep = " ")
# Remove rows with stopwords
<- bigrams_separated %>%
bigrams_filtered filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word)
# Unite the words back into bigrams
<- bigrams_filtered %>%
bigrams_united unite(bigram, word1, word2, sep = " ")
# Count bigram frequencies
<- bigrams_united %>%
bigram_counts count(bigram, sort = TRUE)
# Visualize the top 10 bigrams
%>%
bigram_counts top_n(10, n) %>%
arrange(desc(n)) %>%
kbl(col.names = c("Bigram", "Frequency"), caption = "Top 10 Bigrams in Reddit Posts")%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Bigram | Frequency |
---|---|
beautiful bill | 4 |
elon musk | 3 |
supreme court | 3 |
tax bill | 3 |
trump administration | 3 |
bond market | 2 |
charter school | 2 |
donald trump | 2 |
education department | 2 |
foreign students | 2 |
international students | 2 |
medicaid cuts | 2 |
oval office | 2 |
public religious | 2 |
religious charter | 2 |
Simple sentiment analysis
# Load the sentiment lexicon
<- get_sentiments("bing")
bing_lexicon
# Join the unigrams with the sentiment lexicon
<- words %>%
sentiment_analysis inner_join(bing_lexicon, by = "word") %>%
count(sentiment, word, sort = TRUE)
# Visualize the top sentiment-bearing words
%>%
sentiment_analysis group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("pink","blue"))+
facet_wrap(~ sentiment, scales = "free_y") +
coord_flip() +
labs(title = "Top Sentiment Words in Reddit Posts", x = "Word", y = "Frequency")+
theme_bw()
Hmm, how useful is this actually? Let’s try something else…
Topic Modeling with LDA
# Assign a unique identifier to each post
<- posts %>%
posts mutate(post_id = row_number())
# Tokenize and count word frequencies per post
<- posts %>%
post_words unnest_tokens(word, title) %>%
anti_join(stop_words, by = "word") %>%
count(post_id, word, sort = TRUE)
# Create the DTM (document term matrix)
<- post_words %>%
dtm cast_dtm(document = post_id, term = word, value = n)
# Set a seed for reproducibility
set.seed(123)
# Fit the LDA model
<- LDA(dtm, k = 3, control = list(seed = 123))
lda_model
<- tidy(lda_model, matrix = "beta")
topics
<- topics %>%
top_terms group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
%>%
top_terms mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered() +
scale_fill_manual(values = c("lightgrey","grey","darkgrey"))+
labs(title = "Top Terms per Topic",
x = NULL, y = "Beta")+
theme_bw()
The plot shows the 10 words most representative for each topic. The beta values are the probabilities of a term being associated with a specific topic.
What could these topics be? Do the results make sense to you? 🤔
These simple, out-of-the-box approaches may not answer the core questions you’re interested in. Fortunately, LLMs enable a whole new level of content analysis using approaches that get pretty close to human annotation (Gilardi et al., 2023).
Can you develop a prompt (or a set of prompts) that could help you annotate the data to answer a question you are interested in?
You may want to spend the rest of the time polishing the prompts. In case you realize that your question is rather structural or relational - you may want to continue to step 3, and explore the network structure of the data.
Step 3: Network Visualization
We will create a discussion network from our data using igraph
and ggraph
. You may also want to have a look at
# Let's focus on the largest thread in our dataset here
<- comments%>%
largest_thread_url group_by(url) %>%
tally(name = "comment_count") %>%
slice_max(order_by = comment_count, n = 1, with_ties = FALSE) %>%
pull(url)
# Parse the thread structure
<- comments %>%
thread_data filter(url == largest_thread_url)%>%
filter(!is.na(comment_id)) %>%
# remove top level comments
filter(str_detect(comment_id, "_")) %>%
mutate( # Identify comment levels
level = str_count(comment_id, "_") + 1,
# Extract parent comment_id
parent = str_replace(comment_id, "_[^_]+$", ""))
<- thread_data %>%
interactions filter(parent %in% comment_id) %>% # Ensure parent exists in dataset
select(from = parent, to = comment_id, author_from = author)
# Create a graph object
<- graph_from_data_frame(interactions, directed = TRUE)
graph
# Plot the graph
ggraph(graph, layout = "fr") +
geom_edge_link(aes(width = after_stat(index)), edge_alpha = 0.8) +
geom_node_point(size = 5, color = "blue") +
geom_node_text(aes(label = name), repel = TRUE, size = 1) +
theme_void()+
theme(legend.position = "none")
# Let's try a different layout - as we are looking at discussion threads
ggraph(graph, layout = "tree") +
geom_edge_link(width = 0.3, edge_alpha = 0.8) +
geom_node_point(size = 2, color = "blue") +
geom_node_text(aes(label = name),
hjust = 1,
size = 1) +
theme_void() +
theme(legend.position = "none")
Conclusion
In this very basic introductory tutorial, we covered how to use the Reddit API to collect data, perform basic text analysis, and visualize interactions. Feel free to explore your data further!
Further reading
Social media APIs as access points for social science research have been around for a while and many people developed fantastic materials to guide you through this. Some examples to dig deeper are:
- Paul C. Bauer, APIs for social scientists
- David Schoch, Tidy Network Analysis in R