Text data analysis in R

Five example workflows using ✨LLMs✨

Najmus Saqib

Context

Artificial Intelligence and Large language models (LLMs) in particular have captured the public’s imagination since the release of ChatGPT in late 2022.

While broad-based utility of LLMs is still up for debate, recent developments have made it possible for R users to integrate these tools within their existing workflow.

What has changed?

Recent developments have opened up new possibilities for leveraging LLMs:

  • Release of free and open-source LLMs; proprietary tools were the only options previously
  • Release of tools that make it possible to deploy LLMs on a laptop; previously the only option was to send data to third-party APIs
  • Smaller models that perform admirably relative to the largest models, reducing the resource requirements to run them
  • Development of R packages that make it possible to integrate LLMs like any other data analysis tool

Setup process

Follow the steps below to set up your system:

  • Download and install Ollama
  • Open Windows PowerShell and download one of the models. For example: ollama run llama3.2
  • In R, download the mall library install.packages("mall")

Five example workflows

While there are various ways one can incorporate an LLM in their R workflow, the following five use cases illustrate how users within the Agency might choose to leverage this technology:

  1. Handling typos

  2. Data classification

  3. Pattern matching

  4. Sentiment analysis

  5. Text summarization

1. Handling typos - Import Data

To help illustrate how an LLM can help deal with typos, lets import a dummy data set of travellers coming into Canada.

traveller_data <- read_csv("https://raw.githubusercontent.com/najsaqib/naj_lab/refs/heads/main/traveller_data.csv", show_col_types = FALSE)

traveller_data
# A tibble: 10 × 2
       id citizen     
    <dbl> <chr>       
 1 362062 France      
 2 937423 Canada      
 3 945390 India       
 4 973331 cnada       
 5 504350 Spain       
 6 886613 canda       
 7 276570 South Africa
 8 640876 Canadian    
 9 967784 Cananda     
10 695475 Camada      

1. Handling typos - Utilize LLM

Since there are various typos in the citizen field for “Canada,” lets see if we can prompt the LLM to identify them all correctly

llm_use("ollama", "llama3.2", seed = 100, .silent = TRUE)

my_prompt <- paste(
  "Answer a question.",
  "Return only the answer, no explanation",
  "Acceptable answers are 'yes', 'no'",
  "Answer this about the following text, Is this text related to Canada or a misspelling of Canada?:"
)

traveller_final <- traveller_data %>% llm_custom(citizen, my_prompt)

traveller_final

1. Handling typos - Utilize LLM

# A tibble: 10 × 3
       id citizen      .pred
    <dbl> <chr>        <chr>
 1 362062 France       No   
 2 937423 Canada       Yes  
 3 945390 India        No   
 4 973331 cnada        Yes  
 5 504350 Spain        No   
 6 886613 canda        Yes  
 7 276570 South Africa No   
 8 640876 Canadian     Yes  
 9 967784 Cananda      Yes  
10 695475 Camada       Yes  

1. Handling typos - Considerations

  • You might find the prediction accuracy to be very sensitive to the choice of seed and to the exact wording of the prompt

  • However, the main advantage of this approach is that it can handle misspellings of Canada that you might not encounter until later

  • Another option could be using fuzzy/probabilistic matching techniques

2. Data classification - Import Data

Lets import a dummy global COVID-19 dataset to showcase how an LLM can be used for classification

covid_data <- read_csv("https://raw.githubusercontent.com/najsaqib/naj_lab/refs/heads/main/covid_deaths.csv", show_col_types = FALSE)  

covid_data
# A tibble: 16 × 2
   country        deaths_million
   <chr>                   <dbl>
 1 Bulgaria                 5669
 2 Ghana                    5114
 3 Hungary                  5065
 4 India                    4799
 5 Tunisia                  4766
 6 Japan                    4519
 7 Nigeria                  4317
 8 Czech Republic           4076
 9 Canada                   4027
10 Laos                     3973
11 Greece                   3770
12 Cambodia                 3693
13 Romania                  3590
14 Pakistan                 3482
15 United Kingdom           3404
16 Italy                    3309

2. Data classification - Utilize LLM

We can take advantage of the LLM’s existing knowledge-base to classify data along any category

llm_use("ollama", "llama3.2", seed = 99, .silent = TRUE)

covid_final <- covid_data %>% llm_extract(country, "continent")

print(covid_final)

2. Data classification - Utilize LLM

# A tibble: 16 × 3
   country        deaths_million .extract       
   <chr>                   <dbl> <chr>          
 1 Bulgaria                 5669 "europe"       
 2 Ghana                    5114 " africa"      
 3 Hungary                  5065 "europe"       
 4 India                    4799 "asia"         
 5 Tunisia                  4766 "africa"       
 6 Japan                    4519 "asia"         
 7 Nigeria                  4317 "africa"       
 8 Czech Republic           4076 "europe"       
 9 Canada                   4027 "north america"
10 Laos                     3973 "asia"         
11 Greece                   3770 "europe"       
12 Cambodia                 3693 "asia"         
13 Romania                  3590 "europe"       
14 Pakistan                 3482 "asia"         
15 United Kingdom           3404 "europe"       
16 Italy                    3309 "europe"       

2. Data classification - Considerations

  • The accuracy of this approach will decrease when dealing with lesser known subject matters (e.g. classifying cities along health regions)

  • In this instance, it would be easy to find or create a new data set for continents and left join with the COVID-19 data set, but not all use cases will be as simple

  • Homework: what would happen if the data set contained records with Aragorn and Gondor for countries? What about Wakanda?

3. Pattern matching - Import Data

For this example, we will import a dummy data set on quarantine

quarantine_data <- read_csv("https://raw.githubusercontent.com/najsaqib/naj_lab/refs/heads/main/quarantine_data.csv", show_col_types = FALSE)  

quarantine_data$activity
[1] "I walked through the park on a crisp autumn morning, enjoying the scenery near L4P 1A8, which was particularly beautiful today"     
[2] "The smell of fresh coffee filled the air as I sipped my latte V3M 4B9 and enjoyed the view from the café's patio."                  
[3] "After a long hike, I sat down to rest on a bench overlooking L1B 4T1 in Banff, which was breathtakingly stunning"                   
[4] "I browsed through the bookstore's shelves in search of N2J 2X8 novels near Victoria, but couldn't find anything that caught my eye."
[5] "The sound of waves crashing against A1C 5V5 shore was soothing as I walked along the beach in St. John's, feeling very relaxed."    

3. Pattern matching - Utilize LLM

Lets use the LLM to pattern match and extract a certain piece of information; in this case the postal code

llm_use("ollama", "llama3.2", seed = 99, .silent = TRUE)

quarantine_final <- quarantine_data %>% llm_extract(activity, "postal code")

print(quarantine_final$.extract)

3. Pattern matching - Utilize LLM

[1] "l4p 1a8" "v3m 4b9" "l1b 4t1" "n2j 2x8" "a1c 5v5"

3. Pattern matching - Considerations

  • Model accuracy will decrease for more niche use cases (e.g. Postal codes for smaller countries.)

  • Simple to extract the relevant pieces of information from relatively large bodies of text

  • A traditional option is to use Regular Expressions (regex), but syntax is quite challenging to learn

4. Sentiment analysis - Import Data

Lets import a dummy data set on user reviews for the ArriveCan application

app_data <- read_csv("https://raw.githubusercontent.com/najsaqib/naj_lab/refs/heads/main/app_reviews.csv", show_col_types = FALSE)    

app_data$review
[1] "This app was advertised on the news as a way to save time upon arrival, as you would be able to use an \"express line\". Although the app itself was easy to use, I saved zero time when I arrived at YVR as I had to line up and wait with everybody else to see one of the 2 CBSA agents available. I didn't see any benefit to using it, so not sure I'll use it again."                                                                                                              
[2] "ArriveCan is very useful, coz you declare ahead before you even arrive in Canada, so it speeds up the process once you are there. :)"                                                                                                                                                                                                                                                                                                                                                    
[3] "The app hangs upon saving traveller profiles and you cannot cancel or go back. Need to restart the app and try again. Took 2, 4, 4, and 13 tries (gave up on the last user entry after trying manually, scanning, and adding midway through the declaration process). If other people are experiencing the same issues with adding their profiles, this should be a priority to fix in the app."                                                                                         
[4] "Extremely easy to complete. This is a great way to speed the process of transiting Grigg the Canadian airports."                                                                                                                                                                                                                                                                                                                                                                         
[5] "terrible app. it fails even simple things. camera didn't focus properly, camera didn't read passports well, app should be able to read chips like Australia and other countries, saving a traveller hung at 12% and had to be restarted losing all travellers, even after registering all 4 I was blocked from a customs déclaration for all 4 travellers saying I had duplicate birthdates (there were none). I'm on a modern android: pixel 6 pro. what a total waste of 59.5 million."
[6] "This is simple and convenient. I think completing this online saves time."                                                                                                                                                                                                                                                                                                                                                                                                               
[7] "Extremely easy to complete. This is a great way to speed the process of transiting Grigg the Canadian airports."                                                                                                                                                                                                                                                                                                                                                                         

4. Sentiment analysis - Utilize LLM

It could be helpful to quickly get a sense of the sentiment about the app as new updates are released

llm_use("ollama", "llama3.2", seed = 100, .silent = TRUE)

app_final <- app_data %>% llm_sentiment(review)

print(app_final$.sentiment)

4. Sentiment analysis - Utilize LLM

[1] "negative" "positive" "negative" "positive" "negative" "positive" "positive"

4. Sentiment analysis - Considerations

  • Compared to traditional approaches, LLMs have a better grasp of the overall context, including subtleties such as sarcasm, irony, satire, slang, etc.

  • Less compute intensive approaches (e.g. through the use of the tidytext package) rely on pre-determined lexicons and calculate the sentiment of the text as the sum of the content of the individual words.

  • Generally speaking, accuracy is improved by better machines and larger models

5. Text summarization - Import Data

For this exercise, lets import a data set containing the initiating messages/statements from all of the annual CPHO reports under Dr. Tam’s tenure

cpho_data <- read_csv("https://raw.githubusercontent.com/najsaqib/naj_lab/refs/heads/main/cpho_message.csv", show_col_types = FALSE)   

cpho_data
# A tibble: 8 × 2
   year text                                                                    
  <dbl> <chr>                                                                   
1  2017 "Without being aware of it, our neighbourhoods and how they are built i…
2  2018 "I am pleased to present my annual report, which is a snapshot of the h…
3  2019 "By and large, we are a healthy nation. We can be proud of Canada's hea…
4  2020 "The COVID-19 pandemic is having a profound impact on the health, socia…
5  2021 "The COVID-19 pandemic represents the biggest public health crisis that…
6  2022 "Over the past 2 and a half years, we have been challenged by the COVID…
7  2023 "In recent years, our communities have faced monumental challenges, fro…
8  2024 "This year, the global community celebrates 50 years of progress since …

5. Text summarization - Utilize LLM

It would be helpful to get a quick summary of the messages for each year

llm_use("ollama", "llama3.2", seed = 100, .silent = TRUE)

cpho_final <- cpho_data %>% llm_summarize(text, max_words = 30)

print(cpho_final$.summary)

5. Text summarization - Utilize LLM

[1] "canada's chief public health officer reports that chronic diseases like diabetes are linked to unhealthy living environments and that designing healthy communities can help reduce rates of these diseases."
[2] "canada's health status is generally good, but persistent inequalities and social factors threaten public health, with major chronic diseases and mental health impacting many lives."                        
[3] "canada's health status is strong but vulnerable to trends like rising measles cases and opioid crisis; addressing stigma is crucial for improving health inequities and achieving optimal health for all."   
[4] "canada's covid-19 pandemic has led to fundamental changes in daily life causing suffering, loss, job loss and isolation for many people especially seniors and essential workers."                           
[5] "canada's covid-19 response has shown remarkable achievements but also exposed cracks in its public health system which lacks resources and tools to protect all people living in canada."                    
[6] "canada's public health systems face a crucial test with the climate crisis, must adapt and respond collaboratively with other sectors, and prioritize efforts for immediate health benefits."                
[7] "emergency response is becoming more complex and challenging; investing in social infrastructure is crucial for community resilience and health promotion can support emergency management efforts."          
[8] "canada celebrates 50 years of progress since the launch of the expanded programme on immunization, with estimated 154 million lives saved worldwide as a result of vaccines."                                

5. Text summarization - Considerations

  • The LLM’s context awareness makes it a great tool for summarization

  • While this is a very compute intensive task, there are no good alternatives for doing summarizations non-manually in a programmatic and consistent manner

  • Homework: how will the LLM summarize text that is semantically meaningless? This also applies to sentiment analysis.

Concluding thoughts

  • While the jury is still out on the overall utility of LLMs, there are already several proven use cases where their deployment can greatly supplement ongoing work, especially with respect to text data and qualitative analysis.

  • In situations where other technological alternatives exist, LLM-based solutions have the downside of hallucinations and vulnerability to adversarial attacks, failures in sometimes inexplicable ways, and sensitivity to prompts that are not easily dealt with systematically.

  • Implementation of Retrieval-Augmented Generation (RAG) methods can improve utility by giving the model access to domain/organizational knowledge, but at present requires usage of Python