Artificial Intelligence and Large language models (LLMs) in particular have captured the public’s imagination since the release of ChatGPT in late 2022.
While broad-based utility of LLMs is still up for debate, recent developments have made it possible for R users to integrate these tools within their existing workflow.
What has changed?
Recent developments have opened up new possibilities for leveraging LLMs:
Release of free and open-source LLMs; proprietary tools were the only options previously
Release of tools that make it possible to deploy LLMs on a laptop; previously the only option was to send data to third-party APIs
Smaller models that perform admirably relative to the largest models, reducing the resource requirements to run them
Development of R packages that make it possible to integrate LLMs like any other data analysis tool
Open Windows PowerShell and download one of the models. For example: ollama run llama3.2
In R, download the mall library install.packages("mall")
Five example workflows
While there are various ways one can incorporate an LLM in their R workflow, the following five use cases illustrate how users within the Agency might choose to leverage this technology:
Handling typos
Data classification
Pattern matching
Sentiment analysis
Text summarization
1. Handling typos - Import Data
To help illustrate how an LLM can help deal with typos, lets import a dummy data set of travellers coming into Canada.
# A tibble: 10 × 2
id citizen
<dbl> <chr>
1 362062 France
2 937423 Canada
3 945390 India
4 973331 cnada
5 504350 Spain
6 886613 canda
7 276570 South Africa
8 640876 Canadian
9 967784 Cananda
10 695475 Camada
1. Handling typos - Utilize LLM
Since there are various typos in the citizen field for “Canada,” lets see if we can prompt the LLM to identify them all correctly
llm_use("ollama", "llama3.2", seed =100, .silent =TRUE)my_prompt <-paste("Answer a question.","Return only the answer, no explanation","Acceptable answers are 'yes', 'no'","Answer this about the following text, Is this text related to Canada or a misspelling of Canada?:")traveller_final <- traveller_data %>%llm_custom(citizen, my_prompt)traveller_final
1. Handling typos - Utilize LLM
# A tibble: 10 × 3
id citizen .pred
<dbl> <chr> <chr>
1 362062 France No
2 937423 Canada Yes
3 945390 India No
4 973331 cnada Yes
5 504350 Spain No
6 886613 canda Yes
7 276570 South Africa No
8 640876 Canadian Yes
9 967784 Cananda Yes
10 695475 Camada Yes
1. Handling typos - Considerations
You might find the prediction accuracy to be very sensitive to the choice of seed and to the exact wording of the prompt
However, the main advantage of this approach is that it can handle misspellings of Canada that you might not encounter until later
Another option could be using fuzzy/probabilistic matching techniques
2. Data classification - Import Data
Lets import a dummy global COVID-19 dataset to showcase how an LLM can be used for classification
The accuracy of this approach will decrease when dealing with lesser known subject matters (e.g. classifying cities along health regions)
In this instance, it would be easy to find or create a new data set for continents and left join with the COVID-19 data set, but not all use cases will be as simple
Homework: what would happen if the data set contained records with Aragorn and Gondor for countries? What about Wakanda?
3. Pattern matching - Import Data
For this example, we will import a dummy data set on quarantine
[1] "I walked through the park on a crisp autumn morning, enjoying the scenery near L4P 1A8, which was particularly beautiful today"
[2] "The smell of fresh coffee filled the air as I sipped my latte V3M 4B9 and enjoyed the view from the café's patio."
[3] "After a long hike, I sat down to rest on a bench overlooking L1B 4T1 in Banff, which was breathtakingly stunning"
[4] "I browsed through the bookstore's shelves in search of N2J 2X8 novels near Victoria, but couldn't find anything that caught my eye."
[5] "The sound of waves crashing against A1C 5V5 shore was soothing as I walked along the beach in St. John's, feeling very relaxed."
3. Pattern matching - Utilize LLM
Lets use the LLM to pattern match and extract a certain piece of information; in this case the postal code
[1] "This app was advertised on the news as a way to save time upon arrival, as you would be able to use an \"express line\". Although the app itself was easy to use, I saved zero time when I arrived at YVR as I had to line up and wait with everybody else to see one of the 2 CBSA agents available. I didn't see any benefit to using it, so not sure I'll use it again."
[2] "ArriveCan is very useful, coz you declare ahead before you even arrive in Canada, so it speeds up the process once you are there. :)"
[3] "The app hangs upon saving traveller profiles and you cannot cancel or go back. Need to restart the app and try again. Took 2, 4, 4, and 13 tries (gave up on the last user entry after trying manually, scanning, and adding midway through the declaration process). If other people are experiencing the same issues with adding their profiles, this should be a priority to fix in the app."
[4] "Extremely easy to complete. This is a great way to speed the process of transiting Grigg the Canadian airports."
[5] "terrible app. it fails even simple things. camera didn't focus properly, camera didn't read passports well, app should be able to read chips like Australia and other countries, saving a traveller hung at 12% and had to be restarted losing all travellers, even after registering all 4 I was blocked from a customs déclaration for all 4 travellers saying I had duplicate birthdates (there were none). I'm on a modern android: pixel 6 pro. what a total waste of 59.5 million."
[6] "This is simple and convenient. I think completing this online saves time."
[7] "Extremely easy to complete. This is a great way to speed the process of transiting Grigg the Canadian airports."
4. Sentiment analysis - Utilize LLM
It could be helpful to quickly get a sense of the sentiment about the app as new updates are released
Compared to traditional approaches, LLMs have a better grasp of the overall context, including subtleties such as sarcasm, irony, satire, slang, etc.
Less compute intensive approaches (e.g. through the use of the tidytext package) rely on pre-determined lexicons and calculate the sentiment of the text as the sum of the content of the individual words.
Generally speaking, accuracy is improved by better machines and larger models
5. Text summarization - Import Data
For this exercise, lets import a data set containing the initiating messages/statements from all of the annual CPHO reports under Dr. Tam’s tenure
# A tibble: 8 × 2
year text
<dbl> <chr>
1 2017 "Without being aware of it, our neighbourhoods and how they are built i…
2 2018 "I am pleased to present my annual report, which is a snapshot of the h…
3 2019 "By and large, we are a healthy nation. We can be proud of Canada's hea…
4 2020 "The COVID-19 pandemic is having a profound impact on the health, socia…
5 2021 "The COVID-19 pandemic represents the biggest public health crisis that…
6 2022 "Over the past 2 and a half years, we have been challenged by the COVID…
7 2023 "In recent years, our communities have faced monumental challenges, fro…
8 2024 "This year, the global community celebrates 50 years of progress since …
5. Text summarization - Utilize LLM
It would be helpful to get a quick summary of the messages for each year
[1] "canada's chief public health officer reports that chronic diseases like diabetes are linked to unhealthy living environments and that designing healthy communities can help reduce rates of these diseases."
[2] "canada's health status is generally good, but persistent inequalities and social factors threaten public health, with major chronic diseases and mental health impacting many lives."
[3] "canada's health status is strong but vulnerable to trends like rising measles cases and opioid crisis; addressing stigma is crucial for improving health inequities and achieving optimal health for all."
[4] "canada's covid-19 pandemic has led to fundamental changes in daily life causing suffering, loss, job loss and isolation for many people especially seniors and essential workers."
[5] "canada's covid-19 response has shown remarkable achievements but also exposed cracks in its public health system which lacks resources and tools to protect all people living in canada."
[6] "canada's public health systems face a crucial test with the climate crisis, must adapt and respond collaboratively with other sectors, and prioritize efforts for immediate health benefits."
[7] "emergency response is becoming more complex and challenging; investing in social infrastructure is crucial for community resilience and health promotion can support emergency management efforts."
[8] "canada celebrates 50 years of progress since the launch of the expanded programme on immunization, with estimated 154 million lives saved worldwide as a result of vaccines."
5. Text summarization - Considerations
The LLM’s context awareness makes it a great tool for summarization
While this is a very compute intensive task, there are no good alternatives for doing summarizations non-manually in a programmatic and consistent manner
Homework: how will the LLM summarize text that is semantically meaningless? This also applies to sentiment analysis.
Concluding thoughts
While the jury is still out on the overall utility of LLMs, there are already several proven use cases where their deployment can greatly supplement ongoing work, especially with respect to text data and qualitative analysis.
In situations where other technological alternatives exist, LLM-based solutions have the downside of hallucinations and vulnerability to adversarial attacks, failures in sometimes inexplicable ways, and sensitivity to prompts that are not easily dealt with systematically.
Implementation of Retrieval-Augmented Generation (RAG) methods can improve utility by giving the model access to domain/organizational knowledge, but at present requires usage of Python