class: center, middle, inverse, title-slide .title[ # 11 - Text as Data ] .subtitle[ ## ml4econ, HUJI 2025 ] .author[ ### Itamar Caspi ] .date[ ### June 15, 2025 (updated: 2025-06-15) ] --- # Packages ``` r if (!require("pacman")) install.packages("pacman") pacman::p_load( quanteda, textdata, tidytext, tidyverse, knitr, xaringan, RefManageR ) ``` ``` ## package 'textdata' successfully unpacked and MD5 sums checked ## ## The downloaded binary packages are in ## C:\Users\internet\AppData\Local\Temp\RtmpwHla0e\downloaded_packages ``` The [`quanteda`](http://quanteda.io/) package is a Swiss army knife for handling text with R. More on that later. --- # Outline - [Representing Text as Data](#data) - [Text Regressions](#reg) - [Dictionary-based Methods](#dict) - [Topic Modeling](#lda) --- class: title-slide-section-blue, center, middle name: data # Representing Text as Data --- # Getting Started with Text Analysis - For an insightful introduction to text analysis, featuring numerous real-world examples from the social sciences, consider reading __"Text as Data"__ by Gentzkow, Kelly, and Taddy (JEL 2019). - Comprehensive lecture notes provided by [Maximilian Kasy](https://maxkasy.github.io/home/files/teaching/TopicsEconometrics2019/TextAsData-Slides.pdf) from Harvard and [Matt Taddy](https://github.com/TaddyLab/MBAcourse/blob/master/lectures/text.pdf) from Chicago and Amazon also serve as valuable resources. --- # Essential Terminology for Text Analysis Let's define some basic terms: - _Corpus_: This refers to a collection of `\(D\)` documents. These documents can be emails, tweets, speeches, articles, etc. - _Vocabulary_: This is a comprehensive list of unique words appearing in the corpus. - `\(\mathbf{X}\)`: This is a numerical array representation of text. Here, rows represent documents indexed as `\(i=1,\dots,D\)` and columns correspond to words indexed as `\(j=1,\dots,N\)`. - `\({Y}\)`: This is a vector of predicted outcomes (e.g., spam/ham, trump/not trump, etc.), with one outcome allocated per document. - `\({F}\)`: This stands for a low-dimensional representation of `\(\mathbf{X}\)`. --- # Document Term Matrix (DTM) In many applications, raw text transforms into a numerical array `\(\mathbf{X}\)`. Here, the elements of the array, `\({X}_{ij}\)`, represent counts of words or, more generally, _tokens_. We will discuss this in more detail later. --- # An Example: Distinguishing Spam from Ham Consider the task of spam detection: <img src="figs/spam.png" width="100%" style="display: block; margin: auto;" /> In this scenario: - Documents refer to individual emails. - Vocabulary comprises words that appear in _every single_ email. __NOTE__: Clearly, spam detection constitutes a supervised learning task where `\(Y_i=\{\text{spam, ham}\}\)`. --- # Converting a Corpus to a DTM Let's take a look at a corpus with two documents `\((D=2)\)`: ``` r txt <- c(doc1 = "Shipment of gold damaged in a fire.", doc2 = "Delivery of silver, arrived in 2 silver trucks") tokens_txt <- tokens(txt) # tokenize the text tokens_txt %>% quanteda::dfm() # transform tokes into a document term matrix ``` ``` ## Document-feature matrix of: 2 documents, 14 features (42.86% sparse) and 0 docvars. ## features ## docs shipment of gold damaged in a fire . delivery silver ## doc1 1 1 1 1 1 1 1 1 0 0 ## doc2 0 1 0 0 1 0 0 0 1 2 ## [ reached max_nfeat ... 4 more features ] ``` This example originates from the [quanteda's getting started examples](https://quanteda.io/index.html). --- # Do All Words Matter? ¯\\\_(ツ)_/¯ We can considerably reduce the dimension of `\(\mathbf{X}\)` by: - Excluding highly common ("stop words") and rare words. - Eliminating numbers and punctuation. - Implementing stemming, i.e., replacing words with their roots (use _economi_ instead of _economics, economists, economy_). - Converting all text to lower case. __WARNING:__ Be judicious when using text preprocessing steps. These steps should be tailored to the specific application. --- # Demonstrating Common Preprocessing Steps In the following example, we remove stop words, punctuation, numbers, and implement word stemming: ``` r txt <- c(doc1 = "Shipment of gold damaged in a fire.", doc2 = "Delivery of silver, arrived in 2 silver trucks") txt %>% tokens(remove_punct = TRUE, remove_numbers = TRUE) |> dfm() |> dfm_wordstem() |> dfm_remove(stopwords("english")) ``` ``` ## Document-feature matrix of: 2 documents, 8 features (50.00% sparse) and 0 docvars. ## features ## docs shipment gold damag fire deliveri silver arriv truck ## doc1 1 1 1 1 0 0 0 0 ## doc2 0 0 0 0 1 2 1 1 ``` Please note, the number of features has been reduced from 14 to 8. --- # Introduction to `\(n\)`-grams - In certain scenarios, multiword expressions such as "not guilty" or "labor market" might be significant. - We can define tokens (the fundamental units of text) as `\(n\)`-grams, which are sequences of `\(n\)` words from a given text sample. __NOTE__: Using `\(n\)`-grams with `\(n\)`>2 typically becomes impractical as the column dimension of `\(\mathbf{X}\)` grows exponentially with the order `\(n\)`. --- # DTM with Bigrams In this example, our sample text includes just two "documents". Here, we define tokens as _bigrams_ (sequences of two words): ``` r txt %>% tokens(remove_punct = TRUE, remove_numbers = TRUE) %>% tokens_ngrams() %>% dfm() ``` ``` ## Document-feature matrix of: 2 documents, 12 features (50.00% sparse) and 0 docvars. ## features ## docs shipment_of of_gold gold_damaged damaged_in in_a a_fire delivery_of of_silver silver_arrived ## doc1 1 1 1 1 1 1 0 0 0 ## doc2 0 0 0 0 0 0 1 1 1 ## features ## docs arrived_in ## doc1 0 ## doc2 1 ## [ reached max_nfeat ... 2 more features ] ``` --- # The Text Mining Playbook for Social Sciences Follow these steps for effective text mining: 1. Collect text and create a corpus. 2. Represent the corpus as a DTM `\(\mathbf{X}\)`. 3. Next, choose one of the following steps: - Employ `\(\mathbf{X}\)` to predict an outcome `\(Y\)` using high-dimensional methods (e.g., lasso, Ridge, etc.). In some scenarios, proceed with `\(\hat{Y}\)` for subsequent analysis. - Use dimensionality reduction techniques (like dictionary, PCA, LDA, etc.) on `\(\mathbf{X}\)` and proceed with the resulting `\(\mathbf{F}\)` for further analysis. > "_Text information is usually best as part of a larger system. Use text data to fill in the cracks around what you know. Don’t ignore good variables with stronger signal than text!_" (Matt Taddy) --- class: title-slide-section-blue, center, middle name: reg # Text Regression --- # Familiar Territory: High Dimensionality Problem Our aim is to predict a certain `\(Y\)` using `\(\mathbf{X}\)`. Evidently, dealing with text data introduces the high-dimensionality issue, where `\(\mathbf{X}\)` has `\(M \times N\)` elements. Traditional methods like OLS fall short in this case `\(\Rightarrow\)` thus the need for machine learning approaches. Penalized linear/non-linear regression methods (like Lasso, Ridge, etc.) are typically suitable. Other methods such as random forest may also work. __EXAMPLE__: Consider Lasso text regression `glmnet(Y, X)` where: `$$\hat{\beta}=\underset{\beta\in \mathbb{R}^N}{\operatorname{argmin}} \sum_{i=1}^N\left(Y_{i}-X_{i} \beta\right)^{2}+\lambda \lVert\boldsymbol{\beta}\rVert_1$$` This method easily extends to binary / categorical `\(Y\)`, e.g., `glmnet(X, Y, family = "binomial")` --- # Practical Guidelines for Using Penalized Text Regression - DTM entries usually count the number of times word `\(i\)` appears in document `\(d\)`, which provides an "intuitive" interpretation for regression coefficients. - Depending on the application, different transformations for `\(\mathbf{X}\)` might be more suitable, such as: - Normalizing each row by document length. - Using a binary inclusion dummy instead of count. - However, refrain from attributing a causal interpretation to the Lasso's coefficients (remember the **irrepresentability** condition). --- class: title-slide-section-blue, center, middle name: dict # Dictionary-based Methods --- # Dimensionality Reduction Using Dictionaries - Dictionary-based methods offer a low-dimensional representation of high-dimensional text data. - This is, by far, the most frequently employed method in social science literature that utilizes text (Gentzkow et al., forthcoming). - Essentially, consider `\(F\)` as an unobserved characteristic of the text that we're trying to estimate. Dictionary-based methods provide a mapping from `\(\mathbf{X}\)` onto a lower-dimensional `\(F\)`: `$$g: \mathbf{X}\rightarrow F$$` --- # Example: Sentiment Analysis - A common example of dictionary-based methods is sentiment analysis. - The latent factor we aim to estimate is the writer's attitude towards the discussed topic. - The prevalent approach relies on predefined dictionaries that classify words according to predetermined sentiment classes, such as "positive", "negative", and "neutral". - The sentiment _score_ of each document is typically a function of the relative frequencies of positive, negative, neutral, etc., words. __REMARK__: Sentiment analysis can also be supervised. For instance, available labeled movie reviews (rated 1-5 stars) can be used to train a model, and its predictions can then be used to classify unlabeled reviews. --- # Example: Loughran and McDonald Financial Sentiment Dictionary Below is a random list of words from the Loughran and McDonald (2011) financial sentiment dictionary, which includes positive, negative, litigious, uncertain, and constraining sentiments: ``` r library(tidytext) sample_n(get_sentiments("loughran"),8) ``` ``` ## Do you want to download: ## Name: Loughran-McDonald Sentiment lexicon ## URL: https://sraf.nd.edu/textual-analysis/resources/ ## License: License required for commercial use. Please contact tloughra@nd.edu. ## Size: 6.7 MB (cleaned 142 KB) ## Download mechanism: https ## ## 1: Yes ## 2: No ## ## Enter an item from the menu, or 0 to exit ``` ``` ## # A tibble: 8 × 2 ## word sentiment ## <chr> <chr> ## 1 predeceased litigious ## 2 demolish negative ## 3 legislators litigious ## 4 contingents uncertainty ## 5 disadvantaged negative ## 6 antitrust litigious ## 7 punishment negative ## 8 noninfringement litigious ``` --- # Application: Bank of Israel Communication <img src="figs/box.png" width="70%" style="display: block; margin: auto;" /> Source: [Benchimol and Caspi (2019)](https://www.boi.org.il/en/NewsAndPublications/PressReleases/Documents/Measuring%20Communication%20Quality%20in%20the%20Interest%20Rate%20Announcements.pdf) --- class: title-slide-section-blue, center, middle name: lda # Topic Modeling --- # Topic Models - Topic models enhance unsupervised learning methods for text data. - They classify documents and words into latent topics, often serving as a precursor to more conventional empirical methods. - The cornerstone of topic modeling is the Latent Dirichlet Allocation model (Blei, Ng, and Jordan, 2003), commonly referred to as LDA. --- # Intuition Behind LDA <img src="figs/intuition0.png" width="55%" style="display: block; margin: auto;" /> --- # The Intuition Behind LDA <img src="figs/intuition1.png" width="65%" style="display: block; margin: auto;" /> - A topic is a distribution across _all_ the words within a _fixed_ vocabulary. - A word can have non-zero probabilities in multiple topics (e.g., "bank"). - Each document is a mixture of different topics. - Each word is selected from one of these topics. --- # Intuition Behind LDA <img src="figs/intuition2.png" width="65%" style="display: block; margin: auto;" /> __QUESTION:__ How realistic is the LDA setup? Does it matter? What's our goal here anyway? --- # Notation - A _vocabulary_ comprises of words represented by the vector `\(\{1,\dots,V\}\)`. - Each _word_ is represented by a unit vector `\(\boldsymbol{\delta}_v=(0,\dots,v,\dots,0)'\)`. - A _document_ is a sequence of `\(N\)` words denoted by `\(\mathbf{w}=(w_1,\dots,w_N)\)`. - A _corpus_ is a collection of `\(M\)` documents denoted by `\(\mathcal{D}=(\mathbf{w}_1,\dots,\mathbf{w}_M)\)`. --- # Prerequisite: The Beta Distribution .pull-left[ The probability density function (PDF) for the Beta distribution, denoted as `\(B(\alpha,\beta)\)`, is given by: `$$p(\theta|\alpha,\beta)\propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$$` This function holds for `\(\theta\in[0,1]\)` and `\(\alpha,\beta>0\)`. Due to its properties, the Beta distribution is useful as a prior for probabilities. ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Beta_distribution_pdf.svg/325px-Beta_distribution_pdf.svg.png" width="100%" style="display: block; margin: auto;" /> ] --- # The Dirichlet Distribution The Dirichlet distribution, denoted as `\(\text{Dir}(\boldsymbol{\alpha})\)`, is a multivariate generalization of the Beta distribution. Let `\(\mathbf{\theta}=(\theta_1, \theta_2,\dots,\theta_K)\sim\text{Dir}(\boldsymbol{\alpha})\)`. The probability density function (PDF) for a `\(K\)`-dimensional Dirichlet distribution is `$$p(\boldsymbol{\theta}|\boldsymbol{\alpha})\propto \prod_{i=1}^{K} \theta_{i}^{\alpha_{i}-1}$$` Here, `\(K\geq2\)` is the number of categories, `\(\alpha_i>0\)` and `\(\theta_{i}\in(0,1)\)` for all `\(i\)` and `\(\sum_{i=1}^{K} \theta_{i}=1\)`. __Remark:__ The parameter `\(\boldsymbol{\alpha}\)` controls the sparsity of `\(\boldsymbol{\theta}\)`. __Bottom Line:__ Vectors drawn from a Dirichlet distribution represent probabilities. --- # Visualizing the Dirichlet Distribution .pull-left[ __On the right:__ The change in the density function `\((K = 3)\)` as the vector `\(\boldsymbol{\alpha}\)` changes from `\(\boldsymbol{\alpha} = (0.3, 0.3, 0.3)\)` to `\((2.0, 2.0, 2.0)\)`, while keeping `\(\alpha_1=\alpha_2=\alpha_3\)`. __Remark:__ Placing `\(\boldsymbol{\alpha}=(1,1,1)\)` results in a uniform distribution over the simplex. ] .pull-right[ <p><a href="https://commons.wikimedia.org/wiki/File:LogDirichletDensity-alpha_0.3_to_alpha_2.0.gif#/media/File:LogDirichletDensity-alpha_0.3_to_alpha_2.0.gif"><img src="https://upload.wikimedia.org/wikipedia/commons/5/54/LogDirichletDensity-alpha_0.3_to_alpha_2.0.gif" alt="LogDirichletDensity-alpha 0.3 to alpha 2.0.gif" height="480" width="480"></a><br>By <a href="https://en.wikipedia.org/wiki/en:File:LogDirichletDensity-alpha_0.1_to_alpha_1.9.gif" class="extiw" title="w:en:File:LogDirichletDensity-alpha 0.1 to alpha 1.9.gif">Initial version</a> by <a href="//commons.wikimedia.org/w/index.php?title=Panos_Ipeirotis&action=edit&redlink=1" class="new" title="Panos Ipeirotis (page does not exist)">Panos Ipeirotis</a>, later modified by <a href="//commons.wikimedia.org/w/index.php?title=User:Love_Sun_and_Dreams&action=edit&redlink=1" class="new" title="User:Love Sun and Dreams (page does not exist)">Love Sun and Dreams</a> - <a class="external autonumber" href="http://en.wikipedia.org/wiki/File:LogDirichletDensity-alpha_0.3_to_alpha_2.0.gif">[1]</a>, <a href="https://creativecommons.org/licenses/by/3.0" title="Creative Commons Attribution 3.0">CC BY 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=10073606">Link</a></p> ] --- # The Data Generating Process Behind LDA __Assumption:__ The number of topics `\(K\)` and the size of the vocabulary `\(V\)` are fixed. The Data Generating Process (DGP): For each document `\(d=1,\dots,\mathcal{D}\)`: 1. Choose topic proportions `\(\theta_d\sim\text{Dir}(\boldsymbol{\alpha})\)`. 2. For each word `\(n=1,\dots,N\)`: 2.1. Choose a topic assignment `\(Z_{dn}\sim\text{Mult}(\theta_d)\)`. 2.2. Choose a word `\(W_{dn}\sim\text{Mult}(\beta_{z_{dn}})\)`. __Remark:__ Note the "factor model" aspects of LDA, where topics act as factors and word probabilities act as loadings, both affecting the probability of selecting a word. --- # Aside: Plate Notation <img src="figs/plates.png" width="45%" style="display: block; margin: auto;" /> - Each _node_ represents a random variable. - _Shaded_ nodes indicate observables. - _Edges_ represent dependencies. - _Plates_ indicate replicated structures. The depicted graph corresponds to the following expression: `$$p\left(y, x_{1}, \ldots, x_{N}\right)=p(y) \prod_{n=1}^{N} p\left(x_{n} | y\right)$$` --- # LDA in plate notation <img src="figs/LDA.png" width="70%" style="display: block; margin: auto;" /> _Source_: [http://videolectures.net/mlss09uk_blei_tm/#](http://videolectures.net/mlss09uk_blei_tm/#). --- # Aside: Conjugate Priors The Dirichlet distribution serves as a conjugate prior for the Multinomial distribution. Let `\(n(Z_{i})\)` denote the count of topic `\(i\)`. `$$\boldsymbol{\theta}|Z_{1,\dots,N}\sim \text{Dir}(\boldsymbol{\alpha}+n(Z_{1,\dots,N}))$$` In other words, as the number of times we observe topic `\(i\)` increases, our posterior distribution becomes more concentrated around the `\(i^{\text{th}}\)` component of `\(\boldsymbol{\theta}\)`. --- # Extension #1: Correlated Topic Models (Lafferty and Blei, 2005) - LDA assumes that topics independently co-occur in documents. - However, this assumption is clearly incorrect. - For instance, a document about _economics_ is more likely to also discuss _politics_ than it is to talk about _cooking_. - Lafferty and Blei address this issue by relaxing the independence assumption and drawing topic proportions from a logistic normal distribution, allowing for correlations between topic proportions: <img src="figs/corrLDA.png" width="50%" style="display: block; margin: auto;" /> Here, `\(\mu\)` and `\(\Sigma\)` represent priors for the logistic normal distribution. --- # Extension #2: Dynamic LDA (Blei and Lafferty, 2006) .pull-left[ Dynamic LDA takes into account the ordering of documents and provides a more detailed posterior topical structure compared to traditional LDA. In dynamic topic modeling, a topic is a _sequence_ of distributions over words. Topics evolve systematically over time. Specifically, the parameter vector for topic `\(k\)` in period `\(t\)` evolves with Gaussian noise: `$$\beta_{t, k} | \beta_{t-1, k} \sim \mathcal{N}\left(\beta_{t-1, k}, \sigma^{2} I\right).$$` ] .pull-right[ <img src="figs/DynLDA.png" width="100%" style="display: block; margin: auto;" /> ] --- # Dynamic LDA: _Science_, 1881-1999 The posterior estimate of the frequency of several words as a function of year for two topics, "Theoretical Physics" and "Neuroscience": <img src="figs/DLDA_science.png" width="60%" style="display: block; margin: auto;" /> _Source: Blei and Lafferty (2006)._ --- # Extension #3: Supervised Topic Model (McAuliffe and Blei, 2008) An additional connection is made between `\(Z_{dn}\)` and an observable attribute `\(Y_d\)` in the Supervised Topic Model: <img src="figs/sLDA.png" width="50%" style="display: block; margin: auto;" /> _Source: McAuliffe and Blei (2008)._ --- # Structural Topic Models (Roberts, Stewart, and Tingley) About the Structural Topic Model (STM): > _"The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both."_ In STM, topics are drawn from the following logistic normal distribution, `$$\boldsymbol{\theta}_{d} | X_{d} \gamma, \Sigma \sim \text { LogisticNormal }\left(\mu=X_{d} \gamma, \Sigma\right)$$` where `\(X_{d}\)` is a vector of observed document covariates. __REMARK:__ In the case of no covariates, the STM reduces to a (fast) implementation of the Correlated Topic Model (Blei and Lafferty, 2007). --- # `stm`: R Package for Structural Topic Models ## Roberts, Stewart, and Tingley (JSS, 2014) About the `stm` R package: > _"The software package implements the estimation algorithms for the model and also includes tools for every stage of a standard workflow from reading in and processing raw text through making publication quality figures."_ The package is available on CRAN and can be installed using: ```r install.packages("stm") ``` To get started, see the [vignette](https://github.com/bstewart/stm/blob/master/inst/doc/stmVignette.pdf?raw=true) which includes several example analyses. --- # Applying Topic Models to Measure the Effect of Transparency Hansen, McMahon, and Prat (QJE 2017) examine the impact of increased transparency in the Federal Open Market Committee (FOMC) meetings on the level of debate. Here are some key points: - FOMC meetings have been recorded since the 1970s to create minutes. - Committee members were under the impression that these tapes were erased afterward. - In October 1993, Fed Chair Alan Greenspan discovered and revealed that the tapes had been transcribed and stored in archives all along before being erased. - Following Greenspan's revelation, the Fed agreed to publish all past transcripts and extended this policy to include all future transcripts with a five-year time lag. - This provides Hansen et al. with access to periods when policymakers believed their deliberations would and would not be made public. --- # Topic Modeling of FOMC Meeting Transcripts Data: - The dataset consists of 149 FOMC meeting transcripts during Alan Greenspan's tenure, spanning both pre-1993 and post-1993 periods. - The unit of observation is a member-meeting. - The outcomes of interest include: - The proportion of words devoted to `\(K\)` different topics. - The concentration of topic weights. - The frequency of data citation. --- # Estimation To estimate the topics, LDA (Latent Dirichlet Allocation) is employed. The LDA output is then utilized to construct the outcomes of interest. Difference-in-Differences regressions are applied to estimate the effects of the change in transparency on these outcomes. For instance, Hansen et al. estimate the following model: `$$y_{it}=\alpha_{i}+\gamma D(\text{Trans})_{t}+\lambda X_{t}+\varepsilon_{it}$$` Here: - `\(y_{it}\)` represents any of the communication measures for member `\(i\)` at time `\(t\)`. - `\(D(\text{Trans})\)` is an indicator for being in the transparency regime (1 after November 1993, 0 before). - `\(X_t\)` is a vector of macro controls for the meeting at time `\(t\)`. --- # Pro-cyclical Topics <img src="figs/hansen_pro.png" width="45%" style="display: block; margin: auto;" /> _Source_: Hansen, McMahon, and Prat (QJE 2017). --- # Counter-cyclical Topics <img src="figs/hansen_count.png" width="43%" style="display: block; margin: auto;" /> _Source_: Hansen, McMahon, and Prat (QJE 2017). --- # Increased Accountability: More References to Data <img src="figs/hansen_table_V.png" width="45%" style="display: block; margin: auto;" /> _Source_: Hansen, McMahon, and Prat (QJE 2017). --- # Increased Conformity: Increased Document Similarity <img src="figs/hansen_table_VI.png" width="40%" style="display: block; margin: auto;" /> _Source_: Hansen, McMahon, and Prat (QJE 2017). --- class: .title-slide-final, center, inverse, middle # `slides |> end()` [<i class="fa fa-github"></i> Source code](https://github.com/ml4econ/notes-spring2025/tree/master/11-text-mining) --- # Selected References - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. _Journal of Machine Learning Research_, 3(Jan), 993-1022. - Blei, D. M., & Lafferty, J. D. (2006, June). Dynamic topic models. In _Proceedings of the 23rd international conference on Machine learning_ (pp. 113-120). ACM. - Gentzkow, M., Kelly, B.T., & Taddy, M. (2019). Text as data. _Journal of Economic Literature_ 57(3), 535-574. - Hansen, S., McMahon, M., & Prat, A. (2017). Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach. _The Quarterly Journal of Economics_, 133(2), 801–870. - Lafferty, J. D., & Blei, D. M. (2006). Correlated topic models. In _Advances in neural information processing systems_ (pp. 147-154). - Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. _The Journal of Finance_, 66(1), 35-65. --- # Selected References - Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder‐Luis, J., Gadarian, S.K., Albertson, B., & Rand, D.G. (2014). Structural topic models for open‐ended survey responses. _American Journal of Political Science_, 58(4), 1064-1082. - Roberts, M.E., Stewart, B.M., & Tingley, D. (2014). stm: R package for structural topic models. _Journal of Statistical Software_, 10(2), 1-40.