Collocation and keyword analysis on India-related text of Natural History
1 Introduction1
Pliny the Elder’s Natural History is widely recognized as the earliest encyclopedia in the world. Written around AD 77, the work is thematically divided into 37 books, covering a diverse range of subjects including astronomy, geography, zoology, botany, medicine, and more. Driven by a pioneering attempt to catalog all human knowledge from his era, Pliny made a comprehensive reference to a wide range of Greek and Roman scholarships, and interwove his own literary interpretation or comments to the narratives in the work.
Natural History stands as a seminal work in the history of encyclopedias, serving as a gateway to understanding the knowledge and worldview of ancient Rome. Beyond its role as a compendium of diverse subjects, the text provides a captivating glimpse into the knowledge, cultural, and merchandize exchange (Dodd 2021) between the Roman Empire and other regions (Gibson & Morello 2011).
As pointed out by Beagon (2011), differentiating from his predecessors, Pliny showed a “terrestrial curiosity” in Natural History, emphasizing recognition of the physical, material world. Drawing from the long-established topographical and ethnographic traditions, Pliny connects volumes dedicated to geography (books 3-6) with broader natural elements, human activities, and cultural, historical, and societal contexts (Roller 2022), exemplified in his description of plants, tribe habitats, imperial expeditions, and trade products. In this regard, this study aims to analyse the text of Natural History in a spatial scope, as geographical names mentioned in the text plays a pivotal role in distributing information, knowledge, and discussions throughout the work.
Natural History is originally in Latin. An English translation by Henry T. Riley (1816-1878) and John Bostock (1773-1846), first published in 1855, is utilized for this study. The translated text is obtained in a digitized version from the TOPOSText project, having been sourced from the Perseus Project and governed by a Creative Commons Attribution-Share-Alike 3.0 U.S. License.
In addition to the digitized textual content of Natural History, the place names mentioned in the text are annotated with their corresponding coordinates on TOPOSText. Since the annotations are available in an HTML format, they are scraped and restructured into a dataset of geographical-related text in Natural History with Beautiful Soup library in Python2. The dataset structure, corpus scraping procedure and and corpus extension will be further explained in Corpus Description section (Section 3).
As a starting point of this study, Figure 1 reveals the 20 most frequently mentioned geographical names within the text.
Additionally, Figure 2 depicted the times of each place name found mentioned in the work on a map. The individual places are represented by distinct dots, and its occurring times in the text are represented by the size of the dots. The larger the dot size, the more frequent it is appeared in the context.
An intriguing observation from these figures is that India seems being prominently mentioned in the work despite locating outside the Mediterranean.
Based on this observation, the focus of the study is narrowed down to India-related texts in the Natural History to investigate how India is described in the work. By delving into the themes associated with the Indian region using the methods of collocation and keyword analysis, this study aims to examine Pliny the Elder’s portrayal of India, which may shed light on the spatial perception and imagination of the representative writers and scholars of ancient Rome.
The following sections provide a detailed explanation of the methodology and corpus used, an interpretation and discussion of the results of the collocation and keyword analysis, and an overall conclusion in response to the research questions.
2 Methodology
To investigate the prominent themes in the India-related content discussed in Natural History, the study generally employs methods of collocation analysis and keyword analysis in corpus linguistics, and is conducted in R environment (R Core Team 2022). The discussions are based on the mclm::assoc_scores
objects returned with mclm (Speelman & Montes 2022) package. In addition, tidyverse (Wickham 2023), leaflet (Cheng, Karambelkar & Xie 2023) and kableExtra (Zhu 2021) packages are used for data manipulation and visualization.
In the Data Analysis section (Section 4), both of the collocation and keyword analysis are examined using three measures of association, namely absolute frequency, PMI and signed \(G^2\) to discuss the potential theme it inferred with an integration of a close reading of the text.
Absolute frequency refers to the simple count of each single type (distinct word) observed in the text. In the case of collocation analysis, it shows how often a type collocates with the target word. While in the case of keyword analysis, it shows how often a type occurs in the target corpus. As a basic counting measure, absolute frequency provide an general view of the word distribution.
PMI (Pointwise Mutual Information) is one of effect size measures for association. It compares the observed co-occurring frequency of two outcomes to the probability of either of them occurring independently. It can be applied to observe word association, and the association between words and context categories as well (Bestvater & Shah 2022). As the PMI value increases by one, the probability becomes twice as high. High PMI values indicates strong associations.
\(G^2\) refers to the log-likelihood ratio for measuring the strength of statistical evidence of the observasion (Blume 2017). In the context of this study, it indicates how significant the association between the two factors (either for two words or for keywords associating with a target corpus) is. Specificly, the signed \(G^2\) score returned by mclm::assoc_scores
shows the strength of evidence for attraction, i.e. co-occurrence. High signed \(G^2\) values suggest that the co-occurrence is unlikely to occur by chance.
Since a PMI score higher than two indicates a four times higher probability of association, and a signed \(G^2\) score higher than 3.84 indicates a significant association at a 95% confidence level (Montes 2022), a filter of “PMI >= 2 & signed \(G^2\) >= 3.84” is applied to the returned objects of mclm::assoc_scores
for both methods to obtain strong and significant association results for further discussion.
2.1 Collocation Analysis
Collocation analysis is a technique used in corpus linguistics to identify words that frequently occur together in a given text. This technique is based on the idea that certain words tend to occur together more often than they would by chance.
This study examines the textual co-occurrence of the word “India” in Natural History. Lists of highly frequent words co-occurring with “India” in context are obtained and sorted by scores of different measures. The potential themes of India-related content in the work can be discussed based on the outputs of textual collocation of word “India”.
2.2 Keyword Analysis
Keyword analysis is a statistical method used in corpus linguistics to identify words that occur more frequently in a target corpus than in a reference corpus. The words that have a higher association with the target corpus compared to their frequency in the reference corpus are considered to be the “keyword” of the target corpus.
In this study, the target corpus includes all paragraphs in Natural History that mention place names in Indian regions, and the reference corpus includes all other paragraphs in the work. Lists of words with relatively higher frequency in the target corpus are obtained and sorted by the different measurements. The “keywords” found in the India-related text in Natural History provide more descriptive information about the topics and elements related to “India” in Pliny’s narrative.
By combining the two methods and integrating them with a close reading of the text, this study aims to gain insights into how “India” is discussed and represented in Natural History.
3 Corpus Description
As mentioned, an English translation by Henry T. Riley (1816-1878) and John Bostock (1773-1846) of the original Latin text of Natural History is employed in the study.
The digitized text is available in paragraphs with geographical annotations on TOPOSText website as two parts, book 1-11 and book 12-37, both of which are in HTML format. Extracting all “place” classes in the HTML files, information of all annotated place names in the work, the URI and corresponding coordinates of the place, the book/chapter/paragraph number where it is mentioned, as well as the coordinating textual content is structured as a dataset of geographical-related text in Natural History. This dataset is used for an initial exploration of the prominent places mentioned in the text.
The initial observation shown in Figure 1 and Figure 2 leads to a focus on the India-related text for this study. With a reference to Barrington Atlas of the Greek and Roman World (Talbert 2000a; b), the approximate geographical coordinates defining the Indian region during the time of Natural Hisoty are as follows3:
Latitude: 5-35 degrees North
Longitude: 65-95 degrees East
The records in geographical-related text dataset with corresponding coordinates that fall within the specified range are extracted as the India-related text dataset.
Each distinct paragraph in the India-related text dataset is extracted and stored in TXT format as separate files within a corpus folder, comprising a total of 146 files as the target corpus. The file names contain information about the book, chapter, and paragraph number of the text. This target corpus contains 37489 tokens and 6003 types, with the type-token ratio of 0.16.
The full text of the entire work, separated in paragraphs, are also scraped and stored in TXT format as a general corpus. This general corpus of 3493 files contains 710117 tokens and 31380 types, with the type-token ratio of 0.04.
Excluding those in target corpus from the general corpus, the other parts of the work contain 3347 files, serve as a reference corpus. The reference corpus contains 672628 tokens and 30294 types, with the type-token ratio of 0.05.
The general corpus is employed for the collocation analysis and the target corpus and reference corpus are used in the keywords analysis.
During the extraction process of the corpus files, the texts are already converted to lower case. In addition, the closing quotation mark “’”, which ranks high while provides limited information in the initial observation of the returned frequency lists, is replaced by an empty string for a more informative return for further discussions.
4 Data Analysis
4.1 Textual collocating word with “India” in Natural History
Collocation analysis traces the frequent occurrence of certain words or phrases in close proximity to each other within a given context. Unlike surface collocation, which considers the collocating span, i.e., the specific words to the left and right of the target word (Evert 2009), the span in textual collocation refers to the textual unit around the target word. Setting “India” as the target word for tracing in the corpus of all paragraphs in Natural History, the mclm::text_cooc()
function returns the frequency of words appearing in the context of “India”. As mentioned in the Methodology section (Section 2), the returned results are filtered with a statistical criterion to obtain a list of the most frequently collocated words with the word “India” with strong evidence.
The prominent textual collocates of “India” identified with this approach suggest the potential themes about the Indian region in the narrative of Pliny.
Table 1, Table 2 and Table 3 show the top 20 textual collocates associated with the word “India”, sorted by absolute frequency, signed \(G^2\) and PMI scores in decsending order.
Type | Frequency | PMI | Signed $G^2$ |
---|---|---|---|
arabia | 28 | 3.06 | 82.6 |
indian | 23 | 3.40 | 78.9 |
hundred | 21 | 2.25 | 37.4 |
ethiopia | 15 | 3.06 | 42.7 |
alexander | 14 | 2.08 | 21.3 |
numerous | 13 | 2.15 | 20.9 |
elephants | 12 | 3.09 | 34.4 |
rock-crystal | 11 | 3.59 | 40.2 |
luxury | 10 | 2.13 | 15.6 |
closely | 10 | 2.15 | 15.8 |
gems | 10 | 2.98 | 26.9 |
gum | 9 | 2.07 | 13.3 |
valuable | 9 | 2.51 | 18.3 |
pearls | 9 | 2.95 | 23.8 |
indians | 9 | 3.54 | 32.0 |
ganges | 9 | 4.09 | 41.1 |
shadow | 9 | 2.41 | 17.2 |
bright | 9 | 2.09 | 13.6 |
thousand | 9 | 2.41 | 17.2 |
bearing | 8 | 2.05 | 11.6 |
indus | 8 | 3.44 | 27.0 |
juba | 8 | 2.40 | 15.1 |
resemblance | 8 | 2.13 | 12.4 |
transparent | 8 | 2.58 | 17.0 |
elephant | 8 | 2.78 | 19.2 |
resemble | 8 | 2.27 | 13.8 |
cappadocia | 8 | 2.87 | 20.2 |
deserts | 8 | 2.47 | 15.8 |
caspian | 8 | 3.02 | 21.9 |
blue | 8 | 2.62 | 17.4 |
gem | 8 | 3.66 | 30.0 |
sail | 8 | 2.70 | 18.2 |
Type | Frequency | PMI | Signed $G^2$ |
---|---|---|---|
arabia | 28.0 | 3.06 | 82.6 |
indian | 23.0 | 3.40 | 78.9 |
ethiopia | 15.0 | 3.06 | 42.7 |
ganges | 9.0 | 4.09 | 41.1 |
rock-crystal | 11.0 | 3.59 | 40.2 |
hundred | 21.0 | 2.25 | 37.4 |
elephants | 12.0 | 3.09 | 34.4 |
megasthenes | 5.5 | 4.69 | 33.6 |
indians | 9.0 | 3.54 | 32.0 |
gem | 8.0 | 3.66 | 30.0 |
lustre | 7.0 | 3.94 | 29.7 |
identical | 6.0 | 4.09 | 27.3 |
'smaragdus | 6.0 | 4.09 | 27.3 |
indus | 8.0 | 3.44 | 27.0 |
gems | 10.0 | 2.98 | 26.9 |
gemstone | 7.0 | 3.55 | 24.9 |
pearls | 9.0 | 2.95 | 23.8 |
onesicritus | 5.0 | 4.15 | 23.3 |
mart | 4.0 | 4.51 | 22.0 |
caspian | 8.0 | 3.02 | 21.9 |
Type | Frequency | PMI | Signed $G^2$ |
---|---|---|---|
megasthenes | 5.5 | 4.69 | 33.6 |
thorn-bush | 3.5 | 4.62 | 20.5 |
carnelian | 3.5 | 4.62 | 20.5 |
obsidian | 3.5 | 4.62 | 20.5 |
cophes | 3.5 | 4.62 | 20.5 |
merchandize | 3.5 | 4.62 | 20.5 |
mart | 4.0 | 4.51 | 22.0 |
expeditions | 3.0 | 4.41 | 15.7 |
patala | 3.0 | 4.41 | 15.7 |
arii | 3.0 | 4.41 | 15.7 |
ichthyophagi | 3.0 | 4.41 | 15.7 |
pursuits | 3.0 | 4.41 | 15.7 |
vermilion | 4.0 | 4.24 | 19.4 |
biggest | 4.0 | 4.24 | 19.4 |
onesicritus | 5.0 | 4.15 | 23.3 |
identical | 6.0 | 4.09 | 27.3 |
ganges | 9.0 | 4.09 | 41.1 |
engrave | 3.0 | 4.09 | 13.6 |
'smaragdus | 6.0 | 4.09 | 27.3 |
honey-coloured | 4.0 | 4.02 | 17.5 |
reflects | 4.0 | 4.02 | 17.5 |
The top 20 ranking word lists of absolute frequency and signed \(G^2\) score have a 11/20 overlapping outputs. As shown in Figure 3, there is a positive correlation between the signed \(G^2\) score and absolute frequency in textual collocation analysis.
The high-ranking words collocating with “India”, sorted by their absolute frequency and signed \(G^2\) score, are categorized as follows. Those without specifying the association measures (“A” stands for “absolute frequency” and “G” stands for “signed \(G^2\) score”) are the overlapped ones. The overlapped words are considered having the strongest evidence of frequently collocating with “India” in the context.
Geographical connections: arabia, indian, ethiopia, ganges, indus(G)
Animal/Plant: elephants, gum(A), lustre(G)
Precious stone: stones, rock-crystal, gems, gem(G), gemstone(G), pearls, smaragdus(G)
Historical figure: alexander(A), megasthenes(G), onesicritus(G)
Merchandise exchange: mart(G)
Descriptive expression: hundred, thousand(A), whole(A), numerous(A), luxury(A), closely(A), valuable(A), bright(A), identical(G)
Others: shadow(A), year(A)
These categories indicate that the “India” narrative in Natural History covers three major dimensions. The first dimension is its geographical information, including its connection to other places (primarily Arabia and Ethiopia), the two major rivers in the region, Ganges and Indus, and the animals, plants and precious stones originating from it. And it is frequently mentioned with historical figures, some related to the conquest history of the Roman Empire (in the case of alexander, referring to Alexander the Great, who conducted several expeditions to Indian subcontinent), and some related to the referencing scholarship Pliny consulted in the work. Moreover, the term mart and the descriptive phrases such as luxury and valuable, highlights the merchandise exchanges introduced in the discussion of India.
In the mclm::text_cooc()
function, the given text will be separated as target context (text surrounding the target node) and reference context (text not surrounding the target node). The normalized frequency of the top ranking words with the highest frequency and strongest evidence in the target context are at least 20 times higher than that of the reference context, as shown in Table 4.
Type | Target_Nrm_Frequency | Reference_Nrm_Frequency |
---|---|---|
arabia | 2276 | 198.81 |
indian | 1870 | 115.73 |
hundred | 1707 | 308.61 |
stones | 1545 | 344.21 |
ethiopia | 1220 | 106.83 |
elephants | 976 | 83.09 |
rock-crystal | 894 | 44.51 |
gems | 813 | 77.15 |
pearls | 732 | 71.22 |
indians | 732 | 38.58 |
ganges | 732 | 17.80 |
indus | 650 | 38.58 |
gem | 650 | 29.67 |
gemstone | 569 | 29.67 |
lustre | 569 | 17.80 |
identical | 488 | 11.87 |
'smaragdus | 488 | 11.87 |
onesicritus | 407 | 8.90 |
megasthenes | 407 | NA |
mart | 325 | 2.97 |
When examining the normalized frequency of the top 20 list sorted by PMI score, as presented in Table 5, all of the results exhibit a higher frequency than that of the reference context as well.
Type | Target_Nrm_Frequency | Reference_Nrm_Frequency |
---|---|---|
ganges | 732 | 17.80 |
identical | 488 | 11.87 |
'smaragdus | 488 | 11.87 |
onesicritus | 407 | 8.90 |
megasthenes | 407 | NA |
honey-coloured | 325 | 8.90 |
reflects | 325 | 8.90 |
vermilion | 325 | 5.93 |
biggest | 325 | 5.93 |
mart | 325 | 2.97 |
expeditions | 244 | 2.97 |
thorn-bush | 244 | NA |
patala | 244 | 2.97 |
arii | 244 | 2.97 |
engrave | 244 | 5.93 |
carnelian | 244 | NA |
obsidian | 244 | NA |
cophes | 244 | NA |
ichthyophagi | 244 | 2.97 |
merchandize | 244 | NA |
pursuits | 244 | 2.97 |
As the PMI sorting identifies the words of the strongest attraction with the word “India” in the context, the listed words provides additional descriptive information about the themes relating to India in the work. Most of them can be classified into the aforementioned categories set for the results of absolute frequency and signed \(G^2\) score, some of the words contribute to a new category relating to the existing ones. The top ranking words sorted by PMI are grouped as follows:
Geographical connections: cophes, patala, ganges
Animal/Plant: thorn-bush
Precious stone: carnelian, obsidian, ’smaragdus
Historical figure: megasthenes, onesicritus
Merchandise exchange: merchandize, mart
Descriptive expression: vermilion, biggest, identical, honey-coloured
Others: reflects
Activity (new category): expeditions, pursuits, engrave
Ethnic group (new category): arii, ichthyophagi
Except for displaying more specific geographic connections, plant and gemstone names, the descriptive expressions in the list pertain to the description of stones (such as vermilion and honey-coloured) and the special traits of natural creatures originating in India (such as biggest and identical) in Natural History. The words in the new categories relate to the existing categories. For example, the terms in “Activity” refer to Alexander the Great’s expedition to the Indian subcontinent, the necessity for exchange of merchandise (pursuits), and the processing of gemstones (engrave). The terms in “Ethnic group” pertain to the human communities living in India and its surrounding regions. This additional information reinforces the conclusion that when “India” is mentioned in the work, Pliny tended to focus his discussion on its geographical conditions and trade relations with the Roman Empire.
The observations from the top ranking “India”-collocating word lists of the three measures highlight how the narrative of Natural History underscores India’s geographical interconnection and its important role in merchandise and cultural exchange.
5 Conclusion
In conclusion, this paper investigates how Pliny the Elder describes India in his work, Natural History, by examining the potential themes and topics related to India. The research question is motivated by the notable frequency of references to “India” throughout the text. The analysis employs methods of collocation and keyword analysis within a corpus linguistics framework.
The exploration of textual collocates to the word “India” reveals the geographical connections, trade networks, historical figures, and scholarly references associated with India in Natural History. The collocates also emphasize India’s close relationship with neighboring regions such as Arabia and Ethiopia.
Moreover, the examination of keywords in India-related text aligns with the observations from the collocation analysis and highlights potential topics on stones and rivers about India in the work. These findings enhance the interpretation of the identified themes, such as geographical connectivity and merchandise trade, and also reflect Pliny’s general geographical concern in the historical context of ancient Rome.
As this paper serves as a preliminary exploration, it encourages further in-depth investigations of the text. A comparative analysis of collocation and keywords centering other neighboring regions to India, and a closer examination of “stone” or “river” in the context, incorporating additional literature regarding the historical and cultural background may offer greater insights and enrich the discussion.
References
Footnotes
The present paper is a pilot study of my master thesis, Mapping India in Pliny the Elder’s Natural History, thus the description of the research question is connected with the content in this thesis.↩︎
The scraping code is uploaded to the same repository of the paper.↩︎
As indicated in the map-by-map directory, the range spans territories of “modern states of India (minus the Punjab), Bangladesh, Bhutan, Burma, Nepal, and Sri Lanka”.↩︎