Orientalism has declined over the last one hundred years

Oslo, 2023

Christopher Barrie

Orientalism

Orientalism

Orientalism

Orientalism

Orientalism

Orientalism

Orientalism

What do this scholarship do?

Spans:

  1. Humanities
  2. Social sciences
  3. Social/critical theory

What does this scholarship do?

  1. Exegesis (tafsīr?) of Said
  2. Postcolonial theory
  3. Literary criticism

What does this scholarship not do

  1. Quantify trends in Orientalism over time (/test if Said was right!)

This despite:

  1. Rise of digital humanities
  2. Rise of quantitative techniques for analyzing semantic content of texts

What do I do?

  • Data:
    • Google Books corpus: English (fic. + non-fic.), French (fic. + non-fic.), English (fic. only)
    • Embeddings from Hamilton et al. (2018) HistWords project trained with GloVe algorithm

What is a word embedding?

Dimension 1 Dimension 2 Dimension 3
king 1.2 0.7 -0.3
queen 1.1 0.8 -0.2
man 1.0 -0.2 0.9
woman 0.9 -0.1 1.0

…but bigger than this

Component parts

The PMI is defined by the following formula:

\[ \text{PMI} = \log\left(\frac{P(x,y)}{P(x) \cdot P(y)}\right) \]

Where:

  • \(P(x,y)\) is the probability of word \(x\) appearing within a six-word window of word y
  • \(P(x)\) is the probability of word x appearing in the whole corpus
  • \(P(y)\) is the probability of word y appearing in the whole corpus

Windows and targets

Word embeddings with GloVe

After we’ve generated our term (feature) co-ocurrence matrix

WINDOW_SIZE <- 6
DIM <- 300
ITERS <- 100

glove <- GlobalVectors$new(rank = DIM, 
                           x_max = 100,
                           learning_rate = 0.05)
                           
wv_main <- glove$fit_transform(toks_fcm, n_iter = ITERS,
                               convergence_tol = 1e-3, 
                               n_threads = parallel::detectCores()) # set to 'parallel::detectCores()' to use all available cores

wv_context <- glove$components
local_glove <- wv_main + t(wv_context) # word vectors

Back to Said…

We are working with these:

Dimension 1 Dimension 2 Dimension 3
king 1.2 0.7 -0.3
queen 1.1 0.8 -0.2
man 1.0 -0.2 0.9
woman 0.9 -0.1 1.0

But we have these over every decade 1800-2010 trained on the entirety of Google Books (tens of millions of published works).

So why are these useful?

  • Allow us to look at relationship between words
    • “we shall know a word by the company it keeps” (Firth, 1957)
  • Allow us to understand relatinship between words over time
    • has made possible a number of important articles excavating over-time trends in bias and prejudice…

_

_

_

Estimation procedure

  • Same as in articles by Manzini et al. (2018) and Charlesworth et al. (2022)
    • Mean average cosine similarity (MAC) implemented in sweater R package by Chung-hong Chan.

Estimation procedure

\[ MAC = \frac{1}{N} \sum_{i=1}^{N} \frac{\mathbf{A}_i \cdot \mathbf{B}_i}{||\mathbf{A}_i|| \cdot ||\mathbf{B}_i||} \]

Where:

  • \(\mathbf{A}_i\) and \(\mathbf{B}_i\) are the i-th pair of vectors from the two sets.

  • \(\mathbf{A}_i \cdot \mathbf{B}_i\) is the dot product of the vectors.

  • \(||\mathbf{A}_i||\) and \(||\mathbf{B}_i||\) are the magnitudes of the vectors.

  • \(N\) is the total number of vector pairs.

Seed words

Select four dimensions based on secondary scholarship + close reading:

  1. Exoticism
  2. Irrationality
  3. Despotism
  4. Eroticism

Expand using gensim of Google News corpus + manual classification.

For each set of words:

Calculate cosine similarity between the word “Arab” and each individual word in each dimension.

S1 <- exotic_strange_words
A1 <- c("arab")

P_all <- data.frame(matrix(ncol=0, nrow=length(S1)))
for (i in seq_along(wordvecs.dat)) {

  embedding <- as.matrix(wordvecs.dat[[i]])
  x <- mac(embedding, S1, A1) #this function calculates MAC

  P <- as.data.frame(x$P)
  colnames(P) <- decade_names[[i]]
  P_all <- cbind(P_all, P)
}

For each set of words:

P_all <- P_all %>%
  rownames_to_column("word")

P_all_long <- P_all %>% gather(year, value, -c("word"))
P_all_long$year <- as.numeric(P_all_long$year)

# this section is essentially doing what `mac_es()` would do if called in loop
P_all_mean <- P_all_long %>%
  group_by(year) %>%
  summarise(mean_sim = mean(value, na.rm = T), #note: averaging across non-NA values
            word = "average")

Results

Results

Results

Results

Intepretation

  • “I have already lost, Kingdom after Kingdom, province after province, the more beautiful half of the universe, and soon I will know of no place in which I can find a refuge for my dreams; but it is Egypt that I most regret having driven out of my imagination, now that I have sadly placed it in my memory.” (Nerval, cited in Said, 100)

  • “Unable to recognize”its” Orient in the new Third World, Orientalism now faced a challenging and politically armed Orient.” (Said 1978, 104)

Future directions

  • Change point analyses
  • French corpus
  • External validity
  • Alternative dimensions (?)

Materials

https://github.com/cjbarrie/oslo_embeddings