Week 7: Affiliation Networks

Introduction

How do social actors or entities connect to one another? On its face the answer is obvious: Actors are directly tied to one another through social relations (friends with, talks to, advice from, and so on). But, social scientists for some time have viewed social connectivity more broadly to include how actors connect to one another through indirect ties through co-membership or affiliation (members of the same team, members of the same corporate board, co-authorhip, and so on). As Cooley wrote, “A man may regarded as the point of intersection of an indefinite number of circles representing social groups, having as many arcs passing through him as there are groups.”

When we discussion affiliation networks, we ask:

What does it mean to be affiliated through a literal group, activity, or event?

And we may also ask:

What does it mean to be connected through a generic group or an abstraction?

It is probably not controversial to claim that people who are in the exact same social group share some type of meaningful connection through being members in the same group. If I attend a book club with you, then I am likely to share numerous characteristics, at minimum a shared appreciation of similar books. However, if we attend book clubs in different cities, does generic book club membership provide a meaningful tie?

Scholars interested in duality and affiliation are interested in these kinds of questions, but they extend to non-human entities as well, like scientific papers or emotions, that can be modeled as two-mode or affiliation networks.

Earlier work on duality (Breiger 1974) builds upon Cooley’s idea: People intersect through their associations, which defines (in part) their individuality. The concept of duality recognizes that relations among groups implies relations among individuals or vice versa.

As Breiger(1974:87) writes, “With respect to the membership network…persons who are actors in one picture (the P matrix) are with equal legitimacy viewed as connections in the dual picture (the G matrix), and conversely for groups.”

The resulting network after these transformations: - Is always symmetric - The diagonal tells you how many groups (persons) a person (group) belongs to (has).

We can see how these types of transformations work by turning to a toy example.

First, we should load the packages that we will need in this workshop. We will use bibliometrix for some two mode science network analysis, igraph for network analysis and visualization, ggraph for visualization, udpipe for some language parsing, and dplyr, tidygraph, and ggforce for a few data organization and visualization tasks. We use tidytext and widyr for some text processing functions.

library(bibliometrix)
library(igraph)
library(ggraph)
library(udpipe)
library(dplyr)
library(tidygraph)
library(ggforce)
library(tidytext)
library(widyr)

Affiliation Example

Let’s begin with a person-to-group matrix

Each column is a group, each cell is a person, and the cell=1 if the person in that row belongs to the group in that column. You can calculate how many groups shared between two individuals by comparing the rows. Identify every column where both rows equals 1 and sum them. This is the overlap.

You can also tell the total number of group members by summing the columns and the total number of groups/indvidual by summing the rows.

Let’s make a matrix in R.

one <- c(0, 1, 1, 0, 0, 0)
two <- c(0, 0, 1, 1, 0, 0)
three <- c(0, 0, 0, 1, 1, 1)
four <- c(0, 0, 0, 1, 0, 1)
five <- c(1, 0, 0, 1, 0, 0)

dual <- cbind(one, two, three, four, five)

rownames(dual) <- c("A", "B", "C", "D", "E", "F")

print(dual)

##   one two three four five
## A   0   0     0    0    1
## B   1   0     0    0    0
## C   1   1     0    0    0
## D   0   1     1    1    1
## E   0   0     1    0    0
## F   0   0     1    1    0

Duality & Matrix Algebra

Recall that the matrix is a simple and elegant way to manipulate networks. That is most obviously the case, perhaps, when thinking through duality.

From persons-to-groups to persons and groups.

If we multiple our Persons-to-Group matrix by its transpose, we have a Person matrix.

If we multiply the transpose of our Persons-to-Group matrix by the matrix itself, we have the Groups matrix.

The Person-Person Matrix:

person <- dual %*% t(dual)

print(person)

##   A B C D E F
## A 1 0 0 1 0 0
## B 0 1 1 0 0 0
## C 0 1 2 1 0 0
## D 1 0 1 4 1 2
## E 0 0 0 1 1 1
## F 0 0 0 2 1 2

The Group-Group Matrix:

groups <- t(dual) %*% dual

print(groups)

##       one two three four five
## one     2   1     0    0    0
## two     1   2     1    1    1
## three   0   1     3    2    1
## four    0   1     2    2    1
## five    0   1     1    1    2

Bipartite Network

The general class of affiliation networks are known as bipartite or two-mode networks. The matrix transformation described above where we transform the network from two-mode to one-mode is called a projection.

One-mode projections, while perhaps more easy to understand, obviously involve a substantial loss of information.

While it remains more common, perhaps, to project two-mode networks, increasing work is looking at the tractability of keeping two-mode networks as two-modes.

Science Networks

Science networks have been a major area where duality has been examined. We can think of all of the two-mode networks that structure science: People in labs, departments, etc or ideas in books, articles, etc. From bibliographic records we can construct numerous two-mode networks and project them to one-mode networks if we so choose. We can plot university by article or university by journal networks. We can plot country by keyword networks or typical coauthorship or cocitation networks.

I provide an overview of downloading data in Web of Science below. We download these data manually at this point and use the Bibliometrix package to bring the data into a useable file.

So, we can also start by reading the data into Bibliometrix.

Introduction to Bibliometrix

First, we need to bring the raw WoS data into a readable file. We can do this with the convert2df function in Bibliometrix. This requires a path for where the raw files are stored. We can grab those paths with the list.files function and be sure to specify that full.names = TRUE to grab the entire path.

sh.articles <- list.files(path="data/articles/", full.names = TRUE)

sh.df <- convert2df(file=sh.articles, dbsource="isi", format="plaintext")

We will return to this data frame for reasons described below. But, it may be worthwhile to build a bibliometrix object and use some of this packages functions. We can use the biblioAnalysis function to build things and then use summary and plot to see an overview of the corpus.

sh.biblio <- biblioAnalysis(sh.df, sep=";")

summary(sh.biblio, k=10)

## 
## 
## MAIN INFORMATION ABOUT DATA
## 
##  Timespan                              1970 : 2023 
##  Sources (Journals, Books, etc)        751 
##  Documents                             1679 
##  Annual Growth Rate %                  13.07 
##  Document Average Age                  7.97 
##  Average citations per doc             45.54 
##  Average citations per year per doc    3.891 
##  References                            60012 
##  
## DOCUMENT TYPES                     
##  abstract of published item            1 
##  article                               1249 
##  article; book                         1 
##  article; book chapter                 70 
##  article; early access                 56 
##  article; proceedings paper            19 
##  article; retracted publication        1 
##  book                                  8 
##  book review                           11 
##  correction                            1 
##  editorial material                    8 
##  editorial material; book chapter      3 
##  meeting abstract                      2 
##  note                                  1 
##  proceedings paper                     176 
##  review                                69 
##  review; book chapter                  2 
##  review; early access                  1 
##  
## DOCUMENT CONTENTS
##  Keywords Plus (ID)                    2345 
##  Author's Keywords (DE)                3725 
##  
## AUTHORS
##  Authors                               3405 
##  Author Appearances                    4661 
##  Authors of single-authored docs       247 
##  
## AUTHORS COLLABORATION
##  Single-authored docs                  282 
##  Documents per Author                  0.493 
##  Co-Authors per Doc                    2.78 
##  International co-authorships %        29.9 
##  
## 
## Annual Scientific Production
## 
##  Year    Articles
##     1970        1
##     1987        1
##     1993        3
##     1994        6
##     1995        6
##     1997        4
##     1998        9
##     1999        3
##     2000        7
##     2001        5
##     2002        7
##     2003       13
##     2004       15
##     2005       21
##     2006       15
##     2007       41
##     2008       56
##     2009       57
##     2010       64
##     2011       79
##     2012       85
##     2013       95
##     2014       80
##     2015       92
##     2016       82
##     2017       85
##     2018      132
##     2019      117
##     2020      133
##     2021      126
##     2022      137
##     2023       45
## 
## Annual Percentage Growth Rate 7.446603 
## 
## 
## Most Productive Authors
## 
##       Authors        Articles    Authors        Articles Fractionalized
## 1  BURT RS                 26 BURT RS                             17.83
## 2  GUAN JC                 15 GUAN JC                              5.92
## 3  CHEN Y                  11 LIU CH                               5.28
## 4  LIU CH                  11 SODA G                               4.58
## 5  SODA G                  11 MOLINA-MORALES FX                    4.33
## 6  ZHANG Y                 11 TANG CY                              4.33
## 7  KILDUFF M               10 SHIPILOV AV                          3.95
## 8  LI Y                    10 DIEZ-VIAL I                          3.92
## 9  MOLINA-MORALES FX       10 ZAHEER A                             3.87
## 10 WANG JB                 10 KILDUFF M                            3.53
## 
## 
## Top manuscripts per citations
## 
##                        Paper                                    DOI   TC TCperYear  NTC
## 1  BURT RS, 2004, AM J SOCIOL         10.1086/421787                2756     137.8 8.74
## 2  AHUJA G, 2000, ADMIN SCI QUART     10.2307/2667105               2749     114.5 3.06
## 3  BURT RS, 2000, RES ORGAN BEHAV     10.1016/S0191-3085(00)22009-1 1937      80.7 2.16
## 4  BURT RS, 1997, ADMIN SCI QUART     10.2307/2393923               1773      65.7 2.64
## 5  OWEN-SMITH J, 2004, ORGAN SCI      10.1287/orsc.1030.0054        1241      62.0 3.94
## 6  OBSTFELD D, 2005, ADMIN SCI QUART  10.2189/asqu.2005.50.1.100    1070      56.3 5.21
## 7  PROVAN KG, 2007, J MANAGE          10.1177/0149206307302554       887      52.2 5.81
## 8  ZAHEER A, 2005, STRATEGIC MANAGE J 10.1002/smj.482                883      46.5 4.30
## 9  WALKER G, 1997, ORGAN SCI          10.1287/orsc.8.2.109           873      32.3 1.30
## 10 PODOLNY JM, 2001, AM J SOCIOL      10.1086/323038                 825      35.9 2.46
## 
## 
## Corresponding Author's Countries
## 
##           Country Articles   Freq SCP MCP MCP_Ratio
## 1  CHINA               485 0.2981 362 123     0.254
## 2  USA                 380 0.2336 284  96     0.253
## 3  UNITED KINGDOM       83 0.0510  46  37     0.446
## 4  SPAIN                74 0.0455  57  17     0.230
## 5  ITALY                66 0.0406  42  24     0.364
## 6  NETHERLANDS          63 0.0387  40  23     0.365
## 7  FRANCE               54 0.0332  29  25     0.463
## 8  AUSTRALIA            45 0.0277  24  21     0.467
## 9  CANADA               45 0.0277  27  18     0.400
## 10 GERMANY              40 0.0246  29  11     0.275
## 
## 
## SCP: Single Country Publications
## 
## MCP: Multiple Country Publications
## 
## 
## Total Citations per Country
## 
##       Country      Total Citations Average Article Citations
## 1  USA                       40706                   107.121
## 2  CHINA                      6314                    13.019
## 3  UNITED KINGDOM             4681                    56.398
## 4  NETHERLANDS                3800                    60.317
## 5  FRANCE                     3458                    64.037
## 6  CANADA                     2949                    65.533
## 7  SPAIN                      1940                    26.216
## 8  AUSTRALIA                  1932                    42.933
## 9  ITALY                      1437                    21.773
## 10 GERMANY                    1173                    29.325
## 
## 
## Most Relevant Sources
## 
##                                 Sources        Articles
## 1  ORGANIZATION SCIENCE                              39
## 2  STRATEGIC MANAGEMENT JOURNAL                      39
## 3  SOCIAL NETWORKS                                   33
## 4  RESEARCH POLICY                                   28
## 5  ACADEMY OF MANAGEMENT JOURNAL                     24
## 6  ADMINISTRATIVE SCIENCE QUARTERLY                  24
## 7  SCIENTOMETRICS                                    24
## 8  SUSTAINABILITY                                    23
## 9  JOURNAL OF MANAGEMENT                             22
## 10 TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE       21
## 
## 
## Most Relevant Keywords
## 
##    Author Keywords (DE)      Articles Keywords-Plus (ID)     Articles
## 1    STRUCTURAL HOLES             205    STRUCTURAL HOLES         979
## 2    SOCIAL NETWORKS              155    PERFORMANCE              485
## 3    SOCIAL CAPITAL               138    INNOVATION               392
## 4    INNOVATION                   105    KNOWLEDGE                288
## 5    SOCIAL NETWORK ANALYSIS      103    SOCIAL NETWORKS          218
## 6    NETWORKS                      85    EMBEDDEDNESS             193
## 7    STRUCTURAL HOLE               63    ABSORPTIVE-CAPACITY      192
## 8    SOCIAL NETWORK                62    COLLABORATION            170
## 9    BROKERAGE                     53    NETWORKS                 170
## 10   CENTRALITY                    37    IMPACT                   159

plot(x=sh.biblio, k=5, pause=FALSE)

Bibliometrix has number useful functions and you can see a good introduction here. For example, we can plot the co-authorship network using the biblioNetwork function on the data frame and specifying analysis="collaboration and network=authors.

coauth <- biblioNetwork(sh.df, analysis="collaboration", network="authors", short=TRUE)

And then we can plot using networkPlot here we also remove isolates and only label the top 10 nodes.

networkPlot(coauth, remove.isolates=TRUE, label.n=10, cluster="none")

We can see that the network contains a small large component and numerous smaller components.

Manipulating the network in Bibliometrix isn’t always intuitive. So, we can move back and forth between the matrices and data frames conveniently generated by Bibliometrix and igraph or other packages.

Co-authorship Network

Let’s use the author network as an example. First, let’s reduce the size of the corpus to just the look at 1990-2015 using filter in dplyr. Note that PY is the publication year.

short <- sh.df %>% filter(PY>1989 & PY<2016)

Now we can use the cocMatrix function in Bibliometrix to generate a two-mode article by source matrix by specifying field = "AU".

au.mat <- cocMatrix(short, Field="AU", sep=";")

Now we can use graph.incidence to read the bipartite matrix into an igraph object.

And to project this two mode network into a one mode network, we can use the bipartite_projection function, where “multiplicity” indicates whether the co-appearances should be added together as an edge weight and where “which” is FALSE is the first mode and TRUE is the second mode. As the the lemma is in the first column, we can choose TRUE here.

au.g <- graph.incidence(au.mat)

au1mode.g <- bipartite_projection(au.g, multiplicity = TRUE, which=TRUE)

plot(au1mode.g, vertex.label=NA, vertex.size=4)

And we can grab the largest component using the components function in igraph and plot this component and induced.subgraphs to pull out the component.

comps <- components(au1mode.g)

bcomp <- which.max(comps$csize)

keep_ids <- V(au1mode.g)[comps$membership == bcomp]

aucomp.g <- induced.subgraph(graph=au1mode.g, vids=keep_ids)

plot.igraph(aucomp.g, layout=layout_with_kk, vertex.label.cex=.5, vertex.size=5)

Keyword Networks

Keyword networks are another strategy for looking at the structure of a scientific field. The provide evidence of the epistemic space or the knowledge area in which scholars work.

Keyword networks are two-mode networks (articles x keywords) and we can keep them as two-modes or project them to a keyword x keyword network.

Let’s look at both of these networks in turn.

We begin by using the cocMatrix in Bibliometrix to pull out the two-mode keyword matrix. Here, we set Field="DE" as “DE” is the Web of Science code for author-provided keywords. Then, we use graph_from_incidence from igraph to create an igraph object from the matrix.

kw.mat <- cocMatrix(sh.df, Field="DE")

kw.2mode <- graph_from_incidence_matrix(kw.mat)

We can plot the two-mode keyword network after removing any isolates or articles without any keywords using delete.vertices in igraph.

kw2.noiso <- delete.vertices(kw.2mode, which(degree(kw.2mode)==0))

plot(kw2.noiso, layout=layout_with_fr, vertex.label=NA, vertex.size=4)

We can see that there appears to be a large connected component and many smaller components. This isn’t super informative, but we can graph some basic statistics for reference. Here, we count the number of components of various sizes, the degree centralization, and the edge density.

#can examine the components here

print(table(components(kw2.noiso)$csize))

## 
##    3    4    5    6    7    8    9   11   15 4724 
##    1    6   22   13    5    3    3    1    1    1

#centralization

print(centralization.degree(kw2.noiso)$centralization)

## [1] 0.04006339

# density

print(edge_density(kw2.noiso))

## [1] 0.0005306664

Let’s dig deeper into the large component. First, we use components from igraph then we identify the largest component using which.max in base R.

comps <- components(kw.2mode)

bigcomp <- which.max(comps$csize)

Next, we identify the vertices that are in the group identified with which.max using brackets and shrink the graph using induced.subgraph in igraph.

vert_ids <- V(kw.2mode)[comps$membership == bigcomp]

kw2.gcc <- induced_subgraph(kw.2mode, vert_ids)

Let’s look at some clusters using cluster_louvain and plot in igraph.

lv <- cluster_louvain(kw2.gcc)

plot(kw2.gcc, layout=layout_with_kk, vertex.color=lv$membership, vertex.label.cex=.5, vertex.size=degree(kw2.gcc)/5)

The labels make it very difficult to read, so let’s see if we can plot only the one’s that are attached to nodes with high degree. We can combine which and quantile to do this task.

lab.keep <- which(degree(kw2.gcc) > (quantile(degree(kw2.gcc), .99)
))

And then we plot using ifelse to label only those in the top group.

plot(kw2.gcc, vertex.label = ifelse(V(kw2.gcc) %in% lab.keep, V(kw2.gcc)$name, NA),
     vertex.color=lv$membership, vertex.label.cex=.5,      vertex.size=degree(kw2.gcc)/10)

Still, somewhat of a mess, but we can imagine how pruning the network may lead to something that helps us to see the structure of this literature.

The one-mode approach is exactly the same, but we start by using bipartite_projection from igraph on the two-mode network and selecting which="true" because we know the keywords are in the columns of the original matrix. We can plot that network of course.

We can then go through the same kind of processes as the two mode networks in terms of finding the largest connected component, locating communities, plotting with reduced labels and so forth to locate patterns in the keyword structure.

Text Networks

Text networks are increasingly common and relatively simple methods for document classification and/or generating thematic relationships across texts.

The goal is data reduction: Take a bunch of texts and simplify the relationship between them and/or their ideas.

Fundamentally, these are bipartite networks consisting of documents and words. We can think of any number of networks in similar ways, such as cocitation networks that consist of documents and citations.

Steps in processing texts:

1. Find a collection of texts
2. Force the collection into a .csv (etc.)
3. Preprocess the text data
    - Remove stop words and or select POS (e.g. nouns)
    - N-grams?
    - Word length
    - Common words
4. Build document by word matrix
    - "Bag of words:" Word order doesn't matter
    - Weighted elements?: Common weight is tfidf
5. Project to terms or documents
6. Plot

Let’s build a text network and a network of documents from the Web of Science on “structural holes.”

Bibliometrix has some text analysis capabilities, but we may want more flexibility in terms of what counts as text and so on. For example, we can organize and clean the data by deleting articles that don’t have an abstract and by combining keywords, titles, and abstracts.

Next, we can parse the combined text field and tag the words in the field by part-of-speech by using a pos-tagger. In this case, we use udpipe by first downloading an English language model using udpipe_download_model(language = "english-lines") then loading it into our R session with the local path using udpipe_load_model(file=pathname). Last, we use udpipe_annotate on the text field to tag each word with a part of speech. Here, we also use filter in dplyr to select on nouns, but you may or may not want to look only at nouns.

holes <- short %>% filter(AB != "")

holes$text <- paste(holes$DE, holes$TI, holes$AB)

m_eng_lines   <- udpipe_download_model(language = "english-lines")
m_eng_lines_path <- m_eng_lines$file_model
m_eng_lines_loaded <- udpipe_load_model(file=m_eng_lines_path)

text_annotated <- udpipe_annotate(m_eng_lines_loaded, x = holes$text) %>%
      as.data.frame() %>%
      select(-sentence)

nouns <- text_annotated %>% filter(upos=="NOUN")

After we select nouns, we can also clean the data in some ways that are typical for text networks, such as getting rid of short words, numbers, and words that appear only a handful of times, here 10 or fewer times. You see that most of these cleaning steps use either base R or dplyr.

nouns$lemma <- tolower(nouns$lemma)

data("stop_words")

nouns <- nouns %>%
  filter(!lemma %in% stop_words$word)

word_count <-as.data.frame(table(nouns$lemma))

word_count <- rename(word_count, lemma = Var1)

word_count$isnum <- nchar(gsub("[^0-9]+", "", word_count$lemma))

word_count <- word_count %>% filter(isnum==0)

word_count$lemma <- as.character(word_count$lemma)

word_count$wlength <- nchar(word_count$lemma)

word_count <- word_count %>% filter(wlength>3)

word_count <- word_count %>% filter(Freq>10)

word_count$keep <- 1

noun_fin <- left_join(nouns, word_count)

noun_fin[is.na(noun_fin)] = 0

noun_fin <- noun_fin %>% filter(keep==1)

It is common to adjust the weight of tokens or words by their frequency so very frequent words are penalized. We can use tidytext for that.

art_nouns <- noun_fin %>%
    count(doc_id, lemma, sort = TRUE)

total_nouns <- art_nouns %>% 
  group_by(doc_id) %>% 
  summarize(total = sum(n))

art_nouns <- left_join(art_nouns, total_nouns)

noun_tf_idf <- art_nouns %>% bind_tf_idf(lemma, doc_id, n)

Structural Hole Text Network

The output of the data cleaning process results in a edge list, if we select (from dplyr) on the lemma - or root word - and the doc_id. We can use pairwise_count from widyr to create a weighted edgelist. If we rename the count variable, “weight” igraph will read the network as weighted. To create an igraph object, we can use graph_from_data_frame on this data frame edge list.

noun_edge <- noun_tf_idf %>% select(lemma, doc_id, tf_idf)

noun_pair <- noun_edge %>% pairwise_count(lemma, doc_id, wt=tf_idf)

noun_pair <- rename(noun_pair, weight=n)

noun.g <- graph_from_data_frame(noun_pair)

noun.g <- as.undirected(noun.g)

V(noun.g)$wrd_cnt <- word_count$Freq

noun.g <- delete.vertices(noun.g, V(noun.g)[ degree(noun.g)==0]) 

plot(noun.g, layout=layout_with_kk, vertex.size=V(noun.g)$wrd_cnt/100, main="Structural Holes Noun Network")

This is a messy graph, so we can use ggraph to clean it up and engage some other editing like using delete.edges in igraph where we use quantile function to grab the top 10 percent of edges.

n1sub.g <- delete.edges(noun.g, which(E(noun.g)$weight < (quantile(E(noun.g)$weight, .90)
)))

n1sub.g <- delete.vertices(n1sub.g, V(n1sub.g)[ degree(n1sub.g)==0 ]) 

cls_fast <- cluster_fast_greedy(n1sub.g)

membs <- data.frame(clusters=as.numeric(cls_fast$membership))

clrs <- data.frame(clusters=c(1,2,3,4, 5, 6, 7, 8, 9), clr=c("purple", "yellow", "green", "skyblue", "red", "pink", "grey", "tomato", "gold"))

membs <- left_join(membs, clrs)

ggraph(n1sub.g, layout = "kk") + 
  geom_edge_link(alpha=.25, aes(width=weight), color="gray") + 
  geom_node_point(aes(size=wrd_cnt, fill = membs$clr), shape=21) +
  geom_node_text(aes(label=ifelse(wrd_cnt > 100, name, NA))) +
  scale_size_continuous(range = c(2, 20))   +
  theme_void() +
  theme(legend.position = "none") +
  labs(title="Structural Hole Network: 1986-2000", caption = "This graph visualizes the text network for structural holes based on words in the abstract, title, and keywords.
  Nodes are words. Edges are the count of overlaps across articles. Communities using FastGreedy. 
  Node labels are words that appear in at least 100 articles.")

## Warning: Removed 142 rows containing missing values (geom_text).

Bonus: How to Download WoS Data

We download bibliographic data from the Web of Science. We could use other databases that each have their strengths in terms of coverage. WoS is widely viewed as having good coverage of the social sciences. It also plays nicely with Bibliometrix which is nice for processing as bringing the data into R can be tricky.

The first step is to access Web of Science. You may have to vpn into the library. You can access the main search page via https://www.webofscience.com/wos/woscc/basic-search. Here, you have some decisions to make. Do you want to search the entire collection or do you want to search specific collections like the Social Sciences Citation Index. You can also run an advanced search. For our purposes we search “structural hole*” in all fields. The asterisk indicates that we want any variation of hole, specifically here “hole” and “holes.”

After clicking search, we can see the list of articles that were located in the database. Next, we want to store these as a marked list. If you click “Add to Marked List” without having any articles clicked, you can mark 50,000 at a time.

After clicking add, the marked list is stored in the folder on the right hand navigation menu. If you click on the folder, you will see the marked list. Now, we want to export these files. Bibliometrix can convert several types of files to data frames, but we click on the Export button and choose plain text.

After clicking on the export format, the export records file will appear. This box changes slightly depending on what kind of information that you want. For example, the number of records that you can export changes. We default to capturing the most information - Full Record and Cited References, which reduces the number of records to 500. Export the corpus by changing the numbers in the records box. You can see how this can be quite time consuming.

These records will be stored in your local Downloads folder. I recommend that you move them. Now, we can clean and organize the data in Bibliometrix as described above.