How do social actors or entities connect to one another? On its face the answer is obvious: Actors are directly tied to one another through social relations (friends with, talks to, advice from, and so on). But, social scientists for some time have viewed social connectivity more broadly to include how actors connect to one another through indirect ties through co-membership or affiliation (members of the same team, members of the same corporate board, co-authorhip, and so on). As Cooley wrote, “A man may regarded as the point of intersection of an indefinite number of circles representing social groups, having as many arcs passing through him as there are groups.”
When we discussion affiliation networks, we ask:
What does it mean to be affiliated through a literal group, activity, or event?
And we may also ask:
What does it mean to be connected through a generic group or an abstraction?
It is probably not controversial to claim that people who are in the exact same social group share some type of meaningful connection through being members in the same group. If I attend a book club with you, then I am likely to share numerous characteristics, at minimum a shared appreciation of similar books. However, if we attend book clubs in different cities, does generic book club membership provide a meaningful tie?
Scholars interested in duality and affiliation are interested in these kinds of questions, but they extend to non-human entities as well, like scientific papers or emotions, that can be modeled as two-mode or affiliation networks.
Earlier work on duality (Breiger 1974) builds upon Cooley’s idea: People intersect through their associations, which defines (in part) their individuality. The concept of duality recognizes that relations among groups implies relations among individuals or vice versa.
As Breiger(1974:87) writes, “With respect to the membership network…persons who are actors in one picture (the P matrix) are with equal legitimacy viewed as connections in the dual picture (the G matrix), and conversely for groups.”
The resulting network after these transformations: - Is always symmetric - The diagonal tells you how many groups (persons) a person (group) belongs to (has).
We can see how these types of transformations work by turning to a toy example.
First, we should load the packages that we will need in this workshop. We will use bibliometrix for some two mode science network analysis, igraph for network analysis and visualization, ggraph for visualization, udpipe for some language parsing, and dplyr, tidygraph, and ggforce for a few data organization and visualization tasks. We use tidytext and widyr for some text processing functions.
library(bibliometrix)
library(igraph)
library(ggraph)
library(udpipe)
library(dplyr)
library(tidygraph)
library(ggforce)
library(tidytext)
library(widyr)
Let’s begin with a person-to-group matrix
Each column is a group, each cell is a person, and the cell=1 if the
person in that row belongs to the group in that column. You can
calculate how many groups shared between two individuals by comparing
the rows. Identify every column where both rows equals 1 and sum them.
This is the overlap.
You can also tell the total number of group members by summing the columns and the total number of groups/indvidual by summing the rows.
Let’s make a matrix in R.
<- c(0, 1, 1, 0, 0, 0)
one <- c(0, 0, 1, 1, 0, 0)
two <- c(0, 0, 0, 1, 1, 1)
three <- c(0, 0, 0, 1, 0, 1)
four <- c(1, 0, 0, 1, 0, 0)
five
<- cbind(one, two, three, four, five)
dual
rownames(dual) <- c("A", "B", "C", "D", "E", "F")
print(dual)
## one two three four five
## A 0 0 0 0 1
## B 1 0 0 0 0
## C 1 1 0 0 0
## D 0 1 1 1 1
## E 0 0 1 0 0
## F 0 0 1 1 0
Recall that the matrix is a simple and elegant way to manipulate networks. That is most obviously the case, perhaps, when thinking through duality.
From persons-to-groups to persons and groups.
If we multiple our Persons-to-Group matrix by its transpose, we have a Person matrix.
If we multiply the transpose of our Persons-to-Group matrix by the matrix itself, we have the Groups matrix.
The Person-Person Matrix:
<- dual %*% t(dual)
person
print(person)
## A B C D E F
## A 1 0 0 1 0 0
## B 0 1 1 0 0 0
## C 0 1 2 1 0 0
## D 1 0 1 4 1 2
## E 0 0 0 1 1 1
## F 0 0 0 2 1 2
The Group-Group Matrix:
<- t(dual) %*% dual
groups
print(groups)
## one two three four five
## one 2 1 0 0 0
## two 1 2 1 1 1
## three 0 1 3 2 1
## four 0 1 2 2 1
## five 0 1 1 1 2
The general class of affiliation networks are known as bipartite or two-mode networks. The matrix transformation described above where we transform the network from two-mode to one-mode is called a projection.
One-mode projections, while perhaps more easy to understand, obviously involve a substantial loss of information.
While it remains more common, perhaps, to project two-mode networks, increasing work is looking at the tractability of keeping two-mode networks as two-modes.
Science networks have been a major area where duality has been examined. We can think of all of the two-mode networks that structure science: People in labs, departments, etc or ideas in books, articles, etc. From bibliographic records we can construct numerous two-mode networks and project them to one-mode networks if we so choose. We can plot university by article or university by journal networks. We can plot country by keyword networks or typical coauthorship or cocitation networks.
I provide an overview of downloading data in Web of Science below. We download these data manually at this point and use the Bibliometrix package to bring the data into a useable file.
So, we can also start by reading the data into Bibliometrix.
First, we need to bring the raw WoS data into a readable file. We can
do this with the convert2df
function in
Bibliometrix. This requires a path for where the raw
files are stored. We can grab those paths with the
list.files
function and be sure to specify that
full.names = TRUE
to grab the entire path.
<- list.files(path="data/articles/", full.names = TRUE)
sh.articles
<- convert2df(file=sh.articles, dbsource="isi", format="plaintext") sh.df
We will return to this data frame for reasons described below. But,
it may be worthwhile to build a bibliometrix object and use some of this
packages functions. We can use the biblioAnalysis
function
to build things and then use summary
and plot
to see an overview of the corpus.
<- biblioAnalysis(sh.df, sep=";")
sh.biblio
summary(sh.biblio, k=10)
##
##
## MAIN INFORMATION ABOUT DATA
##
## Timespan 1970 : 2023
## Sources (Journals, Books, etc) 751
## Documents 1679
## Annual Growth Rate % 13.07
## Document Average Age 7.97
## Average citations per doc 45.54
## Average citations per year per doc 3.891
## References 60012
##
## DOCUMENT TYPES
## abstract of published item 1
## article 1249
## article; book 1
## article; book chapter 70
## article; early access 56
## article; proceedings paper 19
## article; retracted publication 1
## book 8
## book review 11
## correction 1
## editorial material 8
## editorial material; book chapter 3
## meeting abstract 2
## note 1
## proceedings paper 176
## review 69
## review; book chapter 2
## review; early access 1
##
## DOCUMENT CONTENTS
## Keywords Plus (ID) 2345
## Author's Keywords (DE) 3725
##
## AUTHORS
## Authors 3405
## Author Appearances 4661
## Authors of single-authored docs 247
##
## AUTHORS COLLABORATION
## Single-authored docs 282
## Documents per Author 0.493
## Co-Authors per Doc 2.78
## International co-authorships % 29.9
##
##
## Annual Scientific Production
##
## Year Articles
## 1970 1
## 1987 1
## 1993 3
## 1994 6
## 1995 6
## 1997 4
## 1998 9
## 1999 3
## 2000 7
## 2001 5
## 2002 7
## 2003 13
## 2004 15
## 2005 21
## 2006 15
## 2007 41
## 2008 56
## 2009 57
## 2010 64
## 2011 79
## 2012 85
## 2013 95
## 2014 80
## 2015 92
## 2016 82
## 2017 85
## 2018 132
## 2019 117
## 2020 133
## 2021 126
## 2022 137
## 2023 45
##
## Annual Percentage Growth Rate 7.446603
##
##
## Most Productive Authors
##
## Authors Articles Authors Articles Fractionalized
## 1 BURT RS 26 BURT RS 17.83
## 2 GUAN JC 15 GUAN JC 5.92
## 3 CHEN Y 11 LIU CH 5.28
## 4 LIU CH 11 SODA G 4.58
## 5 SODA G 11 MOLINA-MORALES FX 4.33
## 6 ZHANG Y 11 TANG CY 4.33
## 7 KILDUFF M 10 SHIPILOV AV 3.95
## 8 LI Y 10 DIEZ-VIAL I 3.92
## 9 MOLINA-MORALES FX 10 ZAHEER A 3.87
## 10 WANG JB 10 KILDUFF M 3.53
##
##
## Top manuscripts per citations
##
## Paper DOI TC TCperYear NTC
## 1 BURT RS, 2004, AM J SOCIOL 10.1086/421787 2756 137.8 8.74
## 2 AHUJA G, 2000, ADMIN SCI QUART 10.2307/2667105 2749 114.5 3.06
## 3 BURT RS, 2000, RES ORGAN BEHAV 10.1016/S0191-3085(00)22009-1 1937 80.7 2.16
## 4 BURT RS, 1997, ADMIN SCI QUART 10.2307/2393923 1773 65.7 2.64
## 5 OWEN-SMITH J, 2004, ORGAN SCI 10.1287/orsc.1030.0054 1241 62.0 3.94
## 6 OBSTFELD D, 2005, ADMIN SCI QUART 10.2189/asqu.2005.50.1.100 1070 56.3 5.21
## 7 PROVAN KG, 2007, J MANAGE 10.1177/0149206307302554 887 52.2 5.81
## 8 ZAHEER A, 2005, STRATEGIC MANAGE J 10.1002/smj.482 883 46.5 4.30
## 9 WALKER G, 1997, ORGAN SCI 10.1287/orsc.8.2.109 873 32.3 1.30
## 10 PODOLNY JM, 2001, AM J SOCIOL 10.1086/323038 825 35.9 2.46
##
##
## Corresponding Author's Countries
##
## Country Articles Freq SCP MCP MCP_Ratio
## 1 CHINA 485 0.2981 362 123 0.254
## 2 USA 380 0.2336 284 96 0.253
## 3 UNITED KINGDOM 83 0.0510 46 37 0.446
## 4 SPAIN 74 0.0455 57 17 0.230
## 5 ITALY 66 0.0406 42 24 0.364
## 6 NETHERLANDS 63 0.0387 40 23 0.365
## 7 FRANCE 54 0.0332 29 25 0.463
## 8 AUSTRALIA 45 0.0277 24 21 0.467
## 9 CANADA 45 0.0277 27 18 0.400
## 10 GERMANY 40 0.0246 29 11 0.275
##
##
## SCP: Single Country Publications
##
## MCP: Multiple Country Publications
##
##
## Total Citations per Country
##
## Country Total Citations Average Article Citations
## 1 USA 40706 107.121
## 2 CHINA 6314 13.019
## 3 UNITED KINGDOM 4681 56.398
## 4 NETHERLANDS 3800 60.317
## 5 FRANCE 3458 64.037
## 6 CANADA 2949 65.533
## 7 SPAIN 1940 26.216
## 8 AUSTRALIA 1932 42.933
## 9 ITALY 1437 21.773
## 10 GERMANY 1173 29.325
##
##
## Most Relevant Sources
##
## Sources Articles
## 1 ORGANIZATION SCIENCE 39
## 2 STRATEGIC MANAGEMENT JOURNAL 39
## 3 SOCIAL NETWORKS 33
## 4 RESEARCH POLICY 28
## 5 ACADEMY OF MANAGEMENT JOURNAL 24
## 6 ADMINISTRATIVE SCIENCE QUARTERLY 24
## 7 SCIENTOMETRICS 24
## 8 SUSTAINABILITY 23
## 9 JOURNAL OF MANAGEMENT 22
## 10 TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE 21
##
##
## Most Relevant Keywords
##
## Author Keywords (DE) Articles Keywords-Plus (ID) Articles
## 1 STRUCTURAL HOLES 205 STRUCTURAL HOLES 979
## 2 SOCIAL NETWORKS 155 PERFORMANCE 485
## 3 SOCIAL CAPITAL 138 INNOVATION 392
## 4 INNOVATION 105 KNOWLEDGE 288
## 5 SOCIAL NETWORK ANALYSIS 103 SOCIAL NETWORKS 218
## 6 NETWORKS 85 EMBEDDEDNESS 193
## 7 STRUCTURAL HOLE 63 ABSORPTIVE-CAPACITY 192
## 8 SOCIAL NETWORK 62 COLLABORATION 170
## 9 BROKERAGE 53 NETWORKS 170
## 10 CENTRALITY 37 IMPACT 159
plot(x=sh.biblio, k=5, pause=FALSE)
Bibliometrix has number useful functions and you can
see a good introduction here.
For example, we can plot the co-authorship network using the
biblioNetwork
function on the data frame and specifying
analysis="collaboration
and
network=authors
.
<- biblioNetwork(sh.df, analysis="collaboration", network="authors", short=TRUE) coauth
And then we can plot using networkPlot
here we also
remove isolates and only label the top 10 nodes.
networkPlot(coauth, remove.isolates=TRUE, label.n=10, cluster="none")
We can see that the network contains a small large component and numerous smaller components.
Manipulating the network in Bibliometrix isn’t always intuitive. So, we can move back and forth between the matrices and data frames conveniently generated by Bibliometrix and igraph or other packages.
Keyword networks are another strategy for looking at the structure of a scientific field. The provide evidence of the epistemic space or the knowledge area in which scholars work.
Keyword networks are two-mode networks (articles x keywords) and we can keep them as two-modes or project them to a keyword x keyword network.
Let’s look at both of these networks in turn.
We begin by using the cocMatrix
in
Bibliometrix to pull out the two-mode keyword matrix.
Here, we set Field="DE"
as “DE” is the Web of Science code
for author-provided keywords. Then, we use
graph_from_incidence
from igraph to create
an igraph object from the matrix.
<- cocMatrix(sh.df, Field="DE")
kw.mat
.2mode <- graph_from_incidence_matrix(kw.mat) kw
We can plot the two-mode keyword network after removing any isolates
or articles without any keywords using delete.vertices
in
igraph.
<- delete.vertices(kw.2mode, which(degree(kw.2mode)==0))
kw2.noiso
plot(kw2.noiso, layout=layout_with_fr, vertex.label=NA, vertex.size=4)
We can see that there appears to be a large connected component and many smaller components. This isn’t super informative, but we can graph some basic statistics for reference. Here, we count the number of components of various sizes, the degree centralization, and the edge density.
#can examine the components here
print(table(components(kw2.noiso)$csize))
##
## 3 4 5 6 7 8 9 11 15 4724
## 1 6 22 13 5 3 3 1 1 1
#centralization
print(centralization.degree(kw2.noiso)$centralization)
## [1] 0.04006339
# density
print(edge_density(kw2.noiso))
## [1] 0.0005306664
Let’s dig deeper into the large component. First, we use
components
from igraph then we identify
the largest component using which.max
in base
R.
<- components(kw.2mode)
comps
<- which.max(comps$csize) bigcomp
Next, we identify the vertices that are in the group identified with
which.max
using brackets and shrink the graph using
induced.subgraph
in igraph.
<- V(kw.2mode)[comps$membership == bigcomp]
vert_ids
<- induced_subgraph(kw.2mode, vert_ids) kw2.gcc
Let’s look at some clusters using cluster_louvain
and
plot
in igraph.
<- cluster_louvain(kw2.gcc)
lv
plot(kw2.gcc, layout=layout_with_kk, vertex.color=lv$membership, vertex.label.cex=.5, vertex.size=degree(kw2.gcc)/5)
The labels make it very difficult to read, so let’s see if we can
plot only the one’s that are attached to nodes with high degree. We can
combine which
and quantile
to do this
task.
<- which(degree(kw2.gcc) > (quantile(degree(kw2.gcc), .99)
lab.keep ))
And then we plot using ifelse
to label only those in the
top group.
plot(kw2.gcc, vertex.label = ifelse(V(kw2.gcc) %in% lab.keep, V(kw2.gcc)$name, NA),
vertex.color=lv$membership, vertex.label.cex=.5, vertex.size=degree(kw2.gcc)/10)
Still, somewhat of a mess, but we can imagine how pruning the network
may lead to something that helps us to see the structure of this
literature.
The one-mode approach is exactly the same, but we start by using
bipartite_projection
from igraph on the
two-mode network and selecting which="true"
because we know
the keywords are in the columns of the original matrix. We can plot that
network of course.
We can then go through the same kind of processes as the two mode networks in terms of finding the largest connected component, locating communities, plotting with reduced labels and so forth to locate patterns in the keyword structure.
Text networks are increasingly common and relatively simple methods for document classification and/or generating thematic relationships across texts.
The goal is data reduction: Take a bunch of texts and simplify the relationship between them and/or their ideas.
Fundamentally, these are bipartite networks consisting of documents and words. We can think of any number of networks in similar ways, such as cocitation networks that consist of documents and citations.
Steps in processing texts:
1. Find a collection of texts
2. Force the collection into a .csv (etc.)
3. Preprocess the text data
- Remove stop words and or select POS (e.g. nouns)
- N-grams?
- Word length
- Common words
4. Build document by word matrix
- "Bag of words:" Word order doesn't matter
- Weighted elements?: Common weight is tfidf
5. Project to terms or documents
6. Plot
Let’s build a text network and a network of documents from the Web of Science on “structural holes.”
Bibliometrix has some text analysis capabilities, but we may want more flexibility in terms of what counts as text and so on. For example, we can organize and clean the data by deleting articles that don’t have an abstract and by combining keywords, titles, and abstracts.
Next, we can parse the combined text field and tag the words in the
field by part-of-speech by using a pos-tagger. In this case, we use
udpipe by first downloading an English language model
using udpipe_download_model(language = "english-lines")
then loading it into our R session with the local path
using udpipe_load_model(file=pathname)
. Last, we use
udpipe_annotate
on the text field to tag each word with a
part of speech. Here, we also use filter
in
dplyr to select on nouns, but you may or may not want
to look only at nouns.
<- short %>% filter(AB != "")
holes
$text <- paste(holes$DE, holes$TI, holes$AB)
holes
<- udpipe_download_model(language = "english-lines")
m_eng_lines <- m_eng_lines$file_model
m_eng_lines_path <- udpipe_load_model(file=m_eng_lines_path)
m_eng_lines_loaded
<- udpipe_annotate(m_eng_lines_loaded, x = holes$text) %>%
text_annotated as.data.frame() %>%
select(-sentence)
<- text_annotated %>% filter(upos=="NOUN") nouns
After we select nouns, we can also clean the data in some ways that are typical for text networks, such as getting rid of short words, numbers, and words that appear only a handful of times, here 10 or fewer times. You see that most of these cleaning steps use either base R or dplyr.
$lemma <- tolower(nouns$lemma)
nouns
data("stop_words")
<- nouns %>%
nouns filter(!lemma %in% stop_words$word)
<-as.data.frame(table(nouns$lemma))
word_count
<- rename(word_count, lemma = Var1)
word_count
$isnum <- nchar(gsub("[^0-9]+", "", word_count$lemma))
word_count
<- word_count %>% filter(isnum==0)
word_count
$lemma <- as.character(word_count$lemma)
word_count
$wlength <- nchar(word_count$lemma)
word_count
<- word_count %>% filter(wlength>3)
word_count
<- word_count %>% filter(Freq>10)
word_count
$keep <- 1
word_count
<- left_join(nouns, word_count)
noun_fin
is.na(noun_fin)] = 0
noun_fin[
<- noun_fin %>% filter(keep==1) noun_fin
It is common to adjust the weight of tokens or words by their frequency so very frequent words are penalized. We can use tidytext for that.
<- noun_fin %>%
art_nouns count(doc_id, lemma, sort = TRUE)
<- art_nouns %>%
total_nouns group_by(doc_id) %>%
summarize(total = sum(n))
<- left_join(art_nouns, total_nouns)
art_nouns
<- art_nouns %>% bind_tf_idf(lemma, doc_id, n) noun_tf_idf
The output of the data cleaning process results in a edge list, if we
select
(from dplyr) on the lemma - or root
word - and the doc_id. We can use pairwise_count
from
widyr to create a weighted edgelist. If we rename the
count variable, “weight” igraph will read the network
as weighted. To create an igraph object, we can use
graph_from_data_frame
on this data frame edge list.
<- noun_tf_idf %>% select(lemma, doc_id, tf_idf)
noun_edge
<- noun_edge %>% pairwise_count(lemma, doc_id, wt=tf_idf)
noun_pair
<- rename(noun_pair, weight=n)
noun_pair
<- graph_from_data_frame(noun_pair)
noun.g
<- as.undirected(noun.g)
noun.g
V(noun.g)$wrd_cnt <- word_count$Freq
<- delete.vertices(noun.g, V(noun.g)[ degree(noun.g)==0])
noun.g
plot(noun.g, layout=layout_with_kk, vertex.size=V(noun.g)$wrd_cnt/100, main="Structural Holes Noun Network")
This is a messy graph, so we can use ggraph to clean
it up and engage some other editing like using delete.edges
in igraph where we use quantile
function
to grab the top 10 percent of edges.
<- delete.edges(noun.g, which(E(noun.g)$weight < (quantile(E(noun.g)$weight, .90)
n1sub.g
)))
<- delete.vertices(n1sub.g, V(n1sub.g)[ degree(n1sub.g)==0 ])
n1sub.g
<- cluster_fast_greedy(n1sub.g)
cls_fast
<- data.frame(clusters=as.numeric(cls_fast$membership))
membs
<- data.frame(clusters=c(1,2,3,4, 5, 6, 7, 8, 9), clr=c("purple", "yellow", "green", "skyblue", "red", "pink", "grey", "tomato", "gold"))
clrs
<- left_join(membs, clrs)
membs
ggraph(n1sub.g, layout = "kk") +
geom_edge_link(alpha=.25, aes(width=weight), color="gray") +
geom_node_point(aes(size=wrd_cnt, fill = membs$clr), shape=21) +
geom_node_text(aes(label=ifelse(wrd_cnt > 100, name, NA))) +
scale_size_continuous(range = c(2, 20)) +
theme_void() +
theme(legend.position = "none") +
labs(title="Structural Hole Network: 1986-2000", caption = "This graph visualizes the text network for structural holes based on words in the abstract, title, and keywords.
Nodes are words. Edges are the count of overlaps across articles. Communities using FastGreedy.
Node labels are words that appear in at least 100 articles.")
## Warning: Removed 142 rows containing missing values (geom_text).
We download bibliographic data from the Web of Science. We could use other databases that each have their strengths in terms of coverage. WoS is widely viewed as having good coverage of the social sciences. It also plays nicely with Bibliometrix which is nice for processing as bringing the data into R can be tricky.
The first step is to access Web of Science. You may have to vpn into the library. You can access the main search page via https://www.webofscience.com/wos/woscc/basic-search. Here, you have some decisions to make. Do you want to search the entire collection or do you want to search specific collections like the Social Sciences Citation Index. You can also run an advanced search. For our purposes we search “structural hole*” in all fields. The asterisk indicates that we want any variation of hole, specifically here “hole” and “holes.”
After clicking search, we can see
the list of articles that were located in the database. Next, we want to
store these as a marked list. If you click “Add to Marked List” without
having any articles clicked, you can mark 50,000 at a time.
After clicking add, the marked list is stored in the folder on the right hand navigation menu. If you click on the folder, you will see the marked list. Now, we want to export these files. Bibliometrix can convert several types of files to data frames, but we click on the Export button and choose plain text.
After clicking on the export format, the export records file will appear. This box changes slightly depending on what kind of information that you want. For example, the number of records that you can export changes. We default to capturing the most information - Full Record and Cited References, which reduces the number of records to 500. Export the corpus by changing the numbers in the records box. You can see how this can be quite time consuming.
These records will be stored in your local Downloads folder. I recommend that you move them. Now, we can clean and organize the data in Bibliometrix as described above.