What is a social group?

One of the fundamental goals of social network analysis is to locate groups of nodes that seem to be “grouped” together. This grouping pattern between nodes can be the result of several factors. For example, there could be exogenous groupings based on affiliation (e.g. work units or afterschool activities) or they could be based on connectivity. Strategies for finding groups in networks use endogenous network properties to identify groups or the propensity to group.

The most basic way to think about this is density, which of course is at the node level.

First, we need to load the packages for this workshop. sna and network will be used for blockmodeling. We load them first as they share functions with igraph and we want to priortize igraph, which we load next. We load RColorBrewer so that we can prettify the igraph networks and intergraph for switching between igraph and network objects.

library(sna)
library(network)
library(igraph)
library(RColorBrewer)
library(intergraph)

Finding Groups: Zachary’s Karate Club

Let’s get a quick look at our network for this week: Zachary’s Karate Club. The nodes are members of a karate club and edges are interactions. The network occurs just prior to a split between the Teacher (node 1) and the club president (node 34). This is the most common exemplar network. Its popularity inspired me to include the network as the front image of the Oxford Handbook of Social Networks.

I have stored the network as an R object that we can simply load.

load("data/karate.rda")

As always we should examine the plot. I want to look at a few things when thinking about groups. For example, the density of the network might be a good initial look at group formation. We also leverage one way of thinking about groups - as ego networks - using the make_ego_graph(g) function that we set to one for first-order neighbors.

Also note that we store the layout of the graph using the layout_with_kk(g) function in igraph so that each network throughout the workshop can have the same layout by using the layout= function when plotting networks.

layout.kk <- layout_with_kk(karate)

kdens <- make_ego_graph(karate, 1) %>%
vapply(graph.density, numeric(1))

plot.igraph(karate, layout=layout.kk, vertex.size=kdens*10)

Defining Groups

Let’s start with several basic approaches to locating groups: components, bicomponents, cliques, and cores.

These are basic because there are not estimated by a community detection algorithm, but are more simply defined by node and path characteristics.

Components

A component is a maximally connected subgraph where every node can be reached. Components could be isolates, dyads, triads, are larger structures, but they are otherwise disconnected from other parts of the network. Strong components are directed and weak components are undirected. The giant connected component is the largest maximally connected subgraph.

A bicomponent is a maximally connected subgraph with at least two paths between each node. Every bicomponent will have at minimum a degree of 2.

We can use components(g) & biconnected.components(g) in igraph.

First we calculate components.

V(karate)$name <- c(1:34)

comps <- components(karate)

print(comps$csize)
## [1] 34

Next, we can calculate bicomponents and plot the bicomponent. biconnected.components results in several objects, but to pick the subgraph we can select from the list of bicomponents stored as $components. If we want to identify the largest bicomponent we can use max(length). So, if we store the results of biconnected.components in an object called biconnected, we can select the largest bicomponent by:

bicomponent.org$components[[max(length(bicomponent.org$components))]]

Next, we use the induced.subgraph(g, vids=x) function in igraph where g is the graph and x equals the list of vertices that you want to use in the subgraph. Tihs is the results of the above example.

bikarate <- lapply(biconnected_components(karate)$components, 
              length) %>% which.max()

bikaratevs <- biconnected_components(karate)$components[[bikarate]]

bikarate.g <- induced_subgraph(graph=karate, vids=bikaratevs)

plot.igraph(bikarate.g)

This is a densely connected graph, so it doesn’t change to much. Note that the leaf node, 12, is removed.

Cliques

We may want to know groups of actors that really hang close to one another. Cliques capture one way of thinking about this groupiness. A clique is a maximally connected subgraph (e.g., every node is connected to one another) of at least three nodes. These are essentially groups of closed triads, so they are dense connections.

In igraph:

cliques(g) | max_cliques(g) | largest_cliques(g)

cliques <- largest_cliques(karate)

c1 <- cliques[[max(length(cliques))]]

clique2 <- induced.subgraph(graph=karate,vids=c1)

plot(clique2)

Cores

k-Cores are more relaxed way of thinking about groups. These are subgraphs where each node is maximally connected to at least k neighboring nodes. Some scholars also think of coreness as a proxy for centrality in very large networks.

coreness(g) in igraph cacluates cores. Each node is assigned a k for its largest k-core. We can quickly examine the distribution of largest cores using tablein base R.

kcore <- coreness(karate)    

table(kcore)
## kcore
##  1  2  3  4 
##  1 11 12 10

We can store the core for each node as a vertex or nodal attribute called “core” or whatever using the usual format V(g)$core <- kcore where kcore is the output for coreness(g).

We can assign a color to each core in several ways. An effective solution is to use brewer.pal from RColorBrewer to index the core groups as factors and storing them as a vector in these steps:

Grab four colors as there are four core groups.

thing <- brewer.pal(4, "Set1")

Index to the four as stored as a vertex attribute in the graph. We need this to be a as.factor.

my_color <- thing[as.numeric(as.factor(V(g)$core))]

Then we plot the network in igraph using vertex.color=my_color. Note there is also a legend example included here.

V(karate)$core <- kcore      

coul <- brewer.pal(4, "Set1") 
 
my_color <- coul[as.numeric(as.factor(V(karate)$core))]

plot.igraph(karate, , layout=layout.kk, vertex.color=my_color)
legend("bottomleft", legend=levels(as.factor(V(karate)$core)) , 
col = coul , bty = "n", pch=20 , pt.cex = 3, cex = 1.5, 
horiz = FALSE, inset = c(0.1, 0.1))

Cohesion and Cut-Points

K-components are defined by: - (a) the number of nodes that must be removed from the graph to disconnect the components. - (b) the number of ties that must be removed from the graph to disconnect the components.

Mathematically, a & b are equivalent.

To locate k-components we first identify cut-points/articulation points. These are nodes whose removal results in the graph splitting apart. These can also be seen as “important” nodes.

articulation.points(g) in igraph identify cut points.

art_points <- articulation.points(karate)

rem <- delete.vertices(karate, art_points)

plot(rem)

art.points2 <- articulation.points(rem)

rem2 <- delete.vertices(rem, art.points2)

plot(rem2)

Modularity Maximization

Modularity maximization is a poular way of thinking about “groupiness” within a complete network.

Maximization is: - generally thought of as describing a tendency towards having more in-group than out-group ties - in-/out- groups are NOT defined on node characteristics (ala homophily), but on tie-patterns - typically estimated through divisive/agglomerative assignment or looking at the consequences of iteratively removing nodes.

Newman-Girvan and Louvain community detection both use modularity maximization.

Newman-Girvan (fast-greedy)

Newman-Girvan is a divisive approach:

  • Find edge with highest betweenness & remove
  • Calculate community groups & fitness
  • Iterate over 1-2 (as many as N times)

How to define “fitness”?

  • Community-based mixing matrix (proportion of edges w/in & across groups)
  • Compute matrix “trace” (the sum of the main diagonal)
  • High values indicate well identified partitions
  • Identify cross-group links (via row/col sums)
  • Modularity = sum of difference between diagonal & off diagonal elements
  • High values indicate a more partitioned / segmented (modular) network
  • Typically examined for local maximization
fg_comms <- cluster_fast_greedy(karate)

V(karate)$fg_comms <- fg_comms$membership

plot.igraph(karate, layout=layout.kk, vertex.color=V(karate)$fg_comms)

Louvain

Louvain (Blondel et al. 2008) is a modularity optimization approach:

-Optimize modularity on all nodes then iterate - Does switching the node to another group increase modularity? - stop at solution that produces the largest increase in modularity - This has been optimized in the Leiden approach (see Leiden pacakge).

lv_comms <- cluster_louvain(karate)

V(karate)$lv_comms <- lv_comms$membership

plot.igraph(karate, layout=layout.kk, vertex.color=V(karate)$lv_comms)

Leiden

The Leiden algorithm is a variation on Louvain that allows you to adjust a refinement parameter or to generate bigger or smaller groups.

cluster_leiden(g, resolution_parameter=#) in igraph with 1 as the default resolution parameter estimates this solution. A higher resolution locates smaller communities, while a lower resolution locates larger groups.

Here we explore a large and a small resolution.

leid_large_comms <- cluster_leiden(karate, resolution_parameter = 3)

V(karate)$leid_large_comms <- leid_large_comms$membership

plot.igraph(karate, layout=layout.kk, vertex.color=V(karate)$leid_large_comms)

leid_small_comms <- cluster_leiden(karate, resolution_parameter = .25)

V(karate)$leid_small_comms <- leid_small_comms$membership

plot.igraph(karate, layout=layout.kk, vertex.color=V(karate)$leid_small_comms)

You can see that the higher resolution is starting to parse smaller bits of the network, here, individual nodes. A very high resolution will basically just put every node into its own community.

Random Walks: Walktrap

Walktrap community detection locates subgraphs via random walks. Nodes that are connected via short random walks will be clustered together.

A random walk is exactly as it sounds: Imagine traversing nodes randomly along available edges.

Infomap community detection also uses random walks.

cluster_walktrap(g) in igraph detects communities useing walktrap community detection.

wt_comms <- cluster_walktrap(karate)

V(karate)$wt_comms <- wt_comms$membership

plot.igraph(karate, layout=layout.kk, vertex.color=V(karate)$wt_comms)

Blockmodels

Another way of thinking about groups is to consider how actors or nodes are related based on the patterns of ties that they share. So, instead of being grouped by some type of proximity, groups are formed based on similarity. For example, we could use structural equivalence as a criteria for groupiness. Two nodes are in the same group if they have the exact same ties to one another, line cooks may have the exact same ties to chefs, managers, and other line cooks. Groups based on tie patterns are called blockmodels and this form of blockmodel based on structural equivalence is its most restrictive form. Obviously, these kinds of groups are pretty rare.

Most common blockmodels are relaxed forms of structural equivalence and allow us to ask which nodes are in similar positions. Blockmodels based on regular equivalence capture groups of nodes that share similar patterns of nodes, but not the exact same nodes such that a prep cook and a line cook could be in the same group if they are connected to similarly structured others (e.g., the same manager, different supervising chef, other prep or line cooks). This has implications for hierachies and for the distributions of resources, such as information.

The hierarchical aspect has been particularly useful in understanding organziations and the world system.

igraph doesn’t implement blockmodels. So we will use sna package instead.

First, we use asNetwork(g) from intergraph to switch from igraph to sna.

knet <- asNetwork(karate)

Blockmodeling involves two major steps.

  1. “A partition of actors in the network into discrete subsets called positions.”
  2. “For each pair of positions a statement of presence or absence of a tie or between positions on each of the relations.” (Wasserman and Faust)

After plotting the network in sna using the plot.network function, we can use equiv.clust(g) to locate clusters. equiv.clust “computes the distances between all pairs of positions” and plot the solution which is a dendogram. We can inspect logical places to slice the solution.

network::plot.network(knet) 

kclus <- sna::equiv.clust(knet, mode="digraph")

plot(kclus)

After inspecting the dendogram, we can construct our blockmodel with a 4 block solution given the break around height=25.

blockmodel(g, k=[# of Blocks]) in sna constructs blockmodels.

kbm <- sna::blockmodel(knet, kclus, k=4)        

kbm
## 
## Network Blockmodel:
## 
## Block membership:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  2  2  2  3  3  3  2  3  3  3  2  2  2  3  3  3  2  3  2  3  2  3  3  3  3 
## 27 28 29 30 31 32 33 34 
##  3  3  3  3  3  3  4  4 
## 
## Reduced form blockmodel:
## 
##   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 
##           Block 1    Block 2    Block 3   Block 4
## Block 1       NaN 1.00000000 0.28571429 0.0000000
## Block 2 1.0000000 0.28888889 0.02380952 0.1500000
## Block 3 0.2857143 0.02380952 0.07619048 0.5714286
## Block 4 0.0000000 0.15000000 0.57142857 1.0000000
sna::plot.blockmodel(kbm)   

kbm.mem <- kbm$block.membership

We can also plot the block in igraph.

blks <- data.frame(id=kbm$order.vector, block=kbm$block.membership)

blks <- blks[order(blks$id),]

V(karate)$blocks <- blks$block

plot(karate, layout=layout.kk, vertex.color=V(karate)$blocks)

Compare Modularity

How do we evaluate solutions in terms of community detection. While some of the algorithms intentionally maximize modularity, they may not be the most maximizing. We can compare modularity across solutions using modularity(g, membership, weights = NULL, ...) in igraph.

mod.df <- data.frame(louvain=modularity(karate, lv_comms$membership), fast_greedy=modularity(karate, fg_comms$membership), leiden_large=modularity(karate, leid_large_comms$membership), leiden_small=modularity(karate, leid_small_comms$membership), walktrap=modularity(karate, wt_comms$membership), blocks=modularity(karate, kbm.mem))

knitr::kable(mod.df)
louvain fast_greedy leiden_large leiden_small walktrap blocks
0.4151052 0.3806706 -0.0498028 0.3739316 0.3532216 0.1245069

Example: Legislative Networks

Next, we can include a couple of examples that put community detection to good use. There are, of course, many examples at this point.

Example: Scientific Consensus (Shwed and Bearman 2010)

  • 1.Construct a citation network among literature corpus on identified topic.
  • 2.Identify “epistemic period” & “slice” network.
  • 3.Plot citation network modularity (changes) across (2).
  • 4.Declines/equilibrium identify arrival of scientific consensus

Example: Scientific Consensus (adams and Light 2015)

  • 1.Construct a citation network among literature corpus on identified topic.
  • 2.Identify “epistemic period” & “slice” network.
  • 3.Plot citation network modularity (changes) across (2).
  • 4.Declines/equilibrium identify arrival of scientific consensus