Community Detection in Network Analysis: Part 1

2 Likes

In network analysis, community detection refers to clustering or grouping nodes within a network in such a way to reveal important patterns or relationships within the network. In practice, community detection can be used to:

Develop and/or improve recommendation engines
Perform market basket analyses by compiling receipt data from a retail organization
Analyze text, call, and/or social media communications
Analyze financial transactions (i.e., deposits and withdrawals from bank accounts)
Aggregate and interpret the results from survey questions
and plenty more!

What each of these examples have in common is that some, if not all, of the underlying data can be constructed as a network comprised of nodes and links. Nodes and links can also be referred to as vertices and edges, respectfully. For consistency I'll refer to them exclusively as nodes and links.

Network Basics: Review

A node is a general term used to denote and actor or an object in a network. Visually, nodes are depicted as circles on a graph. A link represents an interaction between two nodes in the network, and are visually depicted as the lines connecting two nodes.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In the example above, the link is undirected, implying a reciprocal association between the two nodes. For example, if I'm connected to you on LinkedIn, it's implied that you're also connected to me on LinkedIn.

But links can also be directed, implying an asymmetrical association between two nodes. For example, just because a direct flight exists from RDU to JFK does not imply that a direct flight also exists from JFK to RDU. For directed networks, the link direction is visually expressed as an arrow from one node to another.

For broader context, the graph below depicts an undirected network containing 3,715 nodes and 5,368 links.

Depending on the type of network you're working with, the nodes and links could represent a variety of different types of relationships. For example:

In a streaming service network:
- Nodes: All of the shows currently available to watch
- Links: A link exists between two nodes if a person watched both shows

In a receipt network:
- Nodes: All of the SKUs offered for sale by a retailer
- Links: A link exists between two nodes if a person purchased both SKUs in the same transaction

In a social network:
- Nodes: All of the individual accounts (i.e., if you have a LinkedIn account, you are one node on the LinkedIn social network)
- Links: A link exists between two nodes if two individuals are connected on the social network

In a telecommunications network:
- Nodes: All of the unique cell phone numbers
- Links: A link exists between two nodes if a form of communication (i.e., text message) exists between two cell phone numbers

In a survey network:
- Nodes: All of the survey options available to choose from (e.g., "Out of all of these ice cream flavors, which ones do you like?")
- Links: A link exists between two nodes if a survey participant likes both ice cream flavors

In the SAS Education course network:
- Nodes: All of the SAS courses (see the list here: https://learn.sas.com/)
- Links: A link exists between two nodes if one person has taken both SAS courses

In the University course network:
- Nodes: All of the classes offered by the university during a given semester
- Links: A link exists between two nodes if one student is enrolled in both classes

In all of these cases, numerous insights can be gained by grouping these nodes (i.e., streaming shows, SKUs, individuals, survey responses, courses, etc.) using community detection. Additionally, the links can be weighted to indicate the strength of the relationship.

Ice Cream Flavor Survey Network

Below is an example of a survey network, where 20,000 individuals were asked to select which ice cream flavors they liked. Each person was allowed to choose as many (or as few) of the ice cream flavors as they liked.

The nodes represent the ice cream flavors. The links represent whether individuals tend to prefer both flavors, and the link weights indicate the strength of those relationships. The links are undirected, since if you like Key Lime and Vanilla, it's implied that you also like Vanilla and Key Lime. In this example, the link weights range from 0 to 1. The larger the link weight, the stronger the association in preference between the two flavors.

For example, individuals who prefer Chocolate tend to prefer Peanut Butter (PB) more than Peach, Vanilla, Blueberry, or Strawberry. Since Chocolate isn't connected to Key Lime or Sherbet, then individuals who prefer Chocolate don't tend to prefer Key Lime or Sherbet at all.

Introduction to Community Detection

The objective of community detection is to partition a network into n distinct communities, so that each community is as densely connected as possible. One of the key factors of community detection is that a node can belong to only one community.

In reality, a network can be partitioned numerous different ways. To evaluate the quality of the community partitions, the default community detection algorithm in the NETWORK procedure (i.e., the Louvain algorithm) uses a measure called modularity.

Modularity (M) is a global network measure that quantifies the quality of the community partitions. Modularity can range between -1 and +1, where a higher modularity implies better, more densely connected communities.

The Louvain algorithm in the NETWORK procedure is the default community detection algorithm, which partitions a network into n distinct communities to maximize modularity (M). Each partition (i.e., community) receives its own modularity score (Mc), which for unweighted networks is calculated as:

For weighted networks, like the Ice Cream Flavor Survey network, the weights are integrated into the above formula. For both weighted and unweighted networks, summing each community's modularity score together (Mc1 + Mc2 + ... + Mcn) equals the total modularity (M) score across the entire network. The code below creates the mycas.icFlavors table representing the weighted undirected Ice Cream Flavor Survey network, and in only six lines of code, the NETWORK procedure performs community detection on the network and outputs two tables for us to view the results.

data mycas.icFlavors;
 length from $10 to $10;
 input from $ to $ weight;
datalines;
Blueberry Chocolate 0.12752
Blueberry Peach 0.26923
Blueberry Strawberry 0.25455
Blueberry Vanilla 0.10526
Chocolate PB 0.27917
Chocolate Peach 0.11429
Chocolate Strawberry 0.20732
Chocolate Vanilla 0.13187
KeyLime Sherbet 0.20
KeyLime Vanilla 0.19828
Peach Strawberry 0.28761
Peach Vanilla 0.12707
Sherbet Vanilla 0.28448
Strawberry Vanilla 0.16901
;

proc network
 links = mycas.icFlavors
 outNodes = mycas.commNodeOut;
   community
    outLevel = mycas.commLevelOut;
run;

From the mycas.commLevelOut table, the results show that modularity (M) is maximized at 0.28071 by partitioning the network into three communities:

The mycas.commNodeOut table shows the community partitions that maximize modularity (M):

Notice how the strength of the link weights, along with the underlying network topology, both play an integral role in the optimal community partition.

---> Chocolate's strongest connection is to PB, and vice versa

---> Strawberry's strongest connections are to Peach and Blueberry, and vice versa

---> Vanilla's strongest connections are to Key Lime and Sherbet, and vice versa

These principles of community detection remain consistent no matter if your network contains eight nodes, like the Ice Cream Flavor Survey network, or eight million nodes!

Conclusion + Next Steps

By visualizing how community detection works on a small network first, hopefully this post has served to build intuition around how to apply and interpret the results from the community detection algorithm inside of PROC NETWORK. Integrating these results into broader recommendation engines, market basket analyses, etc. can further drive significant insights into your client's business operations that are difficult, if not impossible, to replicate using tabular data alone.

If your new to network analysis and want to learn more, our e-Learning course titled Network Analysis and Network Optimization in SAS Viya is a great place to start. Spoiler alert: in that course, you'll not only learn how to calculate the link weights in the Ice Cream Flavor Survey network, you'll create that entire network from scratch directly from raw survey data.

In the next post we'll use the Zachary Karate Club network (see below) to dive deeper into the set of options for performing community detection inside of the NETWORK procedure.

You'll notice the Zachary Karate Club network below is comprised of two communities, but what if your client has explicitly requested the network be partitioned into three communities? Or five communities? Also, what if your client wants to force certain nodes into the same community, or force specific nodes to be in different communities? How can we accommodate these requirements into the community detection algorithm?

Stay tuned for Part 2 to learn how!

Find more articles from SAS Global Enablement and Learning here.