Maybe you’ve heard of the 6 degrees of separation theory, that everyone in the world is connected through six or less other people. The network analysis in SAS Visual Analytics helps you glean valuable information from even highly complex network data. This article will show you how to get started! Check out other articles in this series: Sensitivity & Specificity in Disease Testing: Part 1 Statistical Concepts in the Time of Coronavirus...and SAS Helps You Understand Disease Spread: Part 2 Biostat Concepts in the Time of Coronavirus.
Or maybe you’ve played the game of “6 degrees of Kevin Bacon,” where you name an actor and try to link that actor to Kevin Bacon through people who’ve been in films together. For example, if I select Kerry Washington, I figure out that I can get to Kevin Bacon in 3 leaps (links) through 2 people (nodes). Kerry Washington was in "Little Fires Everywhere" with Reese Witherspoon. Reese Witherspoon was in "Legally Blonde" with Luke Wilson. Luke Wilson was in "My Dog Skip" with Kevin Bacon. Boom. Done.
But wait! Is there a shorter path? Kerry Washington was in The Last King of Scotland with James McAvoy (who, incidentally, donated £275,000 to UK National Health Service (NHS) for PPE). James McAvoy was in X-Men: First Class with Kevin Bacon. Even shorter! Along 2 links and through just one person.
But neither of these gives the full picture. Network analysis is complex; people or other entities can be linked in very complex manners.
If we try to see too many links at once, we get essentially a hairball. The more complex the network, the more you will need network analytics to gain insights and make good decisions.
How people are linked is very important to the spread of disease. For diseases that can spread through droplets in the air, how much time people spend together, and how close they are affects spread. Also, if the infector is actively talking, singing, laughing and especially coughing and/or sneezing around the infectee, that can reduce the time needed for a dose of the virus sufficient for the infectee to contract the disease. Read more about disease spread in my previous article.
A network analysis may be ungrouped or hierarchical.
Data for ungrouped network analysis must have one row for each source-target pair. For information on preparing data for network analysis, see SAS Education’s course by Nicole Ball and Lynn Matthews Visual Analytics 2 for SAS Viya: Advanced (YVA285) https://support.sas.com/edu/schedules.html?id=17167&ctry=US, Lesson 4: Performing Network Analysis.
Node: An entity, e.g., individual; also called vertices.
Link: A connection between nodes; also may be called ties or edges. Links may be directed (asymmetric) or undirected (symmetric).
Let’s look first at an ungrouped example.
In an ungrouped network, there can be many links among many entities. Let’s use a hypothetical data set representing face to face conversations between individuals. Each individual is represented by a node, and each individual may be both a talker and a listener. The links represents the conversations. Most of the individuals are associated with one of four care centers.
Again we open a new page, and drag the Network analysis object to the canvas. We assign as follows:
Link Width: Duration
The Disconnected Network ID can also be added to the Color role. It provides a label and color code for each separate disconnected group. Below we can see that most of the people are in the yellow network ID 1. However, two individuals, Chuck and Wilson, are stranded on an island, with no one to talk to but each other, and they are in the blue network ID 0.
Communities are highly connected clusters of nodes. Community is a derived attribute that is automatically created by the network analysis object. Community can be added to the Color role as shown below. Below we see 5 distinct communities, including one that is completely disconnected.
Reach and closeness provide information about links. Stress and betweenness provide information about the nodes (e.g., entities or individuals). All of the metrics are based on the shortest path.
Reach is the number of links between a node and the farthest connected node (on the shortest path). The range includes whole numbers greater than or equal to 0.
Let’s filter our network to get a smaller group to illustrate this. Here we have filtered to view only the OpulentCare group, which coincides with one of our communities. We further filter to include only those conversations of 10 minutes or greater.
From our Roles pane, we select Reach Centrality for node Size. Most of the nodes have a Reach Centrality of 2, i.e., the shortest distance to the farthest node is 2 links. Only one node has a Reach Centrality of 1, because that node is connected directly one hop from every other node.
From the Options pane, let’s add data labels. We see that Olivia is the one who has the lowest reach centrality.
Closeness centrality measures the distance that an entity is connected to every other entity in a network. This metric is normalized to a range from 0 to 1 with 1 being the highest closeness. Thus the highest number of links is normalized to 0 (0 closeness) and the lowest number of links is normalized to 1. (For more information on normalizing, see Changing the Scale: Transforming Data.)
In real world examples of disease spread, those entities with high closeness scores could be considered “broadcasters” or “superspreaders” of information or of disease. We see that Olivia has the maximum closeness score of 1.
Stress indicates how close a node is to all of its connected nodes. Specifically, stress centrality identifies the nodes that are crossed most frequently (when taking the shortest paths between nodes), regardless of where that stress originates. Its value is normalized to a range from 0 to 1. Nodes with value of 1 are the most frequently crossed i.e., most trafficked nodes. The node (or nodes) that is most frequently crossed will have stress centrality of 1.
Like stress centrality, betweenness centrality identifies the nodes that are crossed most frequently using shortest distance paths, and it also ranges from 0 to 1. However, betweenness centrality also accounts for multiple shortest paths between two nodes.
Again, at least one node in the network has a value of 1. The highest betweenness scores identify nodes that are critical to lots of origin destination pairs, but are not necessarily stressed the most. Betweenness measures the number of shortest paths an entity is on, which in turn indicates how often entities can reach each other through it. A high score indicates a likely path for flow of whatever is being measured, such as information or disease.
If these nodes represent people who have face to face conversations, based on the centrality measures alone, which node has the potential to spread disease more than the other nodes?
For more details and information on how these metrics are calculated, see this video.
Hierarchical analysis requires a standard hierarchy structure of categorical values. Organizational structure or regional structure are common hierarchies.
For this illustration, let’s say, that we know that the US Centers for Disease Control (CDC) spends $5.7 billion on contracts a year. CDC lies within the Department of Health and Human Services (HHS). Maybe we want to get a picture of the overall US contract spending. We can do this with a hierarchical network.
Hierarchical networks are commonly fairly self-explanatory. For example, with the US Spending data includes four hierarchical levels: Country, Type, Department, and Branch. These are shown below in a List Table.
You can create a hierarchy in SAS Visual Analytics easily, by selecting + New data item, Hierarchy.
Then double-click each item in the order you want the hierarchy, or use the plus-arrow icon to move the items to the right.
This creates your hierarchy.
Use the + tab to create a new page. From the Objects pane, drag your Network analysis icon to the canvas.
Open the Options tab on the right and select Type, Hierarchical.
In the Roles tab on the right, under Levels, select your hierarchy.
Return to the Options pane and use the slider to change Additional levels to the maximum (in this case 3). Under Network Diagram (also in the Options pane), check Data labels, and change the Text style to font size 11, Bold. Voila, you have a hierarchical network diagram.
You can add information using the Roles tab, for example, setting Size to ContractSpendingMillions and Color to NumberOfPersonnel.
Notice that if you don’t like where items are displayed, you can select a node and move it. If you roll over an item, you will see the data tips. Here we see in the data tip below, for example, annual Contract Spending for the US Department of Defense is $358,300,000,000.
The data used here are publicly available from the internet. Defense contracts may be for services (e.g., operations, maintenance, R&D) or products (e.g., planes, radios). Examples of large contractors are companies like Lockheed Martin, Boeing, General Dynamics, Raytheon, United Technologies, and Huntington. If you want to create your own hierarchical network to see which contractors get the most US tax dollars, go ahead! You can find the details on the internet, and now you have the skills to create a hierarchical network!
Sneak peek of Carlos’s presentation:
See the following SAS education courses:
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.