BookmarkSubscribeRSS Feed

Introduction to the PageRank Algorithm: Ranking Sports Teams in SAS Viya

Started ‎09-19-2023 by
Modified ‎09-19-2023 by
Views 728

The objective of this article is to introduce the PageRank algorithm, and learn how it can be used to rank local, college, and professional sports teams in a league. This article describes the intuition behind the algorithm, the matrix algebra for computing PageRank scores (i.e., team rankings), and demonstrates two ways to execute the algorithm in SAS Viya.

 

The PageRank algorithm was originally developed by Google's co-founders Larry Page and Sergey Brin to rank the importance of web pages on the internet. The algorithm is based on the premise that the internet can be represented as one giant network, where the nodes represent web pages, and the directed links represent hyperlinks connecting one web page to another.

 

For example, on the https://www.mlb.com webpage, at the bottom is a hyperlink connecting to the Boys and Girls Clubs of America (https://www.bgca.org/).

 

JL_1_PageRank-png-01.png

 

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Constructed as a directed network, that's equivalent to an outbound link from https://www.mlb.com (node A) pointing to https://www.bgca.org (node B). JL_2_PageRank-png-02.png

 

For ranking the importance of nodes in a network, especially for giant networks with numerous inbound and outbound links, one way that nodes could be ranked is to simply count the number of inbound links each node has. The idea is that the more inbound links a node has, the more important it is, and as a result, it should receive a higher rank compared to other nodes in the network.

 

This ranking method is called degree centrality, and it's a simple, straightforward way to rank node importance. In some networks it can be quite effective. Kim Kardashian currently has over 350 million followers (i.e., over 350 million inbound links) on social media, and it's why companies pay her hundreds of thousands of dollars to promote their product line(s). The more followers she has, the larger of an audience she reaches when posting on social media. More followers = more inbound links = higher in-degree centrality = more important in that network, at least from a marketing standpoint.

 

PageRank centrality works differently. For the PageRank algorithm, the number of inbound links is only part of the story.

 

Consider a Winner network, where each node represents a team in a college or professional sports league. Inbound links represent wins, and outbound links represent losses.

JL_3_PageRank-png-03.png

 

Counting the inbound and outbound links for each node simply counts the number of wins and losses for each team, which, if you recall from a moment ago, is equivalent to a team’s in- and out-degree centrality.

 

JL_4_PageRank-png-04.png

 

Newspapers and sports websites have been displaying standings tables like these forever. However, notice that by constructing the teams, wins, and losses as a network, we get some additional information that the standings tables (i.e., in- and out-degree centrality measures) ignore. Constructed as a network, not only can wins and losses be counted, but the network also tells us who defeated whom. This is critical to understanding how the PageRank algorithm works, because, unlike the standings tables that only tally wins and losses, the PageRank algorithm does not consider all inbound and outbound links to be equally as important to a team's ranking.

 

For example, defeating a strong opponent (i.e., one with many inbound links, and few outbound links) improves a team's ranking more than defeating a weaker opponent. Similarly, losing to a strong opponent penalizes a team's ranking less than losing to a weaker opponent.

 

In other words, the PageRank algorithm not only considers the number of in- and outbound links (i.e., wins and losses) each team has, it also factors in the quality of each team's wins and losses when determining its rank.

 

In the PageRank algorithm, a fundamental tradeoff exists between these two factors (i.e., total wins and losses vs. opponent quality). There's a famous quote by economist, Thomas Sowell, "There are no solutions, there are only trade-offs; and the goal is to find the best trade-off you can." While Sowell was referring to economic policy, it's equally true for the PageRank algorithm.

 

So, then, how does the PageRank algorithm determine which factor is more important to a team’s ranking? The answer is that it depends on what you (yes, you... this is a user-defined parameter) set the damping parameter, alpha, to be. 

 

JL_5_PageRank-png-05.png

 

The closer alpha is to 1, the more the emphasis is placed on opponent quality. When alpha is set to 0.85, notice below that Team C (1 win, 1 loss, .500 winning percentage) has a higher PageRank value than Team E (6 wins, 3 losses, .667 winning percentage).

 

The reason is Team C's lone win is against the best team in the league (i.e., the one with the most inbound links (wins) and fewest outbound links (losses)), Team B. Even though Team E has a higher winning percentage, the overwhelming majority of Team E's wins are against inferior opponents.

JL_6_PageRank-png-06.png

 

As the damping parameter, alpha, approaches 0, more emphasis is placed on total inbound links (i.e., wins), and less on opponent quality. For example, when alpha is changed to 0.25, Team E is now ranked higher than Team C, simply because Team E has more overall wins (i.e., inbound links), regardless of opponent quality.

JL_7_PageRank-png-07.png

 

In its original context, alpha, represented the probability of traversing the internet network via hyperlink, and (1 - alpha) represented the probability of "teleporting" from one node to the next. Think about it: we don't always traverse the internet by hopping from hyperlink to hyperlink. There's nothing preventing us from going to the search bar, typing in our desired web page address, and "teleporting" across the network to the website of our choice.

 

The point here is twofold. First, the PageRank algorithm can be used on a variety of different types of networks to rank node importance. Second, depending on what the nodes and links in the network represent, the damping parameter can take on different interpretations, but the underlying mechanics for ranking nodes using the PageRank algorithm remain consistent.

 

Ranking nodes using the PageRank algorithm has never been easier in SAS Viya. The NETWORK procedure includes a number of graph theory and network analysis algorithms to augment data mining and machine learning approaches. PageRank is one of the many network algorithms embedded within the NETWORK procedure, and it's easy to use.

 

The code below reads in a directed edge list, computes PageRank centrality, and outputs the results to a new CAS table called PageRankScores. The pageRankalpha option allows you to set the damping parameter to a user-defined level between 0 and 1.

 

proc network
  direction = directed
  links = mycas.linksTable
  outNodes = mycas.PageRankScores;
     centrality
       pageRank = unweight
       pageRankalpha = 0.85;
run;

 

Below are the results from the CAS table defined in the outNodes= statement. Nodes are sorted by most important node. While not shown below, sometimes it’s preferred to normalize the PageRank scores by dividing each score by the largest value, so the node with the largest PageRank score equals 1, and all other nodes receive a proportion of its value.

 

JL_8_PageRank-png-08.png

 

Similar to other AI and machine learning algorithms, behind the scenes, the PageRank algorithm is computed using linear algebra. SAS has a rich and comprehensive matrix programming language called SAS IML (Interactive Matrix Language), allowing users to build and customize AI and ML applications.

 

Within the iml Action in SAS Viya, the code below uses the methodology proposed in Langville and Meyer, 2012 to construct a custom pageRank( ) function to compute PageRank scores from an adjacency matrix.

 

start pageRank(AdjMatrix, alpha=0.85, trProbVec=);

   n = nrow(AdjMatrix); 

    if isskipped(trProbVec) then do; 
        e = j(n,1,1);  
       eT = t(e);     
       vT = (1/n)*eT; 
       customTeleVec = 'No';
    end;
    else if ^isskipped(trProbVec) then do; 
       vT = trProbVec; 
       customTeleVec = 'Yes';
    end;

   H = AdjMatrix / (j(n,1,1e-12) <> AdjMatrix[,+]); 

   leafBinary=j(n,1,0); /* set leafBinary[i] = 0 for all i */

   if ^isempty(loc(AdjMatrix[,+]=0)) then leafBinary[loc(AdjMatrix[,+]=0)]=1; 

   pi = vT; /* initialize page rank vector to equal vT */

   maxIters = 1e8;
   tolerance = 1e-12;
   i = 0; /* initialize iteration scalar */ 

   t = time();

     do while (i < maxIters); /* power method */
      i = i + 1; 
	   prevpi = pi;
	   pi = alpha * pi * H + (alpha * (pi * leafBinary) + 1 - alpha) * vT;
           delta = distance(pi, prevpi) / max(prevpi[+,]);
	     if delta < tolerance then do;
	       iters = i;
	       i = maxIters;
	       pageRankVec = t(pi);
	     end;
      end;

   time = time() - t; /* calculate total time */

   return(pageRankVec);
finish;

 

Given an adjacency matrix, A, representing the nodes and links in a network, a new matrix, H, is created by converting the elements (i.e., links) into transition probabilities by dividing by the row sum (i.e., total outbound links) for each node. The result is a new, row stochastic matrix, H, where each row sums to 1.JL_9_PageRank-png-09.png

But wait a second... the first row in H doesn't sum to 1! It turns out that in this network, the node corresponding to row 1 in the adjacency matrix doesn't have any outbound links, so dividing each element in row 1 by the row sum is equivalent to dividing 0 by 0. This node is called a dangling node. In the Winner network above, node A is an example of a dangling node because it has no outbound links (i.e., losses).

 

In a connected network, dangling nodes teleport with equal probability (1/n*eT) to each node in the network, where eT is a row vector of ones and n is the total number of nodes in the network. In other words, for dangling nodes, every element receives the value (1/n*eT), meaning that each dangling node is equally likely to teleport to any other node in the network.

 

To accommodate dangling nodes, a new matrix, S, is created, where S = H + a(1/n * eT). The elements in the nx1 binary vector, a, equal one if the node is a dangling node, and zero otherwise. Therefore, a(1/n * eT) is applied only to the dangling nodes in the network. If the network has no dangling nodes, the equation effectively reduces to S = H.

 

JL_10_PageRank-png-10.png

 

Working through the equation, now every row in S sums to 1 regardless of its dangling node status.

JL_11_PageRank-png-11.png

 

The damping parameter, alpha, is introduced by converting the S matrix to the G matrix, where G = alpha * S + (1 - alpha) * 1/n * e * eT. The alpha value used to create the G matrix below is 0.85.

 

JL_12_PageRank-png-12.png

 

Notice by converting S to G, transition probabilities are dampened based on the value of alpha, so that all transition probabilities now contain non-zero values.

 

JL_13_PageRank-png-13.png

 

The PageRank scores are the dominant left-hand eigenvector of the G matrix, but because G is completely dense, and H is sparse, it’s computationally advantageous to solve G in terms of H, where G = alpha * H + (alpha * a + (1 - alpha) * e) * 1/n * eT. The SAS IML function created above uses the power method, which is an iterative technique, to find the dominant eigenpair from the matrix, and is one of several options available within the NETWORK procedure.

 

Inside of the IML action, the adjacency matrix from the Winner network is created, along with a vector of row and column names, and the pageRank() function is used to compute PageRank scores. 

 

nodeVec = {'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'};

A = { 0 0 0 0 0 0 0 0 0 0 0, 0 0 1 0 0 0 0 0 0 0 0, 0 1 0 0 0 0 0 0 0 0 0, 1 1 0 0 0 0 0 0 0 0 0, 0 1 0 1 0 1 0 0 0 0 0, 0 1 0 0 1 0 0 0 0 0 0, 0 0 0 0 1 0 0 0 0 0 0, 0 0 0 0 1 0 0 0 0 0 0, 0 1 0 0 1 0 0 0 0 0 0, 0 1 0 0 1 0 0 0 0 0 0, 0 1 0 0 1 0 0 0 0 0 0 }; PageRankVec = pageRank(A,0.85); PageRankVecNorm = PageRankVec / PageRankVec[<>]; print PageRankVec[rowname=nodeVec] PageRankVecNorm[rowname=nodeVec];

 

The output from the print statement below shows consistent results between the two approaches in SAS Viya.

JL_14_PageRank-png-14.png

 

Interested in learning more about the functionality packed into the NETWORK procedure, like community detection, shortest paths, network projection, pattern matching, cycles, etc., all with real world applications? Check out our Network Analysis and Network Optimization in SAS Viya training course.

 

See the full PROC NETWORK and SAS IML programs below.

 

SAS IML:  

 

cas mysess; 
libname mycas cas sessref=mysess;

proc cas;
 session mysess; 
 loadactionset 'iml'; /* load the IML action set */
 source PageRank;

 start pageRank(AdjMatrix, alpha=0.85, trProbVec=);

   n = nrow(AdjMatrix); 

    if isskipped(trProbVec) then do; 
	e = j(n,1,1);  
       eT = t(e);     
       vT = (1/n)*eT; 
       customTeleVec = 'No';
    end;
    else if ^isskipped(trProbVec) then do; 
      vT = trProbVec; 
      customTeleVec = 'Yes';
    end;

   H = AdjMatrix / (j(n,1,1e-12) <> AdjMatrix[,+]); 

   leafBinary=j(n,1,0); /* set leafBinary[i] = 0 for all i */

   if ^isempty(loc(AdjMatrix[,+]=0)) then leafBinary[loc(AdjMatrix[,+]=0)]=1; 

   pi = vT; /* initialize page rank vector to equal vT */

   maxIters = 1e8;
   tolerance = 1e-12;
   i = 0; /* initialize iteration scalar */ 

   t = time();

     do while (i < maxIters); /* power method */
       i = i + 1; 
	   prevpi = pi;
	   pi = alpha * pi * H + (alpha * (pi * leafBinary) + 1 - alpha) * vT;
           delta = distance(pi, prevpi) / max(prevpi[+,]);
	     if delta < tolerance then do;
	       iters = i;
	       i = maxIters;
	       pageRankVec = t(pi);
	     end;
      end;

   time = time() - t; /* calculate total time */
   print iters time;

    return(pageRankVec);
 finish;

  nodeVec={'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'};

  A = { 0 0 0 0 0 0 0 0 0 0 0,
 	    0 0 1 0 0 0 0 0 0 0 0,
	    0 1 0 0 0 0 0 0 0 0 0,
	    1 1 0 0 0 0 0 0 0 0 0,
	    0 1 0 1 0 1 0 0 0 0 0,
	    0 1 0 0 1 0 0 0 0 0 0,
	    0 0 0 0 1 0 0 0 0 0 0,
	    0 0 0 0 1 0 0 0 0 0 0,
	    0 1 0 0 1 0 0 0 0 0 0,
	    0 1 0 0 1 0 0 0 0 0 0,
	    0 1 0 0 1 0 0 0 0 0 0 };

   PageRankVec = pageRank(A,0.85);

   PageRankVecNorm = PageRankVec / PageRankVec[<>];

   print PageRankVec[rowname=nodeVec] PageRankVecNorm[rowname=nodeVec];

 endsource;
 iml / code=PageRank nthreads=4;
run;

 

PROC NETWORK:

 

cas mysess; 
libname mycas cas sessref=mysess;

data mycas.linksTable;
 input from $ to $;
datalines;
B C
C B
D A
D B
E B
E D
E F
F B
F E
G E
H E
I E
I B
J E
J B
K B
K E
;

proc network
  direction = directed
  links = mycas.linksTable
  outNodes = mycas.PageRankScores;
     centrality
       pageRank = unweight
       pageRankalpha = 0.85;
run;

 

Find more articles from SAS Global Enablement and Learning here.

Comments

Great! Thanks for sharing! 

Version history
Last update:
‎09-19-2023 09:35 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags