SAS macro for finding all paths in a directed graph (network)

1 Like

/*

The ALLPATHS macro finds all paths between two nodes in a directed network. The

network is described by a list of arcs (from-to node pairs). Nodes can be identified

by numeric or character values. Node pairs for which paths are requested are listed

in a dataset (start-end node pairs).

The macro creates a dataset describing all possible paths for each node pair requested.

Syntax

%macro allPaths(Network, Pairs, Paths, maxPathLength=100);

Input/Output datasets (variables) are:

Input Network (from, to)

Input Pairs (start, end)

Output Paths (start, end, pathNo, nodeOrder, from, to)

The ALLPATHS macro has one optional parameter:

maxPathLength gives the maximum path length (number of arcs) to search for.

Default=100.

IMPORTANT : Prior to calling the macro, Network dataset must be sorted and indexed

by from nodes. For example :

proc sort data=myNetwork out=myNetwork(index=(from));

where from ne to;

by from to;

run;

IMPORTANT : Node variables from, to, start, and end should be of the same

type and length.

PG

*/

%macro allPaths(network, pairs, paths, maxPathLength=100);

%let nodeIDlength=8;

%let which=whichn;

data _null_;

set &network;

if vtype(from) = "C" then do;

call symputx("nodeIDlength", cats("$",vlength(from)));

call symputx("which", "whichc");

end;

stop;

run;

data &paths(keep=start end pathNo nodeOrder from to);

set &pairs;

array arc[&maxPathLength];

array node[0:&maxPathLength] &nodeIDlength;

pathNo = 0;

len = 0;

node[len] = start;

do k = 1 to 10000; /* Catch infinite loop, just in case */

if missing(node[len]) then do; /* Move to sibling or back off*/

if len = 0 then leave;

p = arc[len] + 1;

if p > maxp then do; /* Last arc: no sibling. Back off */

len + (-1);

call missing(node[len]);

end;

else do;

set &network point=p nobs=maxp;

if from = node[len-1] then do; /* Sibling found */

node[len] = to;

arc[len] = p;

end;

else do; /* No more sibling. Back off */

len + (-1);

call missing(node[len]);

end;

else do; /* Go further down the path */

if node[len] = end then do; /* Simple path found */

pathNo + 1;

do j = 1 to len;

from = node[j-1]; to = node; nodeOrder=j; output;

end;

call missing(node[len]);

end;

else if &which(node[len], of node

) < len+1 then do; /* Cycle detected in path */

call missing(node[len]);

end;

else do; /* Move ahead in path */

from = node[len];

set &network key=from/unique curobs=p;

if p and len < dim(arc) then do; /* To node found, add it to path, if possible */

len + 1;

node[len] = to;

arc[len] = p;

end;

else do; /* No arc found or no room left on path. Dead end */

_error_ = 0;

call missing(node[len]);

end;

if pathNo = 0 then do; /* No path found. Output missing values */

call missing(from, to, nodeOrder); output; end;

run;

%mend allPaths;

/*

* Example of use ;

* List the edges of a directed network. Must include variables FROM and TO

(see attached picture) ;

data network;

length from to $4;

input from to @@;

datalines;

a b b c c d d e e f f a

a c c e e a

f d d b b f

;

* IMPORTANT : Sort and index the network by FROM node ;

proc sort data=network out=network(index=(from)); where from ne to; by from to; run;

* List of node pairs for which we want to enumerate the paths.

Variables START and END must match the type of network variables FROM and TO ;

data requests;

length start end $4;

input start end;

label start="Starting Node" end="Ending Node";

datalines;

a a

a x

x a

a b

b f

f b

f .

. .

;

* Call the allPaths macro ;

%allPaths(network, requests, paths);

* Print the paths in a tabular format ;

data printNet;

length Path $200;

do until(last.pathNo);

set paths; by start end pathNo notsorted;

if first.pathNo then path = cats(from);

path = catx(" - ", path, to);

end;

if pathNo = 0 then path = "No path";

keep start end pathNo path;

label pathNo="Path No";

run;

title "List of all paths";

proc print data=printNet noobs label;

by start end notsorted; id start end; run;

*/

/*

* Example of use (with numeric nodes) ;

* List the edges of a directed network. Must include variables FROM and TO ;

data network;

length from to 8;

input from to @@;

datalines;

1 2 2 3 3 4 4 5 5 6 6 1

1 3 3 5 5 1

6 4 4 2 2 6

;

* IMPORTANT : Sort and index the network by FROM node ;

proc sort data=network out=network(index=(from)); by from to; run;

* List of node pairs for which we want to enumerate the paths.

Variables START and END must match the type of network variables FROM and TO ;

data requests;

length start end 8;

input start end;

label start="Starting Node" end="Ending Node";

datalines;

1 1

1 9

9 1

1 2

2 6

6 2

6 .

. .

;

* Call the allPaths macro ;

%allPaths(network, requests, paths);

* Print the paths in a tabular format ;

data printNet;

length Path $200;

do until(last.pathNo);

set paths; by start end pathNo notsorted;

if first.pathNo then path = cats(from);

path = catx(" - ", path, to);

end;

keep start end pathNo path;

label pathNo="Path No";

run;

title "List of all paths";

proc print data=printNet noobs label;

by start end notsorted; id start end; run;

*/

PG

Ksharp · ‎10-08-2014

PG, I think my code is better.

DeepH · ‎07-01-2015

PG,

Why there are no paths shown when the Starting Node & Ending Node are same!! (i.e For a a or 1 1 in the requests dataset)

PGStats · ‎07-01-2015

A path is a series of arcs. The algorithm doesn't allow for reflexive arcs and more generally for cycles in the graph. So there cannot be a path starting and ending at the same node. - PG

DeepH · ‎07-01-2015

ok got it.

Is there any other algorithm that would help me identify cycles in a graph?

Ksharp · ‎07-01-2015

Did you check my URL ?

PGStats · ‎07-01-2015

Also, the CYCLE statement in PROC OPTNET (requires SAS/OR) does just that. - PG

RobPratt · ‎07-01-2015

Also, as of SAS/OR 13.1, the network solver in PROC OPTMODEL provides access to the same algorithms from PROC OPTNET:

SAS/OR(R) 13.2 User's Guide: Mathematical Programming

DeepH · ‎07-02-2015

Hi Xia,

I did check your URL. But in my case I am falling short of memory in case of hash objects, as my dataset is huge.

DeepH · ‎07-02-2015

PG,

Unfortunately I don't have SAS/OR package. So any other alternative?

DeepH · ‎07-02-2015

Hi Rob,

No SAS/OR.

Ksharp · ‎07-02-2015

How Huge ?You can recode your variable and save memory for Hash Table.

And set system option memsize=max When you start SAS .

DeepH · ‎07-02-2015

The dataset would contain nearly 10 million records. And when it tries to find cycles the output dataset would contain nearly 25 million records.

The hash table will store all the intermediate path, which consumes lot of primary memory.

So, I was looking after an alternative way.

Ksharp · ‎07-02-2015

No. You don't understand my code .

The only thing infect the memory is how many layers of deep there are . If it is only 10 or 20 , that could be nothing for my code.

As matter of fact , I clear Hash Table every time when an Obs has been searched , That could reuse these memory again.

And you can also recode the variable to save memory for Hash Table . Like Tom -> A Arthur->B John->C ....etc.

And the point is how deep you are going to search for a branch ? If it were one million or ten million , I am afraid you gotta take a look at SAS/OR , and see if OR could handle such big data .

Xia Keshan

DeepH · ‎07-02-2015

Is there any way to identify the what memory would be required for hash table.

Cause I got the following error:

ERROR: Hash object added 32505840 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

As my input table is huge, the count of intermediate path is also high.

So, i there any other way to accomplish the same using Data Step or Proc SQL?

Ksharp · ‎07-02-2015

No. I don't think you can do it via Data OR SQL on account of your big table .

First of all , you need to see how many unique value there are in your table .

That means you could know how deep the branch could be .

and Base on it,You can set the right length of PATH to avoid the dead loop like your LOG said.

Second , Set length of PATH as long as it could be

.

length path _path $ 32767 ;

Third, you need to filter the missing value of START and END to avoid dead loop.

declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'Y');

If there are 1 million unique value , I am afraid you need to resort to SAS/OR . See if OR could handle it .

BTW. also could free the memory by deleting the path in Hash Table have been searched.

DeepH · ‎07-02-2015

First: My branch would go upto 15 level.

Second: Path length is fine.

Third: Kept check to avoid dead loop.

Forth: The clear() method in your code, I believe does the same. i.e. free the memory by deleting the path searched in hash table.

I have done all those changes.

But is there any way where I can break my input data sets based on id and then execute the hash ?

Ksharp · ‎07-02-2015

"Forth: The clear() method in your code, I believe does the same. i.e. free the memory by deleting the path searched in hash table."

No. I should delete the path as long as it is searched .

"But is there any way where I can break my input data sets based on id and then execute the hash ?"

You put many ID into one table together ? But this version of code could only handle one ID group .

Search it at this forum, there are a bunch of method to split table into lots of small tables . Data Step, Hash Table also could do it .

Tomorrow, I might post the updated code . today is too late . I have to leave now .

Ksharp · ‎07-02-2015

I remember there is a person named BillFish who also post some code to search a tree via data step + array , But I don't know if his code could identify cycles . Search him at this forum .

DeepH · ‎07-02-2015

"Forth: The clear() method in your code, I believe does the same. i.e. free the memory by deleting the path searched in hash table."

No. I should delete the path as long as it is searched .

Tomorrow can you pls elaborate how to delete the path as long as it is searched?

Ksharp · ‎07-03-2015

The code has been updated .

It could handle one million obs for about six seconds running under my crappy laptop .

DeepH · ‎07-03-2015

Thanks. I will now try and execute the code against my data.

Ksharp · ‎07-06-2015

Sorry. I found a problem in my code . I should use FINDW() not FIND() .

Use the first code and let me know what happened .

Anyway, Thank you to keep me to enhance my code.

Xia Keshan

DeepH · ‎07-10-2015

The output that I am getting is fine. But still I am facing memory issue.

Can we:

1) Break the input dataset into parts based on ID group

2) Execute hash for one ID group

3) Delete the hash

4) Again execute the hash for second ID group

Ksharp · ‎07-10-2015

Of course I can. But that code is designed for only one ID Group. Therefore you need to split your big data into lots of small tables(each ID is a table). Otherwise, I need to rewrite my code .

I think it is easy: 1) split your table 2) run my code on every small table .

Ksharp · ‎07-11-2015

Is there some duplicated obs in your table ?

Try

proc sort data=have nodupkey;

by _start _end;

run;

to get rid of these duplicated obs .

Matthew_Galati · ‎03-21-2022

FYI - you can solve this directly with proc optnetwork or proc network with the PATH algorithm:

https://go.documentation.sas.com/doc/en/pgmsascdc/v_024/casnopt/casnopt_optnet_details55.htm

SAS macro for finding all paths in a directed graph (network)

Registration is open

SAS AI and Machine Learning Courses