We’re smarter together. Learn from this collection of community knowledge and add your expertise.

SAS macro for finding all paths in a directed graph (network)

by Respected Advisor on ‎10-07-2014 02:26 PM - edited on ‎10-05-2015 02:40 PM by Community Manager (2,969 Views)

/*

The ALLPATHS macro finds all paths between two nodes in a directed network. The

network is described by a list of arcs (from-to node pairs). Nodes can be identified

by numeric or character values. Node pairs for which paths are requested are listed

in a dataset (start-end node pairs).

The macro creates a dataset describing all possible paths for each node pair requested. 

 

Syntax

 

%macro allPaths(Network, Pairs, Paths, maxPathLength=100);

 

Input/Output datasets (variables) are:

 

Input Network (from, to)

Input Pairs (start, end)

Output Paths (start, end, pathNo, nodeOrder, from, to)

 

The ALLPATHS macro has one optional parameter:

 

maxPathLength gives the maximum path length (number of arcs) to search for.

Default=100.

 

IMPORTANT : Prior to calling the macro, Network dataset must be sorted and indexed

by from nodes. For example :

 

  proc sort data=myNetwork out=myNetwork(index=(from));

  where from ne to;

  by from to;

  run;

 

IMPORTANT : Node variables from, to, start, and end should be of the same

type and length.

 

PG

*/

 

%macro allPaths(network, pairs, paths, maxPathLength=100);

%let nodeIDlength=8;

%let which=whichn;

data _null_;

set &network;

if vtype(from) = "C" then do;

    call symputx("nodeIDlength", cats("$",vlength(from)));

    call symputx("which", "whichc");

    end;

stop;

run;

 

data &paths(keep=start end pathNo nodeOrder from to);

set &pairs;

array arc[&maxPathLength];

array node[0:&maxPathLength] &nodeIDlength;

pathNo = 0;

len = 0;

node[len] = start;

 

do k = 1 to 10000; /* Catch infinite loop, just in case */

    if missing(node[len]) then do; /* Move to sibling or back off*/

        if len = 0 then leave;

        p = arc[len] + 1;

        if p > maxp then do; /* Last arc: no sibling. Back off */

            len + (-1);

            call missing(node[len]);

            end;

        else do;

            set &network point=p nobs=maxp;

            if from = node[len-1] then do; /* Sibling found */

                node[len] = to;

                arc[len] = p;

                end;

            else do; /* No more sibling. Back off */

                len + (-1);

                call missing(node[len]);

                end;

            end;

        end;

    else do; /* Go further down the path */

        if node[len] = end then do; /* Simple path found */

            pathNo + 1;

            do j = 1 to len;

                from = node[j-1]; to = node; nodeOrder=j; output;

                end;

            call missing(node[len]);

            end;

        else if &which(node[len], of node

  • ) < len+1 then do; /* Cycle detected in path */

 

            call missing(node[len]);

            end;

        else do; /* Move ahead in path */

            from = node[len];

            set &network key=from/unique curobs=p;

            if p and len < dim(arc) then do; /* To node found, add it to path, if possible */

                len + 1;

                node[len] = to;

                arc[len] = p;

                end;

            else do; /* No arc found or no room left on path. Dead end */

                _error_ = 0;

                call missing(node[len]);

                end;

            end;

        end;

    end;

if pathNo = 0 then do; /* No path found. Output missing values */

    call missing(from, to, nodeOrder); output; end;

run;

%mend allPaths;

 

/*

* Example of use ;

 

* List the edges of a directed network. Must include variables FROM and TO

  (see attached picture) ;

data network;

length from to $4;

input from to @@;

datalines;

a b  b c  c d  d e  e f  f a

a c  c e  e a

f d  d b  b f

;

 

* IMPORTANT : Sort and index the network by FROM node ;

proc sort data=network out=network(index=(from)); where from ne to; by from to; run;

 

* List of node pairs for which we want to enumerate the paths.

  Variables START and END must match the type of network variables FROM and TO ;

data requests;

length start end $4;

input start end;

label start="Starting Node" end="Ending Node";

datalines;

a a

a x

x a

a b

b f

f b

f .

. .

;

 

* Call the allPaths macro ;

%allPaths(network, requests, paths);

 

* Print the paths in a tabular format ;

data printNet;

length Path $200;

do until(last.pathNo);

    set paths; by start end pathNo notsorted;

    if first.pathNo then path = cats(from);

    path = catx(" - ", path, to);

    end;

if pathNo = 0 then path = "No path";

keep start end pathNo path;

label pathNo="Path No";

run;

 

title "List of all paths";

proc print data=printNet noobs label;

by start end notsorted; id start end; run;

 

*/

/*

* Example of use (with numeric nodes) ;

 

* List the edges of a directed network. Must include variables FROM and TO ;

data network;

length from to 8;

input from to @@;

datalines;

1 2  2 3  3 4  4 5  5 6  6 1

1 3  3 5  5 1

6 4  4 2  2 6

;

 

* IMPORTANT : Sort and index the network by FROM node ;

proc sort data=network out=network(index=(from)); by from to; run;

 

* List of node pairs for which we want to enumerate the paths.

  Variables START and END must match the type of network variables FROM and TO ;

data requests;

length start end 8;

input start end;

label start="Starting Node" end="Ending Node";

datalines;

1 1

1 9

9 1

1 2

2 6

6 2

6 .

. .

;

 

* Call the allPaths macro ;

%allPaths(network, requests, paths);

 

* Print the paths in a tabular format ;

data printNet;

length Path $200;

do until(last.pathNo);

    set paths; by start end pathNo notsorted;

    if first.pathNo then path = cats(from);

    path = catx(" - ", path, to);

    end;

keep start end pathNo path;

label pathNo="Path No";

run;

 

title "List of all paths";

proc print data=printNet noobs label;

by start end notsorted; id start end; run;

*/

 

PG

Comments
by Super User
on ‎10-08-2014 09:39 AM

PG, I think my code is better. :smileyhappy:

by Occasional Contributor DeepH
on ‎07-01-2015 09:54 AM

PG,

Why there are no paths shown when the Starting Node & Ending Node are same!! (i.e For a a or 1 1 in the requests dataset)

by Respected Advisor
on ‎07-01-2015 10:04 AM

A path is a series of arcs. The algorithm doesn't allow for reflexive arcs and more generally for cycles in the graph. So there cannot be a path starting and ending at the same node. - PG

by Occasional Contributor DeepH
on ‎07-01-2015 10:07 AM

ok got it.

Is there any other algorithm that would help me identify cycles in a graph?

by Super User
on ‎07-01-2015 10:13 AM

Did you check my URL ?

by Respected Advisor
on ‎07-01-2015 10:18 AM

Also, the CYCLE statement in PROC OPTNET (requires SAS/OR) does just that. - PG

by SAS Employee RobPratt
on ‎07-01-2015 10:22 AM

Also, as of SAS/OR 13.1, the network solver in PROC OPTMODEL provides access to the same algorithms from PROC OPTNET:

SAS/OR(R) 13.2 User's Guide: Mathematical Programming

by Occasional Contributor DeepH
on ‎07-02-2015 01:13 AM

Hi Xia,

I did check your URL. But in my case I am falling short of memory in case of hash objects, as my dataset is huge.

by Occasional Contributor DeepH
on ‎07-02-2015 01:14 AM

PG,

Unfortunately I don't have SAS/OR package. So any other alternative?

by Occasional Contributor DeepH
on ‎07-02-2015 01:14 AM

Hi Rob,

No SAS/OR. :smileysad:

by Super User
on ‎07-02-2015 07:57 AM

How Huge ?You can recode your variable and save memory for Hash Table.

And  set system option  memsize=max  When you start SAS .

by Occasional Contributor DeepH
on ‎07-02-2015 08:46 AM

The dataset would contain nearly 10 million records. And when it tries to find cycles the output dataset would contain nearly 25 million records.

The hash table will store all the intermediate path, which consumes lot of primary memory.

So, I was looking after an alternative way.

by Super User
on ‎07-02-2015 09:18 AM

No. You don't understand my code .

The only thing infect the memory is how many layers of deep there are . If it is only 10 or 20 , that could be nothing for my code.

As matter of fact , I clear Hash Table every time when an Obs has been searched , That could reuse these memory again.

And you can also recode the variable to save memory for Hash Table . Like    Tom -> A  Arthur->B  John->C ....etc.

And the point is how deep you are going to search for a branch ? If it were one million or ten million , I am afraid you gotta take a look at SAS/OR , and see if OR could handle such big data .

Xia Keshan

by Occasional Contributor DeepH
on ‎07-02-2015 09:31 AM

Is there any way to identify the what memory would be required for hash table.

Cause I got the following error:

ERROR: Hash object added 32505840 items when memory failure occurred.

FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase.

As my input table is huge, the count of intermediate path is also high.

So, i there any other way to accomplish the same using Data Step or Proc SQL?

by Super User
on ‎07-02-2015 09:50 AM

No. I don't think you can do it via Data OR SQL on account of your big table .

First of all , you need to see how many unique value there are in your table .

That means you could know how deep the branch could be .

and Base on it,You can set the right length of PATH to avoid the dead loop like your LOG said.

Second , Set  length of PATH as long as it could be

.

length path _path  $ 32767 ;

Third, you need to filter the missing value of START and END to avoid dead loop.

declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'Y');

If there are 1 million unique value , I am afraid you need to resort to SAS/OR . See if OR could handle it .

BTW. also could free the memory by deleting the path in Hash Table have been searched. 

by Occasional Contributor DeepH
on ‎07-02-2015 10:01 AM

First: My branch would go upto 15 level.

Second: Path length is fine.

Third: Kept check to avoid dead loop.

Forth: The clear() method in your code, I believe does the same. i.e. free the memory by deleting the path searched in hash table.

I have done all those changes.

But is there any way where I can break my input data sets based on id and then execute the hash ?

by Super User
on ‎07-02-2015 10:28 AM

"Forth: The clear() method in your code, I believe does the same. i.e. free the memory by deleting the path searched in hash table."

No. I should delete the path as long as it is searched .

"But is there any way where I can break my input data sets based on id and then execute the hash ?"

You put many ID into one table together ? But this version of code could only handle one ID group .

Search it at this forum, there are a bunch of method to split table into lots of small tables .  Data Step, Hash Table also could do it .


Tomorrow, I might post the updated code . today is too late . I have to leave now .

by Super User
on ‎07-02-2015 10:36 AM

I remember there is a person named BillFish who also post some code to search a tree via data step + array , But I don't know if his code could identify cycles .  Search him at this forum .

by Occasional Contributor DeepH
on ‎07-02-2015 10:41 AM

"Forth: The clear() method in your code, I believe does the same. i.e. free the memory by deleting the path searched in hash table."

No. I should delete the path as long as it is searched .


Tomorrow can you pls elaborate how to delete the path as long as it is searched?

by Super User
on ‎07-03-2015 07:54 AM

The code has been updated .

It could handle one million obs for about six seconds running under my crappy laptop .

by Occasional Contributor DeepH
on ‎07-03-2015 08:49 AM

Thanks. I will now try and execute the code against my data.

by Super User
on ‎07-06-2015 11:12 PM

Sorry. I found a problem in my code . I should use FINDW() not FIND() . :smileysad:

Use the first code and let me know what happened .

Anyway, Thank you to keep me to enhance my code.

Xia Keshan

by Occasional Contributor DeepH
on ‎07-10-2015 01:24 AM

The output that I am getting is fine. But still I am facing memory issue.

Can we:

1) Break the input dataset into parts based on ID group

2) Execute hash for one ID group

3) Delete the hash

4) Again execute the hash for second ID group

by Super User
on ‎07-10-2015 07:49 AM

Of course I can. But that code is designed for only one ID Group. Therefore you need to split your big data into lots of small tables(each ID is a table). Otherwise, I need to rewrite my code .

I think it is easy: 1) split your table 2) run my code on every small table .

by Super User
on ‎07-11-2015 02:32 AM

Is there some duplicated obs in your table ?

Try

proc sort data=have nodupkey;

by _start _end;

run;

to get rid of these duplicated obs .

Your turn
Sign In!

Want to write an article? Sign in with your profile.