Re: Create a flow direction in proc sql

lpy0521 · Posted 10-26-2018 12:20 PM

Hi, there.

Just curious how to achieve this in sas. Basically I want to tract each id's full life cycle from the origin to destination as below example shows:

have:
ID	from	to
1	a	b
1	b	c
1	c	d
2	e	f
2	f	g

want:
ID	from	to/from	to/from	to/from
1	a	b	c	d
2	e	f	f	g

Any comment is highly appreciated.

novinosrin · Posted 10-26-2018 12:33 PM

why is f missing if id 2 traveled from e to f and then from f to g?

lpy0521 · Posted 10-26-2018 12:40 PM

Sorry it was a typo. Thanks for pointing out!

mkeintz · Posted 10-26-2018 12:47 PM

Do you have ~~sas/ets~~ (sorry, I meant SAS/OR)? It includes proc optnet, which will directly solve your problem.

Run

proc setinit;

run;

and see ~~if sas/ets~~ sas/or is listed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

lpy0521 · Posted 10-26-2018 12:55 PM

It looks like I have sas/or.
Never use it before, please guide me.
Thanks.

novinosrin · Posted 10-26-2018 12:57 PM

Yes look for OR as Mark suggested.

But just to clear the air a bit:

I would have thought the logical result to be

want:
ID	from	to/from	to/from	to/from
1	a	b	c	d
2	e	f	g

instead of

want:
ID	from	to/from	to/from	to/from
1	a	b	c	d
2	e	f	f	g

Can you correct my understanding?

lpy0521 · Posted 10-26-2018 01:04 PM

Yes, your logic is correct, The reason for me to create that two fs is just to align the column such that I can use the last column as the final destination, for future analysis.
Thanks.

novinosrin · Posted 10-26-2018 01:10 PM

Ok Thanks. Let's together wait if Mark has the time to demo or lead us with some guidance. I am excited about OR too. Thankfully I haven't lost much hair after having learned base sas fairly well. Otherwise your req is a simple double proc transpose

mkeintz · Posted 10-28-2018 06:09 PM

Having used proc optnet for other from/to linkage tasks, I thought this problem would be convenient to solve. But it's not:

proc optnet doesn't honor BY groups, so you'd have to run it multiple times, and then concatenate all the by-group results.
Within each ID group you have to ask proc optnet to generate the longest path (a-d is longer than all the others). If would provide 2 such paths for ID group 1: a->b->c->d and d->c->b->a. You'd have to compare the path output file to other output files from optnet to determine which one honors the flow direction desired, and then arrange data to be ready for transpose from tall to wide. A pain you-know-where.

So here's two step solution

In a data step
1. make a hash call triple, which
  1. Includes ID as a key in the hash object, making honoring id groups easier
  2. Makes a triple for each node in the path, containing (in addition to ID) NODE, PRIOR, and NEXT
2. At the end of each by group
  1. Find the source (the node with no prior)
  2. output each successive node until reaching the sink (the node with no next)
  3. Label the nodes as "source", "node_2", "node_3", ... to "sink".
After the data step run a proc transpose:

EDITTED NOTE: To accommodate a single record ID, a node value has to be arbitrarily assigned prior to the triple.find() method. That statement ("if missing(node) then node=from;") is included below:

data have ;
  input ID from :$1. to :$1.;
datalines;
1 b c 
1 c d 
1 a b 
2 e f 
2 f g 
run;

data need (keep=id nodename node);  
  set have;
  by id;
  if _n_=1 then do;
    length prior node next $1;
	declare hash triples();
	  triples.definekey('id','node');
	  triples.definedata('id','node','prior','next');
	  triples.definedone();
  end;

  /*Populate triples hash */
  if triples.find(key:id,key:from)^=0 then call missing(prior);
  triples.replace(key:id,key:from,data:id,data:from,data:prior,data:to);

  if triples.find(key:id,key:to)^=0 then call missing(next);
  triples.replace(key:id,key:to,data:id,data:to,data:from,data:next);

  if last.id;
  triples.output(dataset:'triples');

  if missing(node) then node=from;  /*Added to accommodate single-link ID's*/
  triples.find();
  if not missing(prior) then do until (missing(prior));
    triples.find(key:id,key:prior);
  end;

  nodename='Source'; /* Output nodes from source to sink */
  output;
  if not missing(next) then do seq=2 by 1 until(missing(next));
    triples.find(key:id,key:next);
	if missing(next) then nodename='Sink';
    else nodename=cats('Node_',seq);
    output; 
  end;
run;

proc transpose data=need out=want (drop=_name_) ;
  by id;
  var node;
  id nodename;
run;

The only part of the original request unsatisfied here is that some internal values (such as NODE_3 for id group 2) are missing. That should be easy to fix, if really necessary.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

lpy0521 · Posted 10-29-2018 12:29 AM

Thanks a lot for the explanation. Hash is absolutely new to me and just test the code but give an error when I am adding new case: there only contains source and sink node as id3? Would you please show me how can i modify the code accordingly? Thanks!!!

data have ;
  input ID from :$1. to :$1.;
datalines;
1 b c 
1 c d 
1 a b 
2 e f 
2 f g 
3 x y
run;

mkeintz · Posted 10-29-2018 12:33 AM

Please show the log containing the error message and related source code.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

lpy0521 · Posted 10-29-2018 12:51 AM

data have ;
  input ID from :$1. to :$1.;
datalines;
1 b c 
1 c d 
1 a b 
2 e f 
2 f g 
3 x y
run;

data need (keep=id nodename node);  
  set have;
  by id;
  if _n_=1 then do;
    length prior node next $1;
	declare hash triples();
	  triples.definekey('id','node');
	  triples.definedata('id','node','prior','next');
	  triples.definedone();
  end;

  /*Populate triples hash */
  if triples.find(key:id,key:from)^=0 then call missing(prior);
  triples.replace(key:id,key:from,data:id,data:from,data:prior,data:to);

  if triples.find(key:id,key:to)^=0 then call missing(next);
  triples.replace(key:id,key:to,data:id,data:to,data:from,data:next);

  if last.id;
  triples.output(dataset:'triples');

  triples.find();
  if not missing(prior) then do until (missing(prior));
    triples.find(key:id,key:prior);
  end;

  nodename='Source'; /* Output nodes from source to sink */
  output;
  if not missing(next) then do seq=2 by 1 until(missing(next));
    triples.find(key:id,key:next);
	if missing(next) then nodename='Sink';
    else nodename=cats('Node_',seq);
    output; 
  end;
run;

proc transpose data=need out=want (drop=_name_) ;
  by id;
  var node;
  id nodename;
run;

basically I create a new case with id3, there is only two nodes x and y. The error in the log file shows:

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

55

56 data have ;

57 input ID from :$1. to :$1.;

58 datalines;

NOTE: The data set WORK.HAVE has 6 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

65 run;

66

67 data need (keep=id nodename node);

68 set have;

69 by id;

70 if _n_=1 then do;

71 length prior node next $1;

72 declare hash triples();

73 triples.definekey('id','node');

74 triples.definedata('id','node','prior','next');

75 triples.definedone();

76 end;

77

78 /*Populate triples hash */

79 if triples.find(key:id,key:from)^=0 then call missing(prior);

80 triples.replace(key:id,key:from,data:id,data:from,data:prior,data:to);

81

82 if triples.find(key:id,key:to)^=0 then call missing(next);

83 triples.replace(key:id,key:to,data:id,data:to,data:from,data:next);

84

85 if last.id;

86 triples.output(dataset:'triples');

87

88 triples.find();

89 if not missing(prior) then do until (missing(prior));

90 triples.find(key:id,key:prior);

91 end;

92

93 nodename='Source'; /* Output nodes from source to sink */

94 output;

95 if not missing(next) then do seq=2 by 1 until(missing(next));

96 triples.find(key:id,key:next);

97 if missing(next) then nodename='Sink';

98 else nodename=cats('Node_',seq);

99 output;

100 end;

101 run;

NOTE: Variable node is uninitialized.

NOTE: The data set WORK.TRIPLES has 4 observations and 4 variables.

NOTE: The data set WORK.TRIPLES has 7 observations and 4 variables.

NOTE: The data set WORK.TRIPLES has 9 observations and 4 variables.

ERROR: Key not found.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 6 observations read from the data set WORK.HAVE.

WARNING: The data set WORK.NEED may be incomplete. When this step was stopped there were 8 observations and 3 variables.

WARNING: Data set WORK.NEED was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 0.04 seconds

cpu time 0.02 seconds

102

103 proc transpose data=need out=want (drop=_name_) ;

104 by id;

105 var node;

106 id nodename;

107 run;

NOTE: There were 8 observations read from the data set WORK.NEED.

NOTE: The data set WORK.WANT has 3 observations and 5 variables.

NOTE: PROCEDURE TRANSPOSE used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

108

109

110 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

122

Thanks!!!

Ksharp · Posted 10-27-2018 07:22 AM

data have;
input ID	from $	to $ ;
rename from=_start to=_end;
cards;
1	a	b
1	b	c
1	c	d
2	e	f
2	f	g
;
run;
 
proc sql;
create table x as
 select * from have
 where catx(' ',id,_start) not in 
(select catx(' ',id,_end) from have);
quit; 

data want(keep=id path);
if _n_ eq 1 then do;
length path _path  $ 700 ;
if 0 then set have;
declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'y');
ha.definekey('_start');
ha.definedata('_end');
ha.definedone();

declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('n','path');
pa.definedone();

end;

set x;
count=1;n=1;_n=1;
path=catx('|',_start,_end);
   
pa.add();
do while(hi_path.next()=0);
 if n ne 1 then pa.remove(key:_n);_n=n;
 _path=path;   
 _start=scan(path,-1,'|');
 rc=ha.find(); if rc ne 0 then output;
 do while(rc=0);
  if not findw(path,strip(_end),'|') then do;
   if length(path)+length(_end)+1 gt lengthc(path) then do;
    putlog 'ERROR: The length of path and _path are set too short';
    stop;
   end;
   
   count+1;n=count;
   path=catx('|',path,_end);
   pa.add();
   path=_path;
 end;
 else output;
  rc=ha.find_next();
end;
end;
pa.clear();

run;

lpy0521 · Posted 10-29-2018 12:31 AM

Thanks a lot! I tested the code, it gives me same id with various different path. I will need to closely investigate my data source.

novinosrin · Posted 10-28-2018 09:32 PM

To OP ,Mark, Xia who may have a couple of mins to clarify plz

I had this question printed and showed that to a friend of mine who is majoring in data science.

He pointed out. Isn't it always the following L shape that's the flow

have:
ID	from	to
1	a	b
1	b	c
1	c	d
2	e	f
2	f	g

Can you provide some intuition?

Registration is open

SAS Training: Just a Click Away