BookmarkSubscribeRSS Feed
lpy0521
Fluorite | Level 6

Hi, there.

 

Just curious how to achieve this in sas. Basically I want to tract each id's full life cycle from the origin to destination as below example shows:

 

have:  
IDfromto
1ab
1bc
1cd
2ef
2fg

 

want:    
IDfromto/fromto/fromto/from
1abcd
2effg

 

 

Any comment is highly appreciated.

16 REPLIES 16
novinosrin
Tourmaline | Level 20

why is f missing if id 2 traveled from e to f and then from f to g?

lpy0521
Fluorite | Level 6

Sorry it was a typo. Thanks for pointing out!

mkeintz
PROC Star

Do you have sas/ets (sorry, I meant SAS/OR)? It includes proc optnet, which will directly solve your problem.

 

Run

  proc setinit;

  run;

 

and see if sas/ets sas/or is listed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
lpy0521
Fluorite | Level 6

It looks like I have sas/or.
Never use it before, please guide me.
Thanks.

novinosrin
Tourmaline | Level 20

Yes look for OR as Mark suggested. 

 

But just to clear the air a bit:

I would have thought the logical result to be

want:        
ID from to/from to/from to/from
1 a b c d
2 e f g  

 

instead of 

 

want:        
ID from to/from to/from to/from
1 a b c d
2 e f f g

 

Can you correct my understanding?

lpy0521
Fluorite | Level 6
Yes, your logic is correct, The reason for me to create that two fs is just to align the column such that I can use the last column as the final destination, for future analysis.
Thanks.
novinosrin
Tourmaline | Level 20

Ok Thanks. Let's together wait if Mark has the time to demo or lead us with some guidance. I am excited about OR too. Thankfully I haven't lost much hair after having learned base sas fairly well. Otherwise your req is a simple double proc transpose 

mkeintz
PROC Star

Having used proc optnet for other from/to linkage tasks, I thought this problem would be convenient to solve.  But it's not:

  1. proc optnet doesn't honor BY groups, so you'd have to run it multiple times, and then concatenate all the by-group results.
  2. Within each ID group you have to ask proc optnet to generate the longest path (a-d is longer than all the others).  If would provide 2 such paths for ID group 1:  a->b->c->d   and d->c->b->a.   You'd have to compare the path output file to other output files from optnet to determine which one honors the flow direction desired, and then arrange data to be ready for transpose from tall to wide.  A pain you-know-where.

 

So here's two step solution

  1. In a data step
    1. make a hash call triple, which
      1. Includes ID as a key in the hash object, making honoring id groups easier
      2. Makes a triple for each node in the path, containing (in addition to ID)    NODE, PRIOR, and NEXT
    2. At the end of each by group
      1. Find the source (the node with no prior)
      2. output each successive node until reaching the sink (the node with no next)
      3. Label the nodes as "source", "node_2", "node_3", ... to "sink".
  2. After the data step run a proc transpose:

 

EDITTED NOTE:   To accommodate a single record ID, a node value has to be arbitrarily assigned prior to the triple.find() method.  That statement ("if missing(node) then node=from;") is included below:

 

data have ;
  input ID from :$1. to :$1.;
datalines;
1 b c 
1 c d 
1 a b 
2 e f 
2 f g 
run;

data need (keep=id nodename node);  
  set have;
  by id;
  if _n_=1 then do;
    length prior node next $1;
	declare hash triples();
	  triples.definekey('id','node');
	  triples.definedata('id','node','prior','next');
	  triples.definedone();
  end;

  /*Populate triples hash */
  if triples.find(key:id,key:from)^=0 then call missing(prior);
  triples.replace(key:id,key:from,data:id,data:from,data:prior,data:to);

  if triples.find(key:id,key:to)^=0 then call missing(next);
  triples.replace(key:id,key:to,data:id,data:to,data:from,data:next);

  if last.id;
  triples.output(dataset:'triples');

  if missing(node) then node=from;  /*Added to accommodate single-link ID's*/
  triples.find();
  if not missing(prior) then do until (missing(prior));
    triples.find(key:id,key:prior);
  end;

  nodename='Source'; /* Output nodes from source to sink */
  output;
  if not missing(next) then do seq=2 by 1 until(missing(next));
    triples.find(key:id,key:next);
	if missing(next) then nodename='Sink';
    else nodename=cats('Node_',seq);
    output; 
  end;
run;

proc transpose data=need out=want (drop=_name_) ;
  by id;
  var node;
  id nodename;
run;

 

The only part of the original request unsatisfied here is that some internal values (such as NODE_3 for id group 2) are missing.  That should be easy to fix, if really necessary.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
lpy0521
Fluorite | Level 6

Thanks a lot for the explanation. Hash is absolutely new to me and just test the code but give an error when I am adding new case: there only contains source and sink node as id3? Would you please show me how can i modify the code accordingly? Thanks!!!

data have ;
  input ID from :$1. to :$1.;
datalines;
1 b c 
1 c d 
1 a b 
2 e f 
2 f g 
3 x y
run;

 

mkeintz
PROC Star

Please show the log containing the error message and related source code.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
lpy0521
Fluorite | Level 6
data have ;
  input ID from :$1. to :$1.;
datalines;
1 b c 
1 c d 
1 a b 
2 e f 
2 f g 
3 x y
run;

data need (keep=id nodename node);  
  set have;
  by id;
  if _n_=1 then do;
    length prior node next $1;
	declare hash triples();
	  triples.definekey('id','node');
	  triples.definedata('id','node','prior','next');
	  triples.definedone();
  end;

  /*Populate triples hash */
  if triples.find(key:id,key:from)^=0 then call missing(prior);
  triples.replace(key:id,key:from,data:id,data:from,data:prior,data:to);

  if triples.find(key:id,key:to)^=0 then call missing(next);
  triples.replace(key:id,key:to,data:id,data:to,data:from,data:next);

  if last.id;
  triples.output(dataset:'triples');

  triples.find();
  if not missing(prior) then do until (missing(prior));
    triples.find(key:id,key:prior);
  end;

  nodename='Source'; /* Output nodes from source to sink */
  output;
  if not missing(next) then do seq=2 by 1 until(missing(next));
    triples.find(key:id,key:next);
	if missing(next) then nodename='Sink';
    else nodename=cats('Node_',seq);
    output; 
  end;
run;

proc transpose data=need out=want (drop=_name_) ;
  by id;
  var node;
  id nodename;
run;

basically I create a new case with id3, there is only two nodes x and y. The error in the log file shows:

 
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
55
56 data have ;
57 input ID from :$1. to :$1.;
58 datalines;
 
NOTE: The data set WORK.HAVE has 6 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
 
65 run;
 
66
67 data need (keep=id nodename node);
68 set have;
69 by id;
70 if _n_=1 then do;
71 length prior node next $1;
72 declare hash triples();
73 triples.definekey('id','node');
74 triples.definedata('id','node','prior','next');
75 triples.definedone();
76 end;
77
78 /*Populate triples hash */
79 if triples.find(key:id,key:from)^=0 then call missing(prior);
80 triples.replace(key:id,key:from,data:id,data:from,data:prior,data:to);
81
82 if triples.find(key:id,key:to)^=0 then call missing(next);
83 triples.replace(key:id,key:to,data:id,data:to,data:from,data:next);
84
85 if last.id;
86 triples.output(dataset:'triples');
87
88 triples.find();
89 if not missing(prior) then do until (missing(prior));
90 triples.find(key:id,key:prior);
91 end;
92
93 nodename='Source'; /* Output nodes from source to sink */
94 output;
95 if not missing(next) then do seq=2 by 1 until(missing(next));
96 triples.find(key:id,key:next);
97 if missing(next) then nodename='Sink';
98 else nodename=cats('Node_',seq);
99 output;
100 end;
101 run;
 
NOTE: Variable node is uninitialized.
NOTE: The data set WORK.TRIPLES has 4 observations and 4 variables.
NOTE: The data set WORK.TRIPLES has 7 observations and 4 variables.
NOTE: The data set WORK.TRIPLES has 9 observations and 4 variables.
ERROR: Key not found.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 6 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.NEED may be incomplete. When this step was stopped there were 8 observations and 3 variables.
WARNING: Data set WORK.NEED was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.02 seconds
 
 
102
103 proc transpose data=need out=want (drop=_name_) ;
104 by id;
105 var node;
106 id nodename;
107 run;
 
NOTE: There were 8 observations read from the data set WORK.NEED.
NOTE: The data set WORK.WANT has 3 observations and 5 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
 
 
108
109
110 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
122
 
 
Thanks!!!
Ksharp
Super User
data have;
input ID	from $	to $ ;
rename from=_start to=_end;
cards;
1	a	b
1	b	c
1	c	d
2	e	f
2	f	g
;
run;
 
proc sql;
create table x as
 select * from have
 where catx(' ',id,_start) not in 
(select catx(' ',id,_end) from have);
quit; 

data want(keep=id path);
if _n_ eq 1 then do;
length path _path  $ 700 ;
if 0 then set have;
declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'y');
ha.definekey('_start');
ha.definedata('_end');
ha.definedone();

declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('n','path');
pa.definedone();

end;

set x;
count=1;n=1;_n=1;
path=catx('|',_start,_end);
   
pa.add();
do while(hi_path.next()=0);
 if n ne 1 then pa.remove(key:_n);_n=n;
 _path=path;   
 _start=scan(path,-1,'|');
 rc=ha.find(); if rc ne 0 then output;
 do while(rc=0);
  if not findw(path,strip(_end),'|') then do;
   if length(path)+length(_end)+1 gt lengthc(path) then do;
    putlog 'ERROR: The length of path and _path are set too short';
    stop;
   end;
   
   count+1;n=count;
   path=catx('|',path,_end);
   pa.add();
   path=_path;
 end;
 else output;
  rc=ha.find_next();
end;
end;
pa.clear();

run;
lpy0521
Fluorite | Level 6

Thanks a lot! I tested the code, it gives me same id with various different path. I will need to closely investigate my data source.  

novinosrin
Tourmaline | Level 20

To OP ,Mark, Xia who may have a couple of mins to clarify plz

 

I had this question printed and showed that to a friend of mine who is majoring in data science. 

 

He  pointed out. Isn't it always the following L shape that's the flow

 

have:    
ID from to
1 a b
1 b c
1 c d
2 e f
2 f g

 

Can you provide some intuition? 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1710 views
  • 4 likes
  • 4 in conversation