BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
KonstantinVasil
Obsidian | Level 7

 

I have a data set with two variables previous_id and next_id. The rows are not always ordered so that previous and next of two adjacent rows are the same, there is also no available variable for grouping by list origin.

 

Clarification: There is no variable by which to sort or group in the starting data set. The ids of the variables are not sequential numerically or alphabetically, they can be thought of as random (in the example they are sequential for simplicity of illustration).

 

I need an algorithm to track all id's that follow each unique originating id and output it in a new dataset.

 

Have:

 

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
;

Need:

 

 

data result;
    input id_sequence $ sequence_origin$;
    datalines;
a a
b a
c a
d a
w w
x w
y w
;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

No matter what solution you select, you will need to guard against the possibility of an infinite loop.  For example:

 

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
b c
c a
;

Perhaps this shouldn't happen, but the program has to make sure or it might run for a very long time.

 

I suggest building a format that translates the NEXT_ID to PREV_ID.  After creating ID_LISTS:

 

data id_lists2;
   set id_lists;
   retain fmtname '$id';
   rename next_id = label prev_id=start;
run;
proc format cntlin=id_lists2;
run;

Given these translations, you can apply them in this fashion.  Begin with the list of IDs you would like to trace:

 

 

data result;
    input id_sequence $;
    datalines;
a 
b 
c 
d 
w 
x 
y 
;

.  Then trace them back in this way:

 

data want;
   set result;
   sequence_origin = id_sequence;
   do k=1 to 50 until (sequence_origin = test_next);
      sequence_origin = test_next;
      test_next = put(sequence_origin, $id.);
   end;
run;


      

The idea is to loop through the translations up to 50 times per observation (stopping if an origin is found).

 

It's untested code, so you may need to tweak it a bit.

View solution in original post

13 REPLIES 13
andreas_lds
Jade | Level 19

And now please explain how to create the first observation in your result dataset.

KonstantinVasil
Obsidian | Level 7

How to create the results data set is the purpose of the post. What is the particular problem with the first observation?

andreas_lds
Jade | Level 19

@KonstantinVasil wrote:

How to create the results data set is the purpose of the post. What is the particular problem with the first observation?


Maybe the word "create" in my post was not well chosen. If you want someone to write code for you, you should explain in full details what you want in way that someone who never had to solve such an issue understands the relationship between what you have and what you want.

 

I would start with sorting id_lists, then use retain or lag to compare prev_id with next_id of the previous observation to set/update sequence origin, retaining that variable seems to be necessary.

KonstantinVasil
Obsidian | Level 7

This is one of the main difficulties, there is no variable by which to sort or group! The ids of the variables are not sequential in any way. I will add this clarification to the description.

andreas_lds
Jade | Level 19

Try:

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
;
run;

proc sort data=work.id_lists out=work.sorted;
   by prev_id next_id;
run;


data work.result;
   set work.sorted end=jobDone;

   length id_sequence sequence_origin expected_id $ 1;
   retain sequence_origin expected_id;

   if _n_ = 1 or expected_id ^= prev_id then do;
      sequence_origin = prev_id;
   end;

   id_sequence = prev_id;
   expected_id = next_id;

   output;

   if jobDone then do;
      id_sequence = next_id;
      output;
   end;
run;


proc print;run;
KonstantinVasil
Obsidian | Level 7
I am sorry for misleading. As I updated, sorting or grouping by any of the ids is not possible as they are not sequential.
Astounding
PROC Star

No matter what solution you select, you will need to guard against the possibility of an infinite loop.  For example:

 

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
b c
c a
;

Perhaps this shouldn't happen, but the program has to make sure or it might run for a very long time.

 

I suggest building a format that translates the NEXT_ID to PREV_ID.  After creating ID_LISTS:

 

data id_lists2;
   set id_lists;
   retain fmtname '$id';
   rename next_id = label prev_id=start;
run;
proc format cntlin=id_lists2;
run;

Given these translations, you can apply them in this fashion.  Begin with the list of IDs you would like to trace:

 

 

data result;
    input id_sequence $;
    datalines;
a 
b 
c 
d 
w 
x 
y 
;

.  Then trace them back in this way:

 

data want;
   set result;
   sequence_origin = id_sequence;
   do k=1 to 50 until (sequence_origin = test_next);
      sequence_origin = test_next;
      test_next = put(sequence_origin, $id.);
   end;
run;


      

The idea is to loop through the translations up to 50 times per observation (stopping if an origin is found).

 

It's untested code, so you may need to tweak it a bit.

KonstantinVasil
Obsidian | Level 7
Thank you, but this seems to use the result data to trace back the input. What I need is to produce the result data from the input id_lists itself.
gamotte
Rhodochrosite | Level 12

Hello,

 

I think there is a proc for this kind of things but i don't remember the name.

Here is a solution with a recursive macro. An additional stop criterion should be

added to avoid infinite loops if they are likely to occur.

 

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
;
run;

proc sql noprint;
    CREATE TABLE steps AS
    SELECT DISTINCT step0 FROM
    (
    SELECT prev_id AS step0 FROM id_lists
    UNION
    SELECT next_id AS step0 FROM id_lists
    );
quit;


%macro recursive_join(iter=0);

    %local cont iter1;

    %let iter1=%eval(&iter.+1);

    proc sql noprint;
        CREATE TABLE steps AS
        SELECT i.prev_id AS step&iter1., s.*
        FROM steps s	
        LEFT JOIN id_lists i
        ON i.next_id=s.step&iter.;

        SELECT sum(step&iter1. ne " ")>0
        INTO :cont
        FROM steps;
    quit;

    %if &cont. %then %do;
        %recursive_join(iter=%eval(&iter1.));
    %end;

%mend;

%recursive_join;

data want;
    set steps;
    array s step:;
    rename step0=id_sequence;
    length sequence_origin $8.;

    sequence_origin=coalescec(of s(*));

    keep sequence_origin step0;
run;
Astounding
PROC Star

You're right about that.  But it should be easy enough to adapt the program.  It starts out the same way, creating the format.  Then apply the format to the ID_LISTS data.  However, I need to get some clarification to do this right.

 

Why does the number of observations change?  What's the logic behind that?

 

 

KonstantinVasil
Obsidian | Level 7

 

Yes I managed to modify it. Thanks for the hint - the solution with custom format is very elegant.

 

%macro track_id_sequences(id1,id2,input_data,output_data,max_iterations);
	/* 1. Store id data as a custom format = dictionary*/
	data temp; set &input_data.;
	   retain fmtname '$id';
	   rename &id2. = label &id1.=start;
	run;
	proc format cntlin=temp; run;

	/* 2. Translate the next id of each row multiple times until reaching the sequence end
		+ store number of translation = order of the id within the sequence */
	data &output_data.;   set &input_data.;
	   sequence_end = &id1.;
	   test_next = &id2.;
	   do seq_order=1 to &max_iterations. until (sequence_end = test_next);
	      sequence_end = test_next;
	      test_next = put(sequence_end, $id.);
	   end;
	run;

	/* 3. Group sequences by ending and order rows by order within the sequence
		+ add sequence ending as additional row */
	proc sort data=&output_data. out=&output_data.;
		by sequence_end descending seq_order;
	run;
	data &output_data.(keep= id_sequences sequence_end seq_order); set &output_data.;
		by sequence_end;
		if last.sequence_end then do;
			output; &id1. = sequence_end; seq_order=0; output;
		end; else output;
		rename &id1. = id_sequences ; 
	run;
%mend track_id_sequences;

data id_lists;
   input id_prev $ id_next $;
   datalines;
a b
w x
b c
x y
c d
;

%track_id_sequences(id1=id_prev , id2=id_next , input_data=id_lists, output_data=Result, max_iterations=100);
Ksharp
Super User

1) if there were one to many relationship in data ? like 

 

a b
w x
a c
w y
w d

 

2) if there are some common nodes in multiple path ,what you gonna do ? like

 

a b
w x
b c
x b

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @KonstantinVasil 

 

I tried to follow @Astounding's idea with a format and came to the following solution. It expects input to be a hierarchy, i.e. any node can have only one parent-node, but several child-notes, and I added an extra line to input, so c->d and c->e to make an extra child-node.

 

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
c e
;

* Check input - any node can have only one parent-node;
* A hierarchy with more than one child-node to a parent-node is supported;
proc sql;
	create table a as
		select next_id, count(*) as antal from id_lists
		group by next_id
		having antal > 1;
quit;
%if &sqlobs = 0 %then %do;

	* Add originating parent-nodes as their own parents;
	* Done because this is part of the wanted output (a-a, w-w);
	proc sql; 
		create table self_origin as 
			select distinct prev_id, prev_id as next_id
			from id_lists
			where 
				prev_id ne '' and
				prev_id not in (select next_id from id_lists);
	quit;
	data have; set id_lists self_origin;
	run;

	* Build a format to translate node to parent-node;
	data fmt; set have (rename=(next_id=start prev_id=label));
		retain fmtname '$parent';
	run;
	proc format cntlin=fmt;
	run;

	* Translate child-note to parent-node until translation is exhausted;
	data a (drop=prev_id oldorig); set have (rename=(next_id=id_sequence));
		sequence_origin = put(id_sequence,$parent.);
		do until (sequence_origin = '*');
			oldorig = sequence_origin;
			sequence_origin = put(sequence_origin,$parent.);
			if oldorig ne '' and oldorig = sequence_origin then do;
				output;
				leave;
			end;
		end;
	run;

	* sort output in same order as wanted, until now no sorting was done;
	proc sort data=a; by sequence_origin id_sequence;
	run;
%end;
%else %do;
	%put Sorry - input data is not a hierarchy;
%end;



The output is:

 

origin.gif

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1188 views
  • 5 likes
  • 6 in conversation