Solved: Extract id_sequence variable from previous_id and next _id variables

KonstantinVasil · Posted 02-28-2019 03:08 AM

I have a data set with two variables previous_id and next_id. The rows are not always ordered so that previous and next of two adjacent rows are the same, there is also no available variable for grouping by list origin.

Clarification: There is no variable by which to sort or group in the starting data set. The ids of the variables are not sequential numerically or alphabetically, they can be thought of as random (in the example they are sequential for simplicity of illustration).

I need an algorithm to track all id's that follow each unique originating id and output it in a new dataset.

Have:

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
;

Need:

data result;
    input id_sequence $ sequence_origin$;
    datalines;
a a
b a
c a
d a
w w
x w
y w
;

Astounding · Posted 02-28-2019 07:26 AM

No matter what solution you select, you will need to guard against the possibility of an infinite loop. For example:

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
b c
c a
;

Perhaps this shouldn't happen, but the program has to make sure or it might run for a very long time.

I suggest building a format that translates the NEXT_ID to PREV_ID. After creating ID_LISTS:

data id_lists2;
   set id_lists;
   retain fmtname '$id';
   rename next_id = label prev_id=start;
run;
proc format cntlin=id_lists2;
run;

Given these translations, you can apply them in this fashion. Begin with the list of IDs you would like to trace:

data result;
    input id_sequence $;
    datalines;
a 
b 
c 
d 
w 
x 
y 
;

. Then trace them back in this way:

data want;
   set result;
   sequence_origin = id_sequence;
   do k=1 to 50 until (sequence_origin = test_next);
      sequence_origin = test_next;
      test_next = put(sequence_origin, $id.);
   end;
run;

The idea is to loop through the translations up to 50 times per observation (stopping if an origin is found).

It's untested code, so you may need to tweak it a bit.

View solution in original post

andreas_lds · Posted 02-28-2019 03:14 AM

And now please explain how to create the first observation in your result dataset.

KonstantinVasil · Posted 02-28-2019 03:18 AM

How to create the results data set is the purpose of the post. What is the particular problem with the first observation?

andreas_lds · Posted 02-28-2019 03:31 AM

@KonstantinVasil wrote:

How to create the results data set is the purpose of the post. What is the particular problem with the first observation?

Maybe the word "create" in my post was not well chosen. If you want someone to write code for you, you should explain in full details what you want in way that someone who never had to solve such an issue understands the relationship between what you have and what you want.

I would start with sorting id_lists, then use retain or lag to compare prev_id with next_id of the previous observation to set/update sequence origin, retaining that variable seems to be necessary.

KonstantinVasil · Posted 02-28-2019 03:39 AM

This is one of the main difficulties, there is no variable by which to sort or group! The ids of the variables are not sequential in any way. I will add this clarification to the description.

andreas_lds · Posted 02-28-2019 03:40 AM

Try:

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
;
run;

proc sort data=work.id_lists out=work.sorted;
   by prev_id next_id;
run;


data work.result;
   set work.sorted end=jobDone;

   length id_sequence sequence_origin expected_id $ 1;
   retain sequence_origin expected_id;

   if _n_ = 1 or expected_id ^= prev_id then do;
      sequence_origin = prev_id;
   end;

   id_sequence = prev_id;
   expected_id = next_id;

   output;

   if jobDone then do;
      id_sequence = next_id;
      output;
   end;
run;


proc print;run;

KonstantinVasil · Posted 02-28-2019 05:52 AM

I am sorry for misleading. As I updated, sorting or grouping by any of the ids is not possible as they are not sequential.

Astounding · Posted 02-28-2019 07:26 AM

No matter what solution you select, you will need to guard against the possibility of an infinite loop. For example:

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
b c
c a
;

Perhaps this shouldn't happen, but the program has to make sure or it might run for a very long time.

I suggest building a format that translates the NEXT_ID to PREV_ID. After creating ID_LISTS:

data id_lists2;
   set id_lists;
   retain fmtname '$id';
   rename next_id = label prev_id=start;
run;
proc format cntlin=id_lists2;
run;

Given these translations, you can apply them in this fashion. Begin with the list of IDs you would like to trace:

data result;
    input id_sequence $;
    datalines;
a 
b 
c 
d 
w 
x 
y 
;

. Then trace them back in this way:

data want;
   set result;
   sequence_origin = id_sequence;
   do k=1 to 50 until (sequence_origin = test_next);
      sequence_origin = test_next;
      test_next = put(sequence_origin, $id.);
   end;
run;

The idea is to loop through the translations up to 50 times per observation (stopping if an origin is found).

It's untested code, so you may need to tweak it a bit.

KonstantinVasil · Posted 02-28-2019 07:49 AM

Thank you, but this seems to use the result data to trace back the input. What I need is to produce the result data from the input id_lists itself.

gamotte · Posted 02-28-2019 08:00 AM

Hello,

I think there is a proc for this kind of things but i don't remember the name.

Here is a solution with a recursive macro. An additional stop criterion should be

added to avoid infinite loops if they are likely to occur.

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
;
run;

proc sql noprint;
    CREATE TABLE steps AS
    SELECT DISTINCT step0 FROM
    (
    SELECT prev_id AS step0 FROM id_lists
    UNION
    SELECT next_id AS step0 FROM id_lists
    );
quit;


%macro recursive_join(iter=0);

    %local cont iter1;

    %let iter1=%eval(&iter.+1);

    proc sql noprint;
        CREATE TABLE steps AS
        SELECT i.prev_id AS step&iter1., s.*
        FROM steps s	
        LEFT JOIN id_lists i
        ON i.next_id=s.step&iter.;

        SELECT sum(step&iter1. ne " ")>0
        INTO :cont
        FROM steps;
    quit;

    %if &cont. %then %do;
        %recursive_join(iter=%eval(&iter1.));
    %end;

%mend;

%recursive_join;

data want;
    set steps;
    array s step:;
    rename step0=id_sequence;
    length sequence_origin $8.;

    sequence_origin=coalescec(of s(*));

    keep sequence_origin step0;
run;

Astounding · Posted 02-28-2019 08:11 AM

You're right about that. But it should be easy enough to adapt the program. It starts out the same way, creating the format. Then apply the format to the ID_LISTS data. However, I need to get some clarification to do this right.

Why does the number of observations change? What's the logic behind that?

KonstantinVasil · Posted 02-28-2019 09:19 AM

Yes I managed to modify it. Thanks for the hint - the solution with custom format is very elegant.

%macro track_id_sequences(id1,id2,input_data,output_data,max_iterations);
	/* 1. Store id data as a custom format = dictionary*/
	data temp; set &input_data.;
	   retain fmtname '$id';
	   rename &id2. = label &id1.=start;
	run;
	proc format cntlin=temp; run;

	/* 2. Translate the next id of each row multiple times until reaching the sequence end
		+ store number of translation = order of the id within the sequence */
	data &output_data.;   set &input_data.;
	   sequence_end = &id1.;
	   test_next = &id2.;
	   do seq_order=1 to &max_iterations. until (sequence_end = test_next);
	      sequence_end = test_next;
	      test_next = put(sequence_end, $id.);
	   end;
	run;

	/* 3. Group sequences by ending and order rows by order within the sequence
		+ add sequence ending as additional row */
	proc sort data=&output_data. out=&output_data.;
		by sequence_end descending seq_order;
	run;
	data &output_data.(keep= id_sequences sequence_end seq_order); set &output_data.;
		by sequence_end;
		if last.sequence_end then do;
			output; &id1. = sequence_end; seq_order=0; output;
		end; else output;
		rename &id1. = id_sequences ; 
	run;
%mend track_id_sequences;

data id_lists;
   input id_prev $ id_next $;
   datalines;
a b
w x
b c
x y
c d
;

%track_id_sequences(id1=id_prev , id2=id_next , input_data=id_lists, output_data=Result, max_iterations=100);

Ksharp · Posted 02-28-2019 08:08 AM

1) if there were one to many relationship in data ? like

a b
w x
a c
w y
w d

2) if there are some common nodes in multiple path ,what you gonna do ? like

a b
w x
b c
x b

ErikLund_Jensen · Posted 02-28-2019 09:50 AM

Hi @KonstantinVasil

I tried to follow @Astounding's idea with a format and came to the following solution. It expects input to be a hierarchy, i.e. any node can have only one parent-node, but several child-notes, and I added an extra line to input, so c->d and c->e to make an extra child-node.

data id_lists;
   input prev_id $ next_id $;
   datalines;
a b
w x
b c
x y
c d
c e
;

* Check input - any node can have only one parent-node;
* A hierarchy with more than one child-node to a parent-node is supported;
proc sql;
	create table a as
		select next_id, count(*) as antal from id_lists
		group by next_id
		having antal > 1;
quit;
%if &sqlobs = 0 %then %do;

	* Add originating parent-nodes as their own parents;
	* Done because this is part of the wanted output (a-a, w-w);
	proc sql; 
		create table self_origin as 
			select distinct prev_id, prev_id as next_id
			from id_lists
			where 
				prev_id ne '' and
				prev_id not in (select next_id from id_lists);
	quit;
	data have; set id_lists self_origin;
	run;

	* Build a format to translate node to parent-node;
	data fmt; set have (rename=(next_id=start prev_id=label));
		retain fmtname '$parent';
	run;
	proc format cntlin=fmt;
	run;

	* Translate child-note to parent-node until translation is exhausted;
	data a (drop=prev_id oldorig); set have (rename=(next_id=id_sequence));
		sequence_origin = put(id_sequence,$parent.);
		do until (sequence_origin = '*');
			oldorig = sequence_origin;
			sequence_origin = put(sequence_origin,$parent.);
			if oldorig ne '' and oldorig = sequence_origin then do;
				output;
				leave;
			end;
		end;
	run;

	* sort output in same order as wanted, until now no sorting was done;
	proc sort data=a; by sequence_origin id_sequence;
	run;
%end;
%else %do;
	%put Sorry - input data is not a hierarchy;
%end;

The output is:

Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract sequence list from previous_id and next _id variables

Re: Extract sequence list from previous_id and next _id variables

Re: Extract sequence list from previous_id and next _id variables

Re: Extract sequence list from previous_id and next _id variables

Re: Extract sequence list from previous_id and next _id variables

Re: Extract sequence list from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Re: Extract id_sequence variable from previous_id and next _id variables

Classroom Training Available!