Solved: Re: Recursive lookup for ID's - Page 2

Reeza · Posted 03-29-2016 01:50 PM

I posted the macro here as well if you wanted to try it:

https://gist.github.com/statgeek/14e3aa2a9f718f551cd98134e9ceed30

FreelanceReinh · Posted 03-29-2016 01:46 PM

I reckon your output is based on a different version of T1, because when I run the following steps, I obtain the result shown further below. Please note that datasets T1 and T2 are overwritten again and again in the iteration (%DO %UNTIL loop).

data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A3 A1
A4 A5
A6 A7
A7 A8
A1 A9
;

data have2;
set have
    have(rename=(ID1=ID2 ID2=ID1));
run;

proc sort data=have2;
by ID1;
run;

data t1;
set have2;
by ID1;
g+first.ID1;
run;

proc print data=t1 noobs;
run;

Result:

ID1    ID2    g

A1     A2     1
A1     A9     1
A1     A3     1
A2     A3     2
A2     A1     2
A3     A1     3
A3     A2     3
A4     A5     4
A5     A4     5
A6     A7     6
A7     A8     7
A7     A6     7
A8     A7     8
A9     A1     9

munitech4u · Posted 03-29-2016 02:00 PM

oh, maybe thats why. It was so much confusing me

munitech4u · Posted 03-30-2016 05:07 AM

%do %until(&n1=&n2);

Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum?

e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large

FreelanceReinh · Posted 03-30-2016 07:20 AM

@munitech4u wrote:
%do %until(&n1=&n2);

Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum?

e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large

I don't think so.

My reasoning behind this criterion was: After the initial assignment of group numbers on one "side" (i.e. based on ID1) in dataset HAVE2, the same procedure of reassignments of group numbers is applied alternately based on ID2, ID1, ID2, ID1 and so on.

As the new group numbers are always less than or equal to the previous group numbers (they are replaced by the minimum of their respective BY group), the sum of all group numbers is monotonically decreasing from one iteration step to the next. If it stays the same in one step (&n1=&n2), not a single group number may have changed in this step (because otherwise the decrease of this group number would have reduced the sum of all group numbers as well). This means, after the group numbers have been made consistent with regard to one of the two IDs in the previous iteration step, they now turn out to be already consistent also with regard to the other ID. Hence, the algorithm terminates.

A situation where the two sequences of group numbers from two consecutive steps are (1, 2, 5, 10) and (1, 3, 4, 10) is impossible, because group numbers can never increase (from 2 to 3 or from 4 to 5). The same argument holds for any two different sequences with equal sums (&n1=&n2).

Another question would be: Is it possible that the algorithm ends up in a sort of "local" optimum which is not the "global" optimum?

I say no. Because: In this case there would be at least two different IDs x, y such that their cluster numbers were equal (say, 1) in the "global optimum" solution and unequal (say, 1 and 2) in the "local optimum" solution. The former implies that there is a (finite) chain of linked IDs (linked by pairs in dataset HAVE2) from x to y all of which have cluster number 1. Following this same chain from both ends (x in cluster 1 and y in cluster 2) in the "local optimum" solution necessarily leads to a contradiction (an ID with two different cluster numbers).

Thanks for analyzing the algorithm so thoroughly. It would indeed be interesting to prove its correctness mathematically.

Since other solutions have now been suggested, you can also compare the results for validation purposes.

Reeza · Posted 03-29-2016 11:37 AM

https://communities.sas.com/t5/Base-SAS-Programming/How-to-create-a-group-ID-for-interlinked-ID-numb...

https://communities.sas.com/t5/tkb/articleprintpage/tkb-id/library/article-id/1045

Use the Subgraphs macro

FreelanceReinh · Posted 03-29-2016 11:18 AM

@munitech4u wrote:
Mine, is even more complex, as the relationship is not just one way, it can be two way. So like in the example give in other solution, the player id can occur in assignment id and vice versa.

I don't think that this makes it significantly more complex. In order to use the solutions in the other recent thread as much as possible, one could regard your ID1 and ID2 as completely separate sets of IDs (like the assignment and player IDs in the other thread), then assign the cluster numbers (these correspond to the group numbers g in my solution in the other thread) and finally merge the ID1 and ID2 values within each cluster. But intuitively I guess that the solution can be even simpler, given that there is only one type of ID.

munitech4u · Posted 03-29-2016 11:16 AM

I have an idea, but not sure how to execute this:

1. Get a dataset with all distinct IDs.

2. Now Pick the first ID, and look for all the possible IDs in ID1 or ID2 that occurs with this ID, put them in cluster 1.

3. Now Get all other distinct IDs that occur above and repeat the step above. Keep updating the cluster.

4. Loop until we reach at a point, when we have iterated through all the IDs in cluster 1.

5. Eliminate the cluster 1 IDs from the dataset with distinct IDs and the dataset we are looking into.

6. Repeat step 2-5 until we are finished with all IDs.

Example:

Distinct IDs

A1

A2

A3

A4

A5

A6

A7

A8

A9

First ID: A1

All possible combination from dataset:

A1 A2

A1 A9

Cluster 1: A1,A2,A9

Loop Through cluster 1 except A1: For A2, possible combination:

A2 A3

Update cluster 1: A1,A2,A3,A9

Loop Through cluster 1 except A1,A2: For A9, possible combination:

A9 A1

Update cluster 1: A1,A2,A3,A9

Eliminate A1,A2,A3,A9 from distinct IDs and observations wherever they occur in dataset

Repeat above steps, keep incrementing the cluster.

Ksharp · Posted 03-29-2016 10:54 PM

Here is :





data have;
infile cards ;
input from $  to $ ;
cards;
A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata&colon;'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

munitech4u · Posted 03-30-2016 04:14 AM

Thanks for your effort Xia. But this hash thing goes over my head. 😄

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: How to perform this in sas?

Re: Recursive lookup for ID's

Re: Recursive lookup for ID's

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!