I posted the macro here as well if you wanted to try it:
https://gist.github.com/statgeek/14e3aa2a9f718f551cd98134e9ceed30
I reckon your output is based on a different version of T1, because when I run the following steps, I obtain the result shown further below. Please note that datasets T1 and T2 are overwritten again and again in the iteration (%DO %UNTIL loop).
data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A3 A1
A4 A5
A6 A7
A7 A8
A1 A9
;
data have2;
set have
have(rename=(ID1=ID2 ID2=ID1));
run;
proc sort data=have2;
by ID1;
run;
data t1;
set have2;
by ID1;
g+first.ID1;
run;
proc print data=t1 noobs;
run;
Result:
ID1 ID2 g A1 A2 1 A1 A9 1 A1 A3 1 A2 A3 2 A2 A1 2 A3 A1 3 A3 A2 3 A4 A5 4 A5 A4 5 A6 A7 6 A7 A8 7 A7 A6 7 A8 A7 8 A9 A1 9
@munitech4u wrote:
%do %until(&n1=&n2);
Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum?
e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large
I don't think so.
My reasoning behind this criterion was: After the initial assignment of group numbers on one "side" (i.e. based on ID1) in dataset HAVE2, the same procedure of reassignments of group numbers is applied alternately based on ID2, ID1, ID2, ID1 and so on.
As the new group numbers are always less than or equal to the previous group numbers (they are replaced by the minimum of their respective BY group), the sum of all group numbers is monotonically decreasing from one iteration step to the next. If it stays the same in one step (&n1=&n2), not a single group number may have changed in this step (because otherwise the decrease of this group number would have reduced the sum of all group numbers as well). This means, after the group numbers have been made consistent with regard to one of the two IDs in the previous iteration step, they now turn out to be already consistent also with regard to the other ID. Hence, the algorithm terminates.
A situation where the two sequences of group numbers from two consecutive steps are (1, 2, 5, 10) and (1, 3, 4, 10) is impossible, because group numbers can never increase (from 2 to 3 or from 4 to 5). The same argument holds for any two different sequences with equal sums (&n1=&n2).
Another question would be: Is it possible that the algorithm ends up in a sort of "local" optimum which is not the "global" optimum?
I say no. Because: In this case there would be at least two different IDs x, y such that their cluster numbers were equal (say, 1) in the "global optimum" solution and unequal (say, 1 and 2) in the "local optimum" solution. The former implies that there is a (finite) chain of linked IDs (linked by pairs in dataset HAVE2) from x to y all of which have cluster number 1. Following this same chain from both ends (x in cluster 1 and y in cluster 2) in the "local optimum" solution necessarily leads to a contradiction (an ID with two different cluster numbers).
Thanks for analyzing the algorithm so thoroughly. It would indeed be interesting to prove its correctness mathematically.
Since other solutions have now been suggested, you can also compare the results for validation purposes.
@munitech4u wrote:
Mine, is even more complex, as the relationship is not just one way, it can be two way. So like in the example give in other solution, the player id can occur in assignment id and vice versa.
I don't think that this makes it significantly more complex. In order to use the solutions in the other recent thread as much as possible, one could regard your ID1 and ID2 as completely separate sets of IDs (like the assignment and player IDs in the other thread), then assign the cluster numbers (these correspond to the group numbers g in my solution in the other thread) and finally merge the ID1 and ID2 values within each cluster. But intuitively I guess that the solution can be even simpler, given that there is only one type of ID.
I have an idea, but not sure how to execute this:
1. Get a dataset with all distinct IDs.
2. Now Pick the first ID, and look for all the possible IDs in ID1 or ID2 that occurs with this ID, put them in cluster 1.
3. Now Get all other distinct IDs that occur above and repeat the step above. Keep updating the cluster.
4. Loop until we reach at a point, when we have iterated through all the IDs in cluster 1.
5. Eliminate the cluster 1 IDs from the dataset with distinct IDs and the dataset we are looking into.
6. Repeat step 2-5 until we are finished with all IDs.
Example:
Distinct IDs
A1
A2
A3
A4
A5
A6
A7
A8
A9
First ID: A1
All possible combination from dataset:
A1 A2
A1 A9
Cluster 1: A1,A2,A9
Loop Through cluster 1 except A1: For A2, possible combination:
A2 A3
Update cluster 1: A1,A2,A3,A9
Loop Through cluster 1 except A1,A2: For A9, possible combination:
A9 A1
Update cluster 1: A1,A2,A3,A9
Eliminate A1,A2,A3,A9 from distinct IDs and observations wherever they occur in dataset
Repeat above steps, keep incrementing the cluster.
Here is : data have; infile cards ; input from $ to $ ; cards; A1 A2 A2 A3 A4 A5 A6 A7 A7 A8 A1 A9 ; run; data full; set have end=last; if _n_ eq 1 then do; declare hash h(); h.definekey('node'); h.definedata('node'); h.definedone(); end; output; node=from; h.replace(); from=to; to=node; output; node=from; h.replace(); if last then h.output(dataset:'node'); drop node; run; data want(keep=node household); declare hash ha(ordered:'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('last'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedone(); if 0 then set full; declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y'); from_to.definekey('from'); from_to.definedata('to'); from_to.definedone(); if 0 then set node; declare hash no(dataset:'node'); declare hiter hi_no('no'); no.definekey('node'); no.definedata('node'); no.definedone(); do while(hi_no.next()=0); household+1; output; count=1; key=node;_ha.add(); last=node;ha.add(); rc=hi.first(); do while(rc=0); from=last;rx=from_to.find(); do while(rx=0); key=to;ry=_ha.check(); if ry ne 0 then do; node=to;output;rr=no.remove(key:node); key=to;_ha.add(); count+1; last=to;ha.add(); end; rx=from_to.find_next(); end; rc=hi.next(); end; ha.clear();_ha.clear(); end; stop; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.