@munitech4u wrote: %do %until(&n1=&n2); Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum? e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large
I don't think so.
My reasoning behind this criterion was: After the initial assignment of group numbers on one "side" (i.e. based on ID1) in dataset HAVE2, the same procedure of reassignments of group numbers is applied alternately based on ID2, ID1, ID2, ID1 and so on.
As the new group numbers are always less than or equal to the previous group numbers (they are replaced by the minimum of their respective BY group), the sum of all group numbers is monotonically decreasing from one iteration step to the next. If it stays the same in one step (&n1=&n2), not a single group number may have changed in this step (because otherwise the decrease of this group number would have reduced the sum of all group numbers as well). This means, after the group numbers have been made consistent with regard to one of the two IDs in the previous iteration step, they now turn out to be already consistent also with regard to the other ID. Hence, the algorithm terminates.
A situation where the two sequences of group numbers from two consecutive steps are (1, 2, 5, 10) and (1, 3, 4, 10) is impossible, because group numbers can never increase (from 2 to 3 or from 4 to 5). The same argument holds for any two different sequences with equal sums (&n1=&n2).
Another question would be: Is it possible that the algorithm ends up in a sort of "local" optimum which is not the "global" optimum?
I say no. Because: In this case there would be at least two different IDs x, y such that their cluster numbers were equal (say, 1) in the "global optimum" solution and unequal (say, 1 and 2) in the "local optimum" solution. The former implies that there is a (finite) chain of linked IDs (linked by pairs in dataset HAVE2) from x to y all of which have cluster number 1. Following this same chain from both ends (x in cluster 1 and y in cluster 2) in the "local optimum" solution necessarily leads to a contradiction (an ID with two different cluster numbers).
Thanks for analyzing the algorithm so thoroughly. It would indeed be interesting to prove its correctness mathematically.
Since other solutions have now been suggested, you can also compare the results for validation purposes.
... View more