The issue is that, with any sort of sequential processing like SAS data step, the linkage info can show up much later in the file, which requires (somehow) going back to all previously assigned id, and correct all of them! From the corrections, more corrections may be necessary, so there is this cascading ripple effect, hence recursion or loop until. This is the crux of the thorny problem. My code for the labor day challenge was built for householding where the fields are like "name", "address", "phone", "account number". Any records that share the same value in the same column must be put into a group. Structurally this is different from the OP where it's the same value in any column, hence my not posting here. They are closely related though. WIth some modifications, it should work. Also note that whether it's 2 fields, 4 fields, or n fields, that's not really material, the concept is the same. And if you think about it, the OP really has only 1 field because it is "same value in any column"! The records are really statements of arcs or linkages. For my case, wIth the loop until no more inconsistency is found approach on 8 million customer with about 20 million records, it ran for 5-10 hours making around 15+ sweeps through the dataset. The number of sweeps reflect the order and depth implied by the data. This problem is interesting in that if you sort the input data stream so as to minimize linkages showing up later, then it runs fast, but if the sorting is opposite, then you experience horrendous run time. Theoretical worst case, we can think of this as a binary tree. We put the bottom nodes first in the dataset: (1 2) (3 4) (1 3). Now we can replicate this: (1 2) (3 4) (2 3) (5 6) (7 😎 (5 7) (1 5) <this records connects the two subtrees and causes one branch to need to be fully relabeled> and again: (9 10) (11 12) (9 11) (12 13) (14 15) (12 14) (9 12) (1 9) and again, and again, to any depth we wish. Always the last record binds two large trees, only much bigger and deeper. And the relabeling can only begin from the top of the tree and need to cascade all the way to the bottom. If we were using a language that supported a) building linked lists data structure in memory, b) recursive function call, this is easy, provided the machine has sufficient stack depth at run time to accomodate the recursion depth. But this is not SAS data step at all, hence the pain. We can write code that would generate this binary tree structure to any arbitrary depth (hmm... should be fun :smileylaugh:) and feed it to any proposed code to verify correctness.
... View more