@Patrick: 1. The problem with the sorting approach (regardless of how the sorting is done) is that it can map two strings which aren't duplicates per OP's definition the same way and, therefore, kill more records than needed. For example, the strings (case #1):
[1. 2. 3] [3.1.2] [2.3.1]
are duplicates of each other since they can be transformed into one another by moving the front item to the back a number of times. In other words, each string above represents the same circular queue. By contrast, the string (case #2):
[2.1.3]
isn't a duplicate of any of the case #1 strings because the items 1 and 2 are transposed, and so no circular rotation of its items can transform it into any of the case #1 strings.. For [2.1.3], the duplicates are [3.2.1] and [1.3.2] - again, all three belong to the same circular queue; but it's different from the queue in case #1.
Since sorting involves transpositions, it would map all of the above strings to [1.2.3] and thus render them all duplicate, scrambling the distinction between the cases #1 and #2. This is why the approach I've offered is to rotate the queue until the largest item becomes rightmost. After this is done, any string for case #1 becomes [1.2.3], while the strings for case #2 map to [2.1.3]. Hence, the hash table considers all the strings from case #1 duplicate, and so it does for all the strings in case #2; but it sees no duplicates between the cases.
If strings should contain duplicate items, the rotating algorithm above may run into a snag pointed out by @ChrisNZ. For example, if we have [1.2.3.3] and [3.3.1.2], the first rotation of the latter results in [3.1.2.3]. If, per the algorithm above, we stop here, the hash table will consider [1.2.3.3] and [3.1.2.3] different. To circumvent this, we need to rotate until he largest item is on the right, and yet it's not the same as the item on the left. Rotating [3.1.2.3] one more time achieves that by moving 3 from the left to the right resulting in [1.2.3.3]. Thus, by stipulating the criterion that we rotate until he largest item is on the right, and yet it's not the same as the item on the left, we always map the same circular queue regardless of the arrangement of its items to the same surrogate key.
There remains a case where rotating a queue in this manner results in an infinite loop - namely, when all the items are the same, and so UNTIL or WHILE intended to stop the loop when the leftmost and rightmost items become different can never evaluate as true. This can be worked around by either (a) not looping at all if all the items are the same (there're plenty of ways to ascertain that beforehand) or (b) always stopping the loop after it has iterated N(number of items)-1 times (as iterating one more time results into the string before the rotation).
2. I'm all for using an MD5 digest to reduce the hash entry length. But why make it $32 instead of $16 (which is what it is)? If this is just for the convenience of seeing only the hex characters rather than "garbage" and unprintable characters, we can simply attach the $hex32. format to the variable in question.
Kind regards
Paul D.
... View more