Solved: Re: non-trivial algorythm for char fields - Page 2

FreelanceReinh · Posted 10-22-2019 01:36 PM

@Reeza: I take it that @Ivan555 regards two strings (in variable NAME) as "duplicates" if they are cyclic permutations of each other, where the items to be permuted are the "words" (numbers) separated by periods. That is, the set of potential duplicates of 'B.A.C.D' would be

{'A.C.D.B', 'B.A.C.D', 'C.D.B.A', 'D.B.A.C'}

Ivan555 · Posted 10-22-2019 02:02 PM

@FreelanceReinh you took it correct, thank you.

Patrick · Posted 10-22-2019 10:40 PM

The following added later after reading @hashman 's comment: Below code will group all permutations of identical elements together. It will not distinct between different chains with identical elements.

Below approach reads the source string into an array, sorts the array and then creates a md5 digest value as group key.

The 2nd part of the code then uses a hash table to only keep the first instance per group key. You could also skip this step and replace it with a proc sort nodupkey.

data want /*(drop=_:)*/;
  set have;
  length _GroupKey $32;
  array _words {20} $10. _temporary_;
  call missing(of _words[*]);

  /* create variable _GroupKey */
  do _i=1 to dim(_words);
    _words[_i]=scan(name,_i,'.');
    if missing(_words[_i]) then leave;
  end;
  call sortc(of _words[*]);
  _GroupKey=put(md5(catx('|',of _words[*])),hex32.);

  /* keep only first instance per _GroupKey */
  if _n_=1 then
    do;
      dcl hash h1();
      h1.defineKey('_GroupKey');
      h1.defineDone();
    end;
  if h1.check() ne 0 then
    do;
      output;
      h1.add();
    end;
run ;

proc print;
run;

Or here another option which will work for any number of words in the source string.


data want /*(drop=_:)*/;

  set have;
  length _NameSorted $32 _word $20;

  /* create and populate variable _NameSorted */
  if _n_=1 then
    do;
      dcl hash h1(ordered:'y', multidata:'y', hashexp:4);
      dcl hiter hh1('h1');
      h1.defineKey('_word');
      h1.defineData('_word');
      h1.defineDone();
    end;
  h1.clear();

  do _i=1 by 1 until(missing(_word));
    _word=scan(name,_i,'.');
    h1.add();
  end;

  _rc = hh1.first();
  do while(_rc = 0);
    _NameSorted=catx('.',_NameSorted,_word);
    _rc = hh1.next();
  end;

  /* keep only first instance per _NameSorted */
  if _n_=1 then
    do;
      dcl hash h2();
      h2.defineKey('_NameSorted');
      h2.defineDone();
    end;
  if h2.check() ne 0 then
    do;
      output;
      h2.add();
    end;
run ;

hashman · Posted 10-23-2019 12:19 AM

@Patrick:
1. The problem with the sorting approach (regardless of how the sorting is done) is that it can map two strings which aren't duplicates per OP's definition the same way and, therefore, kill more records than needed. For example, the strings (case #1):

[1. 2. 3] [3.1.2] [2.3.1]

are duplicates of each other since they can be transformed into one another by moving the front item to the back a number of times. In other words, each string above represents the same circular queue. By contrast, the string (case #2):

[2.1.3]

isn't a duplicate of any of the case #1 strings because the items 1 and 2 are transposed, and so no circular rotation of its items can transform it into any of the case #1 strings.. For [2.1.3], the duplicates are [3.2.1] and [1.3.2] - again, all three belong to the same circular queue; but it's different from the queue in case #1.

Since sorting involves transpositions, it would map all of the above strings to [1.2.3] and thus render them all duplicate, scrambling the distinction between the cases #1 and #2. This is why the approach I've offered is to rotate the queue until the largest item becomes rightmost. After this is done, any string for case #1 becomes [1.2.3], while the strings for case #2 map to [2.1.3]. Hence, the hash table considers all the strings from case #1 duplicate, and so it does for all the strings in case #2; but it sees no duplicates between the cases.

If strings should contain duplicate items, the rotating algorithm above may run into a snag pointed out by @ChrisNZ. For example, if we have [1.2.3.3] and [3.3.1.2], the first rotation of the latter results in [3.1.2.3]. If, per the algorithm above, we stop here, the hash table will consider [1.2.3.3] and [3.1.2.3] different. To circumvent this, we need to rotate until he largest item is on the right, and yet it's not the same as the item on the left. Rotating [3.1.2.3] one more time achieves that by moving 3 from the left to the right resulting in [1.2.3.3]. Thus, by stipulating the criterion that we rotate until he largest item is on the right, and yet it's not the same as the item on the left, we always map the same circular queue regardless of the arrangement of its items to the same surrogate key.

There remains a case where rotating a queue in this manner results in an infinite loop - namely, when all the items are the same, and so UNTIL or WHILE intended to stop the loop when the leftmost and rightmost items become different can never evaluate as true. This can be worked around by either (a) not looping at all if all the items are the same (there're plenty of ways to ascertain that beforehand) or (b) always stopping the loop after it has iterated N(number of items)-1 times (as iterating one more time results into the string before the rotation).

2. I'm all for using an MD5 digest to reduce the hash entry length. But why make it $32 instead of $16 (which is what it is)? If this is just for the convenience of seeing only the hex characters rather than "garbage" and unprintable characters, we can simply attach the $hex32. format to the variable in question.

Kind regards

Paul D.

Patrick · Posted 10-24-2019 08:01 PM

@hashman

Thanks for the clarification. I've missed the circular requirement when I've posted my solution. I've had permutations in mind.

I'll add a comment to my post.

PGStats · Posted 10-23-2019 12:42 AM

A graph is defined both by nodes and links. For two graphs to be equal, they must have the same nodes and links. Here is a simple way to ensure that:

data HAVE ;                                                                                                                             
  input NUM NAME :$200. ; 
  cards ;                                                                                                                               
 1 281.3891.3891.281                                                                                                                   
 2 281.281.3891.3891                                                                                                                   
 3 1162.5645.5645.500835.500835.1162                                                                                                   
 4 5645.500835.500835.1162.1162.5645                                                                                                   
 5 500835.1162.1162.5645.5645.500835                                                                                                   
 6 1349.1162.1162.5645.5645.500835.500835.1349                                                                                         
 7 1162.5645.5645.500835.500835.1349.1349.1162                                                                                         
 8 5645.500835.500835.1349.1349.1162.1162.5645                                                                                         
 9 500835.1349.1349.1162.1162.5645.5645.500835                                                                                         
10 5645.1162.1162.500835.500835.5645                                                                                                   
11 1162.500835.500835.5645.5645.1162                                                                                                   
12 500835.5645.5645.1162.1162.500835
13 1234                                                                                             
;

data temp;
array c_{99} $16;
set have;
length key $400;
n = countw(name,".");
do i = 0 to n-1;
	c_{i+1} = catx("-", scan(name,i+1,"."), scan(name,mod(i+1,n)+1,"."));
	end;
call sortc(of c_{*});
key = catx(".", of c_{*});
drop n i c_:;
run;

proc sort data=temp out=want(drop=key) nodupkey; by key; run;

Note, I added a trivial case (13), just for safety.

PG

hashman · Posted 10-23-2019 12:55 PM

@PGStats:

An interesting angle to look at the problem - and the corresponding code.

As far as I'm concerned, the more angles, the merrier.

Kind regards

Paul D.

PGStats · Posted 10-23-2019 01:44 PM

After sleeping on it and reading @FreelanceReinh investigation, I realize that directed graphs can be traversed with many non identical paths when they include sub-cycles, such as:

data HAVE ;                                                                                                                             
  input NUM NAME :$200. ; 
  cards ;                                                                                                                               
 1 1.2.3.1.4.5.1.6.7.1                                                                                                                   
 2 1.2.3.1.6.7.1.4.5.1                                                                                                                 
;

My code identifies both paths as identical, when they are not.

PG

hashman · Posted 10-22-2019 03:40 PM

@Ivan555:

As I understand, your definition of duplicates is the names whose dot-delimited parts form the same circular queue. If so, we need to:

find the largest (or smallest) part
rotate the queue until this part ends up at the end (or beginning) of the name
if a subsequent name formed in such a manner is the same as previous, ignore the record

(EDIT: @ChrisNZ has pointed out - correctly - that there's a hole in the original queue rotation logic in case the largest chunks ends up both at the top and the bottom. The queue should be rotated further and stop when the largest chunk is on the top but not on the bottom. Below, a provision is also made to stop the loop from rotating infinitely if all the chunks in the string are the same. The test string @ChrisNZ used to show the flaw is included in the sample data set as the 0 record.)

In other (SAS) words:

data have ;                                                                                                                             
  input NUM NAME :$200. ;                                                                                                               
  cards ;                                                                                                                               
 0  281.281.3891.3891                                                                                                                   
 1  281.3891.3891.281                                                                                                                   
 2  3891.281.281.3891                                                                                                                   
 3  1162.5645.5645.500835.500835.1162                                                                                                   
 4  5645.500835.500835.1162.1162.5645                                                                                                   
 5  500835.1162.1162.5645.5645.500835                                                                                                   
 6  1349.1162.1162.5645.5645.500835.500835.1349                                                                                         
 7  1162.5645.5645.500835.500835.1349.1349.1162                                                                                         
 8  5645.500835.500835.1349.1349.1162.1162.5645                                                                                         
 9  500835.1349.1349.1162.1162.5645.5645.500835                                                                                         
10  5645.1162.1162.500835.500835.5645                                                                                                   
11  1162.500835.500835.5645.5645.1162                                                                                                   
12  500835.5645.5645.1162.1162.500835                                                                                                   
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("_nm") ;                                                                                                               
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  length _tm _t $ 16 ;                                                                                                                  
  _nm = name ;                                                                                                                          
  do _n_ = 1 to countw (_nm) ;                                                                                                          
    _t = scan (_nm, _n_) ;                                                                                                              
    if _t > _tm then _tm = _t ;                                                                                                         
  end ;                                                                                                                                 
  do _n_ = 1 to countw (_nm) - 1 while (not (scan (_nm, 1) < _t = _tm)) ;                                                               
    _nm = catx (".", substr (_nm, findc (_nm, ".") + 1), scan (_nm, 1)) ;                                                               
    _t = scan (_nm, -1) ;                                                                                                               
  end ;                                                                                                                                 
  if h.check() ne 0 ;                                                                                                                   
  h.add() ;                                                                                                                             
run ;

As a result, you get the records 0, 3, 6, 10.

Kind regards

Paul D.

ChrisNZ · Posted 10-22-2019 05:34 PM

@hashman

You logic fails for this record #2.

I amended it slightly to generate a better sorted comparison string.

data HAVE ;                                                                                                                             
  input NUM NAME :$200. ; 
  cards ;                                                                                                                               
 1 281.3891.3891.281                                                                                                                   
 2 281.281.3891.3891                                                                                                                   
 3 1162.5645.5645.500835.500835.1162                                                                                                   
 4 5645.500835.500835.1162.1162.5645                                                                                                   
 5 500835.1162.1162.5645.5645.500835                                                                                                   
 6 1349.1162.1162.5645.5645.500835.500835.1349                                                                                         
 7 1162.5645.5645.500835.500835.1349.1349.1162                                                                                         
 8 5645.500835.500835.1349.1349.1162.1162.5645                                                                                         
 9 500835.1349.1349.1162.1162.5645.5645.500835                                                                                         
10 5645.1162.1162.500835.500835.5645                                                                                                   
11 1162.500835.500835.5645.5645.1162                                                                                                   
12 500835.5645.5645.1162.1162.500835                                                                                                   
run;  

data WANT (drop = _:) ;                                                                                                                 
  if _N_ = 1 then do ;                                                                                                                  
    dcl hash H () ;                                                                                                                     
    H.definekey ("_NM") ;                                                                                                               
    H.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set HAVE ;                                                                                                                            
  length _TM _T $16 ;                                                                                                                  
  _NM = NAME ;                                                                                                                          
  do _N_ = 1 to countw (_NM) ;                                                                                                          
    _T = scan (_NM, _N_) ;                                                                                                              
    if _T > _TM then _TM = _T ;                                                                                                         
  end ;                                                                                                                           
  do until (scan (_NM, 1) ^= _TM & scan (_NM, -1) = _TM ) ;                                                                                                                
    _T = scan (_NM, 1) ;                                                                                                                
    _NM = catx ('.', substr (_NM, findc (_NM, ".") + 1), _T) ;                                                                       
  end ;                                                                                                                                 
  if H.check() ne 0 ;                                                                                                                   
  H.add() ;                                                                                                                             
run ;

High-Performance SAS Coding - Third Edition

hashman · Posted 10-22-2019 08:06 PM

@ChrisNZ:

Thanks for pointing it out! and amending the code.

The logic is fundamentally correct but I've underrotated the queue with respect to possible dupes in the string itself. The rotation should stop when the largest chunk is atop the queue and not at the bottom at the same time.

Note that we had better make a provision to avoid the infinite loop in case all chunks are the same. I'll include it in the amendment to my original reply.

Kind regards

Paul D.

FreelanceReinh · Posted 10-23-2019 07:51 AM

I'd like to suggest a variant of @hashman's hash approach, adding all different cyclic permutations of the names from HAVE to the hash object (see, however, less memory consuming variant under "Edit 3"):

data want(drop=_:);
if _n_=1 then do;
  dcl hash h();
  h.definekey('_nm');
  h.definedone();
end;
set have;
_nm=name;
if h.check();
h.add();
do _i=1 to countw(_nm,'.')-1;
  _nm=catx('.',substr(_nm,findc(_nm,'.')+1),scan(_nm,1));
  h.ref();
end;
run;

I'm afraid to say that this variant, the original (i.e. the edited version after ChrisNZ's comment) and @PGStats's code yield three different results for the below HAVE dataset; 629, 759 and 599 observations, respectively.

data have(drop=i:);
length name $9;
do i1=1 to 5;
  do i2=1 to 5;
    do i3=1 to 5;
      do i4=1 to 5;
        do i5=1 to 5;
          name=catx('.',of i:);
          num+1;
          output;
        end;
      end;
    end;
  end;
end;
run; /* 5**5=3125 obs. */

Edit: Comparing the three results, it appears that among the 759 obs. there are both '1.1.2.1.2' and '1.2.1.1.2' in spite of their equivalence and in the 599 obs. the equivalence class of '1.1.3.1.2' is not represented.

Edit 2: Evidence that a correct solution would have 629 observations can be found here: http://oeis.org/A056665

Edit 3: Here's another variant, which saves only one representative per equivalence class in the hash object.

data want(drop=_:);
if _n_=1 then do;
  dcl hash h();
  h.definekey('_nm');
  h.definedone();
end;
set have;
_nm=name;
_rc=h.check();
do _i=1 to countw(_nm)-1 while(_rc);
  _nm=catx('.',substr(_nm,findc(_nm,'.')+1),scan(_nm,1));
  _rc=h.check();
end;
if _rc;
h.add();
run;

Ivan555 · Posted 10-24-2019 07:08 AM

Big thanks to Everyone.

When wryting the topic I did not expect to get such detailed response.

Several answers are solutions to my problem, and I am a little confused which one I should mark as a solution.

I think it might be fair to choose the @hashman's approach but solutions of @PGStats @ChrisNZ @FreelanceReinh also give correct results on my data.

The only-SQL approach of @PGStats looks very smart, thanks a lot for it - I didn’t think that SQL could solve this issue so elegantly.

__
Looking at the code I have one simple question.

Inside the code some of you use an underscore near the variables (the “_” character).

The purpose of this is to increase code readability, am I right?

From the SAS compiler's point the variables "N" and "_N", "n" and "_n_" are the same, right?

FreelanceReinh · Posted 10-24-2019 09:04 AM

@Ivan555 wrote:

Several answers are solutions to my problem, and I am a little confused which one I should mark as a solution.

Ideally, the accepted solution should work not only for your current data, but also for similar data which later readers (or yourself or later users of your program) might apply it to.

@Ivan555 wrote:

I think it might be fair to choose the @hashman's approach ...

@hashman: You may integrate my (or another) correction into your post if you like, so that everything is in one place.

Looking at the code I have one simple question.

Inside the code some of you use an underscore near the variables (the “_” character).

The purpose of this is to increase code readability, am I right?

No, not in the first place. The leading underscore in user-defined variable names (at least as I use it) has two purposes:

Assuming that, as a rule, people's real datasets do not contain variables whose names start with an underscore (except for standard variables from some procedures, e.g. PROC SUMMARY), the leading underscore avoids name conflicts. (Sometimes, even double or triple underscores are used to be "really" sure that those variables don't collide with existing variables.)
It indicates temporary variables which can be dropped eventually using the variable list _: in a DROP statement or DROP= dataset option.

From the SAS compiler's point the variables "N" and "_N", "n" and "_n_" are the same, right?

No, only N and n are the same. _n_ (=_N_) is an automatic variable (see documentation) in every DATA step. It counts DATA step iterations (but can also be used for other purposes by programmers who know what they're doing) and is dropped automatically. It must not be confused with a variable _N, which a user may create like any other variable (such as X, n, testvar1, _foo, etc.).

Ivan555 · Posted 10-24-2019 11:54 AM

@FreelanceReinh wrote:
Ideally, the accepted solution should work not only for your current data, but also for similar data which later readers (or yourself or later users of your program) might apply it to.

If the proposed methods were combined in one post, I would choose it, but useful information is split between many posts of the thread.. I can't judge about later users, but for me almost every post in this thread is useful 🙂

@FreelanceReinh wrote:
No, only N and n are the same. _n_ (=_N_) is an automatic variable (see documentation) in every DATA step. It counts DATA step iterations (but can also be used for other purposes by programmers who know what they're doing) and is dropped automatically. It must not be confused with a variable _N, which a user may create like any other variable (such as X, n, testvar1, _foo, etc.).

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!