BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
DominicCavenagh
Fluorite | Level 6

Hi All,

I am trying to compute all valid character strings based on some constraints:

1. Each string is 4 alpha characters long i.e. "ABCD" or "QRTS"

2. A string is declared invalid if there is already a valid string where there are the same two characters next to each other. i.e. If "ABCD" is already a valid string, then "BACD" would be invalid due to AB being next to each other in the first two positions. Consequently "ABCD" would make the following three strings invalid: "BACD", "ACBD", "ABDC".

 

My approach so far has been to generate all 26^4 combinations (CHAR4 in the code). Then for each string, generate the strings that this string makes invalid (T_12, T_23, T_34 in the code).  Then loop through each row and merge back into the full list of strings, marking any string that is the same as {T_12, T_23, T_34} as invalid. Then move to the next string and repeat. The resulting dataset should have all the strings, and those that are invalid are flagged i.e. ineligible=1.

 

As you can probably imagine due to the ~500,000 strings and the amount of repeated merging, this process is taking a long time and I havent been able to run it till finish yet. So far I have left it going for 24 hours without any luck.

I feel like there is a better way to do this, but cant put my finger on it, does anyone have any advice on a way to optimise my program, or a better way to produce these strings altogether? Thanks in advance

 

See below for my current approach:

** SET UP POSSIBLE PERMUTATIONS OF THE 4-PART TEXT STRING - 26^4=456,976 combinations ;
proc format;
	value alpha
		1="A"  2="B" 3="C" 4="D" 5="E" 6="F" 7="G" 8="H" 9="I" 10="J"
		11="K" 12="L" 13="M" 14="N" 15="O" 16="P" 17="Q" 18="R" 19="S" 20="T"
		21="U" 22="V" 23="W" 24="X" 25="Y" 26="Z";
	run;

data alpha;
	do N1=1 to 26;
		do N2=1 to 26;
			do N3=1 to 26;
				do N4=1 to 26;
					output;
				end;
			end;
		end;
	end;
	format N1 N2 N3 N4 alpha.;
	run;
data new1;
	set alpha;
	length C1 C2 C3 C4 $1.;
	c1=compress(vvalue(N1));
	c2=compress(vvalue(N2));
	c3=compress(vvalue(N3));
	c4=compress(vvalue(N4));

	CHAR4=C1||C2||C3||C4;

	* by default: ineligible=0;
	INELIGIBLE=0;

	* this variable will be used for 1-to-many merging;
	ALL=1;

	* set up the combinations with potential transcription errors;
	T_12=c2||c1||c3||c4;
	T_23=c1||c3||c2||c4;
	T_34=c1||c2||c4||c3;
	run;

* this is the dataset to be overwritten with each loop ;
data new2;
	set new1;
	run;
proc sort; by all; run;

%macro loopy (start, stop);	
%do i=&start. %to &stop.;
	data select1;
		set new2;
		if _N_=&i. and ineligible=0;
		* keep the combinations with potential transcription errors;
		TRANSPOSE12=T_12;
		TRANSPOSE23=T_23;
		TRANSPOSE34=T_34;
		keep ALL TRANSPOSE12 TRANSPOSE23 TRANSPOSE34;
		run;

	data new2;
		merge NEW2 (in=in1) SELECT1;
		by all;
		if in1;
		* if CHAR4 is the same as any of the transposed combinations, then indicate as INELIGIBLE=1;
		if CHAR4=TRANSPOSE12 or CHAR4=TRANSPOSE23 or CHAR4=TRANSPOSE34 then INELIGIBLE=1;
		drop TRANSPOSE12 TRANSPOSE23 TRANSPOSE34;
		run;
%end;

%mend;
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Editted note:  After this solution was accepted, a much more optimal solution was shown by @FreelanceReinh.

Take look at it in this topic thread.

 

Your task is effectively to examine all 358,800 four-character strings taken from the 26 letter alphabet, where no letter is present in the string more than once.

 

You need to consider the string as 3 overlapping letter-pairs, starting in positions 1, 2, and 3 respectively.

 

For a string to be declared valid (i.e. VALID='Y'), none of its 3 letter-pairs can be a reverse of the corresponding letter-pair in other valid strings.

 

This suggests that, as you determine a string to be valid, you can record its three letter-pairs (reversed) in a hash (three hashes actually: _RP1 for reversed position 1 pairs, _RP2 for reversed position 2 pairs, and _RP3).  Then any subsequent string that has a pair starting in position 1 that is found in hash _RP1, must not be valid.

 

This task is doable in one data step, but I put it in two steps below.  It takes less than 1 second:  It produces 24,173 with VALID='Y', and 334,627 with VALID='N'.

 

data need (drop=i j k l p) ;
  array LTR {26} $1 _temporary_ ('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z');
  length strng $4 ;
  array pr {3} $2;  /*Each 4-letter string has 3 pairs */
  array rp {3} $2;  /*Reversed Pairs */

  n=0;
  do i=1 to 26; 
    do j=1 to 26; 
      if j=i then continue  /* See discussion */;
      do k=1 to 26; 
        if k=i or k=j then continue  /* See discussion */;
        do l=1 to 26;
          if l=i or l=j or l=k then continue  /* See discussion */;
          strng=cats(ltr{i},ltr{j},ltr{k},ltr{l});
          do p=1 to length(strng)-1;
            pr{p}=substr(strng,p,2);
            rp{p}=reverse(pr{p});
          end;
          n=n+1;
          output;
        end;
      end;
    end;
  end;
run;

data want (drop=_:);
  set need  ;

  label link_rp1='Earliest N generating this rp1 value'
        link_rp2='Earliest N generating this rp2 value'
        link_rp3='Earliest N generating this rp3 value'  ;

  if _n_=1 then do;
    declare hash _rp1 ();  /* reverse pair1 values encountered so far */
      _rp1.definekey('rp1');
      _rp1.definedata('rp1','link_rp1');  /*Link_rp1, */
      _rp1.definedone();
    declare hash _rp2 ();
      _rp2.definekey('rp2');
      _rp2.definedata('rp2','link_rp2');
      _rp2.definedone();
    declare hash _rp3 ();
      _rp3.definekey('rp3');
      _rp3.definedata('rp3','link_rp3');
      _rp3.definedone();
  end;


  call missing (of link_:);
  _rc=_rp1.find(key:pr1); /*Get prior LINK_RP1, if any, for current PR1*/
  _rc=_rp2.find(key:pr2);
  _rc=_rp3.find(key:pr3);

  length valid $1;    /* will be either "Y" or "N" */
  if N(of link_:)>0 then valid='N'; /* At least 1 current pair found in a prior reverse pair*/
  else valid='Y';

  if valid='Y' then do;  /*If current is valid, then add new reverse pairs for lookup */
    if _rp1.check(key:rp1) then _rp1.add(key:rp1,data:rp1,data:n);
    if _rp2.check(key:rp2) then _rp2.add(key:rp2,data:rp2,data:n);
    if _rp3.check(key:rp3) then _rp3.add(key:rp3,data:rp3,data:n);
  end;
run;

proc freq data=want; tables valid;run;

Note this generates candidate strings in lexicographic order   (i.e. ABCD is valid, so BACD cannot be valid).  You could take away this lexicographic "bias" by randomly re-ordering the values in the LTR array.

 

Editted annotation: Regarding the three statements with  /* See discussion */ :  Remove these statements to allow a letter to appear in the strings more than once.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
ballardw
Super User

I am not sure if you are clear on the difference between "combination" and "permutation". Permutation, which you have created, considers order of the values. Combination will just have one value with any specific set of values, so if done properly and have "ABCD", then combinations will not include ABDC, BACD, DABC and so on.

So you can start by creating COMBINATIONS.

Which is moderately easy after some time playing with the different combinatorial functions:

data start;
   array x(26) $ 1 ('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z');
   n=dim(x);
   k=4;
   ncomb=comb(n, k);
   do j=1 to ncomb;
      rc=allcomb(j, k, of x[*]);
      string=cats(x1,x2,x3,x4);
      output;
   end;
   keep string;
run;

Not the function COMB will give you the number of N items taken K at a time. In this case 14,590. So I suspect starting there might reduce the scope of this exercise quite a bit as if ABCD is generated won't have a BACD in the output (or other permutations of the specific 4 characters.

 

I personally might be more concerned with a list of "valid" values. Once I have those it is extremely easy to mark anything else as invalid and typically the combinations of "valid" are generally shorter starting something like code values. Perhaps there is a data set with the valid values somewhere?

 

I am wondering about why you have used the comment

* set up the combinations with potential transcription errors;
	T_12=c2||c1||c3||c4;

Transcription tends to make me think that you have some data that you want compared and perhaps this is a poor start to begin with and should be looking at actual "transcribed values".

 

This may be a case of considering one piece of the problem in detail is hiding the actual need and it may help to describe the whole problem a bit more.

DominicCavenagh
Fluorite | Level 6

Hi, thanks for your reply. The combinations idea is very interesting and has given me some food for thought.

In terms of the "valid" values, these are generated in the same loop. So they are highly dependent on which is your first choice. i.e.

1. Generate all string permutations.

2. Choose one string as "valid" value (call it "str")

3. Mark all strings that "str" makes invalid as invalid

4. Choose another string that has not yet been marked invalid.

5.repeat steps 2-4, until you run out of valid strings to choose.

 

Obviously from above, if the first choice is ABCD, then BACD will not be valid. However if your first choice is BACD then ABCD will not be valid. In my original program I was just going through the list of permutations as sorted alphabetically, however this may not be the best approach, and it may be better to just pick at random.

 

In terms of the transcription comment, this is domain specific. The purpose of producing these ID's is that they need to be handwritten on samples and we don't want people accidently writing BACD when they read ABCD. There was some corncern this would happen, so we sought to produce IDs that if this did occur, it would not be a valid ID and we would know something wrong happened, rather than the sample being assigned to the wrong ID.

 

While the combination method you describe guarantees success for our criteria, it is only producing a subset of the valid list of ID's. I think this is what you meant when saying this is a good place to start. If the combination method chooses ABCD, then there can be no other string with all these characters. However, based on our constraints, DBCA, CBAD and ADCB would be valid, as the repeated (/swapped) letters are not next to each other. Producing these extras is relatively easy to add in to the method, so I will have a go at that.

 

Thanks again for your help.

ballardw
Super User

You could define specific "allowed" 2 letter combinations instead of 4 single letters for creating permutations or combinations.

You do need to define a "start" rule in any case.

And what about same letter twice such as "AA" together? Unfortunately a very small example often does not completely describe the rejection rule.

I humbly suggest that if the "entry codes" have not been defined yet that you consider exactly how many are actually needed. That might reduce the scope of the problem. If that is not possible then you may well be in an environment when the next thing is , "Now we need 5 characters" and have a lot more exclusion rules to impose.

Ksharp
Super User
proc format;
 value alpha
  1="A"  2="B" 3="C" 4="D" 5="E" 6="F" 7="G" 8="H" 9="I" 10="J"
  11="K" 12="L" 13="M" 14="N" 15="O" 16="P" 17="Q" 18="R" 19="S" 20="T"
  21="U" 22="V" 23="W" 24="X" 25="Y" 26="Z";
 run;

%let total=%eval(26**4);
proc plan seed=123 ;
factors id=&total. ordered x=4 of 26 comb/noprint;
output out=temp;
quit;
data temp2;
 set temp;
 alpha=put(x,alpha.);
run;
proc transpose data=temp2 out=temp3(drop=_:);
by id;
var alpha;
run;

data want;
 set temp3;
 length want $ 20;
 want=cats(of col:);
 keep id want;
run;
FreelanceReinh
Jade | Level 19

Hi @DominicCavenagh and welcome to the SAS Support Communities!

 

I think this task (as I understand it) can be viewed as a problem in graph theory (which I don't know much about, unfortunately): The four-letter combinations can be thought of as the vertices of a graph and an edge connects two vertices if (and only if) the combinations differ by exactly one permutation of two neighboring letters.

 

While the entire graph with its 26**4=456976 vertices is huge, it can be decomposed into small disjoint components ("subgraphs") comprising only 24 or fewer vertices. This is because switching two letters doesn't change the choice of the letters. The components are

  1. comb(26,4)   = 14950 combinations of four different letters (with 24 possible permutations each)
  2. comb(26,3)*3 =  7800 combinations of three different letters one of which occurs twice (with 12 possible permutations each)
  3. comb(26,2)*2 =   650 combinations of two different letters one of which occurs three times (with 4 possible permutations each)
  4. comb(26,2)   =   325 combinations of two different letters both of which occur twice (with 6 possible permutations each)
  5.                   26 combinations consisting of four equal letters.

(14950*24 + 7800*12 + 650*4 + 325*6 + 26 = 456976)

 

Each of the above five classes of components can be investigated separately, based on an arbitrarily selected single element of the class (because a one-to-one mapping of the letters doesn't change the structure of a combination). The connection rule implies that a vertex of this particular graph can have at most three edges.

 

Let's start with the easiest case: class 5. Obviously, all of the 26 constituent combinations AAAA, BBBB, ..., ZZZZ can be selected as "valid" because they are isolated points in the graph.

 

Class 4: The components of this class can be represented by a subgraph like this:

          ABBA
         /    \
 AABB──ABAB  BABA──BBAA
         \    /
          BAAB

Now you see that some selections are more favorable than others: For example, if you selected ABAB and BABA, you would end up with only these two elements -- the other four elements would then be invalid. However, the selection AABB, ABBA, BAAB, BBAA yields four valid strings -- there is no direct connection (edge) between any two of these four permutations in the graph. It's easy to see that you can't get more than four valid elements out of those six. So, if you want to obtain as many valid four-letter strings as possible, you need to find "optimum" selections like this, i.e., with the maximum number of valid elements, for each class.

 

[Continuation #1]

Class 3: The representatives of this class have a simple chain structure:

ABBB──BABB──BBAB──BBBA

Obviously, there are three possible "optimum" selections: {ABBB, BBAB}, {ABBB, BBBA} and {BABB, BBBA}.

 

Class 2: Using the letter choice A, B, C, C, the representative of this class has the below structure:

   ACBC──ABCC──BACC──BCAC
   /  \              /  \
ACCB  CABC────────CBAC  BCCA
   \  /              \  /   
   CACB──CCAB──CCBA──CBCA

The subset of the six permutations highlighted in bold constitutes one of the "optimum" selections (one of two, as it turns out). While this statement seems plausible, it's not obvious (to me). So, let's try to prove it (using SAS):

A "better" selection than the above must contain seven or more out of the twelve permutations. If we can show that a selection of seven would always contain an invalid element, there couldn't be more than seven a fortiori.

%let n=12;
%let k=7;
data test;
array x[&n] $4 ('ABCC' 'ACBC' 'ACCB' 'BACC' 'BCAC' 'BCCA' 'CABC' 'CACB' 'CBAC' 'CBCA' 'CCAB' 'CCBA');
do i=1 to comb(&n, &k);
  rc=allcomb(i, &k, of x[*]);
  f=0;
  do j=1 to &k-1 until(f);
    do k=j+1 to &k until(f);
      if compged(x[j],x[k])=20 then f=1;
    end;
  end;
  if ~f then output;
end;
run; /* --> 0 observations */

Apparently, each of the comb(12, 7)=792 subsets of 7 out of the 12 permutations in the array creates one or more pairs with generalized edit distance 20 (which in this case is equivalent to a "SWAP", i.e., a transposition of two neighboring letters). The same code with k=6 yields two observations: the two optimum selections. (The second one being the complement of the first, i.e., comprising the six non-bold permutations in the graph above.)

 

[Continuation #2]

Class 1: Using the letter choice A, B, C, D, the representative of this class has the below structure:

      ____________________
     /                    \
   ABDC    ACBD──ACDB    ADBC
   /  \   / │      │ \   /  \
BADC  ABCD  │      │  ADCB  DABC
 / \  /     │      │     \  / \
│  BACD    CABD──CADB    DACB  │
│    │    /          \    │    │
│    │  CBAD        CDAB  │    │
│    │ /  \          /  \ │    │
│   BCAD   CBDA──CDBA   DCAB   │
│      \  /          \  /      │
│       BCDA        DCBA       │
│         \          /         │
\          BDCA──DBCA         /
 \          │      │         /
  \         │      │        /
   \        │      │       /
    \──────BDAC──DBAC─────/

Actually it's rotational symmetric ... Each vertex is linked to exactly three others (as it should, with four different letters, considering the connection rule). We can use almost the same code as above to create the optimum selections (two again) -- each comprising 12 out of the 24 permutations -- and to prove that selecting more than 12 valid elements is impossible. One of the optimum selections is shown above in boldface. The other one is just the complement of it, i.e., the non-bold permutations.

%let n=24;
%let k=12;
data test2;
array c[4] $1 ('A' 'B' 'C' 'D'); 
array x[&n] $4;
do p=1 to &n;   
  rc=allperm(p, of c[*]);
  x[p]=cat(of c[*]);
end;
do i=1 to comb(&n, &k);
  rc=allcomb(i, &k, of x[*]);
  f=0;
  do j=1 to &k-1 until(f);
    do k=j+1 to &k until(f);
      if compged(x[j],x[k])=20 then f=1;
    end;
  end;
  if ~f then output;
end;
run; /* --> 2 observations */

Again, with k=13 this step results in 0 observations. Either way the run time was only a few seconds on my workstation.

 

Given the above results, it should be feasible to select a subset of the original 26**4 four-letter combinations with the maximum possible number of valid elements. And this number should be

14950*12 + 7800*6 + 650*2 + 325*4 + 26 = 228826.

Just translate one of the optimum selections given for each of the five classes to each of the 14950, 7800, 650, 325 and 26 letter combinations, respectively (e.g., by replacing A, B, C, D with K, R, W, Y, etc.). I can help you with that tomorrow (CEST) if needed.

 

(And if a graph theory adept in the forum recognizes this as something for which a standard solution is known, please chime in.)

DominicCavenagh
Fluorite | Level 6

Thanks for this very thorough solution to the problem. It really opened my eyes to the differences in number of solutions that can be produced given the initial selection. I love the graph theory approach. Is it your feeling that all the strings could be produced by starting with one of the optional strings you mention and then just iterating through the letters? Then do this for each case?I would be interested to see your approach to this.

If i could have, I would have really liked to mark this as an acceptable solution as well.

Many thanks!

FreelanceReinh
Jade | Level 19

Here is the implementation of what I outlined in my previous post:

/* Create a maximum-size set of permutations of four capital letters so that any two permutations
   from the set differ by more than just a transposition of two neighboring letters */

data want(keep=string);
call streaminit(27182818);
length string $4;           /* resulting four-letter strings */
array a[26] $1;             /* the alphabet: 'A', 'B', ..., 'Z' */
array c[4]  $1;             /* selected letters for a combination */
array x[24] $4;             /* systematic permutations of four letters */
array d[4]     _temporary_; /* indices for selecting from four characters */
array v[47,4]  _temporary_; /* combinations of indices 1, 2, 3, 4 for "optimum" selections */
array m[5,4]   _temporary_ (1 2 3 4 1 1 2 3 1 1 1 2 1 1 2 2 1 1 1 1); /* see array D */
array t[5]     _temporary_ (24 12  4  6  1); /* number of permutations per selection for class 1, ..., 5 */
array u[5]     _temporary_ (12  6  2  4  1); /* maximum number of valid strings per selection for class 1, ..., 5 */
array s[5]     _temporary_ ( 2  2  3  1  1); /* number of different "optimum" solutions for class 1, ..., 5 */

/* Phase 1: Prepare "optimum" selections of combinations */

do g=1 to dim(t); /* g=1, ..., 5 is the "class" number */
  do i=1 to dim(d);
    d[i]=m[g,i];
  end;
  do i=1 to t[g];
    call lexperm(i, of d[*]);
    x[i]=cat(of d[*]);
  end;
  do i=1 to comb(t[g], u[g]); /* search through maximum-size selections of permutations of four digits */
    select(g);
      when(1) rc=allcomb(i, u[g], of x[*]);
      when(2) rc=allcomb(i, u[g], of x1-x12);
      when(3) rc=allcomb(i, u[g], of x1-x4);
      when(4) rc=allcomb(i, u[g], of x1-x6);
      otherwise;
    end;
    f=0; /* flag variable to indicate the detection of an invalid selection */
    do j=1 to u[g]-1 until(f);              /* check validity of pairs of selected permutations */
      do k=j+1 to u[g] until(f);
        if compged(x[j],x[k])=20 then f=1;  /* here, 20 means one "SWAP" (transposition of neighboring digits) */
      end;
    end;
    if ~f then do;
      do j=1 to u[g];   /* write indices for valid selections to array V */
        y+1;
        do k=1 to dim(d);
          v[y,k]=input(char(x[j],k),1.);
        end;
      end;
    end;
  end;
end;

/* Phase 2: Create strings based on the above results */

do g=1 to dim(t); /* again, g=1, ..., 5 is the "class" number */
  link init;      /* reset the "alphabet" array A for a clean start */
  select(g);
    when(1) do i=1 to comb(26, 4);
              rc=allcomb(i, 4, of a[*]);       /* In class 1, a selection of four letters (e.g., A, B, C, D) */
              do j=1 to dim(c);                /* leads to one combination: ABCD.                            */
                c[j]=a[j];
              end;
              link write;
            end;
    when(2) do i=1 to comb(26, 3);             /* In class 2, a selection of three letters (e.g., A, B, C) */
              rc=allcomb(i, 3, of a[*]);       /* gives rise to three different combinations:              */
              c1=a1; c2=a2; c3=a3; link write; /* AABC, */
              c1=a2; c2=a1; c3=a3; link write; /* BBAC, */
              c1=a3; c2=a1; c3=a2; link write; /* CCAB. */
            end;
    when(3) do i=1 to comb(26, 2);             /* In class 3, a selection of two letters (e.g., A, B) */
              rc=allcomb(i, 2, of a[*]);       /* gives rise to two different combinations:           */
              c1=a1; c2=a2; link write;        /* AAAB, */
              c1=a2; c2=a1; link write;        /* BBBA. */
            end;
    when(4) do i=1 to comb(26, 2);             /* In class 4, a selection of two letters (e.g., A, B) */
              rc=allcomb(i, 2, of a[*]);       /* leads to one combination: AABB.                     */
              c1=a1; c2=a2; link write;
            end;
    when(5) do i=1 to 26;                      /* In class 5, a selected letter (e.g., A) */
              do j=1 to dim(c);                /* leads to one combination: AAAA.         */
                c[j]=a[i];
              end;
              link write;
            end;
    otherwise;
  end;
  e+u[g]*s[g]; /* offset for dimension 1 of array V */
end;
return;
init:
  do i=1 to 26;
    a[i]=byte(i+rank('A')-1); /* write letters 'A', ..., 'Z' to array A */
  end;
return;
write:
  r=rand('integer',s[g]); /* randomly select from several possible "optimum" solutions (if any) */
  do j=e+u[g]*(r-1)+1 to e+u[g]*r;
    string=cat(c[v[j,1]],c[v[j,2]],c[v[j,3]],c[v[j,4]]); /* combine selected letters according to the */
    output;                                              /* patterns stored in "solution" array V     */
  end;
return;
run; /* 14950*12 + 7800*6 + 650*2 + 325*4 + 26 = 228826 obs. (from class 1 through 5) */

 

As predicted, 228,826 "valid" four-letter strings were found. The typical run time of 4 - 5 seconds includes the time needed to find optimum solutions (see "phase 1" of the code). In cases where more than one selection of maximum size exists, one of the possible solutions is chosen randomly on a case-by-case basis.

DominicCavenagh
Fluorite | Level 6
Wow, thanks so much for this - this is a beautiful solution to the problem! Thanks so much for your effort to implement this! I wish I could accept multiple solutions!
Quentin
Super User

Maybe a hash approach could help?

 

So start with your data set of ~500,000 candidate 4-character strings.

 

In a data step,  create an empty hash table with just a key value, which is a 2-character string.  Then read through your dataset of candidate strings.  Each candidate string has three pairs of values you want to check to see if they would introduce risk of transposition errors.  For each candidate string, check all three pairs of values to see if they are already in the hash table.  So if the candidate string is 'ABCD', check if 'AB' 'BC' or 'CD' is already in the hash table. If any are, then this string is INELIGIBLE.  If none of them are in the table, then this string is ELIGIBLE, and you add six values to the hash table 'AB' 'BA' 'BC' 'CB' 'CD' 'DC'.

 

EDIT:

In re-reading your question, you only want to make a string ineligible if the 2-character combination exists in the same position as another string.  So I think you would want a hash table with two keys: Position and Substring.  Position would be 1, 2, or 3.  So in my example, you would add six values to the hash table:

 

position:1 substr:'AB' 

position:1 substr:'BA'

position:2 substr:'BC'

position:2 substr:'CB'

position:3 substr:'CD'

position:3 substr:'DC'

 

 

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
mkeintz
PROC Star

Editted note:  After this solution was accepted, a much more optimal solution was shown by @FreelanceReinh.

Take look at it in this topic thread.

 

Your task is effectively to examine all 358,800 four-character strings taken from the 26 letter alphabet, where no letter is present in the string more than once.

 

You need to consider the string as 3 overlapping letter-pairs, starting in positions 1, 2, and 3 respectively.

 

For a string to be declared valid (i.e. VALID='Y'), none of its 3 letter-pairs can be a reverse of the corresponding letter-pair in other valid strings.

 

This suggests that, as you determine a string to be valid, you can record its three letter-pairs (reversed) in a hash (three hashes actually: _RP1 for reversed position 1 pairs, _RP2 for reversed position 2 pairs, and _RP3).  Then any subsequent string that has a pair starting in position 1 that is found in hash _RP1, must not be valid.

 

This task is doable in one data step, but I put it in two steps below.  It takes less than 1 second:  It produces 24,173 with VALID='Y', and 334,627 with VALID='N'.

 

data need (drop=i j k l p) ;
  array LTR {26} $1 _temporary_ ('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z');
  length strng $4 ;
  array pr {3} $2;  /*Each 4-letter string has 3 pairs */
  array rp {3} $2;  /*Reversed Pairs */

  n=0;
  do i=1 to 26; 
    do j=1 to 26; 
      if j=i then continue  /* See discussion */;
      do k=1 to 26; 
        if k=i or k=j then continue  /* See discussion */;
        do l=1 to 26;
          if l=i or l=j or l=k then continue  /* See discussion */;
          strng=cats(ltr{i},ltr{j},ltr{k},ltr{l});
          do p=1 to length(strng)-1;
            pr{p}=substr(strng,p,2);
            rp{p}=reverse(pr{p});
          end;
          n=n+1;
          output;
        end;
      end;
    end;
  end;
run;

data want (drop=_:);
  set need  ;

  label link_rp1='Earliest N generating this rp1 value'
        link_rp2='Earliest N generating this rp2 value'
        link_rp3='Earliest N generating this rp3 value'  ;

  if _n_=1 then do;
    declare hash _rp1 ();  /* reverse pair1 values encountered so far */
      _rp1.definekey('rp1');
      _rp1.definedata('rp1','link_rp1');  /*Link_rp1, */
      _rp1.definedone();
    declare hash _rp2 ();
      _rp2.definekey('rp2');
      _rp2.definedata('rp2','link_rp2');
      _rp2.definedone();
    declare hash _rp3 ();
      _rp3.definekey('rp3');
      _rp3.definedata('rp3','link_rp3');
      _rp3.definedone();
  end;


  call missing (of link_:);
  _rc=_rp1.find(key:pr1); /*Get prior LINK_RP1, if any, for current PR1*/
  _rc=_rp2.find(key:pr2);
  _rc=_rp3.find(key:pr3);

  length valid $1;    /* will be either "Y" or "N" */
  if N(of link_:)>0 then valid='N'; /* At least 1 current pair found in a prior reverse pair*/
  else valid='Y';

  if valid='Y' then do;  /*If current is valid, then add new reverse pairs for lookup */
    if _rp1.check(key:rp1) then _rp1.add(key:rp1,data:rp1,data:n);
    if _rp2.check(key:rp2) then _rp2.add(key:rp2,data:rp2,data:n);
    if _rp3.check(key:rp3) then _rp3.add(key:rp3,data:rp3,data:n);
  end;
run;

proc freq data=want; tables valid;run;

Note this generates candidate strings in lexicographic order   (i.e. ABCD is valid, so BACD cannot be valid).  You could take away this lexicographic "bias" by randomly re-ordering the values in the LTR array.

 

Editted annotation: Regarding the three statements with  /* See discussion */ :  Remove these statements to allow a letter to appear in the strings more than once.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
DominicCavenagh
Fluorite | Level 6

Thanks for this solution, it is very elegant. I have not come across hashes in SAS before so these are new to me and I will need to do a bit of work to fully understand your implementation.

Your solution answers how the initial question was posed but does not allow for multiples of the same letter. Technically this is allowed by the constraints only if they do not violate the other constraints. Your code was very easy to edit to allow this however (just commented out the equality tests in the initial loop). 

As mentioned by @FreelanceReinhard in his solution, there appears to be many possible solutions and some will produce more valid strings than others. I will be interested to have a play around with the selection order of the strings to see how we can change the number of solutions.It is possible to see this by randomly sorting the first dataset produced in your code before running the second part.

Anyways, I really appreciate you taking time out to help.

Many thanks!

mkeintz
PROC Star

@DominicCavenagh wrote:

Thanks for this solution, it is very elegant. I have not come across hashes in SAS before so these are new to me and I will need to do a bit of work to fully understand your implementation.

Your solution answers how the initial question was posed but does not allow for multiples of the same letter. Technically this is allowed by the constraints only if they do not violate the other constraints. Your code was very easy to edit to allow this however (just commented out the equality tests in the initial loop). 

As mentioned by @FreelanceReinhard in his solution, there appears to be many possible solutions and some will produce more valid strings than others. I will be interested to have a play around with the selection order of the strings to see how we can change the number of solutions.It is possible to see this by randomly sorting the first dataset produced in your code before running the second part.

Anyways, I really appreciate you taking time out to help.

Many thanks!


Your comments on the order of the strings sent to the second data step are interesting.  I reordered of data set NEED by changing

          n=n+1;
          output;

 

to

          n=n+1;
          call streaminit(1295086);
          rn=rand('uniform');
          output;

and then sorted by RN.

 

After permitting letters to appears multiple times, the number of VALID='Y' jumped from 25,750 (using original order) to 53,040 with the above random sort.

 

Do you know of some formula that would provide the size of the largest possible set of strings satisfying your criteria?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
DominicCavenagh
Fluorite | Level 6
Not really. After looking at @FreelanceReinhard s graph theory solution, it seems like this would take a little bit more thought, and someone much smarter than me to figure out!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 3454 views
  • 12 likes
  • 6 in conversation