Hi All, I am trying to compute all valid character strings based on some constraints: 1. Each string is 4 alpha characters long i.e. "ABCD" or "QRTS" 2. A string is declared invalid if there is already a valid string where there are the same two characters next to each other. i.e. If "ABCD" is already a valid string, then "BACD" would be invalid due to AB being next to each other in the first two positions. Consequently "ABCD" would make the following three strings invalid: "BACD", "ACBD", "ABDC". My approach so far has been to generate all 26^4 combinations (CHAR4 in the code). Then for each string, generate the strings that this string makes invalid (T_12, T_23, T_34 in the code). Then loop through each row and merge back into the full list of strings, marking any string that is the same as {T_12, T_23, T_34} as invalid. Then move to the next string and repeat. The resulting dataset should have all the strings, and those that are invalid are flagged i.e. ineligible=1. As you can probably imagine due to the ~500,000 strings and the amount of repeated merging, this process is taking a long time and I havent been able to run it till finish yet. So far I have left it going for 24 hours without any luck. I feel like there is a better way to do this, but cant put my finger on it, does anyone have any advice on a way to optimise my program, or a better way to produce these strings altogether? Thanks in advance See below for my current approach: ** SET UP POSSIBLE PERMUTATIONS OF THE 4-PART TEXT STRING - 26^4=456,976 combinations ;
proc format;
value alpha
1="A" 2="B" 3="C" 4="D" 5="E" 6="F" 7="G" 8="H" 9="I" 10="J"
11="K" 12="L" 13="M" 14="N" 15="O" 16="P" 17="Q" 18="R" 19="S" 20="T"
21="U" 22="V" 23="W" 24="X" 25="Y" 26="Z";
run;
data alpha;
do N1=1 to 26;
do N2=1 to 26;
do N3=1 to 26;
do N4=1 to 26;
output;
end;
end;
end;
end;
format N1 N2 N3 N4 alpha.;
run;
data new1;
set alpha;
length C1 C2 C3 C4 $1.;
c1=compress(vvalue(N1));
c2=compress(vvalue(N2));
c3=compress(vvalue(N3));
c4=compress(vvalue(N4));
CHAR4=C1||C2||C3||C4;
* by default: ineligible=0;
INELIGIBLE=0;
* this variable will be used for 1-to-many merging;
ALL=1;
* set up the combinations with potential transcription errors;
T_12=c2||c1||c3||c4;
T_23=c1||c3||c2||c4;
T_34=c1||c2||c4||c3;
run;
* this is the dataset to be overwritten with each loop ;
data new2;
set new1;
run;
proc sort; by all; run;
%macro loopy (start, stop);
%do i=&start. %to &stop.;
data select1;
set new2;
if _N_=&i. and ineligible=0;
* keep the combinations with potential transcription errors;
TRANSPOSE12=T_12;
TRANSPOSE23=T_23;
TRANSPOSE34=T_34;
keep ALL TRANSPOSE12 TRANSPOSE23 TRANSPOSE34;
run;
data new2;
merge NEW2 (in=in1) SELECT1;
by all;
if in1;
* if CHAR4 is the same as any of the transposed combinations, then indicate as INELIGIBLE=1;
if CHAR4=TRANSPOSE12 or CHAR4=TRANSPOSE23 or CHAR4=TRANSPOSE34 then INELIGIBLE=1;
drop TRANSPOSE12 TRANSPOSE23 TRANSPOSE34;
run;
%end;
%mend;
... View more