About DominicCavenagh

DominicCavenagh · ‎08-30-2022

Wow, thanks so much for this - this is a beautiful solution to the problem! Thanks so much for your effort to implement this! I wish I could accept multiple solutions!

DominicCavenagh · ‎08-29-2022

Not really. After looking at @FreelanceReinhard s graph theory solution, it seems like this would take a little bit more thought, and someone much smarter than me to figure out!

DominicCavenagh · ‎08-29-2022

Thanks for this solution, it is very elegant. I have not come across hashes in SAS before so these are new to me and I will need to do a bit of work to fully understand your implementation. Your solution answers how the initial question was posed but does not allow for multiples of the same letter. Technically this is allowed by the constraints only if they do not violate the other constraints. Your code was very easy to edit to allow this however (just commented out the equality tests in the initial loop). As mentioned by @FreelanceReinhard in his solution, there appears to be many possible solutions and some will produce more valid strings than others. I will be interested to have a play around with the selection order of the strings to see how we can change the number of solutions.It is possible to see this by randomly sorting the first dataset produced in your code before running the second part. Anyways, I really appreciate you taking time out to help. Many thanks!

DominicCavenagh · ‎08-29-2022

Thanks for this very thorough solution to the problem. It really opened my eyes to the differences in number of solutions that can be produced given the initial selection. I love the graph theory approach. Is it your feeling that all the strings could be produced by starting with one of the optional strings you mention and then just iterating through the letters? Then do this for each case?I would be interested to see your approach to this. If i could have, I would have really liked to mark this as an acceptable solution as well. Many thanks!

DominicCavenagh · ‎08-28-2022

Hi, thanks for your reply. The combinations idea is very interesting and has given me some food for thought. In terms of the "valid" values, these are generated in the same loop. So they are highly dependent on which is your first choice. i.e. 1. Generate all string permutations. 2. Choose one string as "valid" value (call it "str") 3. Mark all strings that "str" makes invalid as invalid 4. Choose another string that has not yet been marked invalid. 5.repeat steps 2-4, until you run out of valid strings to choose. Obviously from above, if the first choice is ABCD, then BACD will not be valid. However if your first choice is BACD then ABCD will not be valid. In my original program I was just going through the list of permutations as sorted alphabetically, however this may not be the best approach, and it may be better to just pick at random. In terms of the transcription comment, this is domain specific. The purpose of producing these ID's is that they need to be handwritten on samples and we don't want people accidently writing BACD when they read ABCD. There was some corncern this would happen, so we sought to produce IDs that if this did occur, it would not be a valid ID and we would know something wrong happened, rather than the sample being assigned to the wrong ID. While the combination method you describe guarantees success for our criteria, it is only producing a subset of the valid list of ID's. I think this is what you meant when saying this is a good place to start. If the combination method chooses ABCD, then there can be no other string with all these characters. However, based on our constraints, DBCA, CBAD and ADCB would be valid, as the repeated (/swapped) letters are not next to each other. Producing these extras is relatively easy to add in to the method, so I will have a go at that. Thanks again for your help.

DominicCavenagh · ‎08-28-2022

Hi All, I am trying to compute all valid character strings based on some constraints: 1. Each string is 4 alpha characters long i.e. "ABCD" or "QRTS" 2. A string is declared invalid if there is already a valid string where there are the same two characters next to each other. i.e. If "ABCD" is already a valid string, then "BACD" would be invalid due to AB being next to each other in the first two positions. Consequently "ABCD" would make the following three strings invalid: "BACD", "ACBD", "ABDC". My approach so far has been to generate all 26^4 combinations (CHAR4 in the code). Then for each string, generate the strings that this string makes invalid (T_12, T_23, T_34 in the code). Then loop through each row and merge back into the full list of strings, marking any string that is the same as {T_12, T_23, T_34} as invalid. Then move to the next string and repeat. The resulting dataset should have all the strings, and those that are invalid are flagged i.e. ineligible=1. As you can probably imagine due to the ~500,000 strings and the amount of repeated merging, this process is taking a long time and I havent been able to run it till finish yet. So far I have left it going for 24 hours without any luck. I feel like there is a better way to do this, but cant put my finger on it, does anyone have any advice on a way to optimise my program, or a better way to produce these strings altogether? Thanks in advance See below for my current approach: ** SET UP POSSIBLE PERMUTATIONS OF THE 4-PART TEXT STRING - 26^4=456,976 combinations ; proc format; value alpha 1="A" 2="B" 3="C" 4="D" 5="E" 6="F" 7="G" 8="H" 9="I" 10="J" 11="K" 12="L" 13="M" 14="N" 15="O" 16="P" 17="Q" 18="R" 19="S" 20="T" 21="U" 22="V" 23="W" 24="X" 25="Y" 26="Z"; run; data alpha; do N1=1 to 26; do N2=1 to 26; do N3=1 to 26; do N4=1 to 26; output; end; end; end; end; format N1 N2 N3 N4 alpha.; run; data new1; set alpha; length C1 C2 C3 C4 $1.; c1=compress(vvalue(N1)); c2=compress(vvalue(N2)); c3=compress(vvalue(N3)); c4=compress(vvalue(N4)); CHAR4=C1||C2||C3||C4; * by default: ineligible=0; INELIGIBLE=0; * this variable will be used for 1-to-many merging; ALL=1; * set up the combinations with potential transcription errors; T_12=c2||c1||c3||c4; T_23=c1||c3||c2||c4; T_34=c1||c2||c4||c3; run; * this is the dataset to be overwritten with each loop ; data new2; set new1; run; proc sort; by all; run; %macro loopy (start, stop); %do i=&start. %to &stop.; data select1; set new2; if _N_=&i. and ineligible=0; * keep the combinations with potential transcription errors; TRANSPOSE12=T_12; TRANSPOSE23=T_23; TRANSPOSE34=T_34; keep ALL TRANSPOSE12 TRANSPOSE23 TRANSPOSE34; run; data new2; merge NEW2 (in=in1) SELECT1; by all; if in1; * if CHAR4 is the same as any of the transposed combinations, then indicate as INELIGIBLE=1; if CHAR4=TRANSPOSE12 or CHAR4=TRANSPOSE23 or CHAR4=TRANSPOSE34 then INELIGIBLE=1; drop TRANSPOSE12 TRANSPOSE23 TRANSPOSE34; run; %end; %mend;

Online Status	Offline
Date Last Visited	‎09-12-2022 04:15 AM

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Re: Generate all valid combinations of 4 length character strings

Generate all valid combinations of 4 length character strings