Solved: Random character string with no duplicates

KatLinden · Posted 08-18-2022 12:11 PM

Hi,

I need to create a variable with a random 7 character string with no duplicates. This is what I've tried so far but there are some duplicates. I'm not sure how to do this. Can anyone help?

data a(keep=final);
length final $7;
do i=1 to 500000;
do j=1 to 7;
substr(final,j)=byte(int(65+26*ranuni(0)));
end;
output;
end;
run;

ballardw · Posted 08-18-2022 06:47 PM

Ugly brute force: Create all 7 character strings allowed in a data set. (18 lines of code should do it using nested DO loops)

Proc Survey select to select a desired number of them.

Merge with the original data.

Result is random values.

Small example with 3 letter combos:

data junk;
  length x $ 7;
  do i=65 to 90;
  do j=65 to 90;
  do k=65 to 90;
    x=cats(byte(i),byte(j),byte(k));
    output;
  end;
  end;
  end;
  keep x;
run; 

proc surveyselect data=junk out=rand noprint
     sampsize=25;
run;

/* if rand has the same number of records as 
   have a 1-1 match
*/
data want;
   merge have
         rand
   ;
run;

Note that you can query the metadata about your data set to get a macro variable to hold the number of records to request for a 1-1 sampsize of an existing data set.

View solution in original post

smantha · Posted 08-18-2022 12:17 PM

data a(keep=final);
length final $7;
do i=1 to 500000;
do j=1 to 7;

bytep=byte(int(65+26*ranuni(0)));

if index(final,bytep) <=0 then do;
substr(final,j)=byte(int(65+26*ranuni(0)));

end;

else;

i = i-1;

end;
end;
output;
end;
run;

ballardw · Posted 08-18-2022 01:21 PM

Define "no duplicates". No duplicates of character within single value? No adjacent characters with same character? No overall value duplicating another?

KatLinden · Posted 08-18-2022 01:23 PM

No overall value duplicating another. Same character within a value is OK. Thanks!

PGStats · Posted 08-18-2022 01:50 PM

To avoid replicated characters within a value use the permutation (k out of n) routine:

data want;
array a $1 a_1-a_26;
do i = 1 to dim(a);
    a{i} = byte(i - 1 + rank("A"));
    end;
seed = 87896;
length final $7;
do i = 1 to 50;
    call ranperk(seed, 7, of a{*});
    final = cats(of a_1-a_7);
    output;
    end;
stop;
keep i final;
run;

PG

KatLinden · Posted 08-18-2022 03:52 PM

I need unique values with no duplicates, not unique characters within a value.

Rick_SAS · Posted 08-18-2022 02:01 PM

I'm confused by the response that was accepted as the solution. When I submit that code, it produces a syntax error.

KatLinden · Posted 08-18-2022 02:14 PM

Yes, highlighted in red needs to be removed.

data a(keep=final);
length final $7;
do i=1 to 500000;
do j=1 to 7;

bytep=byte(int(65+26*ranuni(0)));

if index(final,bytep) <=0 then do;
substr(final,j)=byte(int(65+26*ranuni(0)));

end;

else;

i = i-1;

end;
end; This needs to be removed.
output;
end;
run;

Rick_SAS · Posted 08-18-2022 02:58 PM

It seems to me that the program results in an infinite loop because the looping variable i is decremented and never reaches 500000.

KatLinden · Posted 08-18-2022 03:55 PM

I think you're right. I cancelled the solution. Thank you.

mkeintz · Posted 08-18-2022 04:37 PM

As you generate random strings, try to store them as the key in a hash object (the h.add() method below). If the key is already in the hash object (left at default behavior of never accommodating duplicates), then h.add() will return a non-zero. Skip those non-zero instances and generate another random string. But when h.add() returns a zero, then output the string and increment variable i.

The variable i below in the sequence number of the series of unique random strings generated.

data want (drop=_:);
  _characters ='ABCDEFGHIJKLMNOPQRSTUVWXYZ';

  length string $7;
  declare hash h ();
    h.definekey('string');
    h.definedone();

  call streaminit(105986);
  do i=1 by 0 until (i>500000);
    string=' ';
    do _c=1 to 7;
      string=cats(string,char(_characters,rand('integer',26)));
    end;
    if h.add()=0 then do; /*If h.add()=0 this is not a duplicate*/
      output;
      i=i+1;
    end;
  end;
run;

Note: there are 26**7 possible random strings. This program would become an infinite loop if you tried to generate more than 26**7 unique strings (=8,031,810,176). And it would become might slow long before than sample size.

But you're only asking for 500,000.

BTW, the do loop construction

  do i=1 by 0 until (i>500000);
    .... other code ...;
    if some condition then i=i+1;
  end;

is equivalent to the more common:

  i=1;
  do until (i>500000);
    .... other code ...;
    if some condition then i=i+1;
  end;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Rick_SAS · Posted 08-18-2022 05:14 PM

My question is: Do the strings really have to be unique and random? Often people want unique identifiers for items or transactions or whatever. If that is the case, can you use

0000001, 0000002, 0000003, ..., 500000.

These are obviously unique, and can be easily mapped to 7-character strings.

I'd like to interject a little probability theory to explain why "random" and "unique" are at odds with each other.

The OP proposes a set of S = 26**7 = 8 billion possible 7-character strings and asks for N=500k unique IDs chosen from within that set.

Several solutions attempt to generate a random string, check to see if it was already generated, and then regenerate a new string if necessary. This algorithm becomes inefficient when N is greater than about 100k.

Why? Because this formulation is equivalent to the famous Birthday Problem in probability theory. When you start drawing elements uniformly at random from the set S, it doesn't take long before you start drawing duplicates. In fact, after you have drawn 105,520 items, the probability that the next item has been seen before is greater than 50%. By the time you have drawn 200,000 items, the probability of drawing a duplicate value is about 97.5%. For more about the Birthday Problem, see

https://blogs.sas.com/content/iml/2012/04/09/vectorized-computations-and-the-birthday-matching-probl...

and

https://blogs.sas.com/content/iml/2013/07/03/duplicates-in-random-numbers.html

Therefore, you reach a point at which most of your time is spent generating random numbers that have not been seen before. The "randomness" and "uniqueness" are at odds.

I suggest that the OP generate an arithmetic sequence of integers less than 26**7, which is easy to do, and then map those integers to 7-character strings in the natural way. Because 26**7 is larger than constant('EXACTINT'), the OP might want to limit the integers to the interval [1, 2**31-1], which still is very large.

FreelanceReinh · Posted 08-18-2022 05:59 PM

@Rick_SAS wrote:

Therefore, you reach a point at which most of your time is spent generating random numbers that have not been seen before.

Thanks for elaborating on the mathematical background. Luckily, if the sample size is as small as 500,000 out of 26**7, only a few (~10 - 20) hits of already selected items occur, so that performance is hardly impacted.

@mkeintz wrote:

The variable i below in the sequence number of the series of unique random strings generated.

If variable i is not needed, one could use

do until(h.num_items=500000);
  ...
end;

mkeintz · Posted 08-18-2022 06:11 PM

@FreelanceReinh wrote:

@Rick_SAS wrote:

Therefore, you reach a point at which most of your time is spent generating random numbers that have not been seen before.

Thanks for elaborating on the mathematical background. Luckily, if the sample size is as small as 500,000 out of 26**7, only a few (~10 - 20) hits of already selected items occur, so that performance is hardly impacted.

@mkeintz wrote:

The variable i below in the sequence number of the series of unique random strings generated.

If variable i is not needed, one could use
do until(h.num_items=500000);
  ...
end;

Agreed: the variable i in my response is not necessary, but I wanted a means to restore original order, should any sorting occur. BTW, with the streaminit value I used, there were only 11 instances of duplicates to be skipped.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 08-18-2022 06:47 PM

Ugly brute force: Create all 7 character strings allowed in a data set. (18 lines of code should do it using nested DO loops)

Proc Survey select to select a desired number of them.

Merge with the original data.

Result is random values.

Small example with 3 letter combos:

data junk;
  length x $ 7;
  do i=65 to 90;
  do j=65 to 90;
  do k=65 to 90;
    x=cats(byte(i),byte(j),byte(k));
    output;
  end;
  end;
  end;
  keep x;
run; 

proc surveyselect data=junk out=rand noprint
     sampsize=25;
run;

/* if rand has the same number of records as 
   have a 1-1 match
*/
data want;
   merge have
         rand
   ;
run;

Note that you can query the metadata about your data set to get a macro variable to hold the number of records to request for a 1-1 sampsize of an existing data set.

Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Re: Random character string with no duplicates

Classroom Training Available!