BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
KatLinden
Obsidian | Level 7

Hi,

I need to create a variable with a random 7 character string with no duplicates. This is what I've tried so far but there are some duplicates. I'm not sure how to do this.  Can anyone help?  


data a(keep=final);
length final $7;
do i=1 to 500000;
do j=1 to 7;
substr(final,j)=byte(int(65+26*ranuni(0)));
end;
output;
end;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Ugly brute force: Create all 7 character strings allowed in a data set. (18 lines of code should do it using nested DO loops)

Proc Survey select to select a desired number of them.

Merge with the original data.

 

Result is random values.

Small example with 3 letter combos:

data junk;
  length x $ 7;
  do i=65 to 90;
  do j=65 to 90;
  do k=65 to 90;
    x=cats(byte(i),byte(j),byte(k));
    output;
  end;
  end;
  end;
  keep x;
run; 

proc surveyselect data=junk out=rand noprint
     sampsize=25;
run;

/* if rand has the same number of records as 
   have a 1-1 match
*/
data want;
   merge have
         rand
   ;
run;

Note that you can query the metadata about your data set to get a macro variable to hold the number of records to request for a 1-1 sampsize of an existing data set.

 

 

View solution in original post

18 REPLIES 18
smantha
Lapis Lazuli | Level 10

data a(keep=final);
length final $7;
do i=1 to 500000;
do j=1 to 7;

bytep=byte(int(65+26*ranuni(0)));

if index(final,bytep) <=0 then do;
substr(final,j)=byte(int(65+26*ranuni(0)));

end;

else;

i = i-1;

end;
end;
output;
end;
run;

ballardw
Super User

Define "no duplicates". No duplicates of character within single value? No adjacent characters with same character? No overall value duplicating another?

KatLinden
Obsidian | Level 7

No overall value duplicating another.  Same character within a value is OK.  Thanks!

PGStats
Opal | Level 21

To avoid replicated characters within a value use the permutation (k out of n) routine:

 

data want;
array a $1 a_1-a_26;
do i = 1 to dim(a);
    a{i} = byte(i - 1 + rank("A"));
    end;
seed = 87896;
length final $7;
do i = 1 to 50;
    call ranperk(seed, 7, of a{*});
    final = cats(of a_1-a_7);
    output;
    end;
stop;
keep i final;
run;

 

PG
KatLinden
Obsidian | Level 7

I need unique values with no duplicates, not unique characters within a value.

Rick_SAS
SAS Super FREQ

I'm confused by the response that was accepted as the solution. When I submit that code, it produces a syntax error.

KatLinden
Obsidian | Level 7

Yes,  highlighted in red needs to be removed.

 

data a(keep=final);
length final $7;
do i=1 to 500000;
do j=1 to 7;

bytep=byte(int(65+26*ranuni(0)));

if index(final,bytep) <=0 then do;
substr(final,j)=byte(int(65+26*ranuni(0)));

end;

else;

i = i-1;

end;
end;  This needs to be removed.
output;
end;
run;

 

Rick_SAS
SAS Super FREQ

It seems to me that the program results in an infinite loop because the looping variable i is decremented and never reaches 500000.

KatLinden
Obsidian | Level 7

I think you're right.  I cancelled the solution.  Thank you.

mkeintz
PROC Star

As you generate random strings, try to store them as the key in a hash object (the h.add() method below).  If the key is already in the hash object (left at default behavior of never accommodating duplicates), then h.add() will return a non-zero.  Skip those non-zero instances and generate another random string.  But when h.add() returns a zero, then output the string and increment variable i.

 

The variable i below in the sequence number of the series of unique random strings generated.

 

 

data want (drop=_:);
  _characters ='ABCDEFGHIJKLMNOPQRSTUVWXYZ';

  length string $7;
  declare hash h ();
    h.definekey('string');
    h.definedone();

  call streaminit(105986);
  do i=1 by 0 until (i>500000);
    string=' ';
    do _c=1 to 7;
      string=cats(string,char(_characters,rand('integer',26)));
    end;
    if h.add()=0 then do; /*If h.add()=0 this is not a duplicate*/
      output;
      i=i+1;
    end;
  end;
run;

Note: there are 26**7 possible random strings.  This program would become an infinite loop if you tried to generate more than 26**7 unique strings (=8,031,810,176).   And it would become might slow long before than sample size.

 

But you're only asking for 500,000.

 

BTW, the do loop construction

  do i=1 by 0 until (i>500000);
    .... other code ...;
    if some condition then i=i+1;
  end;

is equivalent to the more common:

  i=1;
  do until (i>500000);
    .... other code ...;
    if some condition then i=i+1;
  end;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Rick_SAS
SAS Super FREQ

My question is: Do the strings really have to be unique and random? Often people want unique identifiers for items or transactions or whatever. If that is the case, can you use

0000001, 0000002, 0000003, ..., 500000.

These are obviously unique, and can be easily mapped to 7-character strings.

 

I'd like to interject a little probability theory to explain why "random" and "unique" are at odds with each other.

 

The OP proposes a set of S = 26**7 = 8 billion possible 7-character strings and asks for N=500k unique IDs chosen from within that set.

 

Several solutions attempt to generate a random string, check to see if it was already generated, and then regenerate a new string if necessary. This algorithm becomes inefficient when N is greater than about 100k.

 

Why? Because this formulation is equivalent to the famous Birthday Problem in probability theory. When you start drawing elements uniformly at random from the set S, it doesn't take long before you start drawing duplicates. In fact, after you have drawn 105,520 items, the probability that the next item has been seen before is greater than 50%. By the time you have drawn 200,000 items, the probability of drawing a duplicate value is about 97.5%. For more about the Birthday Problem, see

https://blogs.sas.com/content/iml/2012/04/09/vectorized-computations-and-the-birthday-matching-probl...

and

https://blogs.sas.com/content/iml/2013/07/03/duplicates-in-random-numbers.html

 

Therefore, you reach a point at which most of your time is spent generating random numbers that have not been seen before. The "randomness" and "uniqueness" are at odds.

 

I suggest that the OP generate an arithmetic sequence of integers less than 26**7, which is easy to do, and then map those integers to 7-character strings in the natural way. Because 26**7 is larger than constant('EXACTINT'), the OP might want to limit the integers to the interval [1, 2**31-1], which still is very large.

FreelanceReinh
Jade | Level 19

@Rick_SAS wrote:

Therefore, you reach a point at which most of your time is spent generating random numbers that have not been seen before.


Thanks for elaborating on the mathematical background. Luckily, if the sample size is as small as 500,000 out of 26**7, only a few (~10 - 20) hits of already selected items occur, so that performance is hardly impacted.

 

@mkeintz wrote:

The variable i below in the sequence number of the series of unique random strings generated.


If variable i is not needed, one could use

do until(h.num_items=500000);
  ...
end;

 

mkeintz
PROC Star

@FreelanceReinh wrote:

@Rick_SAS wrote:

Therefore, you reach a point at which most of your time is spent generating random numbers that have not been seen before.


Thanks for elaborating on the mathematical background. Luckily, if the sample size is as small as 500,000 out of 26**7, only a few (~10 - 20) hits of already selected items occur, so that performance is hardly impacted.

 

@mkeintz wrote:

The variable i below in the sequence number of the series of unique random strings generated.


If variable i is not needed, one could use

do until(h.num_items=500000);
  ...
end;

 


Agreed: the variable i in my response is not necessary, but I wanted a means to restore original order, should any sorting occur.  BTW, with the streaminit value I used, there were only 11 instances of duplicates to be skipped.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

Ugly brute force: Create all 7 character strings allowed in a data set. (18 lines of code should do it using nested DO loops)

Proc Survey select to select a desired number of them.

Merge with the original data.

 

Result is random values.

Small example with 3 letter combos:

data junk;
  length x $ 7;
  do i=65 to 90;
  do j=65 to 90;
  do k=65 to 90;
    x=cats(byte(i),byte(j),byte(k));
    output;
  end;
  end;
  end;
  keep x;
run; 

proc surveyselect data=junk out=rand noprint
     sampsize=25;
run;

/* if rand has the same number of records as 
   have a 1-1 match
*/
data want;
   merge have
         rand
   ;
run;

Note that you can query the metadata about your data set to get a macro variable to hold the number of records to request for a 1-1 sampsize of an existing data set.

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 4680 views
  • 10 likes
  • 8 in conversation