Calcite | Level 5

## Create unique, random identifiers

Hello,

I am attempting to create a unique, random ID variable for all existing IDs within dataset y. I have played around with a bunch of code however, I get a few duplicate IDs each time. How do I transform the below code to produce a unique, random number between 1 and 10**8? I have used streaminit(1) to ensure the same random numbers are generated on each run.

Data x;

Set y;

Call streaminit(1);
ID = RAND('Integer',(10 ** 8));

Run;

6 REPLIES 6
Super User

## Re: Create unique, random identifiers

You could try PROC SURVEYSELECT:

``````data have;
set sashelp.class;
run;

data id;
do id=1 to 10**8;
output;
end;
run;
%let dsid=%sysfunc(open(have));
%let nobs=%sysfunc(attrn(&dsid.,nlobs));
%let dsid=%sysfunc(close(&dsid.));
proc surveyselect data=id out=id2 seed=123 noprint sampsize=&nobs. outrandom;
run;
data want;
merge have id2;
run;
``````
Super User

## Re: Create unique, random identifiers

Here is another way:

``````Data x;
if _n_=1 then do;
length id 8;
declare hash h(hashexp:20);
h.definekey(key:'id');
h.definedone();
end;
Set sashelp.class;

Call streaminit(1);
do until(rc ne 0);
ID = RAND('Integer',1,10**8);
rc=h.check();
end;
drop rc;
Run; ``````

## Re: Create unique, random identifiers

Hello @ALEXIA1,

If your new IDs just need to "look like" random numbers, you can also resort to the (deprecated) RANUNI function, which creates up to 2,147,483,646 unique pseudo-random  numbers by default:

``````data x;
set y;
do until(ID <= 1e8);
ID = ranuni(2718)*(2**31-1);
end;
run;``````
Super User

## Re: Create unique, random identifiers

Is there already an ID variable in the dataset which you want to mask with the new random ID? If yes, @Ksharp 's code needs to be expanded to take care of possible duplicate entries of the already existing ID.

## Re: Create unique, random identifiers

@Kurt_Bremser wrote:

Is there already an ID variable in the dataset which you want to mask with the new random ID? If yes, @Ksharp 's code needs to be expanded to take care of possible duplicate entries of the already existing ID.

Good point from @Kurt_Bremser. In the case of an existing ID in the same range (1, 2, ..., 10**8) the RANUNI approach is particularly simple thanks to the CALL RANUNI routine, which can compute the new ID from the old one:

``````/* Create sample data for demonstration */

data have;
input ID othervar \$;
cards;
12345678 A1
11111111 B1
100000000 C1
1 D1
87654321 E1
12345678 A2
11111111 B2
100000000 C2
1 D2
87654321 E2
;

/* Create unique pseudo-random identifier ID2 between 1 and 10**8 for each existing ID in the same range */

data want;
set have;
ID2=ID;
do until(ID2<=1e8);
call ranuni(ID2, _n_);
end;
run;``````

Result:

```Obs           ID    othervar       ID2

1     12345678       A1       69598248
2     11111111       B1       13830595
3    100000000       C1       60730215
4            1       D1       48774337
5     87654321       E1        2209083
6     12345678       A2       69598248
7     11111111       B2       13830595
8    100000000       C2       60730215
9            1       D2       48774337
10     87654321       E2        2209083```

Edit: Moreover, you can compute ("reconstruct") the original ID from the new ID2. There is no SAS-supplied function or CALL routine for that, but the necessary calculations can be done in a DATA step:

``````/* Reconstruct the original ID from the new ID2 */

%let c=58743242 ; /* multiplicative inverse of the "multiplier" 397204094 found in RANUNI documentation, modulo 2**31-1 */
%let d=%sysevalf(2**31-1); /* =2147483647 */

data check(drop=_:);
/* Preparation: compute 10**k * 58743242 modulo 2**31-1, k=1, ..., 9 */
array _a[9] _temporary_;
if _n_=1 then do;
_a[1]=mod(10*&c, &d);
do _k=2 to 9;
_a[_k]=mod(10*_a[_k-1], &d);
end;
end;
set want;
_m=ID2;
do until(orig_ID<=1e8);
_t=put(_m,10.);
_s=input(char(_t,10), 1.)*&c;
do _k=1 to length(left(_t))-1;
_s+input(char(_t,10-_k), 1.)*_a[_k];
end;
orig_ID=mod(_s,&d);
_m=orig_ID;
end;
run;``````

PROC Star

## Re: Create unique, random identifiers

Here's a program that leverages the hash approach suggested by @Ksharp .  It assumes there is an old ID variable, and also that multiple observations can have the same old ID value, and therefore should be assigned the same randomly determined NEWID value.

It creates dataset WANT with the NEWID (but not the old ID), and a second dataset LOOKUP with all the ID/NEWID pairs:

``````data want (drop=id);  /*But keep NEWID*/

set have  end=end_of_have;

if _n_=1 then do;
declare hash LNK (ordered:'A');
LNK.definekey('id');
LNK.definedata('id','newid');
LNK.definedone();
declare hash NEW ();
NEW.definekey('newid');
NEW.definedone();
end;

call streaminit(20240717);
if LNK.find()^=0 then do;
do until (NEW.check()^=0);
newID = RAND('Integer',(10 ** 8));
end;
end;

if end_of_have then LNK.output(dataset:'lookup');
run;``````

Since the LOOKUP dataset is a 1:1 pairing, it can be used to find NEWID for each ID, ... and the reverse.

In the case of very large cardinality of the old ID variable.

1. This also assumes that you have no more than 10**8 unique old ID values.
2. The closer the cardinality of your old ID gets to 10**8, the slower this process will be.
3. And the greater risk there will be that the hash object will require more memory than available.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Discussion stats
• 6 replies
• 519 views
• 1 like
• 5 in conversation