topic Re: Assign new random number to data by date in SAS Procedures

Assign new random number to data by date

saf_nadia — Tue, 18 Jun 2019 23:03:17 GMT

Hi, I want to assign new random number to a master data by date. The random number could be the same but not within the same date. Appreciate your help!

Example input:
Date Id Entries
1/2/19 A 1
1/2/19 B 2
1/2/19 B 2
2/3/19 C 3
2/3/19 C 3
2/3/19 C 3
2/3/19 A 2
2/3/19 A 2

Output that I want:
Date Id Entries Randomnumber
1/2/19 A 1 0.287
1/2/19 B 2 0.758
1/2/19 B 2 0.958
2/3/19 C 3 0.286
2/3/19 C 3 0.346
2/3/19 C 3 0.183
2/3/19 A 2 0.983
2/3/19 A 2 0.758

Re: Assign new random number to data by date

ballardw — Tue, 18 Jun 2019 23:26:56 GMT

data want;

set have;

randomnumber = rand('uniform');

run;

unless you have a pretty large number of records for the same date there's not likely to be much problem with a simple call.

Re: Assign new random number to data by date

saf_nadia — Tue, 18 Jun 2019 23:38:04 GMT

I actually have 93 different dates. How to do 93 sets of random number by dates?

Re: Assign new random number to data by date

Patrick — Wed, 19 Jun 2019 00:32:50 GMT

@saf_nadia

The rand() function will return a number with 8 byte precision (around 15 digits). The risk that you get the exactly same random number twice is very very small unless you've got a really big data set. For this reason you could just create random numbers without having to care about the date groups.

If you want to create random numbers with lower precision where you would have the risk of repetition then you would need to maintain a sort of black list to always check if a new generated number has been used already. Below code is a possible way to do something like that.

data have;
  input Date :ddmmyy. Id $ Entries;
  format date date9.;
  datalines;
1/2/19 A 1
1/2/19 B 2
1/2/19 B 2
2/3/19 C 3
2/3/19 C 3
2/3/19 C 3
2/3/19 A 2
2/3/19 A 2
;

data want(drop=_:);

  if _n_=1 then
    do;
      dcl hash h1();
      h1.defineKey('randomnumber');
      h1.defineDone();
    end;

  set have;
  by date;
  if first.date then h1.clear();

  do while(1);
    randomnumber = round(rand('uniform'),0.001);
    if h1.check() ne 0 then 
      do;
        h1.add();
        leave;
      end;

    /* avoid endless loop */
    _n=sum(_n,1);
    if _n=10000 then 
      do;
        put 'aborting job after 10000 trials to create a new unused random number';
        abort;
      end;
  end;

run;

Re: Assign new random number to data by date

saf_nadia — Wed, 19 Jun 2019 01:34:38 GMT

Patrick, ny data has 1 million observations although only 93 different dates. Have tried to generate the random number using simple code by ballard and checked there are around 233 dupout random number.

Re: Assign new random number to data by date

Patrick — Wed, 19 Jun 2019 05:29:22 GMT

@saf_nadia

Yes, you are absolutely right and what you observed is actually fully documented.

Duplicate Values

The RNG algorithms used by the RAND function have extremely long periods, but this does not imply that large random samples are devoid of duplicate values. With the default 32-bit Mersenne Twister algorithm, the RAND function returns at most 2³² distinct values. In a random uniform sample of size 10⁵, the chance of drawing at least one duplicate is greater than 50%. The expected number of duplicates in a random uniform sample of size M is approximately M²/2³³ when M is much less than 2³². For example, you should expect about 115 duplicates in a random uniform sample of size M=10⁶. These results are consequences of the famous “birthday matching problem” in probability theory.

https://go.documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=p0fpeei0opypg8n1b06qe4r040lv.htm&docsetVersion=9.4&locale=en#p04b3euoqp9oewn1g8ckgx2ljawl

The code I've posted previously should work. You just need to allow for more digits (=remove the round() function). Below the amended code.

data have;
  input Date :ddmmyy. Id $ Entries;
  format date date9.;
  datalines;
1/2/19 A 1
1/2/19 B 2
1/2/19 B 2
2/3/19 C 3
2/3/19 C 3
2/3/19 C 3
2/3/19 A 2
2/3/19 A 2
;

data want(drop=_:);

  if _n_=1 then
    do;
      dcl hash h1();
      h1.defineKey('randomnumber');
      h1.defineDone();
    end;

  set have;
  by date;
  if first.date then h1.clear();

  do while(1);
    randomnumber = rand('uniform');
    if h1.check() ne 0 then 
      do;
        h1.add();
        leave;
      end;
    /* avoid endless loop */
    _n=sum(_n,1);
    if _n=10000 then 
      do;
        put 'aborting job after 10000 trials to create a new unused random number';
        abort;
      end;
  end;

run;

The ranuni() function seems to work differently and there I don't get duplicates.

data sample(keep=randomnumber);
  format randomnumber best32.;
  do i=1 to 10**7;
/*    randomnumber = rand('uniform');*/
    randomnumber = ranuni(1);
    output;
  end;
  stop;
run;

proc sort data=sample nodupkey dupout=duplicates;
  by randomnumber;
run;

proc print data=duplicates;
run;

NOTE: No observations in data set WORK.DUPLICATES.

Re: Assign new random number to data by date

ballardw — Wed, 19 Jun 2019 15:09:23 GMT

@saf_nadia wrote:
Patrick, ny data has 1 million observations although only 93 different dates. Have tried to generate the random number using simple code by ballard and checked there are around 233 dupout random number.

Can you tell us what the purpose of the random number? Perhaps there is another method that would do what you need.

Re: Assign new random number to data by date

ErikLund_Jensen — Wed, 19 Jun 2019 17:04:09 GMT

Hi @saf_nadia

I agree with @ballardw - Do you really need random numbers and not just some sort of unique observarion ID? - It happens to all of us that we are stuck in a coding problem and focus on that instead of considering alternative solutions, so we ask for help to climb the downpipe instead of asking for help to find the staircase.

But if you need random numbers, and need them in random order too, you can cheat by making more random numbers than necessary and keeping distinct values, before they are added to the data set. I have an example here:

* Make some test data - data set HAVE;
data have; 
	do var1 = 1 to 1000000; 
		var2 = mod(var1,10)+1;
		output; 
	end; 
run;

* Get observations from have;
proc sql noprint;
	select count(*) into :obs
	from have;
quit;

* Generate 2*obs random numbers;
data tmp1; 
	do i = 1 to %eval(&obs*2);
		randomnumber = round(rand('uniform')*10000000);
		output;
	end;
run;

* Sort by randomnumber;
proc sort data=tmp1; by randomnumber i;
run;

* Get rid of duplicates;
data tmp2; set tmp1; by randomnumber;
	if first.randomnumber;
run;

* Sort back in random order;
proc sort data=tmp2 out=randomlist (drop=i); by i;
run;

* Add random number to input records;
* Keep input sorting of both data sets;
data want; merge have (in=have_exist) randomlist;
	if not have_exist then stop;
	output;
run;

Re: Assign new random number to data by date

saf_nadia — Thu, 20 Jun 2019 10:30:13 GMT

Hi Erik. I guess I just need some unique Id without any duplication. I find this work best to my problem. Thanks Erik!