Solved: Re: Assign new random number to data by date

saf_nadia · Posted 06-18-2019 07:03 PM

Hi, I want to assign new random number to a master data by date. The random number could be the same but not within the same date. Appreciate your help!

Example input:
Date Id Entries
1/2/19 A 1
1/2/19 B 2
1/2/19 B 2
2/3/19 C 3
2/3/19 C 3
2/3/19 C 3
2/3/19 A 2
2/3/19 A 2

Output that I want:
Date Id Entries Randomnumber
1/2/19 A 1 0.287
1/2/19 B 2 0.758
1/2/19 B 2 0.958
2/3/19 C 3 0.286
2/3/19 C 3 0.346
2/3/19 C 3 0.183
2/3/19 A 2 0.983
2/3/19 A 2 0.758

ErikLund_Jensen · Posted 06-19-2019 01:04 PM

Hi @saf_nadia

I agree with @ballardw - Do you really need random numbers and not just some sort of unique observarion ID? - It happens to all of us that we are stuck in a coding problem and focus on that instead of considering alternative solutions, so we ask for help to climb the downpipe instead of asking for help to find the staircase.

But if you need random numbers, and need them in random order too, you can cheat by making more random numbers than necessary and keeping distinct values, before they are added to the data set. I have an example here:

* Make some test data - data set HAVE;
data have; 
	do var1 = 1 to 1000000; 
		var2 = mod(var1,10)+1;
		output; 
	end; 
run;

* Get observations from have;
proc sql noprint;
	select count(*) into :obs
	from have;
quit;

* Generate 2*obs random numbers;
data tmp1; 
	do i = 1 to %eval(&obs*2);
		randomnumber = round(rand('uniform')*10000000);
		output;
	end;
run;

* Sort by randomnumber;
proc sort data=tmp1; by randomnumber i;
run;

* Get rid of duplicates;
data tmp2; set tmp1; by randomnumber;
	if first.randomnumber;
run;

* Sort back in random order;
proc sort data=tmp2 out=randomlist (drop=i); by i;
run;

* Add random number to input records;
* Keep input sorting of both data sets;
data want; merge have (in=have_exist) randomlist;
	if not have_exist then stop;
	output;
run;

View solution in original post

ballardw · Posted 06-18-2019 07:26 PM

data want;

set have;

randomnumber = rand('uniform');

run;

unless you have a pretty large number of records for the same date there's not likely to be much problem with a simple call.

saf_nadia · Posted 06-18-2019 07:38 PM

I actually have 93 different dates. How to do 93 sets of random number by dates?

Patrick · Posted 06-18-2019 08:32 PM

@saf_nadia

The rand() function will return a number with 8 byte precision (around 15 digits). The risk that you get the exactly same random number twice is very very small unless you've got a really big data set. For this reason you could just create random numbers without having to care about the date groups.

If you want to create random numbers with lower precision where you would have the risk of repetition then you would need to maintain a sort of black list to always check if a new generated number has been used already. Below code is a possible way to do something like that.

data have;
  input Date :ddmmyy. Id $ Entries;
  format date date9.;
  datalines;
1/2/19 A 1
1/2/19 B 2
1/2/19 B 2
2/3/19 C 3
2/3/19 C 3
2/3/19 C 3
2/3/19 A 2
2/3/19 A 2
;

data want(drop=_:);

  if _n_=1 then
    do;
      dcl hash h1();
      h1.defineKey('randomnumber');
      h1.defineDone();
    end;

  set have;
  by date;
  if first.date then h1.clear();

  do while(1);
    randomnumber = round(rand('uniform'),0.001);
    if h1.check() ne 0 then 
      do;
        h1.add();
        leave;
      end;

    /* avoid endless loop */
    _n=sum(_n,1);
    if _n=10000 then 
      do;
        put 'aborting job after 10000 trials to create a new unused random number';
        abort;
      end;
  end;

run;

saf_nadia · Posted 06-18-2019 09:34 PM

Patrick, ny data has 1 million observations although only 93 different dates. Have tried to generate the random number using simple code by ballard and checked there are around 233 dupout random number.

Patrick · Posted 06-19-2019 01:28 AM

@saf_nadia

Yes, you are absolutely right and what you observed is actually fully documented.

Duplicate Values

The RNG algorithms used by the RAND function have extremely long periods, but this does not imply that large random samples are devoid of duplicate values. With the default 32-bit Mersenne Twister algorithm, the RAND function returns at most 2³² distinct values. In a random uniform sample of size 10⁵, the chance of drawing at least one duplicate is greater than 50%. The expected number of duplicates in a random uniform sample of size M is approximately M²/2³³ when M is much less than 2³². For example, you should expect about 115 duplicates in a random uniform sample of size M=10⁶. These results are consequences of the famous “birthday matching problem” in probability theory.

https://go.documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=p0fpeei0opypg8n1b06qe4r040lv....

The code I've posted previously should work. You just need to allow for more digits (=remove the round() function). Below the amended code.

data have;
  input Date :ddmmyy. Id $ Entries;
  format date date9.;
  datalines;
1/2/19 A 1
1/2/19 B 2
1/2/19 B 2
2/3/19 C 3
2/3/19 C 3
2/3/19 C 3
2/3/19 A 2
2/3/19 A 2
;

data want(drop=_:);

  if _n_=1 then
    do;
      dcl hash h1();
      h1.defineKey('randomnumber');
      h1.defineDone();
    end;

  set have;
  by date;
  if first.date then h1.clear();

  do while(1);
    randomnumber = rand('uniform');
    if h1.check() ne 0 then 
      do;
        h1.add();
        leave;
      end;
    /* avoid endless loop */
    _n=sum(_n,1);
    if _n=10000 then 
      do;
        put 'aborting job after 10000 trials to create a new unused random number';
        abort;
      end;
  end;

run;

The ranuni() function seems to work differently and there I don't get duplicates.

data sample(keep=randomnumber);
  format randomnumber best32.;
  do i=1 to 10**7;
/*    randomnumber = rand('uniform');*/
    randomnumber = ranuni(1);
    output;
  end;
  stop;
run;

proc sort data=sample nodupkey dupout=duplicates;
  by randomnumber;
run;

proc print data=duplicates;
run;

NOTE: No observations in data set WORK.DUPLICATES.

ballardw · Posted 06-19-2019 11:09 AM

@saf_nadia wrote:
Patrick, ny data has 1 million observations although only 93 different dates. Have tried to generate the random number using simple code by ballard and checked there are around 233 dupout random number.

Can you tell us what the purpose of the random number? Perhaps there is another method that would do what you need.

ErikLund_Jensen · Posted 06-19-2019 01:04 PM

Hi @saf_nadia

I agree with @ballardw - Do you really need random numbers and not just some sort of unique observarion ID? - It happens to all of us that we are stuck in a coding problem and focus on that instead of considering alternative solutions, so we ask for help to climb the downpipe instead of asking for help to find the staircase.

But if you need random numbers, and need them in random order too, you can cheat by making more random numbers than necessary and keeping distinct values, before they are added to the data set. I have an example here:

* Make some test data - data set HAVE;
data have; 
	do var1 = 1 to 1000000; 
		var2 = mod(var1,10)+1;
		output; 
	end; 
run;

* Get observations from have;
proc sql noprint;
	select count(*) into :obs
	from have;
quit;

* Generate 2*obs random numbers;
data tmp1; 
	do i = 1 to %eval(&obs*2);
		randomnumber = round(rand('uniform')*10000000);
		output;
	end;
run;

* Sort by randomnumber;
proc sort data=tmp1; by randomnumber i;
run;

* Get rid of duplicates;
data tmp2; set tmp1; by randomnumber;
	if first.randomnumber;
run;

* Sort back in random order;
proc sort data=tmp2 out=randomlist (drop=i); by i;
run;

* Add random number to input records;
* Keep input sorting of both data sets;
data want; merge have (in=have_exist) randomlist;
	if not have_exist then stop;
	output;
run;

saf_nadia · Posted 06-20-2019 06:30 AM

Hi Erik. I guess I just need some unique Id without any duplication. I find this work best to my problem. Thanks Erik!

Duplicate Values

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away