Hi SAS Users,
I would like your help with the following.
Raw data:
data have;
input ID ;
cards;
11
11
11
11
11
12
12
12
12
12
;
run;
I would like to assign ranks 1 - 5 to the above rows. Each ID should have all the ranks but randomly assigned.
Any help is appreciated.
Thank you for your time.
data temp;
set have;
rnd = rand("uniform");
run;
proc rank data=temp out=want;
by id;
var rnd;
run;
data temp;
set have;
by ID;
if first.ID then order=1; else order+1;
x=rand('normal', 50, 4);
run;
proc sort data=temp;
by ID x;
run;
data want;
set have;
by id;
if first.id then rank=1; else rank+1;
run;
proc sort data=want;
by id order;
run;
Or essentially use PROC SURVEYSELECT with a 100% samprate.
Thank you for your time. This worked too.
data temp;
set have;
rnd = rand("uniform");
run;
proc rank data=temp out=want;
by id;
var rnd;
run;
Nice one Sir @PGStats . I am just wondering what are chances of duplication caused by rand('uniform')? I did the read the docs, but something is not registering in my head.
@Reeza, Please chime in too. I need this education plz
An exact duplicate that would cause RANK to consider them the same value?
Very, very small IMO to the point of negligible.
This wouldn't be an issue with my solution, which is more code than @PGStats , his is definitley more elegant. I'm pretty sure that there should be a way with PROC SURVEYSELECT.
The likelyhood of getting exactly the same random value within 5 consecutive calls is very very small and depends on the chosen RNG. Read
Marvelous documentation link. Jeez! how I wish I had read that earlier. Hmm, There's a time for everything. @PGStats Needless to mention my usual i.e "Prodigygeniusstats" as you have cycled faster to grasp the cycle lengths. Kudos! and Thank you!!!
@novinosrin wrote:
I am just wondering what are chances of duplication caused by rand('uniform')? I did the read the docs, but something is not registering in my head.
Hi @novinosrin: Good question. According to section Duplicate Values in the RAND function documentation (which I think is even more relevant to this than the information about cycle lengths) the default 32-bit Mersenne Twister algorithm used here would result in at least one duplicate rank for the OP's sample data with an approximate probability of 1-(1-25/2**33)**2≈5.82E-9 (assuming independence between BY groups), which is indeed "very, very small."
However, if the real HAVE dataset was much larger, say, 10,000 IDs with 50 observations each, the corresponding probability would be 1-(1-2500/2**33)**10000≈0.002906 and merely ceil(-0.0001/log2(1-2500/2**33))=239 independent experiments of this size would suffice to observe a duplicate rank with a probability >=50%.
Let's try it:
data temp;
call streaminit(27182818);
do r=1 to 239;
do id=1 to 10000;
do _n_=1 to 50;
rnd = rand("uniform");
output;
end;
end;
end;
run;
proc rank data=temp out=want;
by r id;
var rnd;
run;
proc freq data=want;
tables rnd;
run;
Partial PROC FREQ output shows that it really happened (and, no, I did not tinker with the random seed):
Cumulative Cumulative rnd Frequency Percent Frequency Percent -------------------------------------------------------- 1 2390000 2.00 2390000 2.00 2 2390000 2.00 4780000 4.00 3 2389999 2.00 7169999 6.00 3.5 2 0.00 7170001 6.00 4 2389999 2.00 9560000 8.00 5 2390000 2.00 11950000 10.00 6 2390000 2.00 14340000 12.00
Find the culprit:
proc print data=want;
where rnd=3.5;
run;
proc print data=temp;
where r=147 & id=8807;
format rnd best16.;
run;
Voilà:
Obs r id rnd 73440301 147 8807 0.91091977874748 73440302 147 8807 0.04118569614366 73440303 147 8807 0.14874891191721 73440304 147 8807 0.93303100718185 73440305 147 8807 0.04243680997752 73440306 147 8807 0.23993166047148 73440307 147 8807 0.46066979528405 73440308 147 8807 0.32410094887018 73440309 147 8807 0.09882239229046 73440310 147 8807 0.44069946813397 73440311 147 8807 0.56899999175221 73440312 147 8807 0.04243680997752 73440313 147 8807 0.79902740451507 ...
The HEX16. representations of the two highlighted numbers are 3FA5BA470DFFFFFF and in fact the representations of all 119.5 million RND values end in "FFFFF" showing their 32-bit nature. So, one of the available 64-bit RNGs would be a better choice if the risk of duplicates is an issue, e.g.,
call streaminit('PCG', 27182818);
Good morning/afternoon Sir @FreelanceReinh Thank you and I really feel obliged for that beautiful description. I should pay for it : ) Understood well and clear. Cheers!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.