Hi all,
can someone please advise for the below question.??
How to select 10 random rows in a data set...?
Thanks .
Without SAS/STAT, something like this works (no checks for duplicates here):
data T;
if 0 then set SASHELP.CLASS nobs=NOBS;
do I=1 to 10;
N=rand('uniform',1,NOBS);
OBS=N;
set SASHELP.CLASS point=N;
output;
end ;
stop ;
run;
If you have SAS/STAT the SURVEYSELECT procedure is the best option:https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_surveyselect_syntax.htm&docsetVer...
Without SAS/STAT, something like this works (no checks for duplicates here):
data T;
if 0 then set SASHELP.CLASS nobs=NOBS;
do I=1 to 10;
N=rand('uniform',1,NOBS);
OBS=N;
set SASHELP.CLASS point=N;
output;
end ;
stop ;
run;
It can be expressed more concisely:
E.g.:
data want ;
do _n_ = 1 to 10 ;
p = rand ("integer", n) ;
set sashelp.class point=p nobs=n ;
output ;
end ;
stop ;
run ;
Kind regards
Paul D.
Ah! My version of SAS is so old that I don't have the integer distribution!
Sad ;). But you still have RAND (meaning yours is 9.4), so you can get exactly the same effect by using:
ceil (rand ("uniform") * N)
In fact, this expression has a certain edge over "integer" since it can be used to generate an integer variate in the range [-2**53:+2**53], whereas "integer" is limited to [-2**32:2**32].
Sorry for being late in the discussion.
Hi @ChrisNZ,
Interesting to see that the POINT= option happily accepts non-integer values. To avoid discrimination against the last observation of the input dataset I would suggest a minor change to your code:
N=rand('uniform',1,NOBS+1);
Alternatively, the CEIL function (as suggested by @hashman) could be used:
N=ceil(rand('uniform',NOBS));
Otherwise, in your example, poor William from the class has no chance of getting into the random sample. 🙂
You're not telling whether you want a sample with replacement or without replacement. If it's the former (i.e. you may get duplicate records in the output), then use the version by @ChrisNZ or a slightly shorter version below:
data want ;
call streaminit (7) ;
do _n_ = 1 to 10 ;
p = rand ("integer", n) ;
set sashelp.class point=p nobs=n ;
output ;
end ;
stop ;
run ;
OTOH, if you want a sample without replacement (i.e. to pick up records with all different observation numbers), you'll have to select from a dynamic pool effectuated below by the key-indexed array RR (999 is just a "big enough" number greater than the number of records in the input data set):
data want (drop = _:) ;
call streaminit (7) ;
array rr [999] _temporary_ (1:999) ;
_h = n ;
do _n_ = 1 to 10 ;
_x = rand ("integer", _h) ;
p = rr (_x) ;
set sashelp.class point=p nobs=n ;
output ;
rr[_x] = rr[_h] ;
_h +- 1 ;
end ;
stop ;
run ;
The "standard" technique of doing this, though, is the so-called classic K/N method:
data want (drop = _:) ;
retain _k 10 ;
if _n_ = 1 then call streaminit (7) ;
set sashelp.class nobs = n ;
if rand ("uniform") < divide (_k, n) then do ;
output ;
_k +- 1 ;
end ;
n +- 1 ;
run ;
Since it reads the entire data set, it is more effective than the POINT= method at approximately K/N > 0.5. If the sample is small compared to the input file (and especially very small), POINT= is the winner because it reads no more than exactly K records.
Kind regards
Paul D.
Hi @hashman ,
can you please explain what is "sample with replacement or without replacement.".
Thanks
varmalh
For comprehensive info, Google is your friend.
In short, imagine that you have 10 balls numbered 1 to 10 in a hat and take 5 balls out of it blindly one at a time, shaking the hat after every draw. "Blindly" and "shaking" effectively mean "randomly". Now you can do it two ways:
In the first case, you have a chance to draw a ball you've already previously drawn, so the numbers you've written down may contain duplicates. That means you've sampled with replacement (literally, as you're replacing the drawn balls in the hat). In the second case, since the ball you're already taken out is no longer in the hat, you can never draw any ball you've already taken out again, and so the numbers you've written down cannot contain duplicates. That means you've sampled without replacement.
In terms of you task, if you use the program by @ChrisNZ (or my version similar to it), you'll see that the output data set contains duplicate records, since it picks a record number and doesn't eliminate it from the pool before the next pick - i.e. you get a sample with replacement. The rest of my programs are constructed in such a way that a record picked once is eliminated from the selection pool - i.e. you get a sample without replacement.
Whether you chose one or the other depends on what kind of random sample you want, which is dictated by the purpose of selecting the sample. Both have their uses, and this is the reason I asked. Obviously, selecting with replacement is simpler because you need no provisions to handle selected duplicates or make sure they aren't selected in the first place.
Kind regards
Paul D.
thanks for that info @hashman ...
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.