Solved: How to select 10 random rows in a data set...?

varmalh · Posted 09-16-2019 07:39 PM

Hi all,

can someone please advise for the below question.??

How to select 10 random rows in a data set...?

Thanks .

ChrisNZ · Posted 09-16-2019 08:42 PM

Without SAS/STAT, something like this works (no checks for duplicates here):

data T; 
  if 0 then set SASHELP.CLASS nobs=NOBS;
  do I=1 to 10;
    N=rand('uniform',1,NOBS); 
    OBS=N;
    set SASHELP.CLASS point=N;
    output; 
  end ;
  stop  ;
run;

High-Performance SAS Coding - Third Edition

View solution in original post

SASKiwi · Posted 09-16-2019 07:58 PM

If you have SAS/STAT the SURVEYSELECT procedure is the best option:https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_surveyselect_syntax.htm&docsetVer...

ChrisNZ · Posted 09-16-2019 08:42 PM

Without SAS/STAT, something like this works (no checks for duplicates here):

data T; 
  if 0 then set SASHELP.CLASS nobs=NOBS;
  do I=1 to 10;
    N=rand('uniform',1,NOBS); 
    OBS=N;
    set SASHELP.CLASS point=N;
    output; 
  end ;
  stop  ;
run;

High-Performance SAS Coding - Third Edition

hashman · Posted 09-16-2019 11:37 PM

@ChrisNZ:

It can be expressed more concisely:

Two SETs aren't needed since both point= and nobs= options can be applied to the same SET
the "integer" option for RAND appears to be apter

E.g.:

data want ;                           
  do _n_ = 1 to 10 ;                  
    p = rand ("integer", n) ;         
    set sashelp.class point=p nobs=n ;
    output ;                          
  end ;                               
  stop ;                              
run ;

Kind regards

Paul D.

ChrisNZ · Posted 09-17-2019 12:26 AM

Ah! My version of SAS is so old that I don't have the integer distribution!

High-Performance SAS Coding - Third Edition

hashman · Posted 09-17-2019 12:51 AM

@ChrisNZ:

Sad ;). But you still have RAND (meaning yours is 9.4), so you can get exactly the same effect by using:

ceil (rand ("uniform") * N)

In fact, this expression has a certain edge over "integer" since it can be used to generate an integer variate in the range [-2**53:+2**53], whereas "integer" is limited to [-2**32:2**32].

FreelanceReinh · Posted 09-21-2019 09:13 AM

Sorry for being late in the discussion.

Hi @ChrisNZ,

Interesting to see that the POINT= option happily accepts non-integer values. To avoid discrimination against the last observation of the input dataset I would suggest a minor change to your code:

N=rand('uniform',1,NOBS+1);

Alternatively, the CEIL function (as suggested by @hashman) could be used:

N=ceil(rand('uniform',NOBS));

Otherwise, in your example, poor William from the class has no chance of getting into the random sample. 🙂

ChrisNZ · Posted 09-21-2019 06:41 PM

I never liked William so that serves him right! ;o)Good point.

High-Performance SAS Coding - Third Edition

hashman · Posted 09-17-2019 12:31 AM

@varmalh:

You're not telling whether you want a sample with replacement or without replacement. If it's the former (i.e. you may get duplicate records in the output), then use the version by @ChrisNZ or a slightly shorter version below:

data want ;                                                                                                                             
  call streaminit (7) ;                                                                                                                 
  do _n_ = 1 to 10 ;                                                                                                                    
    p = rand ("integer", n) ;                                                                                                           
    set sashelp.class point=p nobs=n ;                                                                                                  
    output ;                                                                                                                            
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ;

OTOH, if you want a sample without replacement (i.e. to pick up records with all different observation numbers), you'll have to select from a dynamic pool effectuated below by the key-indexed array RR (999 is just a "big enough" number greater than the number of records in the input data set):

data want (drop = _:) ;                                                                                                                             
  call streaminit (7) ;                                                                                                                 
  array rr [999] _temporary_ (1:999) ;                                                                                                  
  _h = n ;                                                                                                                              
  do _n_ = 1 to 10 ;                                                                                                                    
    _x = rand ("integer", _h) ;                                                                                                         
    p = rr (_x) ;                                                                                                                       
    set sashelp.class point=p nobs=n ;                                                                                                  
    output ;                                                                                                                            
    rr[_x] = rr[_h] ;                                                                                                                   
    _h +- 1 ;                                                                                                                           
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ;

The "standard" technique of doing this, though, is the so-called classic K/N method:

data want (drop = _:) ;                                                                                                                 
  retain _k 10 ;                                                                                                                        
  if _n_ = 1 then call streaminit (7) ;                                                                                                 
  set sashelp.class nobs = n ;                                                                                                          
  if rand ("uniform") < divide (_k, n) then do ;                                                                                        
    output ;                                                                                                                            
    _k +- 1 ;                                                                                                                           
  end ;                                                                                                                                 
  n +- 1 ;                                                                                                                              
run ;

Since it reads the entire data set, it is more effective than the POINT= method at approximately K/N > 0.5. If the sample is small compared to the input file (and especially very small), POINT= is the winner because it reads no more than exactly K records.

Kind regards

Paul D.

varmalh · Posted 09-17-2019 08:45 PM

Hi @hashman ,

can you please explain what is "sample with replacement or without replacement.".

Thanks

varmalh

hashman · Posted 09-17-2019 11:48 PM

@varmalh:

For comprehensive info, Google is your friend.

In short, imagine that you have 10 balls numbered 1 to 10 in a hat and take 5 balls out of it blindly one at a time, shaking the hat after every draw. "Blindly" and "shaking" effectively mean "randomly". Now you can do it two ways:

write down the number of the ball you've just taken out and return the ball into the hat
don't return it to the hat, meaning that when you take the next ball out, you're drawing out of the remaining balls only

In the first case, you have a chance to draw a ball you've already previously drawn, so the numbers you've written down may contain duplicates. That means you've sampled with replacement (literally, as you're replacing the drawn balls in the hat). In the second case, since the ball you're already taken out is no longer in the hat, you can never draw any ball you've already taken out again, and so the numbers you've written down cannot contain duplicates. That means you've sampled without replacement.

In terms of you task, if you use the program by @ChrisNZ (or my version similar to it), you'll see that the output data set contains duplicate records, since it picks a record number and doesn't eliminate it from the pool before the next pick - i.e. you get a sample with replacement. The rest of my programs are constructed in such a way that a record picked once is eliminated from the selection pool - i.e. you get a sample without replacement.

Whether you chose one or the other depends on what kind of random sample you want, which is dictated by the purpose of selecting the sample. Both have their uses, and this is the reason I asked. Obviously, selecting with replacement is simpler because you need no provisions to handle selected duplicates or make sure they aren't selected in the first place.

Kind regards

Paul D.

varmalh · Posted 09-18-2019 09:21 PM

thanks for that info @hashman ...

How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

Re: How to select 10 random rows in a data set...?

SAS Innovate 2025: Call for Content

Classroom Training Available!