BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
varmalh
Fluorite | Level 6

Hi all,

can someone please advise for the below question.??

How to select 10 random rows in a data set...?

Thanks .

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Without SAS/STAT, something like this works (no checks for duplicates here):

data T; 
  if 0 then set SASHELP.CLASS nobs=NOBS;
  do I=1 to 10;
    N=rand('uniform',1,NOBS); 
    OBS=N;
    set SASHELP.CLASS point=N;
    output; 
  end ;
  stop  ;
run;

View solution in original post

11 REPLIES 11
ChrisNZ
Tourmaline | Level 20

Without SAS/STAT, something like this works (no checks for duplicates here):

data T; 
  if 0 then set SASHELP.CLASS nobs=NOBS;
  do I=1 to 10;
    N=rand('uniform',1,NOBS); 
    OBS=N;
    set SASHELP.CLASS point=N;
    output; 
  end ;
  stop  ;
run;

hashman
Ammonite | Level 13

@ChrisNZ:

It can be expressed more concisely:

  • Two SETs aren't needed since both point= and nobs= options can be applied to the same SET
  • the "integer" option for RAND appears to be apter

E.g.:

data want ;                           
  do _n_ = 1 to 10 ;                  
    p = rand ("integer", n) ;         
    set sashelp.class point=p nobs=n ;
    output ;                          
  end ;                               
  stop ;                              
run ;                                 

Kind regards

Paul D.

 

ChrisNZ
Tourmaline | Level 20

Ah! My version of SAS is so old that I don't have the integer distribution!

hashman
Ammonite | Level 13

@ChrisNZ:

Sad ;). But you still have RAND (meaning yours is 9.4), so you can get exactly the same effect by using:

 

  ceil (rand ("uniform") * N)

 

In fact, this expression has a certain edge over "integer" since it can be used to generate an integer variate in the range [-2**53:+2**53], whereas "integer" is limited to [-2**32:2**32].  
 

FreelanceReinh
Jade | Level 19

Sorry for being late in the discussion.

 

Hi @ChrisNZ,

 

Interesting to see that the POINT= option happily accepts non-integer values. To avoid discrimination against the last observation of the input dataset I would suggest a minor change to your code:

N=rand('uniform',1,NOBS+1);

Alternatively, the CEIL function (as suggested by @hashman) could be used:

N=ceil(rand('uniform',NOBS));

Otherwise, in your example, poor William from the class has no chance of getting into the random sample. 🙂

ChrisNZ
Tourmaline | Level 20
I never liked William so that serves him right! ;o)Good point.
hashman
Ammonite | Level 13

@varmalh:

You're not telling whether you want a sample with replacement or without replacement. If it's the former (i.e. you may get duplicate records in the output), then use the version by @ChrisNZ or a slightly shorter version below:

data want ;                                                                                                                             
  call streaminit (7) ;                                                                                                                 
  do _n_ = 1 to 10 ;                                                                                                                    
    p = rand ("integer", n) ;                                                                                                           
    set sashelp.class point=p nobs=n ;                                                                                                  
    output ;                                                                                                                            
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ;       

OTOH, if you want a sample without replacement (i.e. to pick up records with all different observation numbers), you'll have to select from a dynamic pool effectuated below by the key-indexed array RR (999 is just a "big enough" number greater than the number of records in the input data set):

data want (drop = _:) ;                                                                                                                             
  call streaminit (7) ;                                                                                                                 
  array rr [999] _temporary_ (1:999) ;                                                                                                  
  _h = n ;                                                                                                                              
  do _n_ = 1 to 10 ;                                                                                                                    
    _x = rand ("integer", _h) ;                                                                                                         
    p = rr (_x) ;                                                                                                                       
    set sashelp.class point=p nobs=n ;                                                                                                  
    output ;                                                                                                                            
    rr[_x] = rr[_h] ;                                                                                                                   
    _h +- 1 ;                                                                                                                           
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ;                                   

The "standard" technique of doing this, though, is the so-called classic K/N method:

data want (drop = _:) ;                                                                                                                 
  retain _k 10 ;                                                                                                                        
  if _n_ = 1 then call streaminit (7) ;                                                                                                 
  set sashelp.class nobs = n ;                                                                                                          
  if rand ("uniform") < divide (_k, n) then do ;                                                                                        
    output ;                                                                                                                            
    _k +- 1 ;                                                                                                                           
  end ;                                                                                                                                 
  n +- 1 ;                                                                                                                              
run ;         

Since it reads the entire data set, it is more effective than the POINT= method at approximately K/N > 0.5. If the sample is small compared to the input file (and especially very small), POINT= is the winner because it reads no more than exactly K records.

 

Kind regards

Paul D.

varmalh
Fluorite | Level 6

Hi @hashman ,

can you please explain what is "sample with replacement or without replacement.".

 

Thanks 

varmalh

hashman
Ammonite | Level 13

@varmalh:

For comprehensive info, Google is your friend.

In short, imagine that you have 10 balls numbered 1 to 10 in a hat and take 5 balls out of it blindly one at a time, shaking the hat after every draw. "Blindly" and "shaking" effectively mean "randomly". Now you can do it two ways:

  1. write down the number of the ball you've just taken out and return the ball into the hat
  2. don't return it to the hat, meaning that when you take the next ball out, you're drawing out of the remaining balls only

In the first case, you have a chance to draw a ball you've already previously drawn, so the numbers you've written down may contain duplicates. That means you've sampled with replacement (literally, as you're replacing the drawn balls in the hat). In the second case, since the ball you're already taken out is no longer in the hat, you can never draw any ball you've already taken out again, and so the numbers you've written down cannot contain duplicates. That means you've sampled without replacement

 

In terms of you task, if you use the program by @ChrisNZ (or my version similar to it), you'll see that the output data set contains duplicate records, since it picks a record number and doesn't eliminate it from the pool before the next pick - i.e. you get a sample with replacement. The rest of my programs are constructed in such a way that a record picked once is eliminated from the selection pool - i.e. you get a sample without replacement

 

Whether you chose one or the other depends on what kind of random sample you want, which is dictated by the purpose of selecting the sample. Both have their uses, and this is the reason I asked. Obviously, selecting with replacement is simpler because you need no provisions to handle selected duplicates or make sure they aren't selected in the first place.

 

Kind regards

Paul D.

  

varmalh
Fluorite | Level 6

thanks for that info @hashman ...Robot Happy

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 12240 views
  • 11 likes
  • 5 in conversation