BookmarkSubscribeRSS Feed
abc_xyz
Calcite | Level 5

Hi,

 

I want to select some observations from a dataset(let's call it as dataset2) based on average value of a column in another dataset(let's call it as dataset1). Also the no of observations should be equal to 20% of the no of observations in the dataset1.

Taking the example of sashelp.class so that the problem is clear.

1) Calculate the average of Height of Male students(let's call it as "y") and count of Male students(let's call it as "z").

2) Select the Female students such that

   i) Their average height should be within 2% range that of average height of male students(i.e. between 0.98 * y and 1.02*y)

   ii) Their count should be 20% of the count of Male students ( i.e. 0.2 *z)

  iii) In case we are unable to find the required number of female students then the preference should be given to satisfying the 1st condition. In these case, the maximum no of female students should be selected who satisfy the 1st criteria.

 

I will be doing this on a dataset which consists of around 100 K observations.

 

Thanks in advance for the help !!!!

 

 

 

 

 

7 REPLIES 7
PeterClemmensen
Tourmaline | Level 20

Can you post some example data that resembles your actual data (datset1 and dataset2) ? Don't post 100K observations 😉

 

Also, post an example of what your want your desired outcome to look like based on those example data sets.

 

 

That makes it much easier to help you. You seem to have a good idea of what you want to do, but as a reader it can be complicated to read 4-5 requirements written as plain text and come up with a code solution.

 

 

abc_xyz
Calcite | Level 5

Hi,

Using the same example of sashelp.class data.

 

1) Take the male students and find out their average height.(it is coming to be 63.91)

2) Find the values which are in 2% range of 63.91 (i.e. 0.98* 63.91 = 62.63 and 1.02* 63.91 = 65.19)

3) Find out count of male students and take 20% of that. ( There are 10 Male students hence required female student count would be 10*0.2 = 2)

4) Hence we need any 2 female students with height between 62.63 and 65.19

 

5) In some case, we might face the situation, where we are unable to get 20% of population ( i.e. there are less than 2 female students with height in the given required range of 62.63 and 65.19) In such case, the priority should be given to lowering the no of students and not to the lowering the cut off's for the height( i.e if there is only one female student with height in the given range, then we will select that female student).

 

Thanks!!!

 

hashman
Ammonite | Level 13

@abc_xyz:

 

Well, you've spelled the algorithm in detail yourself. It can be replicated in DATA step code verbatim. Only I've added some extra females with somewhat varying heights in order to increase their eligible population compared to SASHELP.CLASS, thus creating the test data set HAVE below.

 

data have (keep=name sex height) ;                           
  set sashelp.class ;                                        
  output ;                                                   
  if sex = "F" ;                                             
  _nm = name ;                                               
  _ht = height ;                                             
  do _n_ = 1 to rand ("integer", 9) ;                        
    name = cats (_nm, _n_) ;                                 
    height = _ht + rand ("integer", -2, 2) ;                 
    output ;                                                 
  end ;                                                      
run ;                                                        
                                                             
data want (keep = name sex height) ;                         
  retain rH 0.02 rN 0.2 ;                                    
  call streaminit (7) ;                                      
  do until (e1) ;                                            
    set have end = e1 ;                                      
    where sex = "M" and N (height) ;                         
    nM + 1 ;                                                 
    hM + height ;                                            
  end ;                                                      
  aM = divide (hM, nM) ;                                     
  do until (e2) ;                                            
    set have end = e2 ;                                      
    where sex = "F" and N (height) ;                         
    if aM * (1 - rH) <= height <= aM * (1 + rH) then nF + 1 ;
  end ;                                                      
  kF = nM * rN ;                                             
  do until (e3) ;                                            
    set have end = e3 ;                                      
    where sex = "F" and N (height) ;                         
    if rand ("uniform") < divide (kF, nF) then do ;          
      output ;                                               
      kF +- 1 ;                                              
    end ;                                                    
    nF +- 1 ;                                                
  end ;                                                      
run ;                                                        

If you run this code straight against SASHELP.CLASS (rather than HAVE), you'll still get 2 females out of the only 2 females eligible. Note that it's not the most efficient code, and there're a myriad other ways to skin this kitty, but it follows your stream-of-the-consciousness. In the end, random selection follows the simplest "K/N" algorithm. If you have 2 data sets instead of one as above, simply concatenate them first - just make sure to keep the field identifying M/F. Alternatively, you can plug the "male" data set in the DO loop #1 and the "female" data set - in the loops #2 and #3. (Also note that the RAND function in the form shown above requires version 9.4.)  

 

HTH

Paul D.

abc_xyz
Calcite | Level 5

Hi. 

Not the exact same way how i want since it is taking height of the last Male candidate instead of the average. But I got an overall idea. Thanks for the algorithms.

 

 

PaigeMiller
Diamond | Level 26

If you read this thread, the bottom line is that there is a method to select a random sample with a specified mean, which might work in many situations, but no guarantees either.

 

https://communities.sas.com/t5/Base-SAS-Programming/How-to-random-sample-with-desired-aggregate-stat...

--
Paige Miller
ballardw
Super User

Do you have access to SAS/OR and/or SAS/IML or do expect an answer using only "base" SAS data steps, Proc sql and SAS/STAT procedures?

abc_xyz
Calcite | Level 5

No. I don't have access to any of the above tool. Need to solve this issue using usual Proc sql, SAS/STAT steps only.

 

Thanks!!!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1173 views
  • 1 like
  • 5 in conversation