topic Re: split a data set to multiple data sets in SAS Programming

split a data set to multiple data sets

Ronein — Sun, 29 Dec 2019 17:51:31 GMT

Hello

What is the way to split a data set to multiple data sets by value of a categorical variable?

Let's say that my data set called RawTbl has a categorical field called group.

LEt's say that the number of the values that group field get is not fixed (It means that each day that RawTbl data set is published there are different number of categories).

The target is to split data set RawTbl into multiple data sets (for each value of firld group there will be a seprate data set).

The name of each data set will be new_name_of_the_group

Re: split a data set to multiple data sets

novinosrin — Sun, 29 Dec 2019 18:25:03 GMT

Please refer to

https://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

Re: split a data set to multiple data sets

hashman — Mon, 30 Dec 2019 04:34:13 GMT

@Ronein: Beating a dead horse, eh?

If your input is sorted by your categorical variable, cool. If not, index it to facilitate BY processing. Then use hash extrusion to get what you want. Example (the categorical variable below being ORIGIN):

data have (index = (origin)) ;                                                                                                          
  set sashelp.cars ;                                                                                                                    
run ;                                                                                                                                   
                                                                                                                                        
data _null_ ;                                                                                                                           
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (dataset: "have (obs=0)", multidata:"y") ;                                                                               
    h.definekey ("origin") ;                                                                                                            
    h.definedata (all: "y") ;                                                                                                           
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  do until (last.origin) ;                                                                                                              
    set have ;                                                                                                                          
    by origin ;                                                                                                                         
    h.add() ;                                                                                                                           
  end ;                                                                                                                                 
  h.output (dataset: catx ("_", "ds", origin)) ;                                                                                        
  h.clear() ;                                                                                                                           
run ;

The alternative is to write code to pre-read the input and assemble a DATA statement with the requisite output file names plus a SELECT block or bunch of IF-THEN-ELSE statements for each distinct value of the categorical variable; then execute the generated code. The generation can be done by a macro, SQL, CALL EXECUTE, PUT/%INC, etc. Example (SQL):

proc sql noprint ;                                                                                                                      
  select distinct                                                                                                                       
    catx ("_", "ds", origin)                                                                                                            
  , catx (" ", "when (", quote (trim (origin)), ") output", catx ("_", "ds", origin))                                                   
  into :filelist separated by ' '                                                                                                       
     , :whenlist separated by ';'                                                                                                       
 from  have                                                                                                                             
 ;                                                                                                                                      
quit ;                                                                                                                                  
                                                                                                                                        
data &filelist ;                                                                                                                        
  set have ;                                                                                                                            
  select (origin) ;                                                                                                                     
    &whenlist ;                                                                                                                         
    otherwise ;                                                                                                                         
  end ;                                                                                                                                 
run ;

Both methods have their pros and cons:

In the hash case, the largest BY group with all its data must fit into the hash table memory. If the source file is split into a large number of small files, not a problem.
But it can become a problem with the other method, as too many data sets in the DATA statement may run the step out of buffer memory. Besides, the assembled code pieces may not fit into the FILELIST and WHENLIST macro variables, in which case a different way of assembling code devoid of this limitation (e.g., PUT/%INC) can be used.

Note that in the hash case, the splitting can be done using hash-of-hashes without the source file having to be either sorted or indexed (the method invented by @RichardDeVen in 2004 in the wake of SUGI in Montreal). However, in this case, memory must be plentiful enough to fit the entire file in the hash table memory.

Kind regards

Paul D.

Re: split a data set to multiple data sets

PaigeMiller — Mon, 30 Dec 2019 12:47:56 GMT

@novinosrin wrote:

http://www.sascommunity.org/wiki/Split_Data_into_Subsets

Hey, @Ronein , please read the above wiki article, where it says

"Best Practice: Just Don't Do It"

The reason it says "Don't Do it" is that there are rarely good reasons to do this, and plenty of drawbacks. Don't make your life harder than it has to be.