BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

What is the way to split a data set to multiple data sets by value of a categorical variable?

Let's say that my data set called RawTbl has a categorical field called group.

LEt's say that the number of the values that group field get is not fixed (It means that each day that RawTbl  data set is published there are different number of categories).

The target is to split data set RawTbl  into multiple data sets (for each value of firld group there will be a seprate data set).

The name of  each data set will be  new_name_of_the_group

 

3 REPLIES 3
PaigeMiller
Diamond | Level 26

@novinosrin wrote:

 

http://www.sascommunity.org/wiki/Split_Data_into_Subsets


Hey, @Ronein , please read the above wiki article, where it says

"Best Practice: Just Don't Do It"

 

The reason it says "Don't Do it" is that there are rarely good reasons to do this, and plenty of drawbacks. Don't make your life harder than it has to be.

--
Paige Miller
hashman
Ammonite | Level 13

@Ronein: Beating a dead horse, eh?

 

If your input is sorted by your categorical variable, cool. If not, index it to facilitate BY processing. Then use hash extrusion to get what you want. Example (the categorical variable below being ORIGIN):

data have (index = (origin)) ;                                                                                                          
  set sashelp.cars ;                                                                                                                    
run ;                                                                                                                                   
                                                                                                                                        
data _null_ ;                                                                                                                           
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (dataset: "have (obs=0)", multidata:"y") ;                                                                               
    h.definekey ("origin") ;                                                                                                            
    h.definedata (all: "y") ;                                                                                                           
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  do until (last.origin) ;                                                                                                              
    set have ;                                                                                                                          
    by origin ;                                                                                                                         
    h.add() ;                                                                                                                           
  end ;                                                                                                                                 
  h.output (dataset: catx ("_", "ds", origin)) ;                                                                                        
  h.clear() ;                                                                                                                           
run ;                       

The alternative is to write code to pre-read the input and assemble a DATA statement with the requisite output file names plus a SELECT block or bunch of IF-THEN-ELSE statements for each distinct value of the categorical variable; then execute the generated code. The generation can be done by a macro, SQL, CALL EXECUTE, PUT/%INC, etc. Example (SQL):

proc sql noprint ;                                                                                                                      
  select distinct                                                                                                                       
    catx ("_", "ds", origin)                                                                                                            
  , catx (" ", "when (", quote (trim (origin)), ") output", catx ("_", "ds", origin))                                                   
  into :filelist separated by ' '                                                                                                       
     , :whenlist separated by ';'                                                                                                       
 from  have                                                                                                                             
 ;                                                                                                                                      
quit ;                                                                                                                                  
                                                                                                                                        
data &filelist ;                                                                                                                        
  set have ;                                                                                                                            
  select (origin) ;                                                                                                                     
    &whenlist ;                                                                                                                         
    otherwise ;                                                                                                                         
  end ;                                                                                                                                 
run ;                           

Both methods have their pros and cons:

  • In the hash case, the largest BY group with all its data must fit into the hash table memory. If the source file is split into a large number of small files, not a problem.
  • But it can become a problem with the other method, as too many data sets in the DATA statement may run the step out of buffer memory. Besides, the assembled code pieces may not fit into the FILELIST and WHENLIST macro variables, in which case a different way of assembling code devoid of this limitation (e.g., PUT/%INC) can be used.  

Note that in the hash case, the splitting can be done using hash-of-hashes without the source file having to be either sorted or indexed (the method invented by @RichardDeVen in 2004 in the wake of SUGI in Montreal). However, in this case, memory must be plentiful enough to fit the entire file in the hash table memory. 

 

Kind regards

Paul D.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 1532 views
  • 1 like
  • 4 in conversation