topic Re: Split a large dataset efficiently in SAS Programming

Split a large dataset efficiently

prad001 — Tue, 20 Aug 2019 21:59:32 GMT

Hi All,

Please help me to split the dataset efficiently (preferably hash). Since the data set TEST has duplicates, getting error while splitting using hash.. All i wanted is not to split by record count but by NEW, which is not working with the below code.

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;

data _null_;
if 0 then
set test;
declare hash h_out();
h_out.definekey('new');
h_out.definedata('new');
h_out.definedone();

do filenum=1 by 100 until(eof);
do new=1 to 100 until(eof);
set test end=eof;
h_out.add();
by new;
end;

h_out.output(dataset:cats('one.out_',filenum));
h_out.clear();
end;

stop;
run;

%macro srt;
%DO i=1 %to 10000000 %BY 100;
proc sort data=one.out_&i;
by new;
run;
%end;
%MEND;
%srt;

Re: Split a large dataset efficiently

ChrisNZ — Tue, 20 Aug 2019 22:54:16 GMT

Have you tried using the option multidata to allow for duplicate keys?

See here.

Re: Split a large dataset efficiently

novinosrin — Tue, 20 Aug 2019 23:12:10 GMT

Hello @prad001 Are you asking how to split 10 million records by NEW which is sets of 2. So 10million/2 =5million datasets your requirement?

Re: Split a large dataset efficiently

ChrisNZ — Tue, 20 Aug 2019 23:23:14 GMT

@novinosrin +1

I didn't look at the code!

And how many times is proc sort called?

Re: Split a large dataset efficiently

novinosrin — Tue, 20 Aug 2019 23:30:51 GMT

Sir, I didn't particularly look into the code beyond

All i wanted is not to split by record count but by NEW, which is not working with the below code. 

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;

So for every iteration , two output statements making it rather 20 million records. 11,22,33 up to 20e6 records. And OP wants to split each by group into a dataset i.e 10 million datasets.

Of course for you and me et al regulars let alone splitting logic code is all over the internet , it's not the coding part that is concerning but the objective doesn't make sense at all. Well perhaps learning practice? Even so why anybody would practice with such large splits if i understand correctly

Re: Split a large dataset efficiently

SASKiwi — Tue, 20 Aug 2019 23:31:50 GMT

Why do you think you need to use hash processing? There are lots of easier ways to split data.

It would help if you could explain your requirements in words rather than us trying to confirm what you are trying to do from your code.

Re: Split a large dataset efficiently

hashman — Tue, 20 Aug 2019 23:33:32 GMT

@prad001 :

Try this, for example:

data have ;                                          
  do new = 1 to 1234 ;                               
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;              
    output ;                                         
  end ;                                              
run ;                                                
                                                     
%let new_incr = 100 ;                                
                                                     
data _null_ ;                                        
  if _n_ = 1 then do ;                               
    dcl hash h (dataset:"have(obs=0)", ordered:"a") ;
    h.definekey ("new") ;                            
    h.definedata (all:"y") ;                         
    h.definedone () ;                                
  end ;                                              
  do until (mod (new, &new_incr) = 0 or lr) ;        
    set have end = lr ;                              
    h.add() ;                                        
  end ;                                              
  h.output (dataset: catx ("_", "work.out", _n_)) ;  
  h.clear () ;                                       
run ;

Kind regards

Paul D.

Re: Split a large dataset efficiently

hashman — Tue, 20 Aug 2019 23:43:36 GMT

@SASKiwi;

I'd rather question the very idea of the need to split a data set instead of creating a BY variable (for example, using MOD).

But as long as the need is justified, the hash approach looks fairly easy to me ... as long as the largest split group fits in memory comfortably.

Kind regards

Paul D.

Re: Split a large dataset efficiently

SASKiwi — Tue, 20 Aug 2019 23:58:29 GMT

@hashman - agreed, why do you need to split your data in the first place? Yes, hash techniques are really useful, but I'd argue there are more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers.

Re: Split a large dataset efficiently

hashman — Wed, 21 Aug 2019 03:55:43 GMT

@SASKiwi :

All right, I'm game! You've seen my hash approach for the OP's task. To simplify, HAVE has a variable NEW ranging in order from 1 to some integer N:

%let N = 10011 ;    
                    
data have ;         
  do new = 1 to &n ;
    output ;        
  end ;             
run ;

HAVE needs to be split into an a priori unknown number of data sets with sequentially numbered names, N_INCR records each. If N isn't divisible by N_INCR, the last split file will contain fewer than N_INCR output records. Here's the hash approach:

%let N_incr = 100 ;                                          
                                                             
data _null_ ;                                                
  if _n_ = 1 then do ;                                       
    dcl hash h (ordered:"A") ;                               
    h.definekey ("new") ;                                    
    h.definedone () ;                                        
  end ;                                                      
  do until (mod (new, &n_incr) = 0 or lr) ;                  
    set have end = lr ;                                      
    h.add() ;                                                
  end ;                                                      
  h.output (dataset: catx ("_", "work.out", put (_n_,z3.))) ;
  h.clear() ;                                                
run ;

Could you please show one of "more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers"? Am asking since though I do know a few, none can touch the simplicity of the code above in terms of logic, ease of coding, self-automation, and - as a corollary - ease of support.

On a different note, the hash object has been around for 16 years. The part of its overall functionality used in the DATA step above is marginal at best. The rest relies on two statements (DATA and SET), one function (MOD), one DO loop, and yes, pretty firm understanding of what _N_ really is and how the DATA step really works. Being a "SAS developer" should more than cover this territory, shouldn't it?

Kind regards

Paul D.

Re: Split a large dataset efficiently

prad001 — Thu, 22 Aug 2019 21:59:09 GMT

Thank you for helping Paul (Hashman)..

Your code works fine when the "NEW" variable is unique., what if it has duplicates ??

What I am trying is..

data have ;
do new = 1 to &n ;
output ; Output; ************************************************* 2 OUTPUTS here to create duplicates;
end ;
run ;

All I want is split the dataset into 100 by NEW. If the first 100 contains duplicates then it will have more than 100.

In the above example...

1st dataset will contain..

NEW

100

And the second dataset should contain...

NEW

101

102

200

Thanks alot in advance..

Pradeep

Re: Split a large dataset efficiently

hashman — Fri, 23 Aug 2019 00:38:02 GMT

@prad001 :

Just a few subtle alterations:

Create a view into HAVE with a dummy key unique for each range. Thus, below, _KNEW=1 for 1-100, 2 for 101-200, and so on. This simplifies the splitting task, as it reduces it to simple BY processing.
Code the argument tag MULTIDATA:"Y" to allow hash items with identical key-values.
Code for dropping _KNEW from the output data sets.

In sum:

data have ;                                                         
  do new = 1 to 1234 ;                                              
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;                             
    do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/                            
      output ;                                                      
    end ;                                                           
  end ;                                                             
run ;                                                               
                                                                    
%let new_incr = 100 ;                                               
                                                                    
data knew / view = knew ;                                           
  set have ;                                                        
  _knew = ceil (new / &new_incr) ;                                  
run ;                                                               
                                                                    
data _null_ ;                                                       
  if _n_ = 1 then do ;                                              
    dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ;
    h.definekey ("_knew") ;                                         
    h.definedata (all:"Y") ;                                        
    h.definedone () ;                                               
  end ;                                                             
  do until (last._knew) ;                                           
    set knew ;                                                      
    by _knew ;                                                      
    h.add() ;                                                       
  end ;                                                             
  h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ;     
  h.clear () ;                                                      
run ;

Kind regards

Paul D.

Re: Split a large dataset efficiently

prad001 — Fri, 23 Aug 2019 17:30:17 GMT

Thank you Paul (Hashman). Thanks alot for a spontaneous and perfect reply.