Solved: Re: Split a large dataset efficiently

prad001 · Posted 08-20-2019 05:59 PM

Hi All,

Please help me to split the dataset efficiently (preferably hash). Since the data set TEST has duplicates, getting error while splitting using hash.. All i wanted is not to split by record count but by NEW, which is not working with the below code.

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;

data _null_;
if 0 then
set test;
declare hash h_out();
h_out.definekey('new');
h_out.definedata('new');
h_out.definedone();

do filenum=1 by 100 until(eof);
do new=1 to 100 until(eof);
set test end=eof;
h_out.add();
by new;
end;

h_out.output(dataset:cats('one.out_',filenum));
h_out.clear();
end;

stop;
run;

%macro srt;
%DO i=1 %to 10000000 %BY 100;
proc sort data=one.out_&i;
by new;
run;
%end;
%MEND;
%srt;

hashman · Posted 08-22-2019 08:38 PM

@prad001 :

Just a few subtle alterations:

Create a view into HAVE with a dummy key unique for each range. Thus, below, _KNEW=1 for 1-100, 2 for 101-200, and so on. This simplifies the splitting task, as it reduces it to simple BY processing.
Code the argument tag MULTIDATA:"Y" to allow hash items with identical key-values.
Code for dropping _KNEW from the output data sets.

In sum:

data have ;                                                         
  do new = 1 to 1234 ;                                              
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;                             
    do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/                            
      output ;                                                      
    end ;                                                           
  end ;                                                             
run ;                                                               
                                                                    
%let new_incr = 100 ;                                               
                                                                    
data knew / view = knew ;                                           
  set have ;                                                        
  _knew = ceil (new / &new_incr) ;                                  
run ;                                                               
                                                                    
data _null_ ;                                                       
  if _n_ = 1 then do ;                                              
    dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ;
    h.definekey ("_knew") ;                                         
    h.definedata (all:"Y") ;                                        
    h.definedone () ;                                               
  end ;                                                             
  do until (last._knew) ;                                           
    set knew ;                                                      
    by _knew ;                                                      
    h.add() ;                                                       
  end ;                                                             
  h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ;     
  h.clear () ;                                                      
run ;

Kind regards

Paul D.

View solution in original post

ChrisNZ · Posted 08-20-2019 06:54 PM

Have you tried using the option multidata to allow for duplicate keys?

See here.

High-Performance SAS Coding - Third Edition

novinosrin · Posted 08-20-2019 07:12 PM

Hello @prad001 Are you asking how to split 10 million records by NEW which is sets of 2. So 10million/2 =5million datasets your requirement?

ChrisNZ · Posted 08-20-2019 07:23 PM

@novinosrin +1

I didn't look at the code!

And how many times is proc sort called?

High-Performance SAS Coding - Third Edition

novinosrin · Posted 08-20-2019 07:30 PM

Sir, I didn't particularly look into the code beyond

All i wanted is not to split by record count but by NEW, which is not working with the below code. 

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;

So for every iteration , two output statements making it rather 20 million records. 11,22,33 up to 20e6 records. And OP wants to split each by group into a dataset i.e 10 million datasets.

Of course for you and me et al regulars let alone splitting logic code is all over the internet , it's not the coding part that is concerning but the objective doesn't make sense at all. Well perhaps learning practice? Even so why anybody would practice with such large splits if i understand correctly

SASKiwi · Posted 08-20-2019 07:31 PM

Why do you think you need to use hash processing? There are lots of easier ways to split data.

It would help if you could explain your requirements in words rather than us trying to confirm what you are trying to do from your code.

hashman · Posted 08-20-2019 07:43 PM

@SASKiwi;

I'd rather question the very idea of the need to split a data set instead of creating a BY variable (for example, using MOD).

But as long as the need is justified, the hash approach looks fairly easy to me ... as long as the largest split group fits in memory comfortably.

Kind regards

Paul D.

SASKiwi · Posted 08-20-2019 07:58 PM

@hashman - agreed, why do you need to split your data in the first place? Yes, hash techniques are really useful, but I'd argue there are more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers.

hashman · Posted 08-20-2019 11:55 PM

@SASKiwi :

All right, I'm game! You've seen my hash approach for the OP's task. To simplify, HAVE has a variable NEW ranging in order from 1 to some integer N:

%let N = 10011 ;    
                    
data have ;         
  do new = 1 to &n ;
    output ;        
  end ;             
run ;

HAVE needs to be split into an a priori unknown number of data sets with sequentially numbered names, N_INCR records each. If N isn't divisible by N_INCR, the last split file will contain fewer than N_INCR output records. Here's the hash approach:

%let N_incr = 100 ;                                          
                                                             
data _null_ ;                                                
  if _n_ = 1 then do ;                                       
    dcl hash h (ordered:"A") ;                               
    h.definekey ("new") ;                                    
    h.definedone () ;                                        
  end ;                                                      
  do until (mod (new, &n_incr) = 0 or lr) ;                  
    set have end = lr ;                                      
    h.add() ;                                                
  end ;                                                      
  h.output (dataset: catx ("_", "work.out", put (_n_,z3.))) ;
  h.clear() ;                                                
run ;

Could you please show one of "more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers"? Am asking since though I do know a few, none can touch the simplicity of the code above in terms of logic, ease of coding, self-automation, and - as a corollary - ease of support.

On a different note, the hash object has been around for 16 years. The part of its overall functionality used in the DATA step above is marginal at best. The rest relies on two statements (DATA and SET), one function (MOD), one DO loop, and yes, pretty firm understanding of what _N_ really is and how the DATA step really works. Being a "SAS developer" should more than cover this territory, shouldn't it?

Kind regards

Paul D.

hashman · Posted 08-20-2019 07:33 PM

@prad001 :

Try this, for example:

data have ;                                          
  do new = 1 to 1234 ;                               
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;              
    output ;                                         
  end ;                                              
run ;                                                
                                                     
%let new_incr = 100 ;                                
                                                     
data _null_ ;                                        
  if _n_ = 1 then do ;                               
    dcl hash h (dataset:"have(obs=0)", ordered:"a") ;
    h.definekey ("new") ;                            
    h.definedata (all:"y") ;                         
    h.definedone () ;                                
  end ;                                              
  do until (mod (new, &new_incr) = 0 or lr) ;        
    set have end = lr ;                              
    h.add() ;                                        
  end ;                                              
  h.output (dataset: catx ("_", "work.out", _n_)) ;  
  h.clear () ;                                       
run ;

Kind regards

Paul D.

prad001 · Posted 08-22-2019 05:59 PM

Thank you for helping Paul (Hashman)..

Your code works fine when the "NEW" variable is unique., what if it has duplicates ??

What I am trying is..

data have ;
do new = 1 to &n ;
output ; Output; ************************************************* 2 OUTPUTS here to create duplicates;
end ;
run ;

All I want is split the dataset into 100 by NEW. If the first 100 contains duplicates then it will have more than 100.

In the above example...

1st dataset will contain..

NEW

1

2

.

100

And the second dataset should contain...

NEW

101

102

.

200

Thanks alot in advance..

Pradeep

hashman · Posted 08-22-2019 08:38 PM

@prad001 :

Just a few subtle alterations:

Create a view into HAVE with a dummy key unique for each range. Thus, below, _KNEW=1 for 1-100, 2 for 101-200, and so on. This simplifies the splitting task, as it reduces it to simple BY processing.
Code the argument tag MULTIDATA:"Y" to allow hash items with identical key-values.
Code for dropping _KNEW from the output data sets.

In sum:

data have ;                                                         
  do new = 1 to 1234 ;                                              
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;                             
    do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/                            
      output ;                                                      
    end ;                                                           
  end ;                                                             
run ;                                                               
                                                                    
%let new_incr = 100 ;                                               
                                                                    
data knew / view = knew ;                                           
  set have ;                                                        
  _knew = ceil (new / &new_incr) ;                                  
run ;                                                               
                                                                    
data _null_ ;                                                       
  if _n_ = 1 then do ;                                              
    dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ;
    h.definekey ("_knew") ;                                         
    h.definedata (all:"Y") ;                                        
    h.definedone () ;                                               
  end ;                                                             
  do until (last._knew) ;                                           
    set knew ;                                                      
    by _knew ;                                                      
    h.add() ;                                                       
  end ;                                                             
  h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ;     
  h.clear () ;                                                      
run ;

Kind regards

Paul D.

prad001 · Posted 08-23-2019 01:30 PM

Thank you Paul (Hashman). Thanks alot for a spontaneous and perfect reply.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away