BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
prad001
Obsidian | Level 7

Hi All,

Please help me to split the dataset efficiently (preferably hash). Since the data set TEST has duplicates, getting error while splitting using hash.. All i wanted is not to split by record count but by NEW, which is not working with the below code. 

 

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;

data _null_;
if 0 then
set test;
declare hash h_out();
h_out.definekey('new');
h_out.definedata('new');
h_out.definedone();

do filenum=1 by 100 until(eof);
do new=1 to 100 until(eof);
set test end=eof;
h_out.add();
by new;
end;

h_out.output(dataset:cats('one.out_',filenum));
h_out.clear();
end;

stop;
run;

 

%macro srt;
%DO i=1 %to 10000000 %BY 100;
proc sort data=one.out_&i;
by new;
run;
%end;
%MEND;
%srt;

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@prad001 :

 

Just a few subtle alterations:

  1. Create a view into HAVE with a dummy key unique for each range. Thus, below, _KNEW=1 for  1-100, 2 for 101-200, and so on. This simplifies the splitting task, as it reduces it to simple BY processing.
  2. Code the argument tag MULTIDATA:"Y" to allow hash items with identical key-values.
  3. Code for dropping _KNEW from the output data sets.

In sum:

data have ;                                                         
  do new = 1 to 1234 ;                                              
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;                             
    do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/                            
      output ;                                                      
    end ;                                                           
  end ;                                                             
run ;                                                               
                                                                    
%let new_incr = 100 ;                                               
                                                                    
data knew / view = knew ;                                           
  set have ;                                                        
  _knew = ceil (new / &new_incr) ;                                  
run ;                                                               
                                                                    
data _null_ ;                                                       
  if _n_ = 1 then do ;                                              
    dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ;
    h.definekey ("_knew") ;                                         
    h.definedata (all:"Y") ;                                        
    h.definedone () ;                                               
  end ;                                                             
  do until (last._knew) ;                                           
    set knew ;                                                      
    by _knew ;                                                      
    h.add() ;                                                       
  end ;                                                             
  h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ;     
  h.clear () ;                                                      
run ;                                                               

Kind regards

Paul D.

 

View solution in original post

12 REPLIES 12
ChrisNZ
Tourmaline | Level 20

Have you tried using the option multidata to allow for duplicate keys?

See here.

novinosrin
Tourmaline | Level 20

Hello @prad001  Are you asking how to split 10 million  records by NEW which is sets of 2. So 10million/2 =5million datasets your requirement?

ChrisNZ
Tourmaline | Level 20

@novinosrin +1

I didn't look at the code! 

And how many times is proc sort called?

novinosrin
Tourmaline | Level 20

Sir, I didn't particularly look into the code beyond 

All i wanted is not to split by record count but by NEW, which is not working with the below code. 

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;

So for every iteration , two output statements making it rather 20 million records. 11,22,33 up to 20e6 records. And OP wants to split each by group into a dataset i.e 10 million datasets.

 

Of course for you and me et al regulars let alone splitting logic code is all over the internet , it's not the coding part that is concerning but the objective doesn't make sense at all. Well perhaps learning practice? Even so why anybody would practice with such large splits if i understand correctly

SASKiwi
PROC Star

Why do you think you need to use hash processing? There are lots of easier ways to split data.

 

It would help if you could explain your requirements in words rather than us trying to confirm what you are trying to do from your code. 

hashman
Ammonite | Level 13

@SASKiwi;

 

I'd rather question the very idea of the need to split a data set instead of creating a BY variable (for example, using MOD).

But as long as the need is justified, the hash approach looks fairly easy to me ... as long as the largest split group fits in memory comfortably.

 

Kind regards

Paul D. 

SASKiwi
PROC Star

@hashman  - agreed, why do you need to split your data in the first place? Yes, hash techniques are really useful, but I'd argue there are more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers.

hashman
Ammonite | Level 13

@SASKiwi :

All right, I'm game! You've seen my hash approach for the OP's task. To simplify, HAVE has a variable NEW ranging in order from 1 to some integer N:

%let N = 10011 ;    
                    
data have ;         
  do new = 1 to &n ;
    output ;        
  end ;             
run ;               

HAVE needs to be split into an a priori unknown number of data sets with sequentially numbered names, N_INCR records each. If N isn't divisible by N_INCR, the last split file will contain fewer than N_INCR output records. Here's the hash approach:

%let N_incr = 100 ;                                          
                                                             
data _null_ ;                                                
  if _n_ = 1 then do ;                                       
    dcl hash h (ordered:"A") ;                               
    h.definekey ("new") ;                                    
    h.definedone () ;                                        
  end ;                                                      
  do until (mod (new, &n_incr) = 0 or lr) ;                  
    set have end = lr ;                                      
    h.add() ;                                                
  end ;                                                      
  h.output (dataset: catx ("_", "work.out", put (_n_,z3.))) ;
  h.clear() ;                                                
run ;                                                        

Could you please show one of  "more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers"? Am asking since though I do know a few, none can touch the simplicity of the code above in terms of logic, ease of coding, self-automation, and - as a corollary - ease of support.   

 

On a different note, the hash object has been around for 16 years. The part of its overall functionality used in the DATA step above is marginal at best. The rest relies on two statements (DATA and SET), one function (MOD), one DO loop, and yes, pretty firm understanding of what _N_ really is and how the DATA step really works. Being a "SAS developer" should more than cover this territory, shouldn't it?

 

Kind regards

Paul D. 

hashman
Ammonite | Level 13

@prad001 :

Try this, for example:

data have ;                                          
  do new = 1 to 1234 ;                               
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;              
    output ;                                         
  end ;                                              
run ;                                                
                                                     
%let new_incr = 100 ;                                
                                                     
data _null_ ;                                        
  if _n_ = 1 then do ;                               
    dcl hash h (dataset:"have(obs=0)", ordered:"a") ;
    h.definekey ("new") ;                            
    h.definedata (all:"y") ;                         
    h.definedone () ;                                
  end ;                                              
  do until (mod (new, &new_incr) = 0 or lr) ;        
    set have end = lr ;                              
    h.add() ;                                        
  end ;                                              
  h.output (dataset: catx ("_", "work.out", _n_)) ;  
  h.clear () ;                                       
run ;                                                

Kind regards

Paul D.

prad001
Obsidian | Level 7

Thank you for helping Paul (Hashman)..

 

Your code works fine when the "NEW" variable is unique., what if it has duplicates ??

 

What I am trying is..

 

data have ;
do new = 1 to &n ;
output ; Output; ************************************************* 2 OUTPUTS here to create duplicates;
end ;
run ;

 

All I want is split the dataset into 100 by NEW. If the first 100 contains duplicates then it will have more than 100.

In the above example...

1st dataset will contain..

NEW

1

1

2

2

.

.

.

.

100

100

 

And the second dataset should contain...

 

NEW

101

101

102

102

.

.

.

.

200

200

 

 

Thanks alot in advance..

Pradeep

hashman
Ammonite | Level 13

@prad001 :

 

Just a few subtle alterations:

  1. Create a view into HAVE with a dummy key unique for each range. Thus, below, _KNEW=1 for  1-100, 2 for 101-200, and so on. This simplifies the splitting task, as it reduces it to simple BY processing.
  2. Code the argument tag MULTIDATA:"Y" to allow hash items with identical key-values.
  3. Code for dropping _KNEW from the output data sets.

In sum:

data have ;                                                         
  do new = 1 to 1234 ;                                              
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;                             
    do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/                            
      output ;                                                      
    end ;                                                           
  end ;                                                             
run ;                                                               
                                                                    
%let new_incr = 100 ;                                               
                                                                    
data knew / view = knew ;                                           
  set have ;                                                        
  _knew = ceil (new / &new_incr) ;                                  
run ;                                                               
                                                                    
data _null_ ;                                                       
  if _n_ = 1 then do ;                                              
    dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ;
    h.definekey ("_knew") ;                                         
    h.definedata (all:"Y") ;                                        
    h.definedone () ;                                               
  end ;                                                             
  do until (last._knew) ;                                           
    set knew ;                                                      
    by _knew ;                                                      
    h.add() ;                                                       
  end ;                                                             
  h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ;     
  h.clear () ;                                                      
run ;                                                               

Kind regards

Paul D.

 

prad001
Obsidian | Level 7

Thank you Paul (Hashman). Thanks alot for a spontaneous and perfect reply. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1927 views
  • 0 likes
  • 5 in conversation