Hi All,
Please help me to split the dataset efficiently (preferably hash). Since the data set TEST has duplicates, getting error while splitting using hash.. All i wanted is not to split by record count but by NEW, which is not working with the below code.
LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;
data _null_;
if 0 then
set test;
declare hash h_out();
h_out.definekey('new');
h_out.definedata('new');
h_out.definedone();
do filenum=1 by 100 until(eof);
do new=1 to 100 until(eof);
set test end=eof;
h_out.add();
by new;
end;
h_out.output(dataset:cats('one.out_',filenum));
h_out.clear();
end;
stop;
run;
%macro srt;
%DO i=1 %to 10000000 %BY 100;
proc sort data=one.out_&i;
by new;
run;
%end;
%MEND;
%srt;
@prad001 :
Just a few subtle alterations:
In sum:
data have ; do new = 1 to 1234 ; retain N1 1 C1 "C1" N2 2 C2 "C22" ; do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/ output ; end ; end ; run ; %let new_incr = 100 ; data knew / view = knew ; set have ; _knew = ceil (new / &new_incr) ; run ; data _null_ ; if _n_ = 1 then do ; dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ; h.definekey ("_knew") ; h.definedata (all:"Y") ; h.definedone () ; end ; do until (last._knew) ; set knew ; by _knew ; h.add() ; end ; h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ; h.clear () ; run ;
Kind regards
Paul D.
Have you tried using the option multidata to allow for duplicate keys?
See here.
Hello @prad001 Are you asking how to split 10 million records by NEW which is sets of 2. So 10million/2 =5million datasets your requirement?
Sir, I didn't particularly look into the code beyond
All i wanted is not to split by record count but by NEW, which is not working with the below code.
LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;
So for every iteration , two output statements making it rather 20 million records. 11,22,33 up to 20e6 records. And OP wants to split each by group into a dataset i.e 10 million datasets.
Of course for you and me et al regulars let alone splitting logic code is all over the internet , it's not the coding part that is concerning but the objective doesn't make sense at all. Well perhaps learning practice? Even so why anybody would practice with such large splits if i understand correctly
Why do you think you need to use hash processing? There are lots of easier ways to split data.
It would help if you could explain your requirements in words rather than us trying to confirm what you are trying to do from your code.
I'd rather question the very idea of the need to split a data set instead of creating a BY variable (for example, using MOD).
But as long as the need is justified, the hash approach looks fairly easy to me ... as long as the largest split group fits in memory comfortably.
Kind regards
Paul D.
@hashman - agreed, why do you need to split your data in the first place? Yes, hash techniques are really useful, but I'd argue there are more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers.
@SASKiwi :
All right, I'm game! You've seen my hash approach for the OP's task. To simplify, HAVE has a variable NEW ranging in order from 1 to some integer N:
%let N = 10011 ;
data have ;
do new = 1 to &n ;
output ;
end ;
run ;
HAVE needs to be split into an a priori unknown number of data sets with sequentially numbered names, N_INCR records each. If N isn't divisible by N_INCR, the last split file will contain fewer than N_INCR output records. Here's the hash approach:
%let N_incr = 100 ;
data _null_ ;
if _n_ = 1 then do ;
dcl hash h (ordered:"A") ;
h.definekey ("new") ;
h.definedone () ;
end ;
do until (mod (new, &n_incr) = 0 or lr) ;
set have end = lr ;
h.add() ;
end ;
h.output (dataset: catx ("_", "work.out", put (_n_,z3.))) ;
h.clear() ;
run ;
Could you please show one of "more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers"? Am asking since though I do know a few, none can touch the simplicity of the code above in terms of logic, ease of coding, self-automation, and - as a corollary - ease of support.
On a different note, the hash object has been around for 16 years. The part of its overall functionality used in the DATA step above is marginal at best. The rest relies on two statements (DATA and SET), one function (MOD), one DO loop, and yes, pretty firm understanding of what _N_ really is and how the DATA step really works. Being a "SAS developer" should more than cover this territory, shouldn't it?
Kind regards
Paul D.
@prad001 :
Try this, for example:
data have ;
do new = 1 to 1234 ;
retain N1 1 C1 "C1" N2 2 C2 "C22" ;
output ;
end ;
run ;
%let new_incr = 100 ;
data _null_ ;
if _n_ = 1 then do ;
dcl hash h (dataset:"have(obs=0)", ordered:"a") ;
h.definekey ("new") ;
h.definedata (all:"y") ;
h.definedone () ;
end ;
do until (mod (new, &new_incr) = 0 or lr) ;
set have end = lr ;
h.add() ;
end ;
h.output (dataset: catx ("_", "work.out", _n_)) ;
h.clear () ;
run ;
Kind regards
Paul D.
Thank you for helping Paul (Hashman)..
Your code works fine when the "NEW" variable is unique., what if it has duplicates ??
What I am trying is..
data have ;
do new = 1 to &n ;
output ; Output; ************************************************* 2 OUTPUTS here to create duplicates;
end ;
run ;
All I want is split the dataset into 100 by NEW. If the first 100 contains duplicates then it will have more than 100.
In the above example...
1st dataset will contain..
NEW
1
1
2
2
.
.
.
.
100
100
And the second dataset should contain...
NEW
101
101
102
102
.
.
.
.
200
200
Thanks alot in advance..
Pradeep
@prad001 :
Just a few subtle alterations:
In sum:
data have ; do new = 1 to 1234 ; retain N1 1 C1 "C1" N2 2 C2 "C22" ; do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/ output ; end ; end ; run ; %let new_incr = 100 ; data knew / view = knew ; set have ; _knew = ceil (new / &new_incr) ; run ; data _null_ ; if _n_ = 1 then do ; dcl hash h (dataset:"knew(obs=0)", multidata:"Y", ordered:"A") ; h.definekey ("_knew") ; h.definedata (all:"Y") ; h.definedone () ; end ; do until (last._knew) ; set knew ; by _knew ; h.add() ; end ; h.output (dataset: cats ("work.out_", _n_, "(drop=_knew)")) ; h.clear () ; run ;
Kind regards
Paul D.
Thank you Paul (Hashman). Thanks alot for a spontaneous and perfect reply.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.