Hi i am having a final dataset having observations of 7 crores(71245666) i want to split them in to 50 lac obs each how can i do...
You asked a similar question last month. Did the link provided ( www2.sas.com/proceedings/sugi28/075-28.pdf ) solve your problem then? If so, and if a lac is equal to about 100,000, then couldn't you use it combined with the mod function?
e.g.,if mod(_n_,5000000) eq 0 I think would provide what you are looking for if combined with the code in that paper for splitting a file.
This program assumes that NOBS will return the correct number of observations.
12686 filename FT99F001 temp;
12687 data _null_;
12688 file FT99F001;
12689 if 0 then set sashelp.shoes(drop=_all_) nobs=nobs;
12690 putlog 'NOTE: ' nobs=;
12691 size = 95;
12692 do firstobs = 1 to nobs+size by size while(firstobs le nobs);
12693 put 'data s' firstobs +(-1) ';';
12694 obs = min(nobs,firstobs+size-1);
12695 put +3 'set sashelp.shoes(' firstobs= obs= +(-1) ');';
12696 put +3 'run;';
12697 end;
12698 stop;
12699 run;
NOTE: The file FT99F001 is:
(system-specific pathname),
(system-specific file attributes)
NOTE: nobs=395
NOTE: 15 records were written to the file (system-specific pathname).
The minimum record length was 7.
The maximum record length was 43.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
12700 %inc FT99F001 / source2;
NOTE: %INCLUDE (level 1) file FT99F001 is (system-specific pathname).
12701 +data s1;
12702 + set sashelp.shoes(firstobs=1 obs=95);
12703 + run;
NOTE: The data set WORK.S1 has 95 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
12704 +data s96;
12705 + set sashelp.shoes(firstobs=96 obs=190);
12706 + run;
NOTE: The data set WORK.S96 has 95 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
12707 +data s191;
12708 + set sashelp.shoes(firstobs=191 obs=285);
12709 + run;
NOTE: The data set WORK.S191 has 95 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
12710 +data s286;
12711 + set sashelp.shoes(firstobs=286 obs=380);
12712 + run;
NOTE: The data set WORK.S286 has 95 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
12713 +data s381;
12714 + set sashelp.shoes(firstobs=381 obs=395);
12715 + run;
NOTE: The data set WORK.S381 has 15 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
NOTE: %INCLUDE (level 1) ending.
Hi R_Win
try this macro. The variable num is the max observation in the splitted file.
%macro splitfile(num);
data _null_;
if 0 then
set largefile nobs=count;
call symput('numobs',put(count,8.));
run;
%let m=%sysevalf(&numobs/&num,ceil);
data %do j=1 %to &m;
splittedfile_&j
%end;
;
set largefile;
%do i=1 %to &m;
if %eval(&num*(&i-1)) <_n_ <= %eval(&num*&i) then
output splittedfile_&i;
%end;
run;
%mend splitfile;
Another way is Hash Table,If you would like.
Ksharp
I' not sure if this is what Ksharp had in mind, but it would work. The following example splits sashelp.class into sub files with n le 5:
data temp (drop=g index = (group));
set sashelp.class;
g+(mod(_n_,5)=1);
group=catt('g',g);
run;
data _null_ ;
dcl hash hh ( ) ;
hh.definekey ('k' ) ;
hh.definedata ('group','sex', 'name', 'age', 'height', 'weight') ;
hh.definedone () ;
do k = 1 by 1 until ( last.group ) ;
set temp;
by group ;
hh.add () ;
end ;
hh.output (dataset:group) ;
run ;
Hi. Art.T your code is very good ,I did not even think it so.
This is another version of Art.T's code.
data _null_ ; dcl hash hh ( ) ; hh.definekey ('k' ) ; hh.definedata ('sex', 'name', 'age', 'height', 'weight') ; hh.definedone () ; do until(mod(k,5)=0 or last); k+1; set sashelp.class end=last ; hh.add(); end; hh.output(dataset: 'a'||strip(k)); run;
Ksharp
Ksharp,
I definitely like your proposed code better than what I had suggested, but I would add a group counter to keep the file names a bit more manageable. While it adds a little more overhead, doing so had almost no negative effect on performance:
data _null_ ;
dcl hash hh ( ) ;
hh.definekey ('k' ) ;
hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;
hh.definedone () ;
do until(mod(k,5)=0 or last);
k+1;
set sashelp.class end=last ;
hh.add();
end;
gp+1;
hh.output(dataset: 'a'||strip(gp));
run;
Art.T
Yes. This makes code looks better if add a gp variable to flag a output dataset.
Ksharp
does this method using hash tables not require more memory than the size of that huge data set ?
Peter,
I am definitely NOT a hash expert (where is Paul Dorfman when you really need him?), but I tend to agree that a hash method may not be appropriate for the OP. He/She was originally offered suggestions that WOULD work, but never indicated any response as being correct. Regardless, I'm glad that KSharp suggested the method, as I never realized that one could use the method in this manner without first creating an index.
Yes. The Hash Table only has constant size of memory(i.e. only contains five observations).
Each Iteration of data step will clear the last Hash Table and re-build a new Hash Table, then continue to accept
next five observations. So scan table only once ,you can split it all.
This code I wrote is also based on Art.T's code. So I will thank Art.T ,who make me to think it further about Hash Table.
Ksharp
This paper provides some insight into the splitting utilizing hash object method as well another example from the pre-v9 era. A nice read.
http://www2.sas.com/proceedings/sugi30/236-30.pdf
And to Art's point it is authored in part by Paul Dorfman.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.