Re: split file into multiple files

helloSAS · Posted 04-25-2011 03:59 PM

Hello all,

I'm trying to split a 50,000 record file into 5 different files with 10,000 records in each file.

Can anyone tell me how to do this?

Thanks

SPR · Posted 04-25-2011 04:22 PM

Hello HelloSAS,

This is a solution:
[pre]
data i;
do i=1 to 50000;
output;
end;
run;
data r1 r2 r3 r4 r5;
set i;
if _n_ LE 10000 then output r1;
else if 10000 LT _n_ LE 20000 then output r2;
else if 20000 LT _n_ LE 30000 then output r3;
else if 30000 LT _n_ LE 40000 then output r4;
else if 40000 LT _n_ then output r5;
run;
[/pre]
Sincerely,
SPR

helloSAS · Posted 04-25-2011 05:10 PM

Thank you for the response. I'm sorry but I wasnt very clear in my question above. Actually my end of file is not always 50,000 records. It can vary time to time.
My real need is, I have a huge file about 30 million records that I want to break into pieces probably in 10 pieces. I do not want to hard code record numbers as this number might increase from 30 millioin to 32 milion in very near future and so on. I would probably like to SAS automatically calculate its break points and break the file into 10 peices.

I was trying to do something as simple as this, but this is not what i exactly want.

data r1 r2;
set readxtra;
if _n_ LE 50000 then output r1;
else if _n_ GT 50000 then output r2;
run;

DBailey · Posted 04-25-2011 06:26 PM

without regard to run times...or syntax for that matter...

proc sql;
select (1+count(*)/10) into :recs into from bigfile;
quit;

data
out1 out2 out3 out4 out5 out6 out7 out8 out9 out10;
set bigfile;
if _N_ < &recs then output out1;
else if _N_ < 2*&recs then output out2;
else if _N_ < 3 * &recs then output out3;
.
.
.
else output out10;
run;

helloSAS · Posted 04-25-2011 06:29 PM

Thank you. Exactly what I'm looking for.

Peter_C · Posted 04-26-2011 12:14 PM

Like this technique!
but seek "less"

> without regard to run times...or syntax for that
> matter...
>
> proc sql;
> select (1+count(*)/10) into :recs into from bigfile;
> quit;
>
> data
> out1 out2 out3 out4 out5 out6 out7 out8 out9
> out10;
> set bigfile;
> if _N_ < &recs then output out1;
> else if _N_ < 2*&recs then output out2;
> else if _N_ < 3 * &recs then output out3;
> .
> .
> .
> else output out10;
> run;
[pre]
* peter approach ;
%macro genP( outs=10, prefix= peter_D, from= bigFile );
%local i ;
data %* generate the list of output data set names ;
%do i= 1 %to &outs ; &prefix.&i %end ;
;
%* derive the number of obs in each block (before last);
If _n_ = 1 then blocks + ceil( nobs/&outs ) ;
drop blocks ;

set &from nobs= nobs ;
%* now generate the lines that output to each data set;
%do i = 1 %to &outs ;
if _n_ LE blocks*&i then output &prefix.&i ; else
%end ;
put _all_/ 'E' "RROR: what's left!?" ; %* executed when &outs=0 ;
%put _user_ ;
run ;
%mend genP ;

option mprint nosymbolgen noMlogic ;
%genP( outs=3, prefix= class, from= sashelp.class ) [/pre] This seems to be fairly flexible so validation can be made on small sets before risking a test on the large data set.
My log from the above test shows the following MPRINT and notes[pre]MPRINT(GENP): data class1 class2 class3 ;
MPRINT(GENP): If _n_ = 1 then blocks + ceil( nobs/3 ) ;
MPRINT(GENP): drop blocks ;
MPRINT(GENP): set sashelp.class nobs= nobs ;
MPRINT(GENP): if _n_ LE blocks*1 then output class1 ;
MPRINT(GENP): else if _n_ LE blocks*2 then output class2 ;
MPRINT(GENP): else if _n_ LE blocks*3 then output class3 ;
MPRINT(GENP): else put _all_/ 'E' "RROR: what's left!?" ;
GENP OUTS 3
GENP I 4
GENP PREFIX class
GENP FROM sashelp.class
MPRINT(GENP): run ;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.CLASS1 has 7 observations and 5 variables.
NOTE: The data set WORK.CLASS2 has 7 observations and 5 variables.
NOTE: The data set WORK.CLASS3 has 5 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds[/pre]
think that achieves the result with re-useable process.
More interesting than dividing up by blocks of _N_, might be the (1+mod(_n_, &outs)) and random distribution among the output data sets,
but that did not seem to be a requirement.. (this time)

peterC

split file into multiple files