Re: Reg:Splitting a Huge dataset to smaller datasets

R_Win · Posted 08-10-2011 08:31 AM

Hi i am having a final dataset having observations of 7 crores(71245666) i want to split them in to 50 lac obs each how can i do...

art297 · Posted 08-10-2011 08:57 AM

You asked a similar question last month. Did the link provided ( www2.sas.com/proceedings/sugi28/075-28.pdf ) solve your problem then? If so, and if a lac is equal to about 100,000, then couldn't you use it combined with the mod function?

e.g.,if mod(_n_,5000000) eq 0 I think would provide what you are looking for if combined with the code in that paper for splitting a file.

data_null__ · Posted 08-10-2011 10:09 AM

This program assumes that NOBS will return the correct number of observations.

12686 filename FT99F001 temp;

12687 data _null_;

12688 file FT99F001;

12689 if 0 then set sashelp.shoes(drop=_all_) nobs=nobs;

12690 putlog 'NOTE: ' nobs=;

12691 size = 95;

12692 do firstobs = 1 to nobs+size by size while(firstobs le nobs);

12693 put 'data s' firstobs +(-1) ';';

12694 obs = min(nobs,firstobs+size-1);

12695 put +3 'set sashelp.shoes(' firstobs= obs= +(-1) ');';

12696 put +3 'run;';

12697 end;

12698 stop;

12699 run;

NOTE: The file FT99F001 is:

(system-specific pathname),

(system-specific file attributes)

NOTE: nobs=395

NOTE: 15 records were written to the file (system-specific pathname).

The minimum record length was 7.

The maximum record length was 43.

NOTE: DATA statement used (Total process time):

real time 0.03 seconds

cpu time 0.03 seconds

12700 %inc FT99F001 / source2;

NOTE: %INCLUDE (level 1) file FT99F001 is (system-specific pathname).

12701 +data s1;

12702 + set sashelp.shoes(firstobs=1 obs=95);

12703 + run;

NOTE: The data set WORK.S1 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

12704 +data s96;

12705 + set sashelp.shoes(firstobs=96 obs=190);

12706 + run;

NOTE: The data set WORK.S96 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

12707 +data s191;

12708 + set sashelp.shoes(firstobs=191 obs=285);

12709 + run;

NOTE: The data set WORK.S191 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

12710 +data s286;

12711 + set sashelp.shoes(firstobs=286 obs=380);

12712 + run;

NOTE: The data set WORK.S286 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

12713 +data s381;

12714 + set sashelp.shoes(firstobs=381 obs=395);

12715 + run;

NOTE: The data set WORK.S381 has 15 observations and 7 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

NOTE: %INCLUDE (level 1) ending.

Domenico · Posted 08-11-2011 02:38 AM

Hi R_Win

try this macro. The variable num is the max observation in the splitted file.

%macro splitfile(num);

data _null_;

if 0 then

set largefile nobs=count;

call symput('numobs',put(count,8.));

run;

%let m=%sysevalf(&numobs/&num,ceil);

data %do j=1 %to &m;

splittedfile_&j

%end;

;

set largefile;

%do i=1 %to &m;

if %eval(&num*(&i-1)) <_n_ <= %eval(&num*&i) then

output splittedfile_&i;

%end;

run;

%mend splitfile;

Ksharp · Posted 08-11-2011 04:33 AM

Another way is Hash Table,If you would like.

Ksharp

art297 · Posted 08-13-2011 06:38 PM

I' not sure if this is what Ksharp had in mind, but it would work. The following example splits sashelp.class into sub files with n le 5:

data temp (drop=g index = (group));

set sashelp.class;

g+(mod(_n_,5)=1);

group=catt('g',g);

run;

data _null_ ;

dcl hash hh ( ) ;

hh.definekey ('k' ) ;

hh.definedata ('group','sex', 'name', 'age', 'height', 'weight') ;

hh.definedone () ;

do k = 1 by 1 until ( last.group ) ;

set temp;

by group ;

hh.add () ;

end ;

hh.output (dataset:group) ;

run ;

Ksharp · Posted 08-14-2011 10:08 PM

Hi. Art.T your code is very good ,I did not even think it so.

This is another version of Art.T's code.

data _null_ ;
  dcl hash hh   (             ) ;
  hh.definekey  ('k'          ) ;
  hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;
  hh.definedone () ;

  do until(mod(k,5)=0 or last);
   k+1;
set sashelp.class end=last ;
            hh.add();
  end;
  hh.output(dataset: 'a'||strip(k));
run;

Ksharp

art297 · Posted 08-15-2011 11:36 AM

Ksharp,

I definitely like your proposed code better than what I had suggested, but I would add a group counter to keep the file names a bit more manageable. While it adds a little more overhead, doing so had almost no negative effect on performance:

data _null_ ;

dcl hash hh ( ) ;

hh.definekey ('k' ) ;

hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;

hh.definedone () ;

do until(mod(k,5)=0 or last);

k+1;

set sashelp.class end=last ;

hh.add();

end;

gp+1;

hh.output(dataset: 'a'||strip(gp));

run;

Ksharp · Posted 08-15-2011 11:54 PM

Art.T

Yes. This makes code looks better if add a gp variable to flag a output dataset.

Ksharp

Peter_C · Posted 08-16-2011 02:31 AM

does this method using hash tables not require more memory than the size of that huge data set ?

art297 · Posted 08-16-2011 09:03 AM

Peter,

I am definitely NOT a hash expert (where is Paul Dorfman when you really need him?), but I tend to agree that a hash method may not be appropriate for the OP. He/She was originally offered suggestions that WOULD work, but never indicated any response as being correct. Regardless, I'm glad that KSharp suggested the method, as I never realized that one could use the method in this manner without first creating an index.

Ksharp · Posted 08-17-2011 04:10 AM

Yes. The Hash Table only has constant size of memory(i.e. only contains five observations).

Each Iteration of data step will clear the last Hash Table and re-build a new Hash Table, then continue to accept

next five observations. So scan table only once ,you can split it all.

This code I wrote is also based on Art.T's code. So I will thank Art.T ,who make me to think it further about Hash Table.

Ksharp

FriedEgg · Posted 08-17-2011 02:48 PM

This paper provides some insight into the splitting utilizing hash object method as well another example from the pre-v9 era. A nice read.

http://www2.sas.com/proceedings/sugi30/236-30.pdf

And to Art's point it is authored in part by Paul Dorfman.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!