Help using Base SAS procedures

Reg:Splitting a Huge dataset to smaller datasets

Reply
Regular Contributor
Posts: 229

Reg:Splitting a Huge dataset to smaller datasets

Hi i am having a final dataset having observations of 7 crores(71245666) i want to split them in to 50 lac obs each how can i do...

PROC Star
Posts: 7,467

Reg:Splitting a Huge dataset to smaller datasets

You asked a similar question last month.  Did the link provided ( www2.sas.com/proceedings/sugi28/075-28.pdf ) solve your problem then?  If so, and if a lac is equal to about 100,000, then couldn't you use it combined with the mod function?

e.g.,if mod(_n_,5000000) eq 0 I think would provide what you are looking for if combined with the code in that paper for splitting a file.

Respected Advisor
Posts: 3,799

Reg:Splitting a Huge dataset to smaller datasets

This program assumes that NOBS will return the correct number of observations.

12686  filename FT99F001 temp;

12687  data _null_;

12688     file FT99F001;

12689     if 0 then set sashelp.shoes(drop=_all_) nobs=nobs;

12690     putlog 'NOTE: ' nobs=;

12691     size = 95;

12692     do firstobs = 1 to nobs+size by size while(firstobs le nobs);

12693        put 'data s' firstobs +(-1) ';';

12694        obs = min(nobs,firstobs+size-1);

12695        put +3 'set sashelp.shoes(' firstobs= obs= +(-1) ');';

12696        put +3 'run;';

12697        end;

12698     stop;

12699     run;

NOTE: The file FT99F001 is:

      (system-specific pathname),

      (system-specific file attributes)

NOTE: nobs=395

NOTE: 15 records were written to the file (system-specific pathname).

      The minimum record length was 7.

      The maximum record length was 43.

NOTE: DATA statement used (Total process time):

      real time           0.03 seconds

      cpu time            0.03 seconds

12700  %inc FT99F001 / source2;

NOTE: %INCLUDE (level 1) file FT99F001 is (system-specific pathname).

12701 +data s1;

12702 +   set sashelp.shoes(firstobs=1 obs=95);

12703 +   run;

NOTE: The data set WORK.S1 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

12704 +data s96;

12705 +   set sashelp.shoes(firstobs=96 obs=190);

12706 +   run;

NOTE: The data set WORK.S96 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

12707 +data s191;

12708 +   set sashelp.shoes(firstobs=191 obs=285);

12709 +   run;

NOTE: The data set WORK.S191 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds

12710 +data s286;

12711 +   set sashelp.shoes(firstobs=286 obs=380);

12712 +   run;

NOTE: The data set WORK.S286 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

12713 +data s381;

12714 +   set sashelp.shoes(firstobs=381 obs=395);

12715 +   run;

NOTE: The data set WORK.S381 has 15 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

NOTE: %INCLUDE (level 1) ending.

Occasional Contributor
Posts: 10

Reg:Splitting a Huge dataset to smaller datasets

Hi R_Win

try this macro. The variable num is the max observation in the splitted file.

%macro splitfile(num);

    data _null_;

        if 0 then

            set largefile nobs=count;

        call symput('numobs',put(count,8.));

    run;

    %let m=%sysevalf(&numobs/&num,ceil);

    data %do j=1 %to &m;

        splittedfile_&j

        %end;

        ;

        set largefile;

        %do i=1 %to &m;

            if %eval(&num*(&i-1)) <_n_ <= %eval(&num*&i) then

                output splittedfile_&i;

        %end;

    run;

%mend splitfile;

Super User
Posts: 10,020

Reg:Splitting a Huge dataset to smaller datasets

Another way is Hash Table,If you would like.

Ksharp

PROC Star
Posts: 7,467

Reg:Splitting a Huge dataset to smaller datasets

I' not sure if this is what Ksharp had in mind, but it would work.  The following example splits sashelp.class into sub files with n le 5:

data temp (drop=g index = (group));

  set sashelp.class;

  g+(mod(_n_,5)=1);

  group=catt('g',g);

run;

data _null_ ;

  dcl hash hh   (             ) ;

  hh.definekey  ('k'          ) ;

  hh.definedata ('group','sex', 'name', 'age', 'height', 'weight') ;

  hh.definedone () ;

  do k = 1 by 1 until ( last.group ) ;

    set temp;

    by group ;

    hh.add () ;

  end ;

  hh.output (dataset:group) ;

run ;

Super User
Posts: 10,020

Re: Reg:Splitting a Huge dataset to smaller datasets

Hi. Art.T your code is very good ,I  did not even think it so.

This is another version of Art.T's code.  Smiley Happy

data _null_ ;
  dcl hash hh   (             ) ;
  hh.definekey  ('k'          ) ;
  hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;
  hh.definedone () ;

  do until(mod(k,5)=0 or last);
   k+1;
set sashelp.class end=last ;
            hh.add();
  end;
  hh.output(dataset: 'a'||strip(k));
run;

Ksharp

PROC Star
Posts: 7,467

Re: Reg:Splitting a Huge dataset to smaller datasets

Ksharp,

I definitely like your proposed code better than what I had suggested, but I would add a group counter to keep the file names a bit more manageable.  While it adds a little more overhead, doing so had almost no negative effect on performance:

data _null_ ;

  dcl hash hh   (             ) ;

  hh.definekey  ('k'          ) ;

  hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;

  hh.definedone () ;

 

  do until(mod(k,5)=0 or last);

   k+1;

   set sashelp.class end=last ;

   hh.add();

  end;

  gp+1;

  hh.output(dataset: 'a'||strip(gp));

run;

Super User
Posts: 10,020

Re: Reg:Splitting a Huge dataset to smaller datasets

Art.T

Yes. This makes code looks better if add a gp variable to flag a output dataset.

Ksharp

Smiley Happy

Valued Guide
Posts: 2,177

Re: Reg:Splitting a Huge dataset to smaller datasets

does this method using hash tables not require more memory than the size of that  huge  data set ?

PROC Star
Posts: 7,467

Re: Reg:Splitting a Huge dataset to smaller datasets

Peter,

I am definitely NOT a hash expert (where is Paul Dorfman when you really need him?), but I tend to agree that a hash method may not be appropriate for the OP.  He/She was originally offered suggestions that WOULD work, but never indicated any response as being correct.  Regardless, I'm glad that KSharp suggested the method, as I never realized that one could use the method in this manner without first creating an index.

Super User
Posts: 10,020

Re: Reg:Splitting a Huge dataset to smaller datasets

Yes. The Hash Table only has constant size of memory(i.e. only contains five observations).

Each Iteration of data step will clear the last Hash Table and re-build a new Hash Table, then continue to accept

next five observations. So scan table only once ,you can split it all.

This code I wrote is also based on Art.T's code. So I will thank Art.T ,who make me to think it further about Hash Table.

Ksharp

Trusted Advisor
Posts: 1,301

Re: Reg:Splitting a Huge dataset to smaller datasets

This paper provides some insight into the splitting utilizing hash object method as well another example from the pre-v9 era.  A nice read.

http://www2.sas.com/proceedings/sugi30/236-30.pdf

And to Art's point it is authored in part by Paul Dorfman.

Ask a Question
Discussion stats
  • 12 replies
  • 678 views
  • 0 likes
  • 7 in conversation