BookmarkSubscribeRSS Feed
R_Win
Calcite | Level 5

Hi i am having a final dataset having observations of 7 crores(71245666) i want to split them in to 50 lac obs each how can i do...

12 REPLIES 12
art297
Opal | Level 21

You asked a similar question last month.  Did the link provided ( www2.sas.com/proceedings/sugi28/075-28.pdf ) solve your problem then?  If so, and if a lac is equal to about 100,000, then couldn't you use it combined with the mod function?

e.g.,if mod(_n_,5000000) eq 0 I think would provide what you are looking for if combined with the code in that paper for splitting a file.

data_null__
Jade | Level 19

This program assumes that NOBS will return the correct number of observations.

12686  filename FT99F001 temp;

12687  data _null_;

12688     file FT99F001;

12689     if 0 then set sashelp.shoes(drop=_all_) nobs=nobs;

12690     putlog 'NOTE: ' nobs=;

12691     size = 95;

12692     do firstobs = 1 to nobs+size by size while(firstobs le nobs);

12693        put 'data s' firstobs +(-1) ';';

12694        obs = min(nobs,firstobs+size-1);

12695        put +3 'set sashelp.shoes(' firstobs= obs= +(-1) ');';

12696        put +3 'run;';

12697        end;

12698     stop;

12699     run;

NOTE: The file FT99F001 is:

      (system-specific pathname),

      (system-specific file attributes)

NOTE: nobs=395

NOTE: 15 records were written to the file (system-specific pathname).

      The minimum record length was 7.

      The maximum record length was 43.

NOTE: DATA statement used (Total process time):

      real time           0.03 seconds

      cpu time            0.03 seconds

12700  %inc FT99F001 / source2;

NOTE: %INCLUDE (level 1) file FT99F001 is (system-specific pathname).

12701 +data s1;

12702 +   set sashelp.shoes(firstobs=1 obs=95);

12703 +   run;

NOTE: The data set WORK.S1 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

12704 +data s96;

12705 +   set sashelp.shoes(firstobs=96 obs=190);

12706 +   run;

NOTE: The data set WORK.S96 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

12707 +data s191;

12708 +   set sashelp.shoes(firstobs=191 obs=285);

12709 +   run;

NOTE: The data set WORK.S191 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds

12710 +data s286;

12711 +   set sashelp.shoes(firstobs=286 obs=380);

12712 +   run;

NOTE: The data set WORK.S286 has 95 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

12713 +data s381;

12714 +   set sashelp.shoes(firstobs=381 obs=395);

12715 +   run;

NOTE: The data set WORK.S381 has 15 observations and 7 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

NOTE: %INCLUDE (level 1) ending.

Domenico
Fluorite | Level 6

Hi R_Win

try this macro. The variable num is the max observation in the splitted file.

%macro splitfile(num);

    data _null_;

        if 0 then

            set largefile nobs=count;

        call symput('numobs',put(count,8.));

    run;

    %let m=%sysevalf(&numobs/&num,ceil);

    data %do j=1 %to &m;

        splittedfile_&j

        %end;

        ;

        set largefile;

        %do i=1 %to &m;

            if %eval(&num*(&i-1)) <_n_ <= %eval(&num*&i) then

                output splittedfile_&i;

        %end;

    run;

%mend splitfile;

Ksharp
Super User

Another way is Hash Table,If you would like.

Ksharp

art297
Opal | Level 21

I' not sure if this is what Ksharp had in mind, but it would work.  The following example splits sashelp.class into sub files with n le 5:

data temp (drop=g index = (group));

  set sashelp.class;

  g+(mod(_n_,5)=1);

  group=catt('g',g);

run;

data _null_ ;

  dcl hash hh   (             ) ;

  hh.definekey  ('k'          ) ;

  hh.definedata ('group','sex', 'name', 'age', 'height', 'weight') ;

  hh.definedone () ;

  do k = 1 by 1 until ( last.group ) ;

    set temp;

    by group ;

    hh.add () ;

  end ;

  hh.output (dataset:group) ;

run ;

Ksharp
Super User

Hi. Art.T your code is very good ,I  did not even think it so.

This is another version of Art.T's code.  Smiley Happy

data _null_ ;
  dcl hash hh   (             ) ;
  hh.definekey  ('k'          ) ;
  hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;
  hh.definedone () ;

  do until(mod(k,5)=0 or last);
   k+1;
set sashelp.class end=last ;
            hh.add();
  end;
  hh.output(dataset: 'a'||strip(k));
run;

Ksharp

art297
Opal | Level 21

Ksharp,

I definitely like your proposed code better than what I had suggested, but I would add a group counter to keep the file names a bit more manageable.  While it adds a little more overhead, doing so had almost no negative effect on performance:

data _null_ ;

  dcl hash hh   (             ) ;

  hh.definekey  ('k'          ) ;

  hh.definedata ('sex', 'name', 'age', 'height', 'weight') ;

  hh.definedone () ;

 

  do until(mod(k,5)=0 or last);

   k+1;

   set sashelp.class end=last ;

   hh.add();

  end;

  gp+1;

  hh.output(dataset: 'a'||strip(gp));

run;

Ksharp
Super User

Art.T

Yes. This makes code looks better if add a gp variable to flag a output dataset.

Ksharp

Smiley Happy

Peter_C
Rhodochrosite | Level 12

does this method using hash tables not require more memory than the size of that  huge  data set ?

art297
Opal | Level 21

Peter,

I am definitely NOT a hash expert (where is Paul Dorfman when you really need him?), but I tend to agree that a hash method may not be appropriate for the OP.  He/She was originally offered suggestions that WOULD work, but never indicated any response as being correct.  Regardless, I'm glad that KSharp suggested the method, as I never realized that one could use the method in this manner without first creating an index.

Ksharp
Super User

Yes. The Hash Table only has constant size of memory(i.e. only contains five observations).

Each Iteration of data step will clear the last Hash Table and re-build a new Hash Table, then continue to accept

next five observations. So scan table only once ,you can split it all.

This code I wrote is also based on Art.T's code. So I will thank Art.T ,who make me to think it further about Hash Table.

Ksharp

FriedEgg
SAS Employee

This paper provides some insight into the splitting utilizing hash object method as well another example from the pre-v9 era.  A nice read.

http://www2.sas.com/proceedings/sugi30/236-30.pdf

And to Art's point it is authored in part by Paul Dorfman.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 2195 views
  • 0 likes
  • 7 in conversation