DATA Step, Macro, Functions and more

how to split a dataset into 300 sub datasets

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 13
Accepted Solution

how to split a dataset into 300 sub datasets

currently I have a working code, whereI split a dataset(of 25 years) into 25 different sub datasets and process each of the sub dataset in a macro(working code below;repetitive code indicated by ...);so far so good. My requirement is I need to split this dataset by monthly and my question is what is the best way to do it. If it were C++, I'd use a for loop to go over the dataset and store each sub dataset as an array element and pass each array element to the macro; how would I do it in SAS. Clearly, the approach that I followed below is inefficient, because that would require having 300 sub datasets and a code repeted over 300 times; (PS:I am new to SAS); thank you in advance


*split the time series for each year to accomodate LM12;
data  yr1989mktdata ....... yr2000mktdata yr2001mktdata yr2002mktdata yr2003mktdata yr2004mktdata
yr2005mktdata yr2006mktdata yr2007mktdata yr2008mktdata  yr2009mktdata yr2010mktdata yr2011mktdata  yr2012mktdata  ;
set mktdata;

  if date > '1jan1989'd and date <= '31dec1989'd then
output yr1989mktdata;

....

..

...

...

else if date > '1jan2000'd and date <= '31dec2000'd then
output yr2000mktdata;

else if date > '1jan2001'd and date <= '31dec2001'd then
output yr2001mktdata;

else if date > '1jan2002'd and date <= '31dec2002'd then
output yr2002mktdata;

else if date > '1jan2003'd and date <= '31dec2003'd then
output yr2003mktdata;

else if date > '1jan2004'd and date <= '31dec2004'd then
output yr2004mktdata;

else if date > '1jan2005'd and date <= '31dec2005'd then
output yr2005mktdata;

else if date > '1jan2006'd and date <= '31dec2006'd then
output yr2006mktdata;

else if date > '1jan2007'd and date <= '31dec2007'd then
output yr2007mktdata;

.............

...........

..........

run;

%macro LMGenerate(submktdata,yeartag);

%mend  LMGenerate;

%LMGenerate(yr1989mktdata,1989)

...

....

....

%LMGenerate(yr2000mktdata,2000)

%LMGenerate(yr2001mktdata,2001)

%LMGenerate(yr2002mktdata,2002)

%LMGenerate(yr2003mktdata,2003)

...

...

%LMGenerate(yr2004mktdata,2012)


Accepted Solutions
Solution
‎02-24-2013 12:25 PM
Super User
Posts: 5,083

Re: how to split a dataset into 300 sub datasets

First, note that Reeza's comment may be 100% right.  We don't see the processing that takes place inside the macro for each data set, but it's possible that following Reeza's suggestion would eliminate macro language entirely.  It would look somethig like this:

data want;

   set have;

   year = year(date);

   month = month(date);

   if 1989 <= year <=  2012;

run;

proc sort data=want;

   by year;

run;

Then skip running a macro and instead process the complete data set using the statement BY YEAR.  We would need to know more about what's in the macro to know if that approach is feasiable.

If you conclude that you really do need to process each subset separately, here are a few recommendations.

1. Create the variables YEAR and MONTH.  It is cheaper to add extra variables than to call the YEAR()  and MONTH() functions multiple times.

2. Macro language will let you generate 300 data set names without too much trouble, but it's a little tricky to get the leading 0 in the first 9 months.  You would need a macro along these lines:

data  %do year=1989 %to 2012;

                 %do month=1 %to 12;

                         yr&year._m%sysfunc(putn(&month, z2))

                 %end;

         %end;

3.  To output observations, it sometimes becomes easier to delete the early records first.  That way, you only need to make one comparison instead of two.  In this case, however, you might group the ELSE logic a little differently.  For example:

year = year(date);

month = month(date);

if year < 1989 then delete;

else if year=1989 then do;

   if month=1 then output yr1989_m01;

   else if month=2 then output yr1989_m02;

   ...

   else if month=12 then output yr1989_m12;

end;

else if year=1990 then do;

4. Getting macro language to generate those IF/THEN statements is mildly complex.  Also consider whether SELECT logic would be easier for you:

select (year);

   when (1989) select (month);

                        when (01) output yr1989_m01;

                        when (02) output yr1989_m02;

                         ...

                        when (12) output yr1989_m12;

                        end;

   ...

   when (2012) select (month);

                        when (01) output yr2012_m01;

                        ...

                        end;

   otherwise delete;

end;

So the first decision is to look at the subsequent processing and determine whether it truly requires separate data sets or not.  If so, then the code here may come in handy.  But it's not a given.

Good luck.

View solution in original post


All Replies
Super User
Posts: 17,831

Re: how to split a dataset into 300 sub datasets

Generally splitting the dataset up isn't recommended. Instead you should look into by group processing for SAS. 

SAS doesn't pull a dataset into memory, so the size of a dataset is less of an issue. Instead, it processes a dataset line by line, however sorting a dataset could take more time and memory depending on the size.

http://www.lexjansen.com/wuss/2008/ess/ess11.pdf

Super Contributor
Posts: 1,636

Re: how to split a dataset into 300 sub datasets

The example below may save you some typing:

data mktdata;

input date mmddyy10.;

format date mmddyy10.;

cards;

01/02/2000

01/02/2001

01/02/2002

01/02/2003

01/02/2004

;

%macro test;

  %do year=2000 %to 2004;

   data yr&year.mktdata;

       set mktdata(where=(year(date)=&year));

  run;

  %end;

%mend;

%test

Solution
‎02-24-2013 12:25 PM
Super User
Posts: 5,083

Re: how to split a dataset into 300 sub datasets

First, note that Reeza's comment may be 100% right.  We don't see the processing that takes place inside the macro for each data set, but it's possible that following Reeza's suggestion would eliminate macro language entirely.  It would look somethig like this:

data want;

   set have;

   year = year(date);

   month = month(date);

   if 1989 <= year <=  2012;

run;

proc sort data=want;

   by year;

run;

Then skip running a macro and instead process the complete data set using the statement BY YEAR.  We would need to know more about what's in the macro to know if that approach is feasiable.

If you conclude that you really do need to process each subset separately, here are a few recommendations.

1. Create the variables YEAR and MONTH.  It is cheaper to add extra variables than to call the YEAR()  and MONTH() functions multiple times.

2. Macro language will let you generate 300 data set names without too much trouble, but it's a little tricky to get the leading 0 in the first 9 months.  You would need a macro along these lines:

data  %do year=1989 %to 2012;

                 %do month=1 %to 12;

                         yr&year._m%sysfunc(putn(&month, z2))

                 %end;

         %end;

3.  To output observations, it sometimes becomes easier to delete the early records first.  That way, you only need to make one comparison instead of two.  In this case, however, you might group the ELSE logic a little differently.  For example:

year = year(date);

month = month(date);

if year < 1989 then delete;

else if year=1989 then do;

   if month=1 then output yr1989_m01;

   else if month=2 then output yr1989_m02;

   ...

   else if month=12 then output yr1989_m12;

end;

else if year=1990 then do;

4. Getting macro language to generate those IF/THEN statements is mildly complex.  Also consider whether SELECT logic would be easier for you:

select (year);

   when (1989) select (month);

                        when (01) output yr1989_m01;

                        when (02) output yr1989_m02;

                         ...

                        when (12) output yr1989_m12;

                        end;

   ...

   when (2012) select (month);

                        when (01) output yr2012_m01;

                        ...

                        end;

   otherwise delete;

end;

So the first decision is to look at the subsequent processing and determine whether it truly requires separate data sets or not.  If so, then the code here may come in handy.  But it's not a given.

Good luck.

PROC Star
Posts: 7,363

Re: how to split a dataset into 300 sub datasets

If you decide to go with 's suggestion #4, you can automate that quite a bit with a little bit of proc sql code.  e.g.:

data have;

  informat date date9.;

  input date;

  cards;

02jan2012

05jan2012

04feb2012

04apr2012

01may2012

02may2012

;

proc sql noprint;

  select distinct "when ('"||put(date,monyy7.)||

   "') output "||put(date,monyy7.)||";",

   put(date,monyy7.)

    into :dates separated by " ",

         Smiley Surpriseduts separated by " "

      from have

  ;

quit;

data &outs. (drop=monthyear);

  set have;

  monthyear=put(date,monyy7.);

  select (monthyear);

    &dates.

    otherwise delete;

  end;

run;

Regular Contributor
Posts: 198

Re: how to split a dataset into 300 sub datasets

Since this is a one-time solution

meaning cost does not matter,

I'll go with Art's SQL solution.

see also:

http://www.sascommunity.org/wiki/Category:Making_subsets

and a Work in Progress:

http://www.sascommunity.org/wiki/Macro_Loops_with_Dates

Ron Fehd  list processing maven

Trusted Advisor
Posts: 1,300

Re: how to split a dataset into 300 sub datasets

I cannot remember the maximum number of datasets a single data statement can open at a time, but in case 300 breaches that limit here are two variations on the same method to use a hash object to split out your subgroups.

/* test data */

data have;

format id best. dt date9.;

drop min;

min=intnx('year',today(),-25);

do id=1 to 10000;

  dt= min + floor((1+today()-min)*ranuni(1234));

  output;

end;

run;

/* method 1, no treatment */

data _null_;

  if 0 then set have;

  split=intnx('year',today(),-25,'s');

  do while(split<=today());

   declare hash mon(ordered: 'a');

    mon.definekey('id','dt','_n_');

    mon.definedata('id','dt');

   mon.definedone();

   dsname=cats('have(where=(',intnx('month',split,0,'b'),'<=dt<=',intnx('month',split,0,'e'),'))');

   dsid=open( dsname );

   do _n_=1 by 1 until(rc ne 0);

    call set (dsid) ;

    rc=fetch(dsid);

    mon.add();

   end;

   dsid=close(dsid);

   mon.output (dataset: cats('yr',year(split),'m',month(split),'_mkdata'));

   split=intnx('month',split,1,'s');

  end;

  stop;

run;

/* method 2 - treat with view */

proc sql;

create view _have as

      select id,

             dt,

             cats('yr',year(dt),'m',month(dt),'_mkdata') as split

        from have

    order by split,id,dt

;

quit;

data _null_;

declare hash mon();

  mon.definekey('id','dt','_n_');

  mon.definedata('id','dt');

mon.definedone();

do _n_=1 by 1 until(last.split);

  set _have;

  by split;

  mon.add();

end;

mon.output(dataset:split);

run;

PROC Star
Posts: 7,363

Re: how to split a dataset into 300 sub datasets

: If there is a limit to the number of files one can output, other than memory, I haven't found it.

I just ran some test code where I created 10,400 files in a single datastep and, if I weren't so tired, would have tried to expand it. 

Occasional Contributor
Posts: 13

Re: how to split a dataset into 300 sub datasets

thank you folks...i learnt from this thread.

I removed macro and rewrote using 'by' clause. code look simple and elegant..thanks again

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 645 views
  • 10 likes
  • 7 in conversation