Solved: how to split a dataset into 300 sub datasets

kashili · Posted 02-23-2013 07:00 PM

currently I have a working code, whereI split a dataset(of 25 years) into 25 different sub datasets and process each of the sub dataset in a macro(working code below;repetitive code indicated by ...);so far so good. My requirement is I need to split this dataset by monthly and my question is what is the best way to do it. If it were C++, I'd use a for loop to go over the dataset and store each sub dataset as an array element and pass each array element to the macro; how would I do it in SAS. Clearly, the approach that I followed below is inefficient, because that would require having 300 sub datasets and a code repeted over 300 times; (PS:I am new to SAS); thank you in advance

*split the time series for each year to accomodate LM12;
data yr1989mktdata ....... yr2000mktdata yr2001mktdata yr2002mktdata yr2003mktdata yr2004mktdata
yr2005mktdata yr2006mktdata yr2007mktdata yr2008mktdata yr2009mktdata yr2010mktdata yr2011mktdata yr2012mktdata ;
set mktdata;

if date > '1jan1989'd and date <= '31dec1989'd then
output yr1989mktdata;

....

..

...

else if date > '1jan2000'd and date <= '31dec2000'd then
output yr2000mktdata;

else if date > '1jan2001'd and date <= '31dec2001'd then
output yr2001mktdata;

else if date > '1jan2002'd and date <= '31dec2002'd then
output yr2002mktdata;

else if date > '1jan2003'd and date <= '31dec2003'd then
output yr2003mktdata;

else if date > '1jan2004'd and date <= '31dec2004'd then
output yr2004mktdata;

else if date > '1jan2005'd and date <= '31dec2005'd then
output yr2005mktdata;

else if date > '1jan2006'd and date <= '31dec2006'd then
output yr2006mktdata;

else if date > '1jan2007'd and date <= '31dec2007'd then
output yr2007mktdata;

.............

...........

..........

run;

%macro LMGenerate(submktdata,yeartag);

%mend LMGenerate;

%LMGenerate(yr1989mktdata,1989)

...

....

%LMGenerate(yr2000mktdata,2000)

%LMGenerate(yr2001mktdata,2001)

%LMGenerate(yr2002mktdata,2002)

%LMGenerate(yr2003mktdata,2003)

...

%LMGenerate(yr2004mktdata,2012)

Astounding · Posted 02-24-2013 12:25 PM

First, note that Reeza's comment may be 100% right. We don't see the processing that takes place inside the macro for each data set, but it's possible that following Reeza's suggestion would eliminate macro language entirely. It would look somethig like this:

data want;

set have;

year = year(date);

month = month(date);

if 1989 <= year <= 2012;

run;

proc sort data=want;

by year;

run;

Then skip running a macro and instead process the complete data set using the statement BY YEAR. We would need to know more about what's in the macro to know if that approach is feasiable.

If you conclude that you really do need to process each subset separately, here are a few recommendations.

1. Create the variables YEAR and MONTH. It is cheaper to add extra variables than to call the YEAR() and MONTH() functions multiple times.

2. Macro language will let you generate 300 data set names without too much trouble, but it's a little tricky to get the leading 0 in the first 9 months. You would need a macro along these lines:

data %do year=1989 %to 2012;

%do month=1 %to 12;

yr&year._m%sysfunc(putn(&month, z2))

%end;

3. To output observations, it sometimes becomes easier to delete the early records first. That way, you only need to make one comparison instead of two. In this case, however, you might group the ELSE logic a little differently. For example:

year = year(date);

month = month(date);

if year < 1989 then delete;

else if year=1989 then do;

if month=1 then output yr1989_m01;

else if month=2 then output yr1989_m02;

...

else if month=12 then output yr1989_m12;

end;

else if year=1990 then do;

4. Getting macro language to generate those IF/THEN statements is mildly complex. Also consider whether SELECT logic would be easier for you:

select (year);

when (1989) select (month);

when (01) output yr1989_m01;

when (02) output yr1989_m02;

...

when (12) output yr1989_m12;

end;

...

when (2012) select (month);

when (01) output yr2012_m01;

...

end;

otherwise delete;

end;

So the first decision is to look at the subsequent processing and determine whether it truly requires separate data sets or not. If so, then the code here may come in handy. But it's not a given.

Good luck.

View solution in original post

Reeza · Posted 02-23-2013 07:11 PM

Generally splitting the dataset up isn't recommended. Instead you should look into by group processing for SAS.

SAS doesn't pull a dataset into memory, so the size of a dataset is less of an issue. Instead, it processes a dataset line by line, however sorting a dataset could take more time and memory depending on the size.

http://www.lexjansen.com/wuss/2008/ess/ess11.pdf

Linlin · Posted 02-23-2013 09:35 PM

The example below may save you some typing:

data mktdata;

input date mmddyy10.;

format date mmddyy10.;

cards;

01/02/2000

01/02/2001

01/02/2002

01/02/2003

01/02/2004

;

%macro test;

%do year=2000 %to 2004;

data yr&year.mktdata;

set mktdata(where=(year(date)=&year));

run;

%end;

%mend;

%test

Astounding · Posted 02-24-2013 12:25 PM

First, note that Reeza's comment may be 100% right. We don't see the processing that takes place inside the macro for each data set, but it's possible that following Reeza's suggestion would eliminate macro language entirely. It would look somethig like this:

data want;

set have;

year = year(date);

month = month(date);

if 1989 <= year <= 2012;

run;

proc sort data=want;

by year;

run;

Then skip running a macro and instead process the complete data set using the statement BY YEAR. We would need to know more about what's in the macro to know if that approach is feasiable.

If you conclude that you really do need to process each subset separately, here are a few recommendations.

1. Create the variables YEAR and MONTH. It is cheaper to add extra variables than to call the YEAR() and MONTH() functions multiple times.

2. Macro language will let you generate 300 data set names without too much trouble, but it's a little tricky to get the leading 0 in the first 9 months. You would need a macro along these lines:

data %do year=1989 %to 2012;

%do month=1 %to 12;

yr&year._m%sysfunc(putn(&month, z2))

%end;

3. To output observations, it sometimes becomes easier to delete the early records first. That way, you only need to make one comparison instead of two. In this case, however, you might group the ELSE logic a little differently. For example:

year = year(date);

month = month(date);

if year < 1989 then delete;

else if year=1989 then do;

if month=1 then output yr1989_m01;

else if month=2 then output yr1989_m02;

...

else if month=12 then output yr1989_m12;

end;

else if year=1990 then do;

4. Getting macro language to generate those IF/THEN statements is mildly complex. Also consider whether SELECT logic would be easier for you:

select (year);

when (1989) select (month);

when (01) output yr1989_m01;

when (02) output yr1989_m02;

...

when (12) output yr1989_m12;

end;

...

when (2012) select (month);

when (01) output yr2012_m01;

...

end;

otherwise delete;

end;

So the first decision is to look at the subsequent processing and determine whether it truly requires separate data sets or not. If so, then the code here may come in handy. But it's not a given.

Good luck.

art297 · Posted 02-24-2013 03:44 PM

If you decide to go with 's suggestion #4, you can automate that quite a bit with a little bit of proc sql code. e.g.:

data have;

informat date date9.;

input date;

cards;

02jan2012

05jan2012

04feb2012

04apr2012

01may2012

02may2012

;

proc sql noprint;

select distinct "when ('"||put(date,monyy7.)||

"') output "||put(date,monyy7.)||";",

put(date,monyy7.)

into :dates separated by " ",

:outs separated by " "

from have

;

quit;

data &outs. (drop=monthyear);

set have;

monthyear=put(date,monyy7.);

select (monthyear);

&dates.

otherwise delete;

end;

run;

Ron_MacroMaven · Posted 02-25-2013 08:54 AM

Since this is a one-time solution

meaning cost does not matter,

I'll go with Art's SQL solution.

how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

Re: how to split a dataset into 300 sub datasets

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away