currently I have a working code, whereI split a dataset(of 25 years) into 25 different sub datasets and process each of the sub dataset in a macro(working code below;repetitive code indicated by ...);so far so good. My requirement is I need to split this dataset by monthly and my question is what is the best way to do it. If it were C++, I'd use a for loop to go over the dataset and store each sub dataset as an array element and pass each array element to the macro; how would I do it in SAS. Clearly, the approach that I followed below is inefficient, because that would require having 300 sub datasets and a code repeted over 300 times; (PS:I am new to SAS); thank you in advance
*split the time series for each year to accomodate LM12;
data yr1989mktdata ....... yr2000mktdata yr2001mktdata yr2002mktdata yr2003mktdata yr2004mktdata
yr2005mktdata yr2006mktdata yr2007mktdata yr2008mktdata yr2009mktdata yr2010mktdata yr2011mktdata yr2012mktdata ;
set mktdata;
if date > '1jan1989'd and date <= '31dec1989'd then
output yr1989mktdata;
....
..
...
...
else if date > '1jan2000'd and date <= '31dec2000'd then
output yr2000mktdata;
else if date > '1jan2001'd and date <= '31dec2001'd then
output yr2001mktdata;
else if date > '1jan2002'd and date <= '31dec2002'd then
output yr2002mktdata;
else if date > '1jan2003'd and date <= '31dec2003'd then
output yr2003mktdata;
else if date > '1jan2004'd and date <= '31dec2004'd then
output yr2004mktdata;
else if date > '1jan2005'd and date <= '31dec2005'd then
output yr2005mktdata;
else if date > '1jan2006'd and date <= '31dec2006'd then
output yr2006mktdata;
else if date > '1jan2007'd and date <= '31dec2007'd then
output yr2007mktdata;
.............
...........
..........
run;
%macro LMGenerate(submktdata,yeartag);
%mend LMGenerate;
%LMGenerate(yr1989mktdata,1989)
...
....
....
%LMGenerate(yr2000mktdata,2000)
%LMGenerate(yr2001mktdata,2001)
%LMGenerate(yr2002mktdata,2002)
%LMGenerate(yr2003mktdata,2003)
...
...
%LMGenerate(yr2004mktdata,2012)
First, note that Reeza's comment may be 100% right. We don't see the processing that takes place inside the macro for each data set, but it's possible that following Reeza's suggestion would eliminate macro language entirely. It would look somethig like this:
data want;
set have;
year = year(date);
month = month(date);
if 1989 <= year <= 2012;
run;
proc sort data=want;
by year;
run;
Then skip running a macro and instead process the complete data set using the statement BY YEAR. We would need to know more about what's in the macro to know if that approach is feasiable.
If you conclude that you really do need to process each subset separately, here are a few recommendations.
1. Create the variables YEAR and MONTH. It is cheaper to add extra variables than to call the YEAR() and MONTH() functions multiple times.
2. Macro language will let you generate 300 data set names without too much trouble, but it's a little tricky to get the leading 0 in the first 9 months. You would need a macro along these lines:
data %do year=1989 %to 2012;
%do month=1 %to 12;
yr&year._m%sysfunc(putn(&month, z2))
%end;
%end;
3. To output observations, it sometimes becomes easier to delete the early records first. That way, you only need to make one comparison instead of two. In this case, however, you might group the ELSE logic a little differently. For example:
year = year(date);
month = month(date);
if year < 1989 then delete;
else if year=1989 then do;
if month=1 then output yr1989_m01;
else if month=2 then output yr1989_m02;
...
else if month=12 then output yr1989_m12;
end;
else if year=1990 then do;
4. Getting macro language to generate those IF/THEN statements is mildly complex. Also consider whether SELECT logic would be easier for you:
select (year);
when (1989) select (month);
when (01) output yr1989_m01;
when (02) output yr1989_m02;
...
when (12) output yr1989_m12;
end;
...
when (2012) select (month);
when (01) output yr2012_m01;
...
end;
otherwise delete;
end;
So the first decision is to look at the subsequent processing and determine whether it truly requires separate data sets or not. If so, then the code here may come in handy. But it's not a given.
Good luck.
Generally splitting the dataset up isn't recommended. Instead you should look into by group processing for SAS.
SAS doesn't pull a dataset into memory, so the size of a dataset is less of an issue. Instead, it processes a dataset line by line, however sorting a dataset could take more time and memory depending on the size.
The example below may save you some typing:
data mktdata;
input date mmddyy10.;
format date mmddyy10.;
cards;
01/02/2000
01/02/2001
01/02/2002
01/02/2003
01/02/2004
;
%macro test;
%do year=2000 %to 2004;
data yr&year.mktdata;
set mktdata(where=(year(date)=&year));
run;
%end;
%mend;
%test
First, note that Reeza's comment may be 100% right. We don't see the processing that takes place inside the macro for each data set, but it's possible that following Reeza's suggestion would eliminate macro language entirely. It would look somethig like this:
data want;
set have;
year = year(date);
month = month(date);
if 1989 <= year <= 2012;
run;
proc sort data=want;
by year;
run;
Then skip running a macro and instead process the complete data set using the statement BY YEAR. We would need to know more about what's in the macro to know if that approach is feasiable.
If you conclude that you really do need to process each subset separately, here are a few recommendations.
1. Create the variables YEAR and MONTH. It is cheaper to add extra variables than to call the YEAR() and MONTH() functions multiple times.
2. Macro language will let you generate 300 data set names without too much trouble, but it's a little tricky to get the leading 0 in the first 9 months. You would need a macro along these lines:
data %do year=1989 %to 2012;
%do month=1 %to 12;
yr&year._m%sysfunc(putn(&month, z2))
%end;
%end;
3. To output observations, it sometimes becomes easier to delete the early records first. That way, you only need to make one comparison instead of two. In this case, however, you might group the ELSE logic a little differently. For example:
year = year(date);
month = month(date);
if year < 1989 then delete;
else if year=1989 then do;
if month=1 then output yr1989_m01;
else if month=2 then output yr1989_m02;
...
else if month=12 then output yr1989_m12;
end;
else if year=1990 then do;
4. Getting macro language to generate those IF/THEN statements is mildly complex. Also consider whether SELECT logic would be easier for you:
select (year);
when (1989) select (month);
when (01) output yr1989_m01;
when (02) output yr1989_m02;
...
when (12) output yr1989_m12;
end;
...
when (2012) select (month);
when (01) output yr2012_m01;
...
end;
otherwise delete;
end;
So the first decision is to look at the subsequent processing and determine whether it truly requires separate data sets or not. If so, then the code here may come in handy. But it's not a given.
Good luck.
If you decide to go with 's suggestion #4, you can automate that quite a bit with a little bit of proc sql code. e.g.:
data have;
informat date date9.;
input date;
cards;
02jan2012
05jan2012
04feb2012
04apr2012
01may2012
02may2012
;
proc sql noprint;
select distinct "when ('"||put(date,monyy7.)||
"') output "||put(date,monyy7.)||";",
put(date,monyy7.)
into :dates separated by " ",
:outs separated by " "
from have
;
quit;
data &outs. (drop=monthyear);
set have;
monthyear=put(date,monyy7.);
select (monthyear);
&dates.
otherwise delete;
end;
run;
Since this is a one-time solution
meaning cost does not matter,
I'll go with Art's SQL solution.
see also:
http://www.sascommunity.org/wiki/Category:Making_subsets
and a Work in Progress:
http://www.sascommunity.org/wiki/Macro_Loops_with_Dates
Ron Fehd list processing maven
I cannot remember the maximum number of datasets a single data statement can open at a time, but in case 300 breaches that limit here are two variations on the same method to use a hash object to split out your subgroups.
/* test data */
data have;
format id best. dt date9.;
drop min;
min=intnx('year',today(),-25);
do id=1 to 10000;
dt= min + floor((1+today()-min)*ranuni(1234));
output;
end;
run;
/* method 1, no treatment */
data _null_;
if 0 then set have;
split=intnx('year',today(),-25,'s');
do while(split<=today());
declare hash mon(ordered: 'a');
mon.definekey('id','dt','_n_');
mon.definedata('id','dt');
mon.definedone();
dsname=cats('have(where=(',intnx('month',split,0,'b'),'<=dt<=',intnx('month',split,0,'e'),'))');
dsid=open( dsname );
do _n_=1 by 1 until(rc ne 0);
call set (dsid) ;
rc=fetch(dsid);
mon.add();
end;
dsid=close(dsid);
mon.output (dataset: cats('yr',year(split),'m',month(split),'_mkdata'));
split=intnx('month',split,1,'s');
end;
stop;
run;
/* method 2 - treat with view */
proc sql;
create view _have as
select id,
dt,
cats('yr',year(dt),'m',month(dt),'_mkdata') as split
from have
order by split,id,dt
;
quit;
data _null_;
declare hash mon();
mon.definekey('id','dt','_n_');
mon.definedata('id','dt');
mon.definedone();
do _n_=1 by 1 until(last.split);
set _have;
by split;
mon.add();
end;
mon.output(dataset:split);
run;
thank you folks...i learnt from this thread.
I removed macro and rewrote using 'by' clause. code look simple and elegant..thanks again
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.