(My post here may seem verbose, but I hope that it will at least be clearly stated with the help of the context, the example dataset, and my codes.)
CONTEXT: I have a very large dataset that contains a date variable, and I am trying to come up with a way to subset this large dataset BY the year of this date variable (i.e. so that one subset contains ALL rows for 2018, and another contains ALL rows for 2019. So on and so forth).
IDEA: Instead of using an if ... then output for each of the years in my data, I have attempted (and failed) to use a do loop and macro variables. (I'd also welcome other approaches that will get me the same results.)
Just to illustrate my idea, assume that we have an imaginary table called 'test' that contains the birth year of a few kids:
data test;
input yr;
datalines;
2019
2019
2018
2018
2018
2017
2016
2016
2015
2015
;
The table above contains years from 2015 to 2019 (spanning 5 years). And I've written the following codes in an attempt to programmatically to create 5 subsets by the birth year:
** To save the max and min year as macro var **;
proc sql noprint;
select min(yr), max(yr)
into :min_yr,
:max_yr
from test;
quit;
%let int = %eval(&max_yr - &min_yr); **int denotes the spanning interval**;
** Subset using do loop **;
data _null_;
set test;
do i=0 to &int.;
call symput('iter', i);
if yr = %eval(&min_yr. + i) then output "data&iter.";
end;
run;
Unfortunately, my codes did not work, because inside the %eval( ) function, the "i" was not assigned a value. And SAS log had an error message that reads: A character operand was found in the %eval function ... where a numeric operand is required.
Sorry for making this so lengthy, but I'd really appreciate some help.
if yr = %eval(&min_yr. + i) then output "data&iter.";
The %eval function needs macro variables or numbers to work properly. It doesn't know what to do with the data step variable i
When trying to get macros like this to work, you first need to have a data step that works without macros and without macro variables. Then, you should be able to substitute the macro variables and macro code to get it to work. Then, also, you could see that putting a name of a data set inside double quotes, as you did above, is not necessary, and will not work.
So, this works without macro variables and without macros (and I have saved myself some typing by assuming the only years are 2015 2016 and 2017).
data data2015 data2016 data2017;
set test;
if yr = 2015 then output data2015;
if yr = 2016 then output data2016;
if yr = 2017 then output data2017;
run;
So now you can turn this into a macro by placing macro loops in the places where the years are shown above. The macro processor really just replaces text upon program execution, and so you want a macro variable (let's call it &y) to have values (in the loop) going from 2015 to 2017.
%macro dothis;
data %do y=&min_yr %to &max_yr; data&y %end; ;
set test;
%do y=&min_yr %to &max_yr;
if yr=&y then output data&y;
%end;
run;
%mend;
%dothis
Finally, as a general comment, but usually an important concept that probably applies here ... it is inefficient to split up data sets into yearly data sets. If you need analyses by year, and you keep the entire data set whole (instead of splitting it up), you can get yearly analyses by using the BY command in your SAS code. If I were you, I would see if this works satisfactorily before considering splitting the data up into data sets by year. In other words, my advice to you is: DON'T DO THIS unless you are convinced that there's no way to get what you want with a single large data set.
if yr = %eval(&min_yr. + i) then output "data&iter.";
The %eval function needs macro variables or numbers to work properly. It doesn't know what to do with the data step variable i
When trying to get macros like this to work, you first need to have a data step that works without macros and without macro variables. Then, you should be able to substitute the macro variables and macro code to get it to work. Then, also, you could see that putting a name of a data set inside double quotes, as you did above, is not necessary, and will not work.
So, this works without macro variables and without macros (and I have saved myself some typing by assuming the only years are 2015 2016 and 2017).
data data2015 data2016 data2017;
set test;
if yr = 2015 then output data2015;
if yr = 2016 then output data2016;
if yr = 2017 then output data2017;
run;
So now you can turn this into a macro by placing macro loops in the places where the years are shown above. The macro processor really just replaces text upon program execution, and so you want a macro variable (let's call it &y) to have values (in the loop) going from 2015 to 2017.
%macro dothis;
data %do y=&min_yr %to &max_yr; data&y %end; ;
set test;
%do y=&min_yr %to &max_yr;
if yr=&y then output data&y;
%end;
run;
%mend;
%dothis
Finally, as a general comment, but usually an important concept that probably applies here ... it is inefficient to split up data sets into yearly data sets. If you need analyses by year, and you keep the entire data set whole (instead of splitting it up), you can get yearly analyses by using the BY command in your SAS code. If I were you, I would see if this works satisfactorily before considering splitting the data up into data sets by year. In other words, my advice to you is: DON'T DO THIS unless you are convinced that there's no way to get what you want with a single large data set.
@aaronh wrote:
Thank you so much, Paige! The only reason I was trying to use a do loop here, instead of the if then output, is because I will likely use the same procedure again, sometimes without knowing exactly what years are present in my data. But I can always use the data step if then output if I have no better ways to do this. Thanks again!
Except I also advised you to NOT break up the data set into data sets by year, except as a last resort. I think what you are trying to do is unnecessary extra work that only slows things down in the long run, IN MOST CASES.
You are close, but as you saw you cannot use mix data step and macro logic in that way.
It is easier to just have the data step generate the code. One way to generate code from a data step is to use CALL EXECUTE().
So first get the range of years. Do you want all years from min to max, even those without any records?
proc summary data=test ;
var yr ;
output out=range min=min_yr max=max_yr;
run;
data _null_;
set range;
call execute('data');
do yr=min_yr to max_yr ;
call execute(' '||cats('data',yr));
end;
call execute(';set test;');
do yr=min_yr to max_yr;
call execute(catx(' ','if yr=',yr,'then output',cats('data',yr),';'));
end;
call execute('run;');
run;
Or just the years that exist?
proc summary data=test nway;
class yr ;
output out=range;
run;
data _null_;
call execute('data');
do until(eof1);
set range end=eof1;
call execute(' '||cats('data',yr));
end;
call execute(';set test;');
do until(eof2);
set range end=eof2;
call execute(catx(' ','if yr=',yr,'then output',cats('data',yr),';'));
end;
call execute('run;');
run;
Hi @aaronh
Index your yr column and that will do an implicit sort and then hash is a breeze
data test;
input yr;
datalines;
2019
2019
2018
2018
2018
2017
2016
2016
2015
2015
;
proc sql ;
create index yr on test (yr) ;
quit ;
data _null_ ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("_n_") ;
h.definedata ('yr') ;
h.definedone () ;
end ;
do _n_ = 1 by 1 until (last.yr) ;
set test ;
by yr ;
h.add() ;
end ;
h.output (dataset: catx ("_", "year", yr)) ;
h.clear() ;
run ;
If your date variable is an actual SAS date variable for a very large number of purposes you can create groups directly as you describe without adding any variable just use the correct format:
data example; do date = '01Jan1960'd to '18Jul2020'd; output; end; run; proc freq data=example; tables date; format date year4.; run;
or
proc freq data=example; tables date; format date yyq6.; run;
or
proc freq data=example; tables date; format date yymon.; run;
Groups created by a format applied to a SAS date variable are honored by most analysis and graphing procedures as well as reporting and summarizing such as Proc Freq , Tabulate, Report.
Or leave the data together and only subset as needed:
proc freq data=example; where year(date) > 2010; tables date; format date yymon.; run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.