Dear SAS Community,
I'm trying to compile multiple datasets into single superset - the same data for different months for many years (jan2001_data .... dec2016). Multiple SET doesn't work here because I only need subset of variables from original datasets. Thus I created a dataset which I expand using sql union with every dataset:
%macro appendData(month=, year=); proc sql; create table anto.temp1 as select a.* from anto.dataaccum a union select b.var1, b.var2, b.var3 from data&year..&month.&year._data b; data anto.dataaccum; set anto.temp1; run; %mend; %appendData(month=jan, year=2001); ...... %appendData(month=dec, year=2016);
That would have been much more efficient if I could call %appendData from inside %do loop. However, any use of ARRAY, DO or DO_OVER outside of PROC causes error. Is it possible at all? Or do I call it incorrectly?
%do_over(values=jan feb mar apr may jun jul aug sep oct nov dec, phrase=%appendData(month=?,year=2016)); WARNING: Apparent invocation of macro DO_OVER not resolved. ERROR 180-322: Statement is not valid or it is used out of proper order.
Instead of using macro coding you could also query the SAS dictionary tables and create and populate a macro variable with all the source datasets you need (and which exist).
"Multiple SET doesn't work here because I only need subset of variables from original datasets."
That works: You can use a KEEP option on source datasets: SET mysource(keep=<variable list>);
And to read from multiple datasets only use a single SET statement and just list all your datasets like:
SET mysource1(keep=<variable list>) mysource2(keep=<variable list>);
Here a code option which allows you to create the SET statement dynamically using a start and end date.
The naming convention of your source datasets made things a bit more complicated than necessary - that's the reason for the regular expression and prxmatch() - but things work.
options symbolgen;
/* create sample source data */
data data2017_jan2017_data data2018_feb2018_data;
set sashelp.class;
run;
/* set start and end date for monthly datasets */
%let startdt=jan2017;
%let enddt=mar2018;
/* create macro variable with value containing all source datasets in date range */
/* - only datasets which actually exists will be added */
%let ds_list=_dummy_;
proc sql noprint;
select cats(libname,'.',memname,"(keep=name age)") into :ds_list separated by ' '
from dictionary.tables
where
libname="WORK"
and prxmatch('/^data\d{4}_\w{3}\d{4}_data\s*$/oi',memname)=1
and input(scan(memname,2,'_'),monyy7.) between input("&startdt",monyy7.) and input("&enddt",monyy7.)
;
quit;
/* create table want with all selected source datasets */
data want;
length sourceDS _sourceDS $41;
set &ds_list indsname=_sourceDS;
sourceDS=_sourceDS;
run;
...
where libname="WORK"
...
Change libref "WORK" to the libref where your source datasets reside (i.e. ANTO) and things should work (name of libref must be in upper case).
Instead of using macro coding you could also query the SAS dictionary tables and create and populate a macro variable with all the source datasets you need (and which exist).
"Multiple SET doesn't work here because I only need subset of variables from original datasets."
That works: You can use a KEEP option on source datasets: SET mysource(keep=<variable list>);
And to read from multiple datasets only use a single SET statement and just list all your datasets like:
SET mysource1(keep=<variable list>) mysource2(keep=<variable list>);
Here a code option which allows you to create the SET statement dynamically using a start and end date.
The naming convention of your source datasets made things a bit more complicated than necessary - that's the reason for the regular expression and prxmatch() - but things work.
options symbolgen;
/* create sample source data */
data data2017_jan2017_data data2018_feb2018_data;
set sashelp.class;
run;
/* set start and end date for monthly datasets */
%let startdt=jan2017;
%let enddt=mar2018;
/* create macro variable with value containing all source datasets in date range */
/* - only datasets which actually exists will be added */
%let ds_list=_dummy_;
proc sql noprint;
select cats(libname,'.',memname,"(keep=name age)") into :ds_list separated by ' '
from dictionary.tables
where
libname="WORK"
and prxmatch('/^data\d{4}_\w{3}\d{4}_data\s*$/oi',memname)=1
and input(scan(memname,2,'_'),monyy7.) between input("&startdt",monyy7.) and input("&enddt",monyy7.)
;
quit;
/* create table want with all selected source datasets */
data want;
length sourceDS _sourceDS $41;
set &ds_list indsname=_sourceDS;
sourceDS=_sourceDS;
run;
...
where libname="WORK"
...
Change libref "WORK" to the libref where your source datasets reside (i.e. ANTO) and things should work (name of libref must be in upper case).
This is an untested code but could work if you try
I am in the assumption to use the temp: the dataset name followed with colon in the below step and the same could be kept within the macro.
data anto.dataaccum;
set anto.temp:;
run;
could you please test and let me know
It's extremely tricky to navigate, if you are trying to pull different months for different years. But if you are willing to give a single macro call for each year, here is a possibility. This may not produce what your original program is intended to produce, but it is a good starting point for discussion. The idea of selecting a.* in your current program worries me so I want to simplify it here:
%macro appendData (year=, month_list=);
%local i next_month;
%do i=1 %to %sysfunc(countw(&month_list));
%let next_month = scan(&month_list, &i);
proc sql;
select var1 var2 var3 from data&year..&month.&year._data;
quit;
proc append data=temp base=auto.dataaccum;
run;
%end;
%mend appendData;
%appendData (year=2017, month_list=jan feb mar apr may jun)
This still doesn't rule out using a SET statement, but I'm trying to change as little as possible in your original program at this point. You could try to use macro language to generate this final program as well:
data auto.dataaccum;
do until (done1);
set data2017.jan2017_data (keep=var1 var2 var3) end=done1;
output;
end;
do until (done2);
set data2017.feb2017_data (keep=var1 var2 var3) end=done2;
output;
end;
do until (done3);
set data2017.mar2017_data (keep=var1 var2 var3) end=done3;
output;
end;
run;
It's just not clear at this point whether this produces the outcome you are looking for.
Here is a datastep approach using the call execute function:
data _null_; length sendtox $255; call execute('data anto.dataaccum; set '); year=2000; do i=1 to 192; /*total number of months*/ if mod(i,12) eq 1 then year+1; sendtox=catt('data',year,'.',put(mdy(i,1,year),monname3.),year,'_data (keep=var1 var2 var3)'); call execute(sendtox); end; call execute(';run;'); run;
Art, CEO, AnalystFinder.com
And just for fun here yet another a bit unusual approach:
/* create sample source data */
data data2017_jan2017_data data2018_feb2018_data;
set sashelp.class;
run;
/* set start and end date for monthly datasets */
%let startdt=jan2017;
%let enddt=mar2018;
/* define vars to keep */
%let keepVar=name age height;
/* create DS with source datasets in date range with required naming pattern */
data work.ds(keep=sourceDS &keepVar);
length sourceDS _sourceDS $41;
set
data:(obs=1) indsname=_sourceDS;
if
prxmatch('/^data\d{4}_\w{3}\d{4}_data\s*$/oi',scan(_sourceDS,-1,'.'))=1
and input("&startdt",monyy7.) <= input(scan(_sourceDS,3,'._'),monyy7.) <= input("&enddt",monyy7.)
;
sourceDS=_sourceDS;
run;
/* create table want with all selected source datasets */
data want(keep=sourceDS &keepVar);
set work.ds;
_dsid=open(catx(' ',sourceDS,"(keep=&keepVar)"),"I");
call set(_dsid);
do _nobs=1 by 1;
_rc=fetchobs(_dsid,_nobs);
if _rc = 0 then output;
else leave;
end;
_rc=close(_dsid);
run;
First, make a dataset with dataset names.
Then do
data _null_;
set dataset_names end=done;
if _n_ = 1 then call execute('data want; set ';);
call execute(trim(ds_name) !! ' (keep=var1 var2 var3) ');
if done then call execute('; run;');
run;
Of course there is a catch to this: with a sufficient number of datasets to concatenate, one could exceed the maximum length of a single SAS statement (32767 characters).
In addition to the useful comments sent by others, consider appending the "open=defer" option to the SET statement. I suspect, that all the data sets you are concatenatiing have the same variables. If so (and only if so), then instead of SAS setting up a separate input buffer for each incoming data set (the default behavior), you can tell SAS to reuse the same buffer for each data set in turn. Saves a lot of memory, and probably some cpu time and clock time.
I.e. the SET command would look like
set
... list of data sets here ....
open=defer;
The tradeoff - you can't have a BY statement accompany the SET statement - because that would require SAS to "look ahead" to the next record (to determine last.by status) - which in turn would require more than one buffer.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.