BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pAckmAn
Calcite | Level 5

Dear SAS Community, 

 

I'm trying to compile multiple datasets into single superset - the same data for different months for many years (jan2001_data .... dec2016). Multiple SET doesn't work here because I only need subset of variables from original datasets. Thus I created a dataset which I expand using sql union with every dataset:

%macro appendData(month=, year=);
proc sql;
     create table anto.temp1 as
     select
           a.*
     from
           anto.dataaccum a
     union
     select
           b.var1,
           b.var2,
           b.var3
     from
           data&year..&month.&year._data b;
 
data anto.dataaccum;
     set anto.temp1;
run;
%mend;
 
%appendData(month=jan, year=2001);
......
%appendData(month=dec, year=2016);

That would have been much more efficient if I could call %appendData from inside %do loop. However, any use of ARRAY, DO or DO_OVER outside of PROC causes error. Is it possible at all? Or do I call it incorrectly? 

%do_over(values=jan feb mar apr may jun jul aug sep oct nov dec, phrase=%appendData(month=?,year=2016));

WARNING: Apparent invocation of macro DO_OVER not resolved.

ERROR 180-322: Statement is not valid or it is used out of proper order.
1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@pAckmAn

Instead of using macro coding you could also query the SAS dictionary tables and create and populate a macro variable with all the source datasets you need (and which exist).

 

"Multiple SET doesn't work here because I only need subset of variables from original datasets."

That works: You can use a KEEP option on source datasets: SET mysource(keep=<variable list>);

And to read from multiple datasets only use a single SET statement and just list all your datasets like:

SET mysource1(keep=<variable list>) mysource2(keep=<variable list>);

 

Here a code option which allows you to create the SET statement dynamically using a start and end date.

The naming convention of your source datasets made things a bit more complicated than necessary - that's the reason for the regular expression and prxmatch() - but things work.

options symbolgen;

/* create sample source data */
data data2017_jan2017_data data2018_feb2018_data; 
  set sashelp.class;
run;


/* set start and end date for monthly datasets */
%let startdt=jan2017;
%let enddt=mar2018;


/* create macro variable with value containing all source datasets in date range */
/* - only datasets which actually exists will be added                           */
%let ds_list=_dummy_;
proc sql noprint;
  select cats(libname,'.',memname,"(keep=name age)") into :ds_list separated by ' '
  from dictionary.tables
  where 
    libname="WORK" 
    and prxmatch('/^data\d{4}_\w{3}\d{4}_data\s*$/oi',memname)=1
    and input(scan(memname,2,'_'),monyy7.) between input("&startdt",monyy7.) and input("&enddt",monyy7.)
    ;
quit;

/* create table want with all selected source datasets */
data want;
  length sourceDS _sourceDS $41;
  set &ds_list indsname=_sourceDS;
  sourceDS=_sourceDS;
run;

 

...  
where libname="WORK"
...

Change libref "WORK" to the libref where your source datasets reside (i.e. ANTO) and things should work (name of libref must be in upper case).

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

@pAckmAn

Instead of using macro coding you could also query the SAS dictionary tables and create and populate a macro variable with all the source datasets you need (and which exist).

 

"Multiple SET doesn't work here because I only need subset of variables from original datasets."

That works: You can use a KEEP option on source datasets: SET mysource(keep=<variable list>);

And to read from multiple datasets only use a single SET statement and just list all your datasets like:

SET mysource1(keep=<variable list>) mysource2(keep=<variable list>);

 

Here a code option which allows you to create the SET statement dynamically using a start and end date.

The naming convention of your source datasets made things a bit more complicated than necessary - that's the reason for the regular expression and prxmatch() - but things work.

options symbolgen;

/* create sample source data */
data data2017_jan2017_data data2018_feb2018_data; 
  set sashelp.class;
run;


/* set start and end date for monthly datasets */
%let startdt=jan2017;
%let enddt=mar2018;


/* create macro variable with value containing all source datasets in date range */
/* - only datasets which actually exists will be added                           */
%let ds_list=_dummy_;
proc sql noprint;
  select cats(libname,'.',memname,"(keep=name age)") into :ds_list separated by ' '
  from dictionary.tables
  where 
    libname="WORK" 
    and prxmatch('/^data\d{4}_\w{3}\d{4}_data\s*$/oi',memname)=1
    and input(scan(memname,2,'_'),monyy7.) between input("&startdt",monyy7.) and input("&enddt",monyy7.)
    ;
quit;

/* create table want with all selected source datasets */
data want;
  length sourceDS _sourceDS $41;
  set &ds_list indsname=_sourceDS;
  sourceDS=_sourceDS;
run;

 

...  
where libname="WORK"
...

Change libref "WORK" to the libref where your source datasets reside (i.e. ANTO) and things should work (name of libref must be in upper case).

Jagadishkatam
Amethyst | Level 16

This is an untested code but could work if you try

 

I am in the assumption to use the temp: the dataset name followed with colon in the below step and the same could be kept within the macro.

 

data anto.dataaccum;
     set anto.temp:;
run;

 

could you please test and let me know

Thanks,
Jag
Astounding
PROC Star

It's extremely tricky to navigate, if you are trying to pull different months for different years.  But if you are willing to give a single macro call for each year, here is a possibility.  This may not produce what your original program is intended to produce, but it is a good starting point for discussion.  The idea of selecting a.* in your current program worries me so I want to simplify it here:

 

%macro appendData (year=, month_list=);

   %local i next_month;

   %do i=1 %to %sysfunc(countw(&month_list));

      %let next_month = scan(&month_list, &i);

      proc sql;

      select var1 var2 var3 from data&year..&month.&year._data;

      quit;

      proc append data=temp base=auto.dataaccum;

      run;

   %end;

%mend appendData;

 

%appendData (year=2017, month_list=jan feb mar apr may jun)

 

This still doesn't rule out using a SET statement, but I'm trying to change as little as possible in your original program at this point.  You could try to use macro language to generate this final program as well:

 

data auto.dataaccum;

  do until (done1);

      set data2017.jan2017_data (keep=var1 var2 var3) end=done1;

      output;

   end;

  do until (done2);

      set data2017.feb2017_data (keep=var1 var2 var3) end=done2;

      output;

   end;

  do until (done3);

      set data2017.mar2017_data (keep=var1 var2 var3) end=done3;

      output;

   end;

run;

 

It's just not clear at this point whether this produces the outcome you are looking for.

art297
Opal | Level 21

Here is a datastep approach using the call execute function:

data _null_;
  length sendtox $255;
  call execute('data anto.dataaccum; set ');
  year=2000;
  do i=1 to 192; /*total number of months*/
    if mod(i,12) eq 1 then year+1;
    sendtox=catt('data',year,'.',put(mdy(i,1,year),monname3.),year,'_data (keep=var1 var2 var3)');
    call execute(sendtox);
  end;
  call execute(';run;');
run;

Art, CEO, AnalystFinder.com

 

Patrick
Opal | Level 21

And just for fun here yet another a bit unusual approach:

 

 

/* create sample source data */
data data2017_jan2017_data data2018_feb2018_data;
  set sashelp.class;
run;

/* set start and end date for monthly datasets */
%let startdt=jan2017;
%let enddt=mar2018;

/* define vars to keep */
%let keepVar=name age height;


/* create DS with source datasets in date range with required naming pattern */
data work.ds(keep=sourceDS &keepVar);
  length sourceDS _sourceDS $41;
  set 
    data:(obs=1) indsname=_sourceDS;

  if
    prxmatch('/^data\d{4}_\w{3}\d{4}_data\s*$/oi',scan(_sourceDS,-1,'.'))=1
    and input("&startdt",monyy7.) <= input(scan(_sourceDS,3,'._'),monyy7.)  <= input("&enddt",monyy7.)
    ;

  sourceDS=_sourceDS;
run;

/* create table want with all selected source datasets */
data want(keep=sourceDS &keepVar);
  set work.ds;

  _dsid=open(catx(' ',sourceDS,"(keep=&keepVar)"),"I");
  call set(_dsid);

  do _nobs=1 by 1;
    _rc=fetchobs(_dsid,_nobs);
    if _rc = 0 then output;
    else leave;
  end;

  _rc=close(_dsid);
run;

 

 

Kurt_Bremser
Super User

First, make a dataset with dataset names.

Then do

data _null_;
set dataset_names end=done;
if _n_ = 1 then call execute('data want; set ';);
call execute(trim(ds_name) !! ' (keep=var1 var2 var3) ');
if done then call execute('; run;');
run;

 

Kurt_Bremser
Super User

Of course there is a catch to this: with a sufficient number of datasets to concatenate, one could exceed the maximum length of a single SAS statement (32767 characters).

mkeintz
PROC Star

@pAckmAn

 

In addition to the useful comments sent by others, consider appending the "open=defer" option to the SET statement.  I suspect, that all the data sets you are concatenatiing have the same variables.  If so (and only if so), then instead of SAS setting up a separate input buffer for each incoming data set (the default behavior), you can tell SAS to reuse the same buffer for each data set in turn.  Saves a lot of memory, and probably some cpu time and clock time.

 

I.e. the SET command would look like

    set

       ... list of data sets here ....

      open=defer;

 

 

The tradeoff - you can't have a BY statement accompany the SET statement - because that would require SAS to "look ahead" to the next record (to determine last.by status) - which in turn would require more than one buffer. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2866 views
  • 5 likes
  • 7 in conversation