Merging Files Automatically

JAR · Posted 07-28-2011 09:04 AM

Dear All,

I have to pull data from several datasets into one. The following code works perfectly:

Data Final;
Set
D3 D4 D5 D6 D7 D8 D9 D10;

As the number of datasets changes each time, I wonder if there is a way to call them in one step (similar to Array D3-D10).

If it is not possible, is there a way to use a macro...?

Regards,

JAR

art297 · Posted 07-28-2011 09:18 AM

I think I read that is possible in 9.3

art297 · Posted 07-28-2011 09:20 AM

Of course, even without 9.3, you could always use something like:

data d1;

x=1;

output;

run;

data d2;

x=2;

output;

run;

data all;

set d:;

run;

art297 · Posted 07-28-2011 09:22 AM

And, while I had never tried it, it works in 9.2 as well:

data all;

set d1-d2;

run;

What I must have read is the new ability to do the same thing in the data statement itself.

JAR · Posted 07-28-2011 10:03 AM

I am using learner's edition of Enterprise Guide. The engine is still 9.1, your code does not work in it:

data all;

set d1-d2;

run;

Regards,

JAR

art297 · Posted 07-28-2011 10:14 AM

You could always approximate it using a combination of proc sql and a datastep. E.g.,:

proc sql noprint;

select memname into : files

separated by " "

from dictionary.tables

where libname="WORK" and

memname like 'D%'

;

quit;

data want;

set &files.;

run;

ieva · Posted 07-28-2011 10:19 AM

This should work as well:

%macro combine;

data final;

set

%do i=3 %to 10;

d&i

%end;

;

run;

%mend;

%combine;

MikeZdeb · Posted 07-28-2011 11:03 AM

Hi ... as more and more data sets get added, would PROC APPEND be faster for concatenating data sets ...

%macro fakedata;

%do j=1 %to 10;

data d&j;

do j=1 to 1e6;

output;

end;

run;

%end;

%mend;

* make 10 data sets ... d1 through d10;

%fakedata;

data _null_;

do ds=1 to 10;

call execute(catt('proc append base=final data=d',ds,';run;'));

end;

run;

art297 · Posted 07-28-2011 11:29 AM

Interestingly, yes, proc append (and/or probably using append in proc datasets) is quite a bit more efficient. I wonder why the same operation uses a different algorithm in a datastep. There shouldn't be any need to re-read each file when appending additional files, but the processing time indicates otherwise.

Howles · Posted 07-28-2011 05:12 PM

APPEND is a specialized tool, and that allows a degree of optimization (block operations, etc.). The DATA step is a very flexible thing, but at a cost. It drags all of the data, an observation at a time, through the program data vector. That adds overhead.

I'm pretty sure there's no re-reading. That would have to be deliberately contrived.

Also: OPEN=DEFER may help in the DATA step, if the data sets meet the requirements.

art297 wrote:

Interestingly, yes, proc append (and/or probably using append in proc datasets) is quite a bit more efficient. I wonder why the same operation uses a different algorithm in a datastep. There shouldn't be any need to re-read each file when appending additional files, but the processing time indicates otherwise.